Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Shared ResourceUseNotes

batch machines

LSF general queuesfell, hequ, kiso, dole, bullet
interactive login machineslight load, short-running, development/testrhel6-64, rhel5-64centos7
NFS diskssite-wide, medium performanceFermi group and user space,
in general, NOT backed up 
AFS disksglobal, medium performance$HOME directories for all users,
these areas are backed up 
xroot diskssite-wide, high performanceFermi storage for bulk data
  • read by everyone
  • write restricted to pipeline and science groups

network facilities

 


switches & fabric

It is your responsibility to asses the impact of any significant computing project you wish to run: to ensure it will not unduly stress the system or make it unusable for other users. Such an assessment may start with running successively larger numbers of jobs while carefully monitoring the impact on key servers.  In addition, there are some known problems that one must take care to avoid from ever happening.  This document attempts to provide some hints on preparing your batch jobs and assessing their impact.

...

Tip
titleStart faster

Please also see this page to learn how to get your batch jobs to start running sooner.

...


...

Known problems to avoid

PFILE (and other) Simultaneous File Writing Conflicts

Parameter files, "PFILES", are used by the Fermi ScienceTools, and FTools to store default and last-used options for the commands within these packages.  Normally, these small files are stored in $HOME/pfiles and are written each time a command is invoked.  If multiple jobs attempt to access these files simultaneously, an unfortunate and painful conflict will result.  Not only will your jobs fail to give reliable results, but this sort of activity is very demanding on file servers and can cause severely degraded performance for all users. 


Warning

This problem may also occur with other files, typically to "dot file" or a "dot directory" in your $HOME directory. Therefore, it is good practice to redefine $HOME to a non-shared scratch directory for all projects requiring multiple, simultaneous batch jobs.

...


Therefore, PFILES should be written to directories which are unique for each job, e.g.,

...

One can sometimes get a traceback from a truncated core dump file, but not a whole lot more, so the favored approach is to disable the core file completely. 


...

Minimizing stress on file servers

...

Info

Any large task of more than a few 10's of batch jobs must be ramped up slowly in order to allow for monitoring the relevant servers for adverse impact. Additionally, some tasks may require you to trickle in jobs rather than submitting them as a large batch to prevent overloading of, for example, the Gleam and/or ScienceTools code server.

 

...



General Guidelines for Using Remote File Servers

...

operation

local
scratch

xroot

NFS

AFS

Notes

writing large files (>100MB)

(tick)

(tick)

(minus)

(minus) 


reading large files (>100MB)

(tick)

(tick)

(tick)

(minus) 


writing small files (<100MB)

(tick)

(tick)

(tick)

(tick)

okay only in small numbers (NFS/AFS)

reading small files (<100MB)

(tick)

(tick)

(tick)

(minus) 


copying whole files (local<->remote)

(tick)

(tick)

(tick)

(tick)

 typically at job start and/or end

frequently creating/opening/closing/deleting files

(tick)

(minus)

(minus)

(minus)

best to avoid this completely

frequently stat'ing files

(tick)

(minus)

(minus)

(minus) 


multiple jobs writing to the same file

(minus)

(minus)

(minus)

(minus)

don't do this!

...

  1. xroot is the repository for Fermi on-orbit data and Monte Carlo data. It is readable by anyone, but typically not writable except by pipeline accounts.
  2. NFS refers to the collection of servers dedicated for Fermi use. Typically one server (machine) has multiple disks attached, so that stressing a server can cause a problem for multiple groups of users.
  3. AFS is the filesystem used for user $HOME directories and a relatively small collection of Fermi group areas. This is a world accessible file system (if one has proper access credentials) and caches file on local machines.

...


Finally, keep track of how much space is available on the directories/partitions you write to.  Writing to a 100% full partition is known to cause a lot of stress on the file server.  It is easy to check the available space: cd to the directory of interest and then issue the "df -h ." command, which will tell you the size and remaining space on that partition.  (Less frequently, one may encounter a different limit: inodes.  Check your inode quota with the "df -hi ." command.  This quota runs against the sum: number of files + directories + symlinks)

...