Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

(Skip to the bottom of this page for a concise reminder of all Best Practices.)

...

Shared Resources

When you run batch jobs, you will be sharing various computing resources with your Fermi colleagues and the rest of SLAC.  

...

Tip
titleStart faster

Please also see this page to learn how to get your jobs to start running sooner.

 

...

Known problems to avoid

PFILE (and other) Simultaneous File Writing Conflicts

...

One can sometimes get a traceback from a truncated core dump file, but not a whole lot more, so the favored approach is to disable the core file completely.

 

...

Minimizing stress on file servers

The single largest point of stress is I/O overload on the file servers. There are many ways to stress a server. Different servers will "run out of steam" at different thresholds depending on # of CPUs, CPU speed, amount of memory, I/O system, number of network connections, disk speed/buffering/controller details, RAID striping, etc.  

The number of simultaneous jobs that may be run without causing severe stress will vary depending upon exactly what the jobs are doing.  For example, some jobs perform heavy I/O at the beginning and end, while others perform I/O continuously.  Every job is a bit different and so requires its own assessment.

 

General Guidelines for Using Remote File Servers

The most basic rule is to avoid prolonged I/O to a remote file server.  (The one exception is xroot which seems able to handle very large loads of this type.)  This includes both file reading & writing as well as directory operations, such as creating, opening, closing, deleting files.  A good way to design your job is to copy needed input files to local scratch space for reading, and to write output data products to local scratch, then copy to a remote file system at job completion.

...

  1. xroot is the repository for Fermi on-orbit data and Monte Carlo data. It is readable by anyone, but typically not writable except by pipeline accounts.
  2. NFS refers to the collection of servers dedicated for Fermi use. Typically one server (machine) has multiple disks attached, so that stressing a server can cause a problem for multiple groups of users.
  3. AFS is the filesystem used for user $HOME directories and a relatively small collection of Fermi group areas. This is a world accessible file system (if one has proper access credentials) and caches file on local machines.

 

Local Scratch Space

Local scratch directories are available on all SLAC linux machines. They vary in size from several GB to several 100 GB. This space is shared by all users logged into a given host.  On batch machines, it is vitally important to clean up this space at the end of your job or it will, over time, fill up (and this has happened).  Common practice for using scratch space is to create a directory with your username and put all files in there.  Note that if using the Fermi pipeline to manage your job flow, you will need to devise a 'stream-dependent' method of naming your scratch sub-directories to prevent jobs running on the same host from overwriting each other's files.

...

Finally, note that all linux machines have a /tmp disk partition.  It is strongly recommended that /tmp NOT be used because of the danger of its becoming full which will cause the machine to crash.


Monitoring remote file servers

First, one must identify the server holding all of the job's needed input and future output files.

...