Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Best Practices for Using the SLAC Batch System

Version of 7/29/2014

Contents:

Table of Contents
minLevel2

(Skip to the bottom of this page for a concise reminder of all Best Practices.)

...

Shared Resources

When you run batch jobs, you will be sharing various computing resources with your Fermi colleagues and the rest of SLAC.  

...

Tip
titleStart faster

Please also see this page to learn how to get your jobs to start running sooner.

 

...

Known problems to avoid

PFILE (and other) Simultaneous File Writing Conflicts

Parameter files, "PFILES", are used by the Fermi ScienceTools, and FTools to store default and last-used options for the commands within these packages.  Normally, these small files are stored in $HOME/pfiles and are written each time a command is invoked.  If multiple jobs attempt to access these files simultaneously, an unfortunate and painful conflict will result.  Not only will your jobs fail to give reliable results, but this sort of activity is very demanding on file servers and can cause severely degraded performance for all users.

...

  • Create a unique directory in /scratch for your batch job,

    mkdir -p  /scratch/<userid>/${LSB_JOBID}
     
  • Define this directory as your $HOME and then go there prior to running any ScienceTools/Ftools/etc.,

    export HOME=/scratch/<userid>/${LSB_JOBID}
    cd ${HOME}

    This will automatically take care of PFILES being unique for your job and avoid overloading the /nfs user disk with large numbers of opens and closes.  Create any new files in $HOME and then copy anything you wish to save at the end of your job.
     
  • Cleanup the scratch directory at end of job (after you have copied out anything you want to save),

    rm -rf /scratch/<userid>/${LSB_JOBID}

    Note: Cleaning up the scratch directory is critical! Any scratch file left behind will slowly fill up the /scratch partition and eventually fill it up.  (On the batch machines, the local scratch spaces are managed and cleaned up periodically, but on a very long time-scale and much too long to prevent them from filling up!)


Core files

Core files may be produced if your executable crashes.  These files can be huge, and are typically written to $HOME, which if you have taken to heart the preceding section, will reside on a shared NFS server, but equally bad on an AFS server.  Imagine the load on a file server if hundreds of jobs suddenly crash and all attempt to simultaneously write multi-GB core dumps.  Not only can this bring a file server to its knees, this problem can be very difficult to halt once it gets started.  For this reason, and the fact that hundreds of core files are almost never useful, it is strongly urged that core dumps be disabled or severely limited in size for all batch jobs (other than individual test jobs from which a core dump may actually be of some use).  The appropriate line(s) from the table below should be placed at the beginning of your batch scripts.

...

One can sometimes get a traceback from a truncated core dump file, but not a whole lot more, so the favored approach is to disable the core file completely.

 

...

Minimizing stress on file servers

The single largest point of stress is I/O overload on the file servers. There are many ways to stress a server. Different servers will "run out of steam" at different thresholds depending on # of CPUs, CPU speed, amount of memory, I/O system, number of network connections, disk speed/buffering/controller details, RAID striping, etc.  

The number of simultaneous jobs that may be run without causing severe stress will vary depending upon exactly what the jobs are doing.  For example, some jobs perform heavy I/O at the beginning and end, while others perform I/O continuously.  Every job is a bit different and so requires its own assessment.

 

General Guidelines for Using Remote File Servers

The most basic rule is to avoid prolonged I/O to a remote file server.  (The one exception is xroot which seems able to handle very large loads of this type.)  This includes both file reading & writing as well as directory operations, such as creating, opening, closing, deleting files.  A good way to design your job is to copy needed input files to local scratch space for reading, and to write output data products to local scratch, then copy to a remote file system at job completion.

...

  1. xroot is the repository for Fermi on-orbit data and Monte Carlo data. It is readable by anyone, but typically not writable except by pipeline accounts.
  2. NFS refers to the collection of servers dedicated for Fermi use. Typically one server (machine) has multiple disks attached, so that stressing a server can cause a problem for multiple groups of users.
  3. AFS is the filesystem used for user $HOME directories and a relatively small collection of Fermi group areas. This is a world accessible file system (if one has proper access credentials) and caches file on local machines.

 

Local Scratch Space

Local scratch directories are available on all SLAC linux machines. They vary in size from several GB to several 100 GB. This space is shared by all users logged into a given host.  On batch machines, it is vitally important to clean up this space at the end of your job or it will, over time, fill up (and this has happened).  Common practice for using scratch space is to create a directory with your username and put all files in there.  Note that if using the Fermi pipeline to manage your job flow, you will need to devise a 'stream-dependent' method of naming your scratch sub-directories to prevent jobs running on the same host from overwriting each other's files.

...

Finally, note that all linux machines have a /tmp disk partition.  It is strongly recommended that /tmp NOT be used because of the danger of its becoming full which will cause the machine to crash.


Monitoring remote file servers

First, one must identify the server holding all of the job's needed input and future output files.

...

The Ganglia pages display strip charts for CPU usage and I/O rates as a function of time.  Here is an historical image of the Fermi user/group space disk, wain025, Ganglia page's first few plots.  The third plot shows CPU time.  User jobs should avoid a situation where the "System CPU" exceeds about 50%.  Greater loads will begin to cause file-access delays, then timeouts and, eventually, the machine will hang or crash.  Now take a look at wain025's performance at this very moment.  In addition to the top five plots (as shown in the historical image), you can scroll down to see activity on each user and group partition on the server.  Near the very bottom, keep an eye open for the "nfs_server_badcalls" plot; experience has shown that any entries at all in this plot is a very bad thing.

What to be alert for...
  • CPU utilization > 50%  (especially "System CPU")
  • NFS disk I/O > 30 MB/s
  • AFS disk I/O > 5-10 MB/s
  • xroot disk I/O >> 200 MB/s (wains only)

...

Anchor
Summary
Summary
Summary

  • Store analysis code and scripts in your AFS home directories (which are backed up)
  • Assessment.  For every new task, assess its impact on key servers to ensure they will not be overloaded
  • File staging.  Files that remain open for the duration of the job (either reading or writing) should be located in local scratch space.  Copy needed input files to local scratch at the beginning of your job; write output data products to their final destinations at the end of the job.
  • Submitting jobs.  
    • Never submit a large number (~>50) jobs without first assessing their impact on key shared resources.
    • If your jobs are known to produce a large I/O load only during the start-up phase, then submit jobs in small batches, wait for those to run and pass the start-up phase and only then submit another small batch, etc.
    • If you are planning a large batch operation of, say, more than 50 simultaneous jobs, please inform and coordinate with SAS management (Richard Dubois).
  • PFILES. Arrange that the parameter files for ScienceTools, FTools, etc. be stored in a directory unique to the batch job.
  • Core dumps.  Completely disable core dumps.
  • Cleanup. Be sure to perform a cleanup on the local scratch space after your jobs have completed!