Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Shared ResourceUseNotes

batch machines

LSF general queuesfell, hequ, kiso, dole, bullet
interactive login machineslight load, short-running, development/testrhel6-64, rhel5-64
NFS diskssite-wide, medium performanceFermi group and user space,
in general, NOT backed up up 
AFS disksglobal, medium performance$HOME directories for all users,
these areas are backed up 
xroot diskssite-wide, high performanceFermi storage for bulk data
  • read by everyone
  • write restricted to pipeline pipeline and science groups

network facilities

 switches & fabric

...

Tip
titleStart faster

Please also see this page to learn how to get your batch jobs to start running sooner.

...

operation

local
scratch

xroot

NFS

AFS

Notes

writing large files (>100MB)

(tick)

(tick)

(minus)

(minus)

 

reading large files (>100MB)

(tick)

(tick)

(tick)

(minus)

 

writing small files (<100MB)

(tick)

(tick)

(tick)

(tick)

okay only in small numbers (NFS/AFS)

reading small files (<100MB)

(tick)

(tick)

(tick)

(minus)

 

copying whole files (local<->remote)

(tick)

(tick)

(tick)

(tick)

 typically at job start and/or end

frequently creating/opening/closing/deleting files

(tick)

(minus)

(minus)

(minus)

best to avoid this completely

frequently stat'ing files

(tick)

(minus)

(minus)

(minus)

 

multiple jobs writing to the same file

(minus)

(minus)

(minus)

(minus)

don't do this!

...

Finally, keep track of how much space is available on the directories/partitions you write to.  Writing to a 100% full partition is known to cause a lot of stress on the file server.  It is easy to check the available space: cd to the directory of interest and then issue the "df -h ." command, which will tell you the size and remaining space on that partition.  (Less frequently, one may encounter a different limit: inodes.  Check your inode quota with the "df -hi ." command.  This quota runs against the sum: number of files + directories + symlinks)

Local Scratch Space

Local scratch directories are available on all SLAC linux machines. They vary in size from several GB to several 100 GB. This space is shared by all users logged into a given host.  On batch machines, it is vitally important to clean up this space at the end of your job or it will, over time, fill up (and this has happened).  Common practice for using scratch space is to create a directory with your username and put all files in there.  Note that if using the Fermi pipeline to manage your job flow, you will need to devise a 'stream-dependent' method of naming your scratch sub-directories to prevent jobs running on the same host from overwriting each other's files.

...

  1. 'cd' to the location holding the existing or future file
  2. 'df .'  this give tell you one of three answers:
    1. AFS

      Code Block
      Filesystem     1K-blocks  Used Available Use% Mounted on
      AFS              9000000     0   9000000   0% /afs

      This tells you it is an AFS server.  Then, follow up with the command 'fs whereis .', e.g.,

      Code Block
      File . is on host afs03.slac.stanford.edu 

      The server is afs03

    2. old NFS (single wain-class server)

      Code Block
      Filesystem           1K-blocks    Used Available Use% Mounted on
      wain025:/g.glast.u55/dragon
      .u55/dragon
                            10485760 3731456   6754304  36% /nfs/farm/g/glast/u55/dragon

      This is an NFS location and the server is wain025

    3. new NFS (GPFS as underlying file system with clusteredNFS 'heads', as of May 2015)

      Code Block
      titleGPFS
      Filesystem            Size  Used Avail Use% Mounted on
      fermi-cnfslb1:/gpfs/slac/fermi/fs2/u
                             1048576010G 3731456 3.6G  67543046.5G  36% /nfs/farm/g/glast/u55/dragonu

      This is also an NFS location and the server is wain025, but one must monitor four different servers: fermi-gpfs03, fermi-gpfs04, fermi-cnfs01, fermi-cnfs02

    4. local

      Code Block
      Filesystem     1K-blocks    Used Available Use% Mounted on
      /dev/sda1       18145092 8530464   8692904  50% /

      This is a local disk – and will not be visible to any batch jobs

...

  • Store analysis code and scripts in your AFS home directories (which are backed up)
  • Assessment.  For every new task, assess its impact on key servers to ensure they will not be overloaded
  • File staging.  Files that remain open for the duration of the job (either reading or writing) should be located in local scratch space.  Copy needed input files to local scratch at the beginning of your job; write output data products to their final destinations at the end of the job.
  • Submitting jobs.  
    • Never submit a large number (~>50) jobs without first assessing their impact on key shared resources.
    • If your jobs are known to produce a large I/O load only during the start-up phase, then submit jobs in small batches, wait for those to run and pass the start-up phase and only then submit another small batch, etc.
    • If you are planning a large batch operation of, say, more than 50 simultaneous jobs, please inform and coordinate with SAS management (Richard Dubois).
  • PFILES. Arrange that the parameter files for ScienceTools, FTools, etc. be stored in a directory unique to the batch job.
  • Avoid Disk Thrashing
    • Completely disable core dumps
    • Avoid unnecessary file open() and close() operations, as well as file creates/deletes.
    • Avoid writing to a full disk partition.
  • Cleanup. Be sure to perform a cleanup on the local scratch space after your jobs have completed!