Managing Large Numbers of Jobs

Overview

Ideally, job management would be a system that lived in ASE that "abstracted out" the concepts of determining if a job was:

running
completed
exited with error

It would ideally also abstract out the directory structure that a user wanted for his/her jobs. This would probably be a significant amount of work, so here we just summarize what has been done, in the hope that other people can re-use and perhaps improve, over time.

Scripts described below for job management can be found in /afs/slac/g/suncat/share/scripts. Less user-friendly scripts can be found in the "devel" subdirectory. Anyone in the g-suncat afs group can put scripts in these directories (type "pts mem g-suncat" to see if you're in the list ... if not, email cpo). If you don't want someone to modify scripts that you have put there, remember to "chmod -w" the file.

Andy Peterson Job Management System

Andy has two reusable scripts that are in the above directory: "running" and "rundirs". Change the username to yours within the python script. He has some help enabled (e.g., "running -h"). Those are for dealing with jobs in the queue.

He also has scripts for dealing with jobs and determining if they are done. These are not as "clean" – that is, they won't just work for you like the two scripts above should, so they are in the "devel" subdirectory. The one he uses for dealing with the hundreds of alloys he screens is called "checkdone.py". He organizes the output with "analyze". Both have the "-h" option enabled, so you might get an idea of how he approached the problem but they probably won't work directly for you.

More Info from Andy on Job Submission

The basic python tool I use in submitting large numbers of jobs is:

from string import Template

With that, you first read in a template file with ${keyword} in all the places you want to make substitutions, and then use the Template class to make the substitutions. The ${keyword} can potentially be as simple as a lattice constant or as complex as several lines of (ASE) code. Within the directory you looked at (20110110_alloys), take a look at the script:

makescripts-OH.py

That has the procedure for all of my runs with OH on something like 1400 surfaces. This script is obviously a bit complicated. I copied a much simpler implementation of the same concept into the below directory. Check it out and let me know if it makes sense to you:

/a/suncatfs1/u1/aap/temp/adam-example

Some Utilities from AJ Medford

Just in case anyone is interested I have written a few more simple commands for managing large amounts of jobs. They aren't that pretty, but are functional and I have found them useful. The commands can be found in my
development folder (/afs/slac.stanford.edu/g/suncat/vol3/scripts/devel/ajmedfor) and a brief summary is:

parseErr: a tremendously simple script which parses error output from jobs and does not allow duplicate lines. This makes it much easier to read error files output by multiple cores. The arguement should be an error file.

jobInfo: list all jobs along with their status, run time, and submission directory. If no argument is supplied then it only shows the name of the submission directory, but if you pass any argument then it will show the full absolute path of the submission directory. This is significantly slower than bjobs, but gives a lot more information. You can probably figure out how to hack it into doing your bidding if you want something more detailed.

jobDir: this command takes a job's ID as an argument and will return the directory of that job. If you couple this with an alias like : alias bcd 'cd `jobDir !*`' then you can automatically change to a job's directory by typing "bcd jobID"

resub: takes a job ID as the first argument. If nothing else is supplied it moves to that job's directory and resubmits the job using the gpaw-bsub command and the original name of the submission script. You can optionally supply a different submission command as the second argument (e.g. 'gpaw-ver-bsub 21') and you can optionally supply a different file name as the third argument (say you originally submitted 'run_k441.py' but this time you want to submit using 'run_k881.py' then you could do: resub jobID gpaw-bsub run_k881.py)

massCommand: this allows you to issue a command to a large number of jobs. The first argument is required, and is the command to issue. If no other argument is passed it issues the command to all jobs. The command should take a job ID as its argument (bkill, btop, bbot, etc). The second argument is a "flag" or "filter word". If it is supplied without any additional arguments then the "filter word" must be contained in the absolute path of the submission directory, or else the command will not be issued to that job. The third optional argument can be specified with -f and gives the "field" to check for the "filter word" in. If it is not specified then it defaults to filepath, but the other options are submissiondir (the directory it was submitted from), filename (the name of the submission script), command (the actual submission command pam -g...), runtime (the time in hours that the job has been running) , id (job ID), status (RUN/PEND). The final optional argument is the "condition". This is the operator which is used to compare the "filter word" and the "field". The default is "in", but you can supply anything which python would understand. A few examples (* means they are untested)....

massCommand bkill #kill all jobs
massCommand bbot Ir #move all jobs with Ir in their absolute path to the bottom of the queue
massCommand btop COOH -f filepath #move all jobs with 'COOH' in the filepath to the top of the queue
massCommand bkill CH2 -f submissiondir -c 'not in' #kill all jobs withOUT 'CH2' in the submission directory
massCommand resub 49 -f runtime -c '<' #resubmit all jobs which have been running for more than 49 hours (using the custom resub command)

Other Thoughts

Heine has also said Jacapo has something related to job management. I took a quick look but didn't spot it.

Confluence and Jira now require federated login. Read more.

Child pages

Overview

Andy Peterson Job Management System

More Info from Andy on Job Submission

Some Utilities from AJ Medford

Other Thoughts