Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The Batch Submission System consists of two applications. The lsf Daemon and the lsf callback application. The daemon is setup to run on fermilnx-v03 (a virtual machine) via trscron (trscron is used instead of regular cron because the LSF processes need AFS tokens and normal cron doesn't provide them). The daemon runs for an hour and is then retriggered.

It checks the database for new jobs. Jobs are submitted to lsf LSF in suspend mode. For each job, there are three lsf LSF tasks submitted. The first tasks task is the actual command to be executed. The second task is a call to the lsf callback application with the condition that it be executed as soon as the first task starts execution. The third task is a call to the lsf callback application with the condition that it be executed as soon as the first task finishes execution. All lsf LSF tasks are submitted and initially put into suspend mode. If all three tasks are submitted successfully, the jobs are changed from suspend to pending mode so they actually start execution. If at any time anything goes wrong, the lsf LSF tasks are killed and the run in the table is marked as failed. The job is put back into the queue to be attempted at a later time as a new run.

The lsf callback application is called once when the lsf LSF task starts and once when it finishes. When started, it updates the run table with the information about the start time. Additionally, it checkes the callback table to determine if the higher layers wish to be notified when the job has started. If so, it calls the callback command to notify the higher levels. When called for a task that has finished. The callback system checks the lsf LSF output, parses some of the information, like the return code, and stores that information in the run, job, and output tables. Additionally, it checks for any callback information for the higher layers.

...