Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To use this system, choose an experiment from https://pswww.slac.stanford.edu/lgbk/lgbk/experiments.

1.1)

...

Workflow Definitions

Under the Workflow dropdown, select Definitions. The Workflow Definitions tab is the location where scripts are defined; scripts are registered with a unique name. Any number of scripts can be registered. To register a new script, please click on the + button on the top right hand corner. 

...

The parameters that will be passed to the executable. Any number of name value pairs can be associated with a job definition. These are then made available to the above executable as environment variables. For example, in the above example, one can obtain the value of the parameter LOOP_COUNT (100) in a bash script by using ${LOOP_COUNT}. In addition, the following additional environment variables are made available to the script

  • EXPERIMENT - The name of the experiment; in the example shown above, diadaq13.
  • RUN_NUM - The run number for the job.
  • JID_UPDATE_COUNTERS - This is a URL that can be used to update the progress of the job. These updates are also automatically reflected in the UI. 

1.1.4) Location

This defines where the analysis is done. While many experiments prefer to use the psana cluster (SLAC) to perform their analysis, some experiments prefer to use HPC facilities like NERSC to perform their analysis. 

...

Use the edit/trash icons to edit a job definition or to delete a job definition

1.2)

...

Workflow Control

Under the Workflow tab dropdown, select Batch Processing and then select the Control tab to Control to create and check the status of batch your analysis jobs.  The Control tab is where hashes job definitions defined in the Definitions tab may be applied to experiment runs. In An entry is automatically created for jobs that are triggered automatically. To manually trigger a job, in the drop-down menu of the Action Job column, any hash defined in Batch defs may be selected.

Image Removed

select the job. A job can be triggered any number of times; each execution has a separate row in the UI.

Image AddedOnce a hash is applied to a run, it will appear as shown above. In this case, the example case has finished as shown by the DONE status (other statuses are described below). The last two columns also warrant some explanation.

1.2.1) Status

There These are multiple the different statuses that a job can have . They include a few used by LSF: PENDRUN, DONE, and EXIT. There are also a few others to describe possible situations that could arise:

  • START - Pending submission to the HPC workload management infrastructure.
  • SUBMITTED - The job has been submitted to the HPC workload management infrastructure. A job may stay in the SUBMITTED for some time depending on how busy the queues are.
  • RUNNING - The job is currently running. One can get job details and the log file. The job can also potentially be killed.
  • EXITED - The job has finished unsuccessfully. The log files may have some additional information. 
  • DONE - The job has finished successfully. The job details and log files may be available; most HPC workload management systems delete this information after some time
  • Request Sent - The first status of a job. If the job has yet to be submitted into LSF, Request Sent is shown to give the user feedback to their submitting.
  • Persisted -  The batch client is down (i.e. the system that submits the jobs for the user).
  • LSF_Slow - A timeout has occured (currently set at 5 seconds) on the posting of the job back to the batch manager. It will mostly be caused by the system hanging up on the submission of the job to LSF.

1.2.2) Actions

There are four different actions which can be applied to a script. They do the following if pressed:

Image RemovedImage Added - Attempt to kill the job (via the bkill command). A green success statement will appear near the top-right of the page if the job is killed successfully and a red failure statement will appear if the job is not killed successfully.

Image Removed - Delete the hash from the run. Note: this does not kill the job, it only removes it from the webpage.

Image Removed  Image Added- Returns the log file for the job. If there is no log file or if no log file could be found, it will return blank.

Image Removed Image Added- Returns details for the current job by invoking the "bjobs -l" command on the LSF IDthe the appropriate job details command in the HPC workload management infrastructure.

 Image Added- Delete the job execution from the run. Note: this does not kill the job, it only removes it from the webpage.

1.2.3) Report

This is a customizable column which can be updated used by the used script script executable to report progress. The script executable reports progress by posting JSON to the correct URL. The URL is stored in a URL that is available as the environment variable BATCHJID_UPDATE_URL. The counters shown in the screenshot above were done with the following syntax which was posted in a for loop in a python script: 

Code Block
{'counters' : {'Example Counter' : [i + 1, 'red'], 'Random Char' : rand_char}}

...

COUNTERS.

For example, to update the status of the job using bash, one can use 

 

Code Block
languagebash
curl -s -XPOST ${JID_UPDATE_COUNTERS} -H "Content-Type: application/json" -d '[ {"key": "<b>LoopCount</b>", "value": "'"${i}"'" } ]'

In Python, one can use 

Code Block
languagepy
import os
import requests
requests.post(os.environ["JID_UPDATE_COUNTERS"], json=[ {"key": "<b>LoopCount</b>", "value": "75" } ])

2) Hash Script

The following example scripts live at /reg/g/psdm/web/ws/test/apps/release/logbk_batch_client/test/submit.sh and /reg/g/psdm/web/ws/test/apps/release/logbk_batch_client/test/submit.py.

...