Confluence will be unusable 23-July-2024 at 06:00 due to a Crowd upgrade.
The Automatic Run Processor (or ARP, for short, because I want that to catch on) is a web service that allows for automatic workflows and for the easier submission of batch jobs via a web interface. A script that submits the batch job (to allow for more customization in this command) is all that is needed for this system to work.
Table of Contents |
---|
To use this system, choose an experiment from https://pswww.slac.stanford.edu/lgbk/lgbk/experiments.
Under the Workflow dropdown, select Definitions. The Workflow Definitions tab is the location where scripts are defined; scripts are registered with a unique name. Any number of scripts can be registered. To register a new script, please click on the + button on the top right hand corner.
...
A unique name given to a registered script; the same script can be registered under different names with different parameters.
The absolute path to the batch script. An example can be seen here. This This script must contain the batch job submission command (bsub/sbatch for SLURM) since it . It gives the user the ability to customize the the batch submission. Overall, it can act as a wrapper for the code that will do the analysis on the data along with submitting the job.
The parameters that will be passed to the executable . Any number of name value pairs can be associated with a job definition. These are then made available to the above executable as environment variables. For example, in the above example, one can obtain the value of the parameter LOOP_COUNT (100) in a bash script by using ${LOOP_COUNT}. In addition, the following additional environment variables are made available to the scriptas command line arguments. These parameters can be used as parameters to the sbatch command to specify the queue, number of cores etc. Or; they can be used to customize the script execution. In addition, details of the batch job are made available as as environment variables.
JIDARP_UPDATE_COUNTERS - This is a URL that can be used to update the progress of the job. These updates are also automatically reflected in the UI. . In previous releases, this was called JID_UPDATE_COUNTERS.
This defines where the analysis is done. While many experiments prefer to use the SLAC psana cluster (SLAC) or the SRCF (SRCF_FFB) to perform their analysis, some experiments others prefer to use HPC facilities like NERSC to perform their analysis.
This defines the event that in the data management system that kicks off the job submission.
The job If the job is automatically triggered, it will be executed as this user. If the job is manually triggered; it will be executed as the user . triggering the job manually. This is set when creating the job definition and cannot be changed.
Under the Workflow dropdown, select Control to create and check the status of your analysis jobs. The Control tab is where job definitions defined in the Definitions tab may be applied to experiment runs. An entry is automatically created for jobs that are triggered automatically. To manually trigger a job, in the drop-down menu of the Job column, select the job. A job can be triggered any number of times; each execution has a separate row in the UI.
These are the different statuses that a job can have -
There are four different actions which can be applied to a script. They do the following if pressed:
...
- Delete the job execution from the run. Note: this does not kill the job, it only removes it from the webpage.
This is a customizable column which can be used by the script executable to report progress. The script executable reports progress by posting JSON to a URL that is available as the environment variable JID_UPDATE_COUNTERS.
For example, to update the status of the job using bash, one can use
Code Block | ||
---|---|---|
| ||
curl -s -XPOST ${JID_UPDATE_COUNTERS} -H "Content-Type: application/json" -d '[ {"key": "<b>LoopCount</b>", "value": "'"${i}"'" } ]' |
...
Code Block | ||
---|---|---|
| ||
import os import requests requests.post(os.environ["JID_UPDATE_COUNTERS"], json=[ {"key": "<b>LoopCount</b>", "value": "75" } ]) |
...
The executable script that the hash corresponds to is the one that submits the job via the bsub command. This script is shown below.used in the workflow definition should be used primarily to set up the environment etc and submit the analysis script to the HPC workload management infrastructure. For example, a simple executable script that uses SLURM's sbatch to submit the analysis script is available here - /reg/g/psdm/tutorials/batchprocessing/arp_submit.sh
Code Block |
---|
#!/bin/bash source /reg/g/psdm/etc/psconda.sh ABS_PATH=/reg/g/psdm/web/ws/test/apps/logbk_batch_client/test bsub -q psdebugq -o $ABS_PATH/logs/%J.log pythontutorials/batchprocessing sbatch --nodes=2 --partition=psanaq --time=5 --output="arp_example_${RUN_NUM}_%j.log" $ABS_PATH/submitarp_actual.py "$@" |
This script will run the batch job on psdebugq and store the log files in submit /reg/g/psdm/web/ws/test/apps/release/logbk_batch_client/test/logs/<lsf_id>. Also, it will pass all arguments passed to it to the python script, submit.py (these would be the parameters entered in the Batch defs tab)/tutorials/batchprocessing/arp_actual.py. /reg/g/psdm/tutorials/batchprocessing/arp_actual.py will be passed the parameters as command line arguments and will inherit the EXPERIMENT, RUN_NUM and JID_UPDATE_COUNTERS environment variables.
Log files:
→ If the --output parameter is not specified to sbatch, then SLURM will store the log output in /reg/d/psdm/dia/diadaq13/scratch/<slurm_job_id>.out
→ In the example above, the log output will be sent to to the default working folder for the job; which is the scratch folder but the file name will be generated using the run number and the job id. For example, the log file for run 25 job id 409327. will be send to /reg/d/psdm/dia/diadaq13/scratch/arp_example_25_409327.log
→ To avoid cluttering the scratch folder, one can use an absolute path in the --output command to specify an alternate location for the job log files. See the "filename pattern" in the sbatch man page for more details.
...
The This Python script is the code that will do analysis and whatever is necessary on the run data. Since this is just an example, the Python script, submitarp_actual.py, doesn't get that involved. It is shown below.
Code Block |
---|
from time import sleep from requests import post from sys import argv from os import environ from numpy import random from string import ascii_uppercase print 'This is a test function for the batch submitting.\n'#!/usr/bin/env python import os import sys import requests import time import datetime import logging logging.basicConfig(level=logging.DEBUG) logger = logging.getLogger(__name__) logger.debug("In the arp_actual script - current time is %s", datetime.datetime.now().strftime("%c")) for k, v in sorted(os.environ.items()): logger.debug("%s=%s", k, v) ## Fetch the URL to POSTpost progress toupdates update_url = os.environ.get('BATCHJID_UPDATE_URLCOUNTERS') print 'The update_url is:'logger.debug("The URL to post updates is %s", update_url, '\n') ### These Fetchare the passedparameters argumentsthat asare passed by submit.sh params = argv print 'The parameters passed are:' for n, param in enumerate(params): print 'Param %d:' % (n + 1), param print '\n'in logger.debug("The parameters passed into the script are %s", " ".join(sys.argv)) loop_count = 20 try: loop_count = int(sys.argv[1]) except: pass ## Run a loop, sleep a second, then POST for i in range(10loop_count): time.sleep(1) rand_char = random.choice(list(ascii_uppercase)) print 'Step: %d, %s' % (i + 1, rand_charlogger.debug("Posting for step %s", i) requests.post(update_url, json={'counters'[{"key": "<strong>Counter</strong>", "value" : {'Example Counter' : [i + 1, 'red'], 'Random Char' : rand_char}}) "<span style='color: red'>{0}</span>".format(i+1)}, {"key": "<strong>Current time</strong>", "value": "<span style='color: blue'>{0}</span>".format(datetime.datetime.now().strftime("%c"))}]) logger.debug("Done with job execution") |
The print statements print out logger.debug statements are sent to the runjob's log file. The output of submit.py is below. The first parameter is the path to the Python script, the second is the experiment name, the third is the run number and the rest are the parameters passed to the script.Note, one can form sbatch commands where the log output is not sent to a logfile and is instead sent as an email. Part of an example log file output is shown below.
No Format |
---|
DEBUG:__main__:In the arp_actual script - current time is Thu Apr 16 11:12:40 2020
...
DEBUG:__main__:EXPERIMENT=diadaq13
...
DEBUG:__main__:JID_UPDATE_COUNTERS=https://pswww.slac.stanford.edu/ws/jid_slac/jid/ws/replace_counters/5e98a01143a11e512cb7c8ca
...
DEBUG:__main__:RUN_NUM=26
...
DEBUG:__main__:The parameters passed into the script are |
No Format |
This is a test function for the batch submitting. The update_url is: http://psanaphi110:9843//ws/logbook/client_status/450 The parameters passed are: Param 1: /reg/g/psdm/webtutorials/ws/test/apps/logbk_batch_client/test/submit.py Param 2: xppi0915 Param 3: 134261 Param 4: param1 Param 5: param2 Step: 1, R Step: 2, J Step: 3, T Step: 4, P Step: 5, S Step: 6, B Step: 7, E Step: 8, K Step: 9, X Step: 10, Vbatchprocessing/arp_actual.py 100 DEBUG:__main__:Posting for step 0 DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): pswww.slac.stanford.edu:443 DEBUG:urllib3.connectionpool:https://pswww.slac.stanford.edu:443 "POST /ws/jid_slac/jid/ws/replace_counters/5e98a01143a11e512cb7c8ca HTTP/1.1" 200 195 DEBUG:__main__:Posting for step 1 ... DEBUG:__main__:Done with job execution ... |
Is it possible to submit more than one job per run?
...
Can a submitted job submit other subjobs?
...
...
ws-kerb/batch_manager/ws/logbook/batches_status/<experiment_id>.
For example, this gives the batch processing status for experiment id 302
language | py |
---|