Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Automatic Run Processor (or ARP, for short, because I want that to catch on) is a web service that allows for automatic workflows and for the easier submission of batch jobs via a web interface. A script that submits the batch job (to allow for more customization in this command) is all that is needed for this system to work.

Table of Contents

1) Webpage

To use this system, choose an experiment from https://pswww-dev.slac.stanford.edu/apps-devlgbk/portal/select_experiment.phplgbk/experiments.

1.1)

...

Workflow Definitions

Under the Experiment tab, the Batch defs Workflow dropdown, select Definitions. The Workflow Definitions tab is the location where scripts are stored. Below is an example where the example script is used which is described in detail belowscripts are defined; scripts are registered with a unique name. Any number of hashes scripts can be created; as a hash is entered, a new row appears.registered. To register a new script, please click on the + button on the top right hand corner. 

Image AddedImage Removed

1.1.1)

...

Name

A unique name given to a registered script; the same script can be registered under different names with different parameters

The Id is a unique identifier of each hash. This is used is used in the backend of the software to keep track of each hashtag/executable pair.

1.1.2) Hash

The name given to the executable to be run. As shown below, each hash can be selected under the Action column to apply the script that the hash represents to that run.

1.1.

...

2) Executable

The absolute path to the batch script. An example can be seen here. This  This script must contain the batch job submission command (bsub) since it sbatch for SLURM). It gives the user the ability to customize the the batch submission. Overall, it can act as a wrapper for the code that will do the analysis on the data along with submitting the job.

1.1.

...

3) Parameters

The parameters that will be passed to the executable . They must be space-separatedas command line arguments.  These parameters can be used as parameters to the sbatch command to specify the queue, number of cores etc. Or; they can be used to customize the script execution. In addition, details of the batch job are made available as as environment variables.

  • EXPERIMENT - The name of the experiment; in the example shown above, diadaq13.
  • RUN_NUM - The run number for the job.
  • ARP_UPDATE_COUNTERS - This is a URL that can be used to update the progress of the job. These updates are also automatically reflected in the UI. In previous releases, this was called JID_UPDATE_COUNTERS.

  • ARP_JOB_ID - The id for this job execution; this is an internal identified used in API calls to the other data management systems for example, to update counters or in AirFlow integrations.
  • ARP_ROOT_JOB_ID - If using Airflow or other workflow engines; this is the identifier of the initial job in the DAG.
  • ARP_LOCATION - The data management location that this job is running at; for example, S3DF or NERSC.
  • ARP_SLURM_ACCOUNT - The SLURM account to be used in sbatch calls ( if applicable ).

1.1.

...

4)

...

Location

This defines where the analysis is done. While many experiments prefer to use the SLAC psana cluster (SLAC) or the SRCF (SRCF_FFB) to perform their analysis, others prefer to use HPC facilities like NERSC to perform their analysis. 

1.1.4) Trigger

This defines the event that in the data management system that kicks off the job submission.

  • Manually triggered - The user will manually trigger the job in the Workflow Control tab.
  • Start of a run - When the DAQ starts a new run.
  • End of a run - When the DAQ closes a run.
  • First file transfer - When the data movers indicate that the first registered file for a run has been transferred and is available at the job location.
  • All files transferred - When the data movers indicate that all registered files for a run have been transferred and are available at the job location.

Image Added

1.1.5) As user

If the job is automatically triggered, it will be executed as this user. If the job is manually triggered; it will be executed as the user triggering the job manually. This is set when creating the job definition and cannot be changedUsed for experiments that are currently running. If checked, every new run that is finished (i.e. every run that finishes after the box is checked) will has that hash executed on it by the user that checks the box. If hovered over, the checkbox will show who has checked the box.

1.1.6)

...

Edit/delete job

Use the edit/trash icons to edit a job definition or to delete a job definition

...

1.2)

...

Workflow Control

Over on Run Tables tab, the Batch control tab is where this hash Under the Workflow dropdown, select Control to create and check the status of your analysis jobs.  The Control tab is where job definitions defined in the Definitions tab may be applied to experiment runs. In An entry is automatically created for jobs that are triggered automatically. To manually trigger a job, in the drop-down menu of the Action Job column, any hash defined in Batch defs may be selected. There are plans to add more options such as the ability to apply a hash to every run or to automatically apply a hash to a run in real-time.

Image Removed

Once a hash is applied to a run, it will appear as shown below. In this case, the example case has finished as shown by the DONE status (other statuses are described below). The last two columns also warrant some explanation.

select the job. A job can be triggered any number of times; each execution has a separate row in the UI.

Image AddedImage Removed

1.2.1) Status

There These are multiple the different statuses that a job can have . They include a few used by LSF: PENDRUN, DONE, and EXIT. There are also a few others to describe possible situations that could arise:

  • START - Pending submission to the HPC workload management infrastructure.
  • SUBMITTED - The job has been submitted to the HPC workload management infrastructure. A job may stay in the SUBMITTED for some time depending on how busy the queues are.
  • RUNNING - The job is currently running. One can get job details and the log file. The job can also potentially be killed.
  • EXITED - The job has finished unsuccessfully. The log files may have some additional information. 
  • DONE - The job has finished successfully. The job details and log files may be available; most HPC workload management systems delete this information after some time
  • Request Sent - The first status of a job. If the job has yet to be submitted into LSF, Request Sent is shown to give the user feedback to their submitting.
  • Persisted -  The batch client is down (i.e. the system that submits the jobs for the user).
  • LSF_Slow - A timeout has occured (currently set at 5 seconds) on the posting of the job back to the batch manager. It will mostly be caused by the system hanging up on the submission of the job to LSF.

1.2.2) Actions

There are four different actions which can be applied to a script. They do the following if pressed:

Image RemovedImage Added - Attempt to kill the job (via the bkill command). A green success statement will appear near the top-right of the page if the job is killed successfully and a red failure statement will appear if the job is not killed successfully.

Image Removed - Delete the hash from the run. Note: this does not kill the job, it only removes it from the webpage.Image Removed - Image Added- Returns the log file for the job. If there is no log file or if no log file could be found, it will return blank.

Image Removed Image Added- Returns details for the current job by invoking the "bjobs -l" command on the LSF IDthe the appropriate job details command in the HPC workload management infrastructure.

 Image Added- Delete the job execution from the run. Note: this does not kill the job, it only removes it from the webpage.

1.2.3) Report

This is a customizable column which can be updated used by the used script script executable to report progress. The script executable reports progress by posting JSON to the correct URL. The URL is stored in a URL that is available as the environment variable BATCHJID_UPDATE_URL. The counters shown in the screenshot above were done with the following syntax which was posted in a for loop in a python script: 

Code Block
{'counters' : {'Example Counter' : [i + 1, 'red'], 'Random Char' : rand_char}}

As shown, the color of the output can also be customized. Whenever a POST is done for some submitted job (via the hash), the stored JSON for that job is updated only for what is posted. One value of this JSON is counters, along with others like lsf_id, job_database_idstatus and so on.

2) Hash Script

The following example scripts live at /reg/g/psdm/web/ws/test/apps/release/logbk_batch_client/test/submit.sh and /reg/g/psdm/web/ws/test/apps/release/logbk_batch_client/test/submit.py.

2.1) submit.sh

COUNTERS.

For example, to update the status of the job using bash, one can use 


Code Block
languagebash
curl -s -XPOST ${JID_UPDATE_COUNTERS} -H "Content-Type: application/json" -d '[ {"key": "<b>LoopCount</b>", "value": "'"${i}"'" } ]'

In Python, one can use 

Code Block
languagepy
import os
import requests
requests.post(os.environ["JID_UPDATE_COUNTERS"], json=[ {"key": "<b>LoopCount</b>", "value": "75" } ])

2) Examples.

2.1) arp_submit.sh

The executable script used in the workflow definition should be used primarily to set up the environment etc and submit the analysis script to the HPC workload management infrastructure. For example, a simple executable script that uses SLURM's sbatch to submit the analysis script is available here - /reg/g/psdm/tutorials/batchprocessing/arp_submit.shThe script that the hash corresponds to is the one that submits the job via the bsub command. This script is shown below.

Code Block
#!/bin/bash

source /reg/g/psdm/etc/psconda.sh
ABS_PATH=/reg/g/psdm/web/ws/test/apps/logbk_batch_client/test
bsub -q psdebugq -o $ABS_PATH/logs/%J.log pythontutorials/batchprocessing
sbatch --nodes=2 --partition=psanaq --time=5 --output="arp_example_${RUN_NUM}_%j.log"  $ABS_PATH/submitarp_actual.py "$@"

This script will run the batch job on psdebugq and store the log files in submit /reg/g/psdm/web/ws/test/apps/release/logbk_batch_client/test/logs/<lsf_id>. Also, it will pass all arguments passed to it to the python script, submit.py (these would be the parameters entered in the Batch defs tab)/tutorials/batchprocessing/arp_actual.py.  /reg/g/psdm/tutorials/batchprocessing/arp_actual.py will be passed the parameters as command line arguments and will inherit the EXPERIMENT, RUN_NUM and JID_UPDATE_COUNTERS environment variables.
Log files:
→ If the --output parameter is not specified to sbatch, then SLURM will store the log output in /reg/d/psdm/dia/diadaq13/scratch/<slurm_job_id>.out
→ In the example above, the log output will be sent to to the default working folder for the job; which is the scratch folder but the file name will be generated using the run number and the job id. For example, the log file for run 25 job id 409327. will be send to /reg/d/psdm/dia/diadaq13/scratch/arp_example_25_409327.log
→ To avoid cluttering the scratch folder, one can use an absolute path in the --output command to specify an alternate location for the job log files. See the "filename pattern" in the sbatch man page for more details.


2.2)

...

arp_actual.py

The This Python script is the code that will do analysis and whatever is necessary on the run data. Since this is just an example, the Python script,   submitarp_actual.py, doesn't get that involved. It is shown below.

Code Block
from time import sleep
from requests import post
from sys import argv
from os import environ
from numpy import random
from string import ascii_uppercase

print 'This is a test function for the batch submitting.\n'#!/usr/bin/env python
import os
import sys
import requests
import time
import datetime
import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

logger.debug("In the arp_actual script - current time is %s", datetime.datetime.now().strftime("%c"))

for k, v in sorted(os.environ.items()):
    logger.debug("%s=%s", k, v)

## Fetch the URL to post POSTprogress toupdates
update_url = os.environ.get('BATCHJID_UPDATE_URLCOUNTERS')
print 'The update_url is:'logger.debug("The URL to post updates is %s", update_url, '\n')

### These Fetchare the passedparameters argumentsthat asare passed by submit.sh
params = argv
print 'The parameters passed are:'
for n, param in enumerate(params):
    print 'Param %d:' % (n + 1), param
print '\n'in
logger.debug("The parameters passed into the script are %s", " ".join(sys.argv))
 
loop_count = 20
try:
    loop_count = int(sys.argv[1])
except:
    pass

## Run a loop, sleep a second, then POST
for i in range(10loop_count):
    time.sleep(1)
    rand_char = random.choice(list(ascii_uppercase))
 
    print 'Step: %d, %s' % (i + 1, rand_charlogger.debug("Posting for step %s", i)
    requests.post(update_url, json={'counters'[{"key": "<strong>Counter</strong>", "value" : {'Example Counter' : [i + 1, 'red'],
                                         'Random Char' : rand_char}})

"<span style='color: red'>{0}</span>".format(i+1)}, {"key": "<strong>Current time</strong>", "value": "<span style='color: blue'>{0}</span>".format(datetime.datetime.now().strftime("%c"))}])


logger.debug("Done with job execution")
 

2.3) Log File Output

The print statements print out logger.debug statements are sent to the runjob's log file. The output of submit.py is below. The first parameter is the path to the Python script, the second is the experiment name, the third is the run number and the rest are the parameters passed to the script.Note, one can form sbatch commands where the log output is not sent to a logfile and is instead sent as an email. Part of an example log file output is shown below.

No Format
DEBUG:__main__:In the arp_actual script - current time is Thu Apr 16 11:12:40 2020
...
DEBUG:__main__:EXPERIMENT=diadaq13
...
DEBUG:__main__:JID_UPDATE_COUNTERS=https://pswww.slac.stanford.edu/ws/jid_slac/jid/ws/replace_counters/5e98a01143a11e512cb7c8ca
...
DEBUG:__main__:RUN_NUM=26
...
DEBUG:__main__:The parameters passed into the script are
No Format
This is a test function for the batch submitting.

The update_url is: http://psanaphi110:9843//ws/logbook/client_status/450 

The parameters passed are:
Param 1: /reg/g/psdm/webtutorials/ws/test/apps/logbk_batch_client/test/submit.py
Param 2: xppi0915
Param 3: 134261
Param 4: param1
Param 5: param2


Step: 1, R
Step: 2, J
Step: 3, T
Step: 4, P
Step: 5, S
Step: 6, B
Step: 7, E
Step: 8, K
Step: 9, X
Step: 10, Vbatchprocessing/arp_actual.py 100
DEBUG:__main__:Posting for step 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): pswww.slac.stanford.edu:443
DEBUG:urllib3.connectionpool:https://pswww.slac.stanford.edu:443 "POST /ws/jid_slac/jid/ws/replace_counters/5e98a01143a11e512cb7c8ca HTTP/1.1" 200 195
DEBUG:__main__:Posting for step 1
...
DEBUG:__main__:Done with job execution
...

3.0 Frequently Asked Questions (FAQ)

Is it possible to submit more than one job per run?

...

Can a submitted job submit other subjobs?

  • Yes, in a standard LSF/SLURM fashion, BUT the ARP will not know about the subjobs.  Only jobs submitted through the ARP webpage are known to the ARP.

...

  • The ARP keeps track of the hashtags for each run and the associated LSF/SLURM jobid.  That information allows the ARP to kill jobs.

...

  • The JSON values are displayed in the ARP webpage automatically.  To access them programmatically, use the kerberos endpoint 

    ws-kerb/batch_manager/ws/logbook/batches_status/<experiment_id>.
    For example, this gives the batch processing status for experiment id 302

    Code Block
    languagepy
    import requests from krtc import KerberosTicket from urllib.parse import urlparse ws_url = "https://pswww.slac.stanford.edu/ws-kerb/batch_manager/ws/logbook/batches_status/302" krbheaders = KerberosTicket("HTTP@" + urlparse(ws_url).hostname).getAuthHeaders() r = requests.get(ws_url, headers=krbheaders) print(r.json())endpoint. See API access to the LCLS2 eLog for more details.