Table of Contents |
---|
Why use the pipeline to run
...
Monte-Carlo
...
jobs?
...
The
...
pipeline
...
offers
...
a
...
number
...
of
...
advantages
...
when
...
running
...
large
...
numbers
...
of
...
Monte-Carlo
...
jobs:
...
- Provides
...
- an
...
- easy
...
- way
...
- to
...
- submit
...
- a
...
- large
...
- number
...
- of
...
- jobs
...
- Maintains
...
- record
...
- of
...
- all
...
- jobs
...
- run,
...
- including
...
- links
...
- to
...
- log
...
- files
...
- and
...
- other
...
- files
...
- produced
...
- during
...
- the
...
- job
...
- Makes
...
- it
...
- easy
...
- to
...
- rerun
...
- any
...
- jobs
...
- that
...
- fail
...
- due
...
- to
...
- quirks
...
- in
...
- the
...
- SLAC
...
- batch
...
- system.
...
- Makes
...
- it
...
- easy
...
- to
...
- register
...
- output
...
- datasets
...
- in
...
- the
...
- data
...
- catalog,
...
- which
...
- in
...
- turn
...
- makes
...
- it
...
- easy
...
- to
...
- keep
...
- track
...
- of
...
- what
...
- MC
...
- data
...
- is
...
- available.
...
- Provides
...
- a
...
- web
...
- interface
...
- to
...
- allow
...
- the
...
- status
...
- of
...
- jobs
...
- to
...
- be
...
- monitored
...
- from
...
- anywhere.
...
Example
...
Monte-Carlo
...
task
...
Normally
...
to
...
run
...
a
...
set
...
of
...
Monte-Carlo
...
jobs
...
it
...
is
...
necessary
...
to
...
define
...
a
...
pipeline
...
"Task".
...
A
...
task
...
consists
...
of
...
an
...
arbitrary
...
graph
...
of
...
batch
...
jobs
...
and
...
"scriptlets"
...
to
...
be
...
run,
...
however
...
a
...
typical
...
Monte-Carlo
...
task
...
consists
...
of
...
just
...
two
...
steps:
In this example the two steps are:
- runMonteCarlo - a batch job that is run to execute the simulation program and generate the output file
- register-ds - a scriptlet which is run if the batch job finishes successfully that registers the output file in the data catalog.
Typically when generating large MC event samples it is necessary to run many MC jobs each of which generates some number of events. In the pipeline this is achieved by creating many streams within a given task. In the case of this example Monte-Carlo task each stream will run one batch job followed by one registration scriplet.
Defining a task
To create a pipeline task it is necessary to write an XML configuration file. The key elements of the XML configuration file for the task above (with some details initially left out) are shown here:
Code Block | ||||
---|---|---|---|---|
| ||||
!TaskImageServlet.gif! In this example the two steps are: # runMonteCarlo - a batch job that is run to execute the simulation program and generate the output file # register-ds - a scriptlet which is run if the batch job finishes successfully that registers the output file in the data catalog. Typically when generating large MC event samples it is necessary to run many MC jobs each of which generates some number of events. In the pipeline this is achieved by creating many streams within a given task. In the case of this example Monte-Carlo task each stream will run one batch job followed by one registration scriplet. h2. Defining a task To create a pipeline task it is necessary to write an XML configuration file. The key elements of the XML configuration file for the task above (with some details initially left out) are shown here: {code:title=EXOMCTemplate.xml} <?xml version="1.0" encoding="UTF-8"?> <pipeline xmlns="http://glast-ground.slac.stanford.edu/pipeline" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://glast-ground.slac.stanford.edu/pipeline http://srs.slac.stanford.edu/Pipeline-II/schemas/2.0/pipeline.xsd"> <task name="EXOMCBackground" type="EXO" version="1.3"> <notation>A generic task for running EXO MC backgrounds</notation> <variables> <var name="EXODIR">/nfs/slac/g/exo</var> <var name="EXOBASE">${EXODIR}/software/builds/trunk</var> <var name="BATCHOPTIONS">-R "select[-rhel30] rusage[scratch=1]"</var> <var name="CORE_LIMIT">1024</var> <var name="MAXEVENTS">10000</var> <var name="PRINTMODULO">${MAXEVENTS/100}</var> <var name="MAXCPU">${MAXEVENTS/10}</var> <var name="MAXMEM">1000</var> <var name="OUTPUT_DIR">/nfs/slac/g/exo/exo_data/data/MC/backgrounds/test</var> <var name="OUTPUT_FORMAT">MC-background-%06d.root</var> <var name="OUTPUT_NAME">${format(pipeline.stream,OUTPUT_FORMAT)}</var> <var name="OUTPUT_FILE">${OUTPUT_DIR}/${OUTPUT_NAME}</var> <var name="DATACAT_DIR">EXO/Test</var> <var name="DATACAT_GROUP">MyGroup</var> </variables> <process name="runMonteCarlo"> <job batchOptions="${BATCHOPTIONS}" maxCPU="${MAXCPU}" maxMemory="${MAXMEM}"> ... </job> </process> <process name="register-ds"> <notation>Register datasets created in this task</notation> <script> ... </script> <depends> <after process="runMonteCarlo"/> </depends> </process> </task> </pipeline> {code} |
In
...
this
...
file
...
the
...
<task>
...
element
...
defines
...
the
...
name
...
and
...
version
...
#
...
of
...
the
...
task,
...
as
...
well
...
as
...
the
...
task
...
type.
...
<notation>
...
just
...
indicates
...
a
...
comment
...
describing
...
the
...
task.
...
The
...
<variables>
...
section
...
defines
...
a
...
set
...
of
...
variables
...
which
...
will
...
be
...
used
...
elsewhere
...
in
...
the
...
task.
...
The
...
values
...
given
...
to
...
the
...
variables
...
are
...
defaults
...
which
...
can
...
be
...
overridden
...
for
...
any
...
specific
...
stream
...
when
...
the
...
stream
...
is
...
created.
...
Note
...
that
...
variables
...
can
...
be
...
defined
...
in
...
terms
...
of
...
other
...
variables
...
by
...
using
...
${expression}
...
syntax.
...
The
...
two
...
steps
...
of
...
the
...
task
...
are
...
each
...
defined
...
using
...
a
...
<process>
...
element.
...
The
...
first
...
<process>
...
contains
...
a
...
<job>
...
element
...
indicating
...
it
...
is
...
a
...
batch
...
job
...
(the
...
body
...
of
...
the
...
job
...
is
...
not
...
omitted
...
for
...
the
...
moment).
...
Note
...
again
...
how
...
variables
...
defined
...
earlier
...
in
...
the
...
task
...
are
...
used
...
to
...
provide
...
values
...
for
...
some
...
parameters
...
using
...
the
...
${expression}
...
syntax.
...
The
...
second
...
<process>
...
contains
...
a
...
<script>
...
element
...
indicating
...
that
...
it
...
is
...
a
...
scriptlet
...
(again
...
the
...
body
...
of
...
the
...
scriptlet
...
is
...
omitted
...
for
...
the
...
time
...
being).
...
The
...
<depends>
...
element
...
indicates
...
that
...
the
...
scriptlet
...
should
...
only
...
run
...
after
...
the
...
batch
...
job
...
successfully
...
completes.
...
Now
...
lets
...
look
...
at
...
the
...
parts
...
which
...
we
...
initially
...
missed
...
out.
...
First
...
the
...
body
...
of
...
the
...
batch
...
job,
...
which
...
by
...
default
...
is
...
written
...
as
...
a
...
bash
...
script:
Code Block | ||||
---|---|---|---|---|
| ||||
{code} ulimit -c ${CORE_LIMIT} # Limit core dumps set -e # exit on error # Create a scratch area to write the output to export SCRATCH_DIR=/scratch/exo/${PIPELINE_PROCESSINSTANCE} mkdir -p ${SCRATCH_DIR} gotEXIT() { rm -rf ${SCRATCH_DIR} } trap gotEXIT EXIT source ${EXOBASE}/setup.sh cat > background.exo <<EOF use exosim rec toutput /exosim/macro background.mac /exosim/filter true printmodulo ${PRINTMODULO} /exosim/initial_seed ${PIPELINE_STREAM} maxevents ${MAXEVENTS} /toutput/file ${SCRATCH_DIR}/output.root begin exit EOF cat > background.mac <<EOF /digitizer/wireNoise 800.000000 /digitizer/APDNoise 2000.000000 /digitizer/LXeEnergyRes 0.015000 /event/LXeEventsOnly true /event/digitizeWires true /event/digitizeAPDs true /gps/pos/type Volume /gps/pos/shape Cylinder /gps/pos/halfz 72.5 cm /gps/pos/radius 75.0 cm /gps/pos/centre 0.0 0.0 0.0 cm /gps/pos/confine HFE /gps/energy 0 keV /gps/particle ion /gps/ion 19 40 0 0 /grdm/analogueMC 1 EOF EXOAnalysis background.exo mkdir -p ${OUTPUT_DIR} cp -pv ${SCRATCH_DIR}/output.root ${OUTPUT_FILE} {code} There are a few points worth noting * All of the variables defined earlier in the task are passed to the batch job as environment variables and ca be referred to using the bash ${VARIABLE} syntax. * The bash file creates a scratch folder at the top of the job, and registers a cleanup trap to delete the scratch area at the end of the job. The output data is written to the scratch area and copied to its final location if the job completes successfully. This is recommended practice since if many batch jobs write simultaneously to the same NFS file server it will likely become overloaded and fail horribly. * The .mac file and .exo file required for running EXOAnalysis are generated on the fly, substituting in settings from the variables defined earlier. Note also the use of some special pipeline variables: ** *${PIPELINE_STREAM}* \-\- an id assigned when each stream is created. The id will be unique within this task and normally starts at 0 and increments for each subsequent stream ** *${PIPELINE_PROCESSINSTANCE}* \-\- similar to pipeline stream this is a unique id associated with a stream, but this id is unique across all tasks, and is normally a large ugly number Finally lets look at the body of the scriptlet which is used to register the output dataset. The scriptlet is written in python: {code} |
There are a few points worth noting
- All of the variables defined earlier in the task are passed to the batch job as environment variables and ca be referred to using the bash ${VARIABLE} syntax.
- The bash file creates a scratch folder at the top of the job, and registers a cleanup trap to delete the scratch area at the end of the job. The output data is written to the scratch area and copied to its final location if the job completes successfully. This is recommended practice since if many batch jobs write simultaneously to the same NFS file server it will likely become overloaded and fail horribly.
- The .mac file and .exo file required for running EXOAnalysis are generated on the fly, substituting in settings from the variables defined earlier. Note also the use of some special pipeline variables:
- ${PIPELINE_STREAM} -- an id assigned when each stream is created. The id will be unique within this task and normally starts at 0 and increments for each subsequent stream
- ${PIPELINE_PROCESSINSTANCE} -- similar to pipeline stream this is a unique id associated with a stream, but this id is unique across all tasks, and is normally a large ugly number
Finally lets look at the body of the scriptlet which is used to register the output dataset. The scriptlet is written in python:
Code Block | ||||
---|---|---|---|---|
| ||||
attributes = {'sCreator':'tonyj','nThings':0.1}
dsNew = datacatalog.newDataset(OUTPUT_NAME, "root", "EXOROOT", DATACAT_DIR, DATACAT_GROUP, "SLAC", OUTPUT_FILE)
datacatalog.registerDataset(dsNew, attributes);
{code}
|
Again
...
a
...
few
...
things
...
worth
...
noting:
...
- The
...
- datacatalog
...
- allows
...
- arbitrary
...
- meta-data
...
- to
...
- be
...
- associated
...
- with
...
- datasets.
...
- In
...
- this
...
- case
...
- the
...
- meta-data
...
- is
...
- defined
...
- as
...
- a
...
- python
...
- dictionary
...
- The
...
- newDataset
...
- method
...
- has
...
- many
...
- arguments
...
- which
...
- need
...
- some
...
- explanation.
...
Putting
...
everything
...
together
...
the
...
full
...
XML
...
file
...
for
...
the
...
task
...
is:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
<?xml version="1.0" encoding="UTF-8"?> <pipeline xmlns {code:title=EXOMCBackground.xml} xmlns:xs="http://www.w3.org/glast-ground.slac.stanford.edu/pipeline" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://glast-ground.slac.stanford.edu/pipeline http://srs.slac.stanford.edu/Pipeline-II/schemas/2.0/pipeline.xsd"> A generic task for <task name="EXOMCBackground" type="EXO" version="1.5"> <notation>A generic task for running EXO MC backgrounds</notation> backgrounds <variables> <var name="EXODIR">/nfs/slac/g/exo ${EXODIRexo</var> <var name="EXOBASE">${EXODIR}/software/builds/trunk trunk</var> <var name="BATCHOPTIONS">-R ""select[-rhel30] rusage[scratch=1]" 1024 10000 ${MAXEVENTS/100} ${MAXEVENTS/10} 1000 /nfs/slac/g/exo/exo_data/data/MC/backgrounds/test MC-background-%06d.root ${format(pipeline.stream,OUTPUT_FORMAT)} ${OUTPUT_DIR}/${OUTPUT_NAME} EXO/Test MyGroup </job> </process> "</var> <var name="CORE_LIMIT">1024</var> <var name="MAXEVENTS">10000</var> <var name="PRINTMODULO">${MAXEVENTS/100}</var> <var name="INITIALSEED">pipeline.stream%100000</var> <var name="MAXCPU">${MAXEVENTS/10}</var> <var name="MAXMEM">1000</var> <var name="OUTPUT_DIR">/nfs/slac/g/exo/exo_data/data/MC/backgrounds/test</var> <var name="OUTPUT_FORMAT">MC-background-%06d.root</var> <var name="OUTPUT_NAME">${format(pipeline.stream,OUTPUT_FORMAT)}</var> <var name="OUTPUT_FILE">${OUTPUT_DIR}/${OUTPUT_NAME}</var> <var name="DATACAT_DIR">EXO/Test</var> <var name="DATACAT_GROUP">MyGroup</var> </variables> <process name="runMonteCarlo"> <job batchOptions="${BATCHOPTIONS}" maxCPU="${MAXCPU}" maxMemory="${MAXMEM}"><![CDATA[ ulimit -c ${CORE_LIMIT} # Limit core dumps set -e # exit on error # Create a scratch area to write the output to export SCRATCH_DIR=/scratch/exo/${PIPELINE_PROCESSINSTANCE} mkdir -p ${SCRATCH_DIR} gotEXIT() { rm -rf ${SCRATCH_DIR} } trap gotEXIT EXIT source ${EXOBASE}/setup.sh cat > background.exo <<EOF use exosim rec toutput /exosim/macro background.mac /exosim/filter true printmodulo ${PRINTMODULO} /exosim/initial_seed ${INITIALSEED} /exosim/run_number ${PIPELINE_STREAM} maxevents ${MAXEVENTS} /toutput/file ${SCRATCH_DIR}/output.root begin exit EOF cat > background.mac <<EOF /digitizer/wireNoise 800.000000 /digitizer/APDNoise 2000.000000 /digitizer/LXeEnergyRes 0.015000 /event/LXeEventsOnly true /event/digitizeWires true /event/digitizeAPDs true /gps/pos/type Volume /gps/pos/shape Cylinder /gps/pos/halfz 72.5 cm /gps/pos/radius 75.0 cm /gps/pos/centre 0.0 0.0 0.0 cm /gps/pos/confine HFE /gps/energy 0 keV /gps/particle ion /gps/ion 19 40 0 0 /grdm/analogueMC 1 EOF EXOAnalysis background.exo mkdir -p ${OUTPUT_DIR} cp -pv ${SCRATCH_DIR}/output.root ${OUTPUT_FILE} ]]> </job> </process> <process name="register-ds"> <notation> Register <notation>Register datasets created in this task <task</notation> <script><\!\ <script><![CDATA\[ from java.util import HashMap from org.glast.datacat.client.sql import NewDataset attributes = HashMap() attributes.put({'sCreator':'tonyj', 'tonyj') nThings':0.1} dsNew = NewDatasetdatacatalog.newDataset(OUTPUT_NAME, "root", "EXOROOT", DATACAT_DIR, DATACAT_GROUP, "SLAC", OUTPUT_FILE) datacatalog.registerDataset(dsNew, attributes); \]\]> ]]> </script> <depends> <after process="runMonteCarlo"></after> > </depends> </process> </task> </pipeline> {code} {code} {code} |