Processing Data in Batch Mode using LCSim XML
Table of Contents |
---|
...
Overview
If you have not gotten here by following the LCSim Tutorials, then you might want to backup and review, as necessary.
This tutorial explains how to run org.lcsim in a batch computing environment , such as on a unix like a Unix command line or from a shell script , which that could be run on the grid.
If you have not gotten here by following the LCSim Tutorials, then backup and read or review as necessary.
Setup
Follow the instructions for building lcsim software using maven2.
Grid or your local batch computing system. The user provides what is typically called a "steering" file in HEP. It specifies all the parameters of the batch job. These steering files may have the extension .xml, but it is recommended to use .lcsim instead, to avoid ambiguity.
Running from the Command Line
Before starting you need to install org.lcsim on your local machine.
You You can now run lcsim from the command-line using the java command.
No Format |
---|
cd trunk/distribution # where is your lcsim? java -server -jar ./target/lib/lcsim-distribution-[VERSION]-bin.jar [XML]myjob.lcsim |
The VERSION is replaced by your lcsim build version . And XML to point to the actual "bin" file in your target directory.
The myjob.lcsim argument is an example name of is a file in the lcsim recon reconstruction XML format.
No Format |
---|
java -server -jar ./target/lib/lcsim-1.11-SNAPSHOT.jar ./myJob.xml
|
Simple Example
The JobManager class processes your job, which is written in an xml format.
Here is a simple example which will print the event number.
No Format |
---|
<lcsim>
<inputFiles>
<file>./myEvents.slcio</file>
</inputFiles>
<control>
<numberOfEvents>100</numberOfEvents>
</control>
<execute>
<driver name="EventMarkerDriver"/>
</execute>
<drivers>
<driver name="EventMarkerDriver"
type="org.lcsim.job.EventMarkerDriver">
<eventInterval>1</eventInterval>
</driver>
</drivers>
</lcsim>
|
The inputFiles section is a list of one or more LCIO input file paths that will be processed.
The control section sets the jobs run parameters. Here we set the maximum numberOfEvents.
The execute section is a list of drivers to be executed in order. The name field of the driver element must correspond with a valid driver.
Finally, the drivers section describes the drivers that will be run on the input file. Certain types of Driver parameters can be set in this section. Here the interval for event printing is set as eventInterval, which is an integer.
The signature for this method looks like this.
No Format |
---|
public void setEventInterval(int eventInterval);
|
The JobManager is able to convert from xml to these simple setters using Javabeans. All Java primitive types are accepted, as are 1d arrays of these types. The method must have a single argument only.
LCSim XML Format
This shows all possible XML elements in the LCSim format.
Subsequently, in this documentation, the runnable jar will be referenced to as lcsim-distribution-bin.jar but the actual jar will have the version number in it.
LCSim Command Line Options
Running the jar without any arguments will print the usage instructions.
No Format |
---|
java -jar lcsim-distribution-bin.jar [options] steeringFile.xml
usage:
-D Define a variable with form [name]=[value]
-n Set the max number of events to process.
-p Load a properties file containing variable definitions
-q Turn on quiet mode.
-s Set the number of events to skip.
-v Turn on verbose mode
-w Rewrite the XML file with variables resolved
-x Perform a dry run which does not process events
|
These options should be mostly self-explanatory.
Variable Definitions
The LCSim XML format allows variables to be defined using the -D switch or within properties files specified by the -p option.
For instance, an LCIO input file could be defined using a variable.
No Format |
---|
<file>${inputFile}</file>
|
Then this file could be specified at the command line.
No Format |
---|
java -jar lcsim-distribution-bin.jar -DinputFile=myInputFile.slcio steeringFile.xml
|
This variable could also be set in a properties file.
No Format |
---|
java -jar lcsim-distribution-bin.jar -pmySettings.prop steeringFile.xml
|
The file mySettings.prop could contain the following.
No Format |
---|
inputFile=myInputFile.slcio
|
An unlimited number of definitions and properties files can be used.
Simple Job Example
Here is a simple example which will print the event number.
No Format |
---|
<lcsim xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:noNamespaceSchemaLocation="http://www.lcsim.org/schemas/lcsim/1.0/lcsim.xsd">
|
No Format |
<lcsim> <inputFiles> <fileUrl /> <file /><file>./myEvents.slcio</file> </inputFiles> <control> <dryRun <numberOfEvents>100</>numberOfEvents> </control> <logFile /><execute> <cacheDirectory<driver name="EventMarkerDriver"/> </execute> <skipEvents /><drivers> <numberOfEvents /><driver name="EventMarkerDriver" <verbose /> <printDriverStatistics /type="org.lcsim.job.EventMarkerDriver"> <printSystemProperties /> <printUserClassPath /><eventInterval>1</eventInterval> <printDriversDetailed </>driver> </control> <classpath> <jarUrl /> <jar /> </classpath> <define> <anExampleVariable/> </define> <execute> <driver name="ExampleDriver" /> </execute> <drivers> <driver name="ExampleDriver" type="org.lcsim.example.ExampleDriver"> <exampleParam /> </driver> </drivers> </lcsim> |
Each of these xml sections will be explained in greater detail below.
Input Files
The <inputFiles> section contains a list of local or remote files to be processed.
These can be <file> elements which contain a relative or absolute path to a file on the local file system.
No Format |
---|
<inputFiles>
<file>/path/to/local/datafile.slcio</file>
</inputFiles>
|
Remote files that accessible via a public URL can be accessed using a <fileUrl> element.
No Format |
---|
<inputFiles>
<fileUrl>ftp://example.org/datafile.slcio</fileUrl>
</inputFiles>
|
These remote files will be downloaded to the cache directory, which is ~/.cache, by default. A different local cache directory can be specified using the <cacheDirectory> tag (covered below).
The <inputFiles> section can contain a mixture of <file> and <fileUrl> objects.
Some batch systems may not support remote file access via URL. Check with your administrator.
Job Control
The <control> section contains parameters that control the batch job, including the number of events to run and whether various debugging output should be printed.
Setting <dryRun> to true means that the job manager will create the drivers but will not run the job. This can be used to check that your driver setup and arguments are correct.
The <logFile> element is used to specify a log file location. If no log file is specified, the job output goes to the terminal screen.
The <cacheDirectory> specifies the root directory to be used for caching remote data files.
The <numberOfEvents> is the total number of events that will be run. All events will be processed if this argument is left blank.
The <skipEvents> argument tells the job manager to skip a number of events up-front before processing the rest.
The <verbose> tag should be set to true for verbose debugging output.
These tags can also be set to true to print out additional information about the job: <printDriverStatistics>, <printSystemProperties>, <printUserClassPath>, and <printDriversDetailed>.
drivers>
</lcsim>
|
The inputFiles section is a list of one or more LCIO input file paths that will be processed. There are actually multiple ways to specify input files (covered below).
The control section sets the jobs run parameters. Here we set the maximum numberOfEvents to 100.
The execute section is a list of drivers to be executed in order. The name field of the driver element must correspond with a valid driver.
Finally, the drivers section describes the drivers that will be run on the input file. Certain types of Driver parameters can be set in this section. Here the interval for event printing is set as eventInterval, which is an integer.
The signature for this Driver method looks like this.
No Format |
---|
public void setEventInterval(int eventInterval);
|
LCSim is able to convert from XML parameters to method calls on Drivers.
LCSim XML Format
The pseudo-XML below shows all possible elements in the LCSim format.
No Format |
---|
<lcsim xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
xs:noNamespaceSchemaLocation="http://www.lcsim.org/schemas/lcsim/1.0/lcsim.xsd">
<inputFiles>
<fileUrl />
<file />
<fileSet>
<file />
</fileSet>
<fileList />
<fileUrlList />
<fileRegExp />
</inputFiles>
<control>
<dryRun>true</dryRun>
<logFile>/path/to/mylog.txt</logFile>
<cacheDirectory>/path/to/mycache/</cacheDirectory>
<skipEvents>1</skipEvents>
<numberOfEvents>1000</numberOfEvents>
<verbose>true</verbose>
<printDriverStatistics>true</printDriverStatistics>
<printSystemProperties>true</printSystemProperties>
<printUserClassPath>true</printUserClassPath>
<printDriversDetailed>true</printDriversDetailed>
</control>
<classpath>
<jarUrl />
<jar />
<directory />
</classpath>
<define>
<anExampleVariable>1234</anExampleVarible>
</define>
<execute>
<driver name="ExampleDriver" />
</execute>
<drivers>
<driver name="ExampleDriver" type="org.lcsim.example.ExampleDriver">
<exampleParam>1234</exampleParam>
<exampleArrayParam>1 2 3 4</exampleParam>
<exampleArray2DParam>1 2 3 4; 5 6 7 8</exampleArray2DParam>
</driver>
</drivers>
</lcsim>
|
The format is completely described by the LCSim XML Schema. At run-time, the actual schema is not read from the internet but from an embedded resource in the LCSim jar file. If your XML file does not follow this format, the job will fail, and a trace back will be printed with information about the error.
Input Files
The <inputFiles> section contains a list of local or remote files to be processed. It may contain a mixture of any of the elements described below, but it may not be empty. And it must result in at least one input file being found or the job will fail.
file
The <file> element is a relative or absolute path to a file on the local file system.
No Format |
---|
<inputFiles>
<file>/path/to/local/datafile.slcio</file>
</inputFiles>
|
Or it may be a publically accessible URL.
No Format |
---|
<inputFiles>
<file>ftp://example.org/datafile.slcio</file>
</inputFiles>
|
Some batch systems may not support remote file access via a URL. Check with your administrator.
These remote files will be downloaded to the cache directory, which is ~/.cache, by default. A different local cache directory can be specified using the <cacheDirectory> tag in the <control> section.
fileSet
Sets of files on the local filesystem with the same base directory can be specified by using the <fileSet> element.
No Format |
---|
<fileSet baseDir="/my/data/dir">
<file>events1.slcio</file>
<file>events2.slcio</file>
</fileSet>
|
When processing these files, the base direcotry "/my/data/dir" will be prepended to each file to make a complete file path.
fileList
The <fileList> element should point to a text file containing a list of files, one per line.
For instance, say that you had a local text file at /example/mylciofiles.txt containing paths to local LCIO files.
No Format |
---|
/my/data/dir/events1.slcio
/my/data/dir/events2.slcio
|
This can be fed into LCSim using this XML code.
No Format |
---|
<fileList>/example/mylciofiles.txt</fileList>
|
fileRegExp
The <fileRegExp> element will include files that match a regular expression.
Here is an example that would match files similar to input1.slcio, input2.slcio, etc. in the current directory.
No Format |
---|
<fileRegExp baseDir=".">input*[0-9].slcio</fileRegExp>
|
See http://docs.oracle.com/javase/tutorial/essential/regex/ for more information about regular expressions in Java.
Job Control
The <control> section contains parameters that control the batch job, including the number of events to run and whether various debugging output should be printed.
dryRun
Setting <dryRun> to true means that the job manager will create the drivers but will not run the job. This can be used to check that your driver setup and arguments are correct. No events will be processed when this argument is set to true.
logFile
The <logFile> element is used to specify a log file location. If no log file is specified, the job output goes to the terminal screen. The text needs to point to a valid path on the local file system.
cacheDirectory
The <cacheDirectory> specifies the root directory to be used for caching remote data files.
numberOfEvents
The <numberOfEvents> is the total number of events that will be run before the job ends. All events will be processed if this argument is left blank or if it is set to a negative number.
skipEvents
The <skipEvents> argument tells the job manager to skip a number of events up-front before processing the rest.
The "print" tags can also be set to true to print out additional information about the job: <printDriverStatistics>, <printSystemProperties>, <printUserClassPath>, <printDriversDetailed>, and <printInputFiles>. The meaning of each should be self-explanatory.
The following will turn on all verbose output but turn off the printing of the system properties.
No Format |
---|
<control>
<verbose>true</verbose>
<printSystemProperties>false</printSystemProperties>
<control>
|
The settings of individual "print" commands will always override the verbose setting for that particular print out.
verbose
The <verbose> tag should be set to true to enable verbose debugging output when the XML input file is processed. This turns on all of the "print" elements described above, which can still be turned off individually by setting them to false after verbose has been turned on.
Variable Definitions
The job manager has very limited support for "free" variable definitions, using the <define> block.
At the moment, this is limited to single doubles, which can include expressions to be evaluated.
Here is an example of a simple double parameter.
No Format |
---|
<define>
<aDoubleParam>1.1</aDoubleParam>
</define>
|
Variables defined here can be included in expressions by using their name.
No Format |
---|
<define>
<aDoubleParam1>1.1</aDoubleParam1>
<aDoubleParam2>2.2</aDoubleParam2>
<aDoubleParam3>aDoubleParam1 + aDoubleParam2</aDoubleParam3>
</define>
|
Variables defined here are also available when passing values to Drivers (covered in the next section).
Class Path
The classpath section is for adding external jar files that contain Driver classes.
Here is an example pointing to a (non-existant) jar at a URL.
No Format |
---|
<classpath>
<jarUrl>http://www.example.org/something/myjar.jar</jarUrl>
</classpath>
|
The same thing can be done with local jar files and directories.
No Format |
---|
<classpath>
<jar>/path/to/myjar.jar</jar>
<directory>/path/to/myclassfiles</directory>
</classpath>
|
LCSim does not have the ability to determine the dependencies of the jar files listed here, so all required dependencies need to be included here.
...
Driver Execution
The <execute> section specifies the order in which the drivers will be called for each event. Each <driver> tag must have a unique name attribute value that matches the name of a driver defined in the <drivers> section (see next section).
...
The <drivers> section contains definitions for all drivers that will be called in the job. These drivers need to be defined in the LCSim package jar or any of the jars in the <classpath>.
Driver Arguments
LCSim can convert Using Javabeans, the job manager is able to convert simple XML text into parameter arguments for driversDriver methods. Only simple method signatures with single arguments are supported, and there is a limited amount of types included in this binding.
Here is a table of supported parameter types.
type | array1d | array2d | expression |
---|---|---|---|
int | yes | yes | yes |
String | yes | no | no |
double | yes | yes | yes |
float | yes | no | yes |
boolean | yes | no | no |
Hep3Vector | no | no | no |
File | no | no | no |
URL | no | no | no |
Types with a "yes" in the array1d or array2d columns support arrays of those dimensions. Arrays beyond two dimensions are not supported and would need to be read in manually by user code, perhaps using a method with a File or URL argument. Types that support expression evaluation have a "yes" in that the expression column.
Driver Example
The easiest way to understand how the driver parameter conversion works is to study a simple an example.
Here is an example Driver class with a number of setter methods.
No Format |
---|
package org.lcsim.example; public class MyDriver { public void setX(int x); public void setX1(int[] xx1); public void setX2(int[][] x)2; public void setFile(File f); public void setUrl(URL url); public void setVector(Hep3Vector vec); } |
In a real Driver, the methods would be defined to Implementation of these methods, which would set private variables to the argument valuespassed arguments, but this is left out of the example for brevity.
This is the corresponding XML code in <drivers> that would set pass values for to each of these parametersmethods.
No Format |
---|
<driver name="MyDriver" type="org.lcsim.example.MyDriver"> <x>1</x> <x1>1 2 3</x1> <x2>1 2 3; 4 5 6</x2> <file>/path/to/a/file.txt</file> <url>http://example.org/file.txt</url> <vector>1.0 2.0 3.0</vector> </driver> |
There are several important things to notice in this example.
By Javabeans' convention, the The set methods are transformed into matched to parameter names by removing the "set" string from the method name and making the first letter of the parameter lower case. The Driver set methods must begin with "set", or they will not be accessible from LCSim XMLbe ignored and not matched with any input parameters.
Multi-dimensional arguments are space delimited, meaning String arguments should not have spaces.
The rows in 2D arrays are separated by semicolons.
In the above example, integers are used for the 1D and 2D arrays, but other types support arrays, also. See the types table above for specifics.
Expression Evaluation
Guidelines for Creating Compatible Drivers
driver class must be public
drivers must have a public, no argument constructor
drivers should not do any initialization in their constructors; use detectorChanged() or startOfData() instead
driver set methods should always be of the form public void setParameter(type singleArg)
use of sub-drivers is discouraged due to these being inaccessible to the job manager, though it is still possible
Running a Specific LCSim Release
When an LCSim release is made, a zip file is created containing the LCSim jar and all its dependencies. Running a specific version of LCSim from the command line is as simple as downloading this zip file, unzipping it, and using the java command to run the jar with your XML input.
Retrieve the dependencies jar for the version you want to run.
No Format |
---|
wget http://www.lcsim.org/maven2/org/lcsim/lcsim/1.4/lcsim-1.4-deps.zip
|
You can also paste this URL into your browser, and a prompt should show asking whether to download it. (Specifics depend on your browser.)
Now, unzip the dependencies jar. All the jars will show up in a directory called lib/ in your current directory.
No Format |
---|
unzip lcsim-1.4-deps.zip
|
This uses the command line zip utility, but a zip program with a GUI such as WinZip or WinRar would work fine, too.
We're ready to run this version of lcsim. This step requires java 1.5 or greater to be installed and accessible from your command terminal.
No Format |
---|
java -server -jar ./lib/lcsim-1.4.jar ./myJob.xml
|
2D arrays are separated by semicolons.
In the above example, integers are used for the 1D and 2D arrays, but other types support arrays, also. See the types table for specifics.
Expression Evaluation
Simple expression evaluation is supported for a limited set of the supported parameter types, including int, double, and float, plus 1D or 2D arrays of these types. Supported symbols include *, /, +, (, ), and -, which have their usual mathematical meaning, plus trig functions like sin and cos. Variables created in <define> can also be accessed by their name. Expressions may have units, also.
The GNU JEL library provides this capability. Refer to its documentation for further information on the expression format.
Units
LCSim supports the named units defined by CLHEP's SystemOfUnits.
The names of the units are the same, but the actual values may not be the same. For instance, in LCSim, the basic energy unit is GeV, whereas it is MeV in CLHEP.
Refer to the LCsim SystemOfUnits documentation to see which units are defined.
Guidelines for Creating Compatible Drivers
Drivers that will be accessed via an LCSim XML file need to follow these guidelines.
- The Driver class must be public.
- The Driver class must have a public constructor that takes no arguments.
- The Driver's constructor should not do any initialization. It should instead use the detectorChanged() or startOfData() methods, which are called after all input parameters are processed.
The set methods to be accessed in the XML should always be of the form
No Format public void set[ParameterName]([type] [varName])
Set methods not of this form will not be accessible as XML parameters.
- The use of sub-drivers is discouraged due to these being inaccessible by the XML format, though it is still possible to use them. Any dependence of a child Driver on its parent's XML input parameters can be handled by using the startOfData() method to add a new child Driver instance.
How to Run a Specific Release
You do not need to build lcsim yourself in order to run a specific release. The SLAC Nexus Repository can be searched for all lcsim-distribution releases which will display a table including downloadable links. The bin.jar links are the runnable jars which can be downloaded to your machine and run as per the above instructions. Each release is also tagged in the cvs, like lcsim-1_4, so checking it out and rebuilding yourself is another possibility. (Not covered here.)