Reason for the change

  • In April computing division at IN2P3 informed us about the fact that they are moving from their current batch system BQS to a new system Sun Grid Engine (often also denoted SGE, GE or gridEngine).
  • In order to accommodate this change which will effectively replace all current BQS workers by the end of 2011 with SGE, the jobControl Daemon has to be updated.
  • Details for the current BQS implementation are provided here: P2 Architecture at Lyon. The details regarding CCIN2P3 here

Urgency

  • high as Lyon will switch fully to SGE sooner rather than later and we will loose our computing resources if we do not comply with the changes.

Details

  • A new GridEngineJobControlService module has been appended to the current org-glast-jobcontrol package. This class is to some large extent a duplicate of the existing BQSJobControlService module, except that the submission parameters are changed to accommodate SGE needs. All changes to this code are included in CVS.
  • In addition SGE does not provide the same commands for querying job status. Therefore a wrapper, ge-qselect in python has been written that uses native SGE commands but provides an output identical to the BQS qselect command. This code is under SVN version control and its latest version lives on ccglast.in2p3.fr in /glast_data/Pipeline2/gridEngine/ge-qselect
  • The new daemon is loaded together with the current BQS implementation through the described procedure in the CCIN2P3 Pipeline pages (see link above). I have created a little wrapper script called bsub-all.sh that starts the daemon and registers it along with the BQS service. In addition Lyon has opened port 1097 for communication with our pipeline infrastructure.
  • In order to accommodate all changes from BQS to SGE and not to rewrite major parts of the pipeline wrapper scripts, a number of variables need to be casted to old BQS variables. See the taskconfig.xml of LYON-TEST-AG-GR-v17r35p14 for details:
    taskconfig.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <pipeline
       xmlns="http://glast-ground.slac.stanford.edu/pipeline"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://glast-ground.slac.stanford.edu/pipeline http://glast-ground.slac.stanford.edu/Pipeline-II/schemas/2.0/pipeline.xsd">
    
       <task name="LYON-TEST-AG-GR-v17r35p14"
             version="1.1"
             type="LYON">
    
          <notation>
               more elaborate test case for GridEngine JobControl Daemon
          </notation>
    
          <variables>
    <!-- IN2P3 version -->
    
                <var name="GPL_SITE">IN2P3</var>
                <var name="BATCHSYSTEM">SGE</var>
                <var name="GPL_TASKROOT">/sps/glast/Pipeline2/MC-tasks/${pipeline.task}</var>
                <var name="GLASTROOT">/afs/in2p3.fr/group/glast/glastpro</var>
                <var name="GPL_SCRIPTS">${GLASTROOT}/ground/PipelineConfig/GPL/python</var>
                <var name="GPL2">${GLASTROOT}/ground/PipelineConfig/GPLtools/prod/</var>
                <var name="GPL_XROOTD_DIR">/afs/in2p3.fr/group/glast/glastpro/xroot/bin</var>
                <var name="GPL2_MESSAGELVL">DEBUG</var>
    
                <var name="GPL_BATCHCPULIMIT">1000</var> <!-- cputime in actual seconds... -->
                <var name="GPL_BATCHVMLIMIT">4000</var> <!-- that is the memory max for now -->
                <var name="GPL_BATCHSCRATCHLIMIT">4096M</var>            <!-- need scratch space in order to define TMPBATCH -->
                <var name="PIPELINE_LOGFILE">logFile.txt</var>
          <process name="runMonteCarlo">
                  <job batchOptions=" -l fsize=${GPL_BATCHSCRATCHLIMIT}" maxCPU="${GPL_BATCHCPULIMIT}" maxMemory="${GPL_BATCHVMLIMIT}" >
                    <!-- Need to redefine old BQS variables that are different in GE to work with wrapper scripts -->
                    echo "CAST OLD ENV VARS TO BQS FAKE"
                    export QSUB_HOME=${SGE_CWD_PATH}
                    export QSUB_HOST=${SGE_CELL}
                    export QSUB_SHELL=${SGE_O_CSHELL}
                    export QSUB_USER=${SGE_O_LOGNAME}
                    export QSUB_WORKDIR=${SGE_O_WORKDIR}
                    export QSUB_REQNAME=${JOB_ID}
                    export QSUB_REQID=${SGE_O_HOST}
                    export TMPBATCH=${TMPDIR}
                    ### cool that's what we needed i guess SZ
    ...
    </process>
    </task>
    </pipeline>
    
  • In addition the runMonteCarlo.py script needs a minor modification to support the logscan. See the aforementioned task for details.

Test Conditions

  • on glastlnx12 there is the current package org-glast-jobcontrol-1.10-SNAPSHOT.jar which can be sourced to use the jython script in the glast-home directory. A sample test that does a few counts and returns an email is contained in a jython script in the glast-home directory.
    GridEngine.jy
    glast@glastlnx12 $ setenv CLASSPATH org-glast-jobcontrol-1.10-SNAPSHOT.jar
    glast@glastlnx12 $ jython GridEngineTest.jy
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/u/gl/glast/org-glast-jobcontrol-1.10-SNAPSHOT.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/u/gl/glast/org-glast-jobcontrol-1.10-SNAPSHOT.jar'
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/resources.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/resources.jar'
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/rt.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/rt.jar'
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/jsse.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/jsse.jar'
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/jce.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/jce.jar'
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/charsets.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/charsets.jar'
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/ext/sunjce_provider.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/ext/sunjce_provider.jar'
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/ext/sunpkcs11.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/ext/sunpkcs11.jar'
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/ext/dnsns.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/ext/dnsns.jar'
    *sys-package-mgr*: processing new jar, '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/ext/localedata.jar'
    *sys-package-mgr*: can't write cache file for '/afs/slac.stanford.edu/package/java/i386_linux2/jdk1.6.0_26/jre/lib/ext/localedata.jar'
    *sys-package-mgr*: can't write index file
    cycle  0  status:  Job 2217825 PENDING null  sleep for 15 seconds
    cycle  1  status:  Job 2217825 PENDING null  sleep for 15 seconds
    cycle  2  status:  Job 2217825 PENDING null  sleep for 15 seconds
    cycle  3  status:  Job 2217825 PENDING null  sleep for 15 seconds
    cycle  4  status:  Job 2217825 RUNNING ccwsge0467.in2p3.fr  sleep for 15 seconds
    cycle  5  status:  Job 2217825 RUNNING ccwsge0467.in2p3.fr  sleep for 15 seconds
    cycle  6  status:  Job 2217825 RUNNING ccwsge0467.in2p3.fr  sleep for 15 seconds
    cycle  7  status:  Job 2217825 RUNNING ccwsge0467.in2p3.fr  sleep for 15 seconds
    cycle  8  status:  Job 2217825 DONE ccwsge0467.in2p3.fr  sleep for 15 seconds
    cycle  9  status:  Job 2217825 DONE ccwsge0467.in2p3.fr  sleep for 15 seconds
    done.
    
  • In addition Tony has updated the development pipeline with the new java class and tests can be done using its webinterface
  • Currently there are two tasks which are identical clones of task AG-GR-v17r35p14-IRFS76BK-allE: BQS-TEST-AG-GR-v17r35p14 and LYON-TEST-AG-GR-v17r35p14. Both tests live in the development version of the pipeline but are stored in the usual MC-Tasks directory, i.e. on /nfs/farm/g/glast/u44/IN2P3/MC-Tasks/