Blog

Included are fix for PII-424 - Logfiles from Lyon not shown

Fix for a bug where the schema for 2.1 allowed jobsites over 64 characters. 64 is the database limit.

Jira:

https://jira.slac.stanford.edu/browse/SSC-349

Timeline

I plan to release the new version to dev on March 19. I plan to release the new version to Prod on March 26.

Motivation:

1.4.1 Adds new support for cancelProcessInstance, parameterized jobsites in a batch process, and fixes bugs.

Notable new features:

1. 'Split mode' has been implemented as follows:

A process element in your XML definition can now contain a "site" attribute. This attribute will override the default site which is typically resolved through the task type. This site can be static, like SLAC, SLACMC, SLACRAW, or it can be a simple expression using a variable, or it can be an even more complex expression. Currently the expressions are limited to 64 characters, but we'll probably raise that if needed.

Examples include:

Static jobsite resolution:

 <process name="jobsiteProcess" site="SLAC">

Variable resolution [1]:

 <process name="jobsiteProcess" site="${jobsite}">

Complex conditional (Check to see if pipeline.stream is even, if it is, then SLAC, if it's not, then SLACMC)

 <process name="jobsiteProcess" site="${pipeline.stream %2 == 0 ? &quot;SLAC&quot; : &quot;SLACMC&quot;}">

[1] The variable jobsite can be declared with a default in the XML file, as you would with any variable, which can also be overridden at stream creation time.

2. cancelProcessInstance support in the linemode client.

This is a limited and fine grained tool for killing a specific submitted/running job. It is not a broad tool that cancelStream will need to be. I've been working on cancelStream but I couldn't get a robust, fast version out in time and tested for this release.

Example:

 pipeline cancelProcessInstance TaskName/SubTask processName 0.2 
  •  where 0.2 is the streampath. The full taskpath, processname, and streampath are required. It can only kill one job at a time (but it's fast!) 
3. New Mbean for overriding the total number of jobs submitted and running in order to limit job submission. You can set this per site and/or per task.

A user can add an override in the jmx admin page. 

Where the old maxjobs is really a "max submitted", the MaxJobsTotal and MaxJobsTotalPerTask overrides will limit the maximum number of submitted AND running jobs.

For instance, you could limit MaxJobs to 200, and MaxJobsTotal to 1000. It would never allow more than 200 jobs in a submitted state, and it would never allow you to submit anymore if there were, say 100 submitted jobs and 900 running jobs. (for a total of 1000).

4. jython scriptlets can now use sys.exit() and exit with a code of 0 for success or any other code for failure.
5. Stream creation should be much quicker in most circumstances. 

Components to be upgraded:

Here is a list of the changes that will occur:

  • Stored procedures and Server move to version 1.4.1
  • Client:
    • org-srs-pipeline-client-1.1.0.jar -> org-srs-pipeline-client-1.1.1.jar

Related Jira Issues:

https://jira.slac.stanford.edu/browse/SSC-346

Issues relevant to this release:

https://jira.slac.stanford.edu/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+PII+AND+fixVersion+%3D+%221.4.1%22+AND+status+%3D+Resolved+ORDER+BY+priority+DESC&mode=hide1

Backward Compatibility:

Fully backward compatible

Upgrade Procedure:

Deploy new jars. Deploy stored procedures

Back-out Procedure:

Redeploy old jars. Redeploy old stored procedures

Background

The latest version of our login filter had a bug which impacted POST requests after a minute of web inactivity, which had visibly affected the Pipeline, merit skimmer, and astro skimmer. I've fixed that bug.

Versions

Pipeline front end from 3.2.1 to 3.2.2.

Merit skimmer to 1.8.1 from 1.8.

Astroserver to 1.4.6 from 1.4.5.

https://jira.slac.stanford.edu/browse/SSC-341

Background

Max is deploying some new versions of the login filter/group manager. The current applications will no longer work with the new version of the group manager, so applications needed to be updated. The datacatalog and pipeline versions in subversion have had some bug fixes. 

I had to back out of the previous version of the pipeline front end ( https://confluence.slac.stanford.edu/display/ds/2012/05/30/CCB+Request+-+Release+Pipeline+web+front+end+3.1 ) after some problems with Oracle were found. Those problems have since been fixed. This version that will be released is 3.2.1. That release had some large changes (most notably, caching for the main page to dramatically speed up load time). Those changes will be in this version.

The datacatalog has had very few changes since the last release. It's version will be 1.10.

Versions

Pipeline front end from 3.0 to 3.2.1.

Datacatalog front end from 1.10-SNAPSHOT (Actually just 1.9 with a fix) to 1.10.

Everything is backed up in case we need to back out.

https://confluence.slac.stanford.edu/display/ds/2012/08/27/CCB+Request+to+Update+Login+Filter+and+CAS+authentication

https://jira.slac.stanford.edu/browse/SSC-338

There are two parts to this request:

  1. Update the login filter for all Fermi Web applications
  2. Update the version of CAS

Login Filter update

The login filter is what allows Fermi web applications to communicate with the Single-Sign-On CAS server. 

In its original design it made use of a domain specific Cookies to handle the propagation of login/logout actions across Fermi web applications. Since then the login filter is being used by other experiments and the domain specific Cookie provides a barrier across which the SSO information cannot propagate.

The logic of the filter has been updated to span across a sub-domain: for example it will be possible to carry the login/logout information across glast-ground.slac.stanford.edu and srs.slac.stanford.edu and exo-data.slac.stanford.edu.

The overall logic is more sound and will be easier to explain in future security reviews.

This change is NOT backward compatible. What this means is that going from one application with the new login filter to one with the old one (and vice-versa), the login/logout information will not be carried and a user might have to login/logout again.

For the change to be effective, the login filter for all Fermi applications (and other experiments as well) will have to be updated.

CAS Server update

We are currently using a very old version of the CAS server that provides authentication against the SLAC Kerberos server.

We have a new version of CAS that is not interfacing with the SLAC Crowd server. This means that SLAC users will still be able to login with their Unix/Windows credentials, but additionally the Single-Sign-On will be extended to Confluence and Jira. In other words logging in to Jira or Confluence will authenticate users against the new Login Filter. The reverse is also true, logging in one of the Fermi web applications will automatically migrate to Confluence and Jira.

It will also be possible to authenticate non-SLAC users against their Confluence account. This might be of no interest to Fermi, but other experiments might like this feature.

The API for the new CAS server is backward compatible with the old one, so it should be easy to put the old CAS server back in production if needed.

Motivation:

Lots of bug fixes, some new features like caching for the main (index.jsp) page for faster load times on average, code syntax highlighting and numbering for viewing scriptlets.

Components to be upgraded:

org-srs-pipeline-web-3.0 to org-srs-pipeline-web-3.1

Related Jira Issues:

Resolved bugs
SSC-323

Backward Compatibility:

No backward compatibility issues should occur.

Upgrade Procedure:

1. Deploy to tomcat

Back-out Procedure:

The current version can be restored if needed.

Motivation:

The current client for 10g is buggy and won't allow OCI/wallet connections to the astro database for the new pass 7 oracle account. The 11g drivers work fine, but this change affects all web applications as we can't use two different drivers simultaneously.

In Summary:

Drivers will be updated from 10g to 11g.

Components to be upgraded:

Symlinks in the tomcat common directories will be updated for all tomcat webservers.

Related Jira Issues:

SSC-317

Backward Compatibility:

We tested on various applications without seeing any complications.

Upgrade Procedure:

Change symlinks to the new driver, restart tomcat server.

Back-out Procedure:

Change symlinks back to the old driver, restart tomcat server.

Reason for change

This is a request to upgrade the xrootd client tools from version 20091008-2019v1 to v3.1.1. The new version contains critical fixes to xrdcp.

Testing

The new version has been installed in the Fermi xrootd application area and is available through the TEST link. All tools xrdcp, xrd.pl, xrdls and xrdprel were tested against the test and production xrootd server running on rhel5 and rhel6 both 32 and 64 bit architecture.

Rollback

The client version is rolled back by recreating the link to the old version.

CCB Request

https://jira.slac.stanford.edu/browse/SSC-314

Details

The current xrdcp version can return with a zero exit code (success) when it writes to a xrootd server and the server disk is full. The server will send an error but xrdcp will continue to write. The new version will fail immediately.
The new version will also pass the file size to the server which it will use to select the disk partition the file will be written to.
All client tools will be updated: xrd.pl, xrdls, xrdcp, xrdprel and xrd.

In order to update the version the link /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD has to be changed to point to dist/v3.1.1/@sys. This is an atomic operation and clients should not fail because of this change.

Motivation:

1.4.0 stored procedures had bug a which prevented the rolling back of certain stored procedures. That bug has been patched, with a one line change the the code in the 1.4.0 version of the stored procedures.

In Summary:

Components to be upgraded:

org-srs-pipeline-sp-1.4.0 -> org-srs-pipeline-sp-1.4.0.1

Related Jira Issues:

https://jira.slac.stanford.edu/browse/PII-419
https://jira.slac.stanford.edu/browse/SSC-310

Backward Compatibility:

No backward compatibility issues will occur.

Upgrade Procedure:

  • Stop pipeline
  • Upload stored procedures (~ 2 minutes)
  • Start pipeline

Back-out Procedure:

The current version can be restored if needed.

Motivation:

1.4.0 has been a unification and refactoring of code between the 1.4 branch and the 1.3 branch.  The 1.4 branch was originally split off for support of EXO and other experiments, but features were added to the 1.3 branch that weren't propagated to the 1.4 branch.  These features and bug fixes have been merged between the two branches.  Furthermore, from Fermi's perspective, we have completely divorced the pipeline from glast now.

The proposed date for the request is the 18th of January (a Wednesday)

In Summary:
  • The 1.4.0 release includes a new interfaces for scripts.  It also has some fixes in the original interface where certain methods weren't implemented correctly.
  • The mentioned naming changes (glast to srs), and a move to Subversion
  • The stored procedures have been split out to be their own module as well.
  • Jython has been upgraded to 2.5 from 2.1
  • Maintainability! 

Components to be upgraded:

The 1.4.0 branch affects multiple modules in different ways.  For most modules, this means simply that we had to edit the import statements to import from the correct namespace (now org.srs.pipeline.*).  For others, like the runquality and datacatalog plugins, we had to modify a file in META-INF that denotes the plugin.  Similarly, we also updated org-glast-jobcontrol to be org-srs-jobcontrol.

Here is a list of the changes that will occur:

  • Stored Procedures for the Pipeline (new module split out of the Pipeline server): Stored procedures updated with new naming, moved into subversion.
  • Client:
    • org-glast-pipeline-client-1.0.2.jar -> org-srs-pipeline-client-1.1.0.jar
  • Plugins: 
    • org-glast-datacat-client-2.3.2.jar -> org-srs-datacat-client-2.4.0.jar
    • org-glast-datacat-sp-2.2.1-.jar -> org-srs-datacat-client-sp-2.2.2.jar
    • org-glast-runquality-jar-1.3.4.jar -> org-glast-runquality-jar-1.4.0.jar
    • org-glast-historyprocessing-1.0-SNAPSHOT.jar - > org-glast-historyprocessing-1.0.jar
  • Web: Pipeline front end is affected through the update to the pipeline client
    • org-glast-pipeline-web-2.8.5.jar -> org-srs-pipeline-web-3.0.0.jar
  • JobControl:  JobControl has also been moved to srs naming and subversion.
    • org-glast-jobcontrol-1.10.jar -> org-srs-jobcontrol-2.0.0.jar

Related Jira Issues:

https://jira.slac.stanford.edu/browse/PII/fixforversion/12283
https://jira.slac.stanford.edu/browse/SSC-308

Backward Compatibility:

No known compatibility issues should arise with this release.

Upgrade Procedure:

?Migrating Pipeline II systems from 1.3 to 1.4

The procedure is somewhat complicated, but most of these procedures are executed while the pipeline is online.  I expect the upgrade of prod to be able to successfully be completed within 15 minutes.

Back-out Procedure:

The current version can be restored if needed.

Motivation:

  • The 1.3.6 release includes features that will allow Level-1 processing to move off of the AFS buffer machines and onto XRootD. This will allow for increased performance and reliability.
  • There is also a minor bug-fix which addresses a problem with incorrect pipeline-file versions being retrieved during rollback operations.

Components affected:

The 1.3.6 release is a new version of the org-pipeline-server module only. No dependent libraries have changed and there are no database structure or stored procedure changes.

Related Jira Issues:

https://jira.slac.stanford.edu/browse/SSC-268

https://jira.slac.stanford.edu/browse/PII/fixforversion/12140

Backward Compatibility:

The release is fully backward compatible. Although the 1.3.6 release involves new features added to the jython-scriptlet and variable-expression-parsing interfaces, all previous features remain available.

Upgrade Procedure:

The upgrade requires only a simple restart of the pipeline server application.

Back-out Procedure:

If problems are discovered in the 1.3.6 release, the current release version (1.3.5) can be restored with a simple restart of the pipeline server application. However, any new tasks using the functionality available in 1.3.6 will fail to execute if a back-out of the new version is performed.

Computing Newsletter 200912

Computing Newsletter

Disk Space

We continue to add about 5TB of new data a week bringing the total size of LAT data at SLAC to 600TB (400TB Level 1 output, 180TB MC and 20TB reprocessed data). 4 new file 30TB servers arrived at SLAC on Monday and hopefully by the time you are reading this the first one will have been installed averting the need to store older data only on tape. Although the LAT data is all kept on raid arrays with multiple redundant drives we have also acquired an additional 250TB of tapes so that we can continue to keep all data backed up in case of unanticipated disk problems.

New user disk space (too successful?)

The new user disk space, /afs/slac.stanford.edu/g/glast/users, is gaining users and usage. Currently, 3 TB of disk space is allocated (of which about 1/2 is actually used) amongst 167 users. In addition, there are nine science groups with allocations totaling 345 GB. From this perspective, the new user space is a success.

However, a usage pattern for user space has emerged that is stressing the server. Submitting 100s of simultaneous batch jobs can cause the server to become non-responsive which, in turn, causes batch jobs to stall and eventually fail. In addition, interactive users attempting to access this space will be unsuccessful. The SLAC Computing Division has been alerted of this issue with hopes for improved service. In the meantime, please be aware of the possibility that one can overload that server and affect other users. Batch jobs should be limited to prevent such overloading conditions. This can be done by dribbling in batch jobs a few at a time while monitoring the server. When the CPU utilization exceeds ~50%, you are entering the danger zone. General guidelines for using the SLAC batch system in any substantial way (>50 simultaneous jobs) should also be read to avoid making common mistakes which can unnecessarily burden the file servers.

Pass 7.2 reprocessing

The first 14 months of survey data was reprocessed in October and November with the new Pass 7.2 event classification. The data sample extends from run 239557414 (2008-08-04 15:43:34 UT) through run 277596392 (2009-10-18 22:06:32 UT), spanning 6581 runs and including over 14 billion events. The C&A group is currently evaluating this reprocessed data and, depending on their findings, there may be another reprocessing cycle early next year. See the C&A pages for additional details: Pass7.2 planning, and Analysis User Forum.

Computer Security Training (SLAC)

An email went out earlier this month to all Fermi LAT collaborators who had not yet completed the required computer security training. We were informed by the SLAC Cyber Security Team that beginning in January 2010, all non-SLAC employees who had not completed this training would have their SLAC computer accounts disabled. (The deadline for SLAC employees was July 2009.) These accounts are used for access to the glast-ground web site including data access, interactive logins (Linux), email access, and access to a variety of other web-based services. Don't get stuck!

It is already the case (since Oct 2009) that users who need their passwords reset by an administrator must have first completed this training.

For more information on the course, or if you have questions, contact Marilyn Cariola in SLAC Computer Security at 650-926-2820 (email mcariola@slac.stanford.edu).

LAT Workbook developments

The LAT Workbook continues to evolve, expand and be updated. Some highlights since the last newsletter include: LAT GRBanalysis (new); User and group disk space; Using the SLAC batch farm; pylikelihood analysis (updated); Science Tools environment setup (update, including new SCons section); new astroserver examples; ASP Data Viewer help (updated). View a full chronicle of the updates, or, better yet, just browse through the Workbook

Good-bye and good luck to Navid!

an email from Richard last month

A bittersweet tale: Navid will be leaving us at the end of November to take up a position in Earth Sciences at Goddard. He did his Masters with them and will carry on with a PhD.

Navid has been a key player in our success to date and we can barely thank him enough for all he has done. Wish him the best...

In the meantime, we will be trying to figure out how to fill his shoes. No easy task given the breadth and depth of contributions. He is doing a mind meld with confluence now. At least he will be around to answer questions!

Sniff,
Richard

SCons On the Way

(Joanne Bogart)

SAS is in the process of changing build systems from CMT to SCons. This has been going on a long time (first investigation was over two years ago!) and isn't done yet, but the end is nearly in sight. ScienceTools builds are already being generated with the new SCons Release Manager as well as the CMT Release Manager; GlastRelease will take a few months longer. Should you care? That depends on how you use SAS software. End users who use pre-built binaries will see at most small differences in set up procedures. Those who build from sources, e.g. because their platform is unsupported, will need to know something about how SCons works and how we've chosen to use it. Developers need the same kind of understanding of SCons machinery - the equivalent of requirements files, GlastPolicy package, and so forth - as they currently have for CMT, and may also benefit by learning about SCons analogs of MRvcmt and the CMT Release Manager web pages. Look for more information in a future Newsletter as this transition nears production and documentation becomes available.

Possible topics for this newsletter...

Level 1 developments

New astroserver release

Possible topics for the next newsletter (April 2010?)...

HPSS update

Pass 8 reprocessing

Reason for change

This is a request to upgrade the xrootd client tools from version 20090202-1402v2 to 20091008-2019v1. The main reason for the update is that the new version runs on rhel5 whereas for the old version certain commands did not work properly on rhel5.

Testing

The new version has been installed in the Fermi xrootd application area and is available through the DEV link. All tools xrdcp, xrd.pl, xrdls and xrdprel were tested against the test and production xrootd server running on rhel3, rhel4-32, rhel4-64, rhel5-64 and rhel5-32.

Rollback

The client version is rolled back by recreating the link to the old version.

CCB Request

https://jira.slac.stanford.edu/browse/SSC-232

Details

Some of the client tools do not work on rhel5. The reason is that there is a mismatch between the bitness of the executable and the xrootd libraries. For example perl is a 32-bit executable on all architectures but on rhel5 perl attempts to load the 64bit xrootd libraries which fails.

The new version addresses the problem in the following way:

  • rhel3 (32bit) and rhel4-64 (64bit) xrootd versions are installed.
  • if xrdcp and xrd.pl are executed the 32 bit version of the xrootd release are used.
  • for xrdls the bitness is obtained from the system (which matches the one for ls)
  • for xrdprel the bitness is guessed from the command that is run with the preload library. If the bitness can not
    be guessed (shell scripts for example) the systems default is used
  • the options -32 or -64 are available to force using the 32bit or 64bit xrootd release respectively.

Besides addressing the bitness issue a new xrootd version will be used. The xrootd version is 20091008-2019. There are
are a few improvements for the xrootd client and xrdcp. In particular the annoying message about the xrootd client version is not printed anymore.

In order to update the version the link /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD has to be changed to point to dist/0091008-2019v1/@sys. This is an atomic operation and clients should not fail because of this change.

The pipeline can now be scheduled for shutdown by creating a file in the pipeline installation directory.  The file must be named "shutdown_schedule" and must contain exactly 2 lines, both of which are dates in the form understood by the unix date command.  When the first date has passed, the monitor (which runs every 5 minutes in cron) will shutdown the pipeline and not restart it until the second date has passed or the file has been removed.  (The second date could also be changed to the current time in order to force the monitor to restart the pipeline on it's next execution.)

As an example, the following file will be used to turn off the pipeline during the Sept 30th computing center 1st-floor power outage:

[dflath@glastlnx13 prod]$ pwd
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod
[dflath@glastlnx13 prod]$ cat shutdown_schedule
Wed Sep 30 04:25:00 PDT 2009
Wed Sep 30 17:00:00 PDT 2009
[dflath@glastlnx13 prod]$

See:

https://jira.slac.stanford.edu/browse/PII-398

And:

https://jira.slac.stanford.edu/browse/SSC-228

Reason for change

We would like to upgrade the xrootd server version for the Fermi xrootd cluster from 20090202-1402 to 20090721-0636.
The main reasons for the change is an improvement in the xrootd server and a configuration change:

  1. Better handling of sendfile error recovery. In the old version some sendfile errors caused the server to disconnect the client. In the new version the server recovers from the sendfile errors and does not disconnect the client. Disconnecting the client is not fatal as after a timeout the client notices the disconnect and reconnects again, but it will slow down the client.
  2. Allow production accounts to remove directories below /glast/Scratch/. So far this option has been available only for the test xrootd setup.

Testing

As every xrootd version basic tests were done reading from and writing to xrootd, and testing the client admin interface (rm, stat, checksum,...).

The new version has been installed as a test version on the Fermi xrootd cluster which allows access to the glast data. Tests were performed to read and write to the new version. Reprocessing test jobs were successfully run against the server and the new version was also used for L1 tests.

The test xrootd has been setup for the directory removal (rmdir). It has been successfully used for some production testing.

Rollback

To switch the servers back to the old version the production link has to be set to the old version and a restart of all xrootd servers is needed.

CCB Request

https://jira.slac.stanford.edu/browse/SSC-227

Details

To allow production accounts (glastraw, glastxrw, glastmc and glast) to remove directory trees the xrootd forward method is used. The redirector will be configured to forward a rmdir request to all data servers. The data servers upon a request will execute a script that first checks if a directory is eligible for removal and then remove all files and directories below the specified directory. The xrootd configuration changes are:

  1. On the redirector allow forwarding of the rmdir command
  2. On the data servers specify the application that is called to remove directories. Only directories below /glast/Scratch will be allowed for removal.

To deploy a new xrootd version the following steps are required:

  1. Update the xrootd config
  2. Stop the redirector
  3. Restart the data servers with the new version
  4. Start the redirectors with the new version

The restart should take less then five minutes. Stopping the redirectors first prevents clients being redirected and the chance that a file is not found because a data server is being restarted. The clients will wait while the xrootds are down and reconnect once the data servers and redirectors are up.