Blog from September, 2009

The pipeline can now be scheduled for shutdown by creating a file in the pipeline installation directory.  The file must be named "shutdown_schedule" and must contain exactly 2 lines, both of which are dates in the form understood by the unix date command.  When the first date has passed, the monitor (which runs every 5 minutes in cron) will shutdown the pipeline and not restart it until the second date has passed or the file has been removed.  (The second date could also be changed to the current time in order to force the monitor to restart the pipeline on it's next execution.)

As an example, the following file will be used to turn off the pipeline during the Sept 30th computing center 1st-floor power outage:

[dflath@glastlnx13 prod]$ pwd
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod
[dflath@glastlnx13 prod]$ cat shutdown_schedule
Wed Sep 30 04:25:00 PDT 2009
Wed Sep 30 17:00:00 PDT 2009
[dflath@glastlnx13 prod]$

See:

https://jira.slac.stanford.edu/browse/PII-398

And:

https://jira.slac.stanford.edu/browse/SSC-228

Reason for change

We would like to upgrade the xrootd server version for the Fermi xrootd cluster from 20090202-1402 to 20090721-0636.
The main reasons for the change is an improvement in the xrootd server and a configuration change:

  1. Better handling of sendfile error recovery. In the old version some sendfile errors caused the server to disconnect the client. In the new version the server recovers from the sendfile errors and does not disconnect the client. Disconnecting the client is not fatal as after a timeout the client notices the disconnect and reconnects again, but it will slow down the client.
  2. Allow production accounts to remove directories below /glast/Scratch/. So far this option has been available only for the test xrootd setup.

Testing

As every xrootd version basic tests were done reading from and writing to xrootd, and testing the client admin interface (rm, stat, checksum,...).

The new version has been installed as a test version on the Fermi xrootd cluster which allows access to the glast data. Tests were performed to read and write to the new version. Reprocessing test jobs were successfully run against the server and the new version was also used for L1 tests.

The test xrootd has been setup for the directory removal (rmdir). It has been successfully used for some production testing.

Rollback

To switch the servers back to the old version the production link has to be set to the old version and a restart of all xrootd servers is needed.

CCB Request

https://jira.slac.stanford.edu/browse/SSC-227

Details

To allow production accounts (glastraw, glastxrw, glastmc and glast) to remove directory trees the xrootd forward method is used. The redirector will be configured to forward a rmdir request to all data servers. The data servers upon a request will execute a script that first checks if a directory is eligible for removal and then remove all files and directories below the specified directory. The xrootd configuration changes are:

  1. On the redirector allow forwarding of the rmdir command
  2. On the data servers specify the application that is called to remove directories. Only directories below /glast/Scratch will be allowed for removal.

To deploy a new xrootd version the following steps are required:

  1. Update the xrootd config
  2. Stop the redirector
  3. Restart the data servers with the new version
  4. Start the redirectors with the new version

The restart should take less then five minutes. Stopping the redirectors first prevents clients being redirected and the chance that a file is not found because a data server is being restarted. The clients will wait while the xrootds are down and reconnect once the data servers and redirectors are up.

1.3.5 is built against a patched version of the Data-Handling-Common library which allows database connections to be removed from the connection pool as they age (and replaced with freshly created connections.)  It also contains monitoring and run-time configuration capabilities.
The patch has been tested in DEV and works as expected.
This is intended to address the Memory leak we see on the Oracle server which slows down the pipeline software when the application has been running for some time.  Since the Oracle Memory usage goes back down when the pipeline application is restarted, we feel that the problem is probably in the long-lived, cached connections.

Jira CCB Request:  https://jira.slac.stanford.edu/browse/SSC-224