Blog from September, 2008

Reason for change

Server version update

We would like to upgrade the xrootd server version for the glast xrootd cluster from version
20080513-1222 to 20080828-1632. The main changes between these two versions are:

  1. Allow removing files via the redirector from data servers that are filled (GXR-37@JIRA)
  2. Allow collection of xrootd statistics by Ganglia (GXR-38@JIRA)
  3. Option to suppress reverse DNS lookups.
  4. Support for sendfile(). Improves read performance and lowers xrootd's memory usage.
  5. Fixes a bug that could cause the redirector cmsd to crash if a suspended data server connects
    to quickly again.

The full cvs Changelog is available at http://xrootd.slac.stanford.edu/download/20080828-1632/ChangeLog_to_v20080513-1222

Client version update

The xrootd client tools are installed in /afs/slac.stanford.edu/g/glast/applications/xrootd We also would like to update the PROD version from 20080513-1222 to 20080728-0933.
The main changes for the 20080728-0933 version are:

  1. Support reading fits files using the xroootd preload library.
  2. Bug fixes to xrootd that cause increased memory usage for large file transfers.

We also would like to link the FITS version to PROD.

Testing

As every xrootd version basic tests were done reading from and writing to xrootd, and testing the client admin interface (rm, stat, checksum,...).

The new version has been installed as a test version on the glast xrootd cluster which allows access to the glast data. MC jobs were run successfully against it.

The xrootd client version has been installed as TEST, DEV and FITS version in /afs/slac.stanford.edu/g/glast/applications/xrootd. The FITS and DEV versions have been successfully used for production activities.

Rollback

To switch the servers back to the old version the xrootd configuration has to be reverted to the old version followed by a restart of the old version.

The client version is rolled back by recreating the link to the old version.

CCB Request

https://jira.slac.stanford.edu/browse/SSC-141

Details

Server version upgrade

The following changes to the config file, xoootd.cf, are needed:

  1. use load balancing between xrootd and cmsd
  2. turn of dns reverse lookup
  3. allow the data server to login to xrootd without authentication (needed to gather statistics with Ganglia)

Restart of xrootd

  1. Stop the redirector
  2. Restart the data servers
  3. Start the redirectors

The restart should take less then five minutes. Stopping the redirectors first prevents clients being redirected and the chance that a file is not found because a data server is being restarted. The clients will wait during the restart and reconnect to the data servers and redirectors.

Update the client

  1. Change the link /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD to point to 20080728-0933.
  2. Have FITS linked to PROD

Actions:

  • Install org-glast-runquality-web (version 1.2) library to PROD.
  • Install org-glast-datahandling-common (version 1.4) library to PROD.
  • Install org-glast-datacat-client (version 2.3) library to PROD.
  • Install org-glast-datacat-sp (version 2.2) library to PROD.
    • Upload org-glast-datacat-sp (version 2.2) stored procedures to PROD.
  • Install org-glast-pipeline-server (version 1.3) library to PROD.
    • Upload org-glast-pipeline-server (version 1.3) stored procedures to PROD.
  • Modify PROD pipeline startup script to use version 1.3
  • Apply patches to Database tables, adding new columns with default values to support new features.
    • this is non-destructive and does not have to be reverted in the event of a version back-out
  • Restart PROD

Estimated Length of Outage to Perform Upgrade

Approximately 20 minutes total based on length of time it took to add the new table-columns in DEV.

Motivation:

  • Support for L1 Data reprocessing
  • Support in Pipeline Jython scripts for newest dataset-find routine (with significant improvements in meta-data search capabilities and bugfixes to same)
  • Ability to auto-retry failed processes
  • New ability to adjust zombie-process reaping delay
  • New command line feature enables users to deal with 'zombie' processes when reaper is unable to. (previously this required a developer's direct intervention.)

Rollback Procedure:

Because of the new stored procedures that will be installed the back-out procedure is slightly more involved than usual:

  1. Shut-down pipeline server
  2. Return server start-up script to version 1.2.5
  3. Re-upload Pipeline stored procedures v1.2.5
  4. Re-upload Data Catalog stored procedures v2.1
  5. Restart pipeline server

Note that the additional steps are quite simple to perform and only extend the outage for a back-out from ~5 minutes to ~10.
Updates that would not require backing out (with justification):

  1. Columns added to Process and ProcessInstance tables for auto-retry support need not be removed as it is not used by previous pipeline version and will be ignored.
  2. PFE (Pipeline Front End) need not be reverted from 2.7 back to 2.6 because additional features supporting process auto-retry will work if columns are not removed, and schema additions are optional and backward-compatible.

Associated Jira:

SSC-135@JIRA

Details

Pipeline 1.3

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

DataCat Client 2.3

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.


Pipeline Front End 2.7


type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.