Blog

1.3.5 is built against a patched version of the Data-Handling-Common library which allows database connections to be removed from the connection pool as they age (and replaced with freshly created connections.)  It also contains monitoring and run-time configuration capabilities.
The patch has been tested in DEV and works as expected.
This is intended to address the Memory leak we see on the Oracle server which slows down the pipeline software when the application has been running for some time.  Since the Oracle Memory usage goes back down when the pipeline application is restarted, we feel that the problem is probably in the long-lived, cached connections.

Jira CCB Request:  https://jira.slac.stanford.edu/browse/SSC-224

Details are in the following page:

http://confluence.slac.stanford.edu/x/_gRzAw

I intend to perform the clean-up and foreign-key creation during the all-day Computing Center power outage on Tuesday, August 25th (6am-5pm)

Oracle support insists that we use a java 1.6 driver with our java 1.6 application before they will give us more help on the PGA usage problem.

We will be moving to ojdbc6.jar version 11.1.0.7 and the associated native libraries provided in the oracle client software.

SSC jira is here:

https://jira.slac.stanford.edu/browse/SSC-208

Jira Release info for pipeline project is here:

https://jira.slac.stanford.edu/secure/IssueNavigator.jspa?reset=true&mode=hide&sorter/order=ASC&sorter/field=priority&pid=10360&fixfor=11960

Reason for change

This is a request to upgrade the xrootd client tools from version 20080728-0933v1 to 20090202-1402v2.
In the new version xrdcp is able to overwrite a file that is located on a data server that has no space left. The current xrdcp will fail in this case.

Testing

The new version has been installed in the Fermi xrootd application area and is available through the DEV link. All tools xrdcp, xrd.pl, xrdls and
xrdprel were tested against the test and production xrootd server.

Rollback

The client version is rolled back by recreating the link to the old version.

CCB Request

https://jira.slac.stanford.edu/browse/SSC-202

Details

The current xrdcp version will fail to overwrite a file that is on a data server that has no free space left. It will fail because the redirector will not redirect the client. The new xrdcp version however will first remove the file and then write it to a new server.

The other client tools have not been changed except xrd.pl for which an option to remove a directory tree has been added which is currently not applicable for the production xrootd.

In order to update the version the link /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD has to be changed to point to dist/20090202-1402v2/@sys

Reason for change

The xrootd redirectors are configured to forward a file remove request to all of its data server. Therefore we would like to configure the redirectors so that clients have to authenticate them self and only one production account is authorized to remove files.

Testing

The Fermi xrootd test setup was configured to use authentication/authorization for the redirectors and data servers:
1) only glastxrw was allowed to remove files (through redirector or data server)
2) all clients were allowed to read/write files if connected to redirector
3) only Fermi users are allowed to read files from the data servers
4) only Fermi production accounts are allowed to write files

These rules were tested using the four accounts one being a Fermi user (read-only access), a production account, the account that that has privileges to remove files and a non Fermi
user account.

Rollback

The configuration can be rolled back by using the previous xrootd configuration and authorization. A restart of the xrootd redirectors is needed.

CCB Request

https://jira.slac.stanford.edu/browse/SSC-199

Details

Authentication and authorization is required for all of the xrootd data server in order to restrict access to the Fermi data to Fermi members only. Write and remove privileges are granted to production accounts only. No restrictions were needed for the redirectors as all they did was to redirect clients to the data server.

The redirectors got reconfigured so that they are able to remove files and therefore authentication and authorization has to be enabled.
The same authentication scheme as used for the data servers will be used and the authorization will be very simple:
1. All users are allowed to read and write files (this is later restricted by the data servers)
2. Only glastxrw is allowed to remove files

For the data servers we would like to change the authorization so that only the glastxrw user is able to remove files (so far other production accounts are also allowed).

After changing the authorization files and xrootd config file the xrootd on the redirectors have to be restarted in order to activate the changes.
The data server do not need to be restarted as they reread the authorization file periodically.

2.3.3 fixes a bug that prevents (large) dataset searches issued from the linemode client from completing.

See https://jira.slac.stanford.edu/browse/SSC-197 for details.

The current 10.2.0 driver used by applications needing passwordless (wallet) Oracle database access has a bug which prevents it from connecting to Oracle from a client machine which has been up for more than 200 days (32-bit linux kernel version 2.6) or 248 days (32-bit linux kernel version 2.4)

A patched version of the client tools has been made available which fixes this problem.

Change details:

We have a central configuration script (actually 2, one for each of unix sh and csh) which requires a 1-line change to point everything to the new driver.

Rollback details:

Backing out the change is only a matter of restoring the original pointer.

Testing:

I've tested both clients on both 32-bit and 64-bit linux machines and the results are consistent.

CCB Request:

https://jira.slac.stanford.edu/browse/SSC-187

Reason for change

This is a request to change the xrootd server version and have a minor update of the client tools.

Server version update

We would like to upgrade the xrootd server version for the Fermi xrootd cluster from 20080828-1632 to 20090202-1402. The main change between these two versions are:

  1. Improved and fixed handling of checksum requests by the server. This fixes issues that could cause checksum request to hang and large cpu usage of the xrootd server.

Due to this issue the crawler is currently not using the production xrootd but the test xrootd that runs the new version.

Client version update

The first time a xrootd client connects to a cluster it tries FirstConnectMaxCnt times to connect before it will fail. The default for this number is 150 but for xrd.pl it is overwritten and set to 10. Therefore a client will fail after about 3.3 min (the wait between connection attempts is 20sec) whereas with the default setting the client will fail only after 50 min. This is import as for an outage which typically last from 5-30 min we stop the redirector to avoid clients from being redirected and with the short wait time xrd.pl might fail.

Testing

As every xrootd version basic tests were done reading from and writing to xrootd, and testing the client admin interface (rm, stat, checksum,...).

The new version has been installed as a test version on the Fermi xrootd cluster which allows access to the glast data. The production crawler is using this version for more than a month.
Also skimmer jobs were run successfully against this version.

The fix to the timeout for xrd.pl has been tested. It has been verified that it will wait the expected time if a xrootd server is not available.

Rollback

To switch the servers back to the old version the xrootd configuration has to be reverted to the old version followed by a restart of the old version.

The client version is rolled back by recreating the link to the old version.

CCB Request

https://jira.slac.stanford.edu/browse/SSC-185

Details

Server version upgrade

cmsd logfile name change

At the same time of the restart I would like to change the logfile name for the cmsd from olbdlog to cmsdlog this requires to change the name in StartXrd.cf.glast:

  1. CMSLOGFN=cmsdlog

Restart of xrootd

  1. Stop the redirector
  2. Restart the data servers with the new version
  3. Start the redirectors

The restart should take less then five minutes. Stopping the redirectors first prevents clients being redirected and the chance that a file is not found because a data server is being restarted. The clients will wait while the xrootds are down and reconnect once the data servers and redirectors are up.

Update the client

  1. Change the link /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD from
    @sys/20080728-0933/ to dist/20080728-0933v1/@sys

This update will not change the xrootd client binaries it will only change the xrd.pl
perl script.

The directory structure of the xrootd application changed. Instead of keeping releases in afs-sysname/release-name they are now in dist/release-name/afs-sysname (e.g.: i386_rhel30/20080728-0933v1 to dist/20080728-0933v1/i386_rhel30).

These small changes to the pipeline make it possible to perform reprocessing, and make it possible for the run status to default to good. It also adds a feature to timeout database connections to see if this fixes the problem with gradual pipeline slowdown requiring frequent restarts.

The timeout can be easily turned off, and the new version can be easily backed out if any problems occur. The associated JIRA is SSC-168@JIRA.

Pipeline 1.3.2

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

DataCat Client 2.3.2, DataCat Stored Procedures 2.2.1, DataHandling Common 1.5.1

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Run Quality 1.3.2

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

These small changes to the pipeline make it possible to perform reprocessing, and make it possible for the run status to default to good. It also adds a feature to timeout database connections to see if this fixes the problem with gradual pipeline slowdown requiring frequent restarts.

The timeout can be easily turned off, and the new version can be easily backed out if any problems occur. The associated JIRA is SSC-156@JIRA.

Pipeline 1.3.1

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Pipeline Front-End 2.8

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Run Quality 1.3

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for change

Server version update

We would like to upgrade the xrootd server version for the glast xrootd cluster from version
20080513-1222 to 20080828-1632. The main changes between these two versions are:

  1. Allow removing files via the redirector from data servers that are filled (GXR-37@JIRA)
  2. Allow collection of xrootd statistics by Ganglia (GXR-38@JIRA)
  3. Option to suppress reverse DNS lookups.
  4. Support for sendfile(). Improves read performance and lowers xrootd's memory usage.
  5. Fixes a bug that could cause the redirector cmsd to crash if a suspended data server connects
    to quickly again.

The full cvs Changelog is available at http://xrootd.slac.stanford.edu/download/20080828-1632/ChangeLog_to_v20080513-1222

Client version update

The xrootd client tools are installed in /afs/slac.stanford.edu/g/glast/applications/xrootd We also would like to update the PROD version from 20080513-1222 to 20080728-0933.
The main changes for the 20080728-0933 version are:

  1. Support reading fits files using the xroootd preload library.
  2. Bug fixes to xrootd that cause increased memory usage for large file transfers.

We also would like to link the FITS version to PROD.

Testing

As every xrootd version basic tests were done reading from and writing to xrootd, and testing the client admin interface (rm, stat, checksum,...).

The new version has been installed as a test version on the glast xrootd cluster which allows access to the glast data. MC jobs were run successfully against it.

The xrootd client version has been installed as TEST, DEV and FITS version in /afs/slac.stanford.edu/g/glast/applications/xrootd. The FITS and DEV versions have been successfully used for production activities.

Rollback

To switch the servers back to the old version the xrootd configuration has to be reverted to the old version followed by a restart of the old version.

The client version is rolled back by recreating the link to the old version.

CCB Request

https://jira.slac.stanford.edu/browse/SSC-141

Details

Server version upgrade

The following changes to the config file, xoootd.cf, are needed:

  1. use load balancing between xrootd and cmsd
  2. turn of dns reverse lookup
  3. allow the data server to login to xrootd without authentication (needed to gather statistics with Ganglia)

Restart of xrootd

  1. Stop the redirector
  2. Restart the data servers
  3. Start the redirectors

The restart should take less then five minutes. Stopping the redirectors first prevents clients being redirected and the chance that a file is not found because a data server is being restarted. The clients will wait during the restart and reconnect to the data servers and redirectors.

Update the client

  1. Change the link /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD to point to 20080728-0933.
  2. Have FITS linked to PROD

Actions:

  • Install org-glast-runquality-web (version 1.2) library to PROD.
  • Install org-glast-datahandling-common (version 1.4) library to PROD.
  • Install org-glast-datacat-client (version 2.3) library to PROD.
  • Install org-glast-datacat-sp (version 2.2) library to PROD.
    • Upload org-glast-datacat-sp (version 2.2) stored procedures to PROD.
  • Install org-glast-pipeline-server (version 1.3) library to PROD.
    • Upload org-glast-pipeline-server (version 1.3) stored procedures to PROD.
  • Modify PROD pipeline startup script to use version 1.3
  • Apply patches to Database tables, adding new columns with default values to support new features.
    • this is non-destructive and does not have to be reverted in the event of a version back-out
  • Restart PROD

Estimated Length of Outage to Perform Upgrade

Approximately 20 minutes total based on length of time it took to add the new table-columns in DEV.

Motivation:

  • Support for L1 Data reprocessing
  • Support in Pipeline Jython scripts for newest dataset-find routine (with significant improvements in meta-data search capabilities and bugfixes to same)
  • Ability to auto-retry failed processes
  • New ability to adjust zombie-process reaping delay
  • New command line feature enables users to deal with 'zombie' processes when reaper is unable to. (previously this required a developer's direct intervention.)

Rollback Procedure:

Because of the new stored procedures that will be installed the back-out procedure is slightly more involved than usual:

  1. Shut-down pipeline server
  2. Return server start-up script to version 1.2.5
  3. Re-upload Pipeline stored procedures v1.2.5
  4. Re-upload Data Catalog stored procedures v2.1
  5. Restart pipeline server

Note that the additional steps are quite simple to perform and only extend the outage for a back-out from ~5 minutes to ~10.
Updates that would not require backing out (with justification):

  1. Columns added to Process and ProcessInstance tables for auto-retry support need not be removed as it is not used by previous pipeline version and will be ignored.
  2. PFE (Pipeline Front End) need not be reverted from 2.7 back to 2.6 because additional features supporting process auto-retry will work if columns are not removed, and schema additions are optional and backward-compatible.

Associated Jira:

SSC-135@JIRA

Details

Pipeline 1.3

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

DataCat Client 2.3

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.


Pipeline Front End 2.7


type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

 



We need to apply a patch to the pipeline server to fix PII-353.

The problem is that currently all files sent with jobs are stored in a single unix directory. Currently there are 30924 files (up from 29000 at the end of last week). When that reaches 32765 the pipeline will cease to work.

A fix has been applied in DEV and after limited testing appears to work. The fix simply adds a non-flat structure to the fileStore (as we do for other directories in the pipeline). A script is available to transform the fileStore from the old flat structure to the new hierarchical structure.

More details are available here:

https://jira.slac.stanford.edu/browse/PII-353

The will about a 15 minute outage of the pipeline (to reorganize the file store).

Actions:

* Install org-glast-datahandling-common (version 2.2.1) to PROD.

* Install org-glast-pipeline-server (version 1.2.4) to PROD.

* Modify PROD pipeline startup script to use version 1.2.4

* Restart PROD

Reason:

Fixes a bug in parsing HalfPipe MET variables.

Rollback Procedure:

Simple -- Return server start-up script to version 1.2.3

Associated Jira:

SSC-74@JIRA

Details

Pipeline 1.2.4

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Actions:

  • Install org.glast.datacat.sp package (version 2.1) to PROD.

Reason:

Fixes a bug that prevents meta-data replacement from working in current DataCat Web Front End release.

Rollback Procedure:

Re-upload Data Catalog Stored procedures tagged v2r0. 

Associated Jira:

SSC-73@JIRA

Details

Pipeline 1.2.4

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.