1.3.5 is built against a patched version of the Data-Handling-Common library which allows database connections to be removed from the connection pool as they age (and replaced with freshly created connections.) It also contains monitoring and run-time configuration capabilities.
The patch has been tested in DEV and works as expected.
This is intended to address the Memory leak we see on the Oracle server which slows down the pipeline software when the application has been running for some time. Since the Oracle Memory usage goes back down when the pipeline application is restarted, we feel that the problem is probably in the long-lived, cached connections.
Jira CCB Request: https://jira.slac.stanford.edu/browse/SSC-224
Details are in the following page:
http://confluence.slac.stanford.edu/x/_gRzAw
I intend to perform the clean-up and foreign-key creation during the all-day Computing Center power outage on Tuesday, August 25th (6am-5pm)
Oracle support insists that we use a java 1.6 driver with our java 1.6 application before they will give us more help on the PGA usage problem.
We will be moving to ojdbc6.jar version 11.1.0.7 and the associated native libraries provided in the oracle client software.
SSC jira is here:
https://jira.slac.stanford.edu/browse/SSC-208
Jira Release info for pipeline project is here:
This is a request to upgrade the xrootd client tools from version 20080728-0933v1 to 20090202-1402v2.
In the new version xrdcp is able to overwrite a file that is located on a data server that has no space left. The current xrdcp will fail in this case.
The new version has been installed in the Fermi xrootd application area and is available through the DEV link. All tools xrdcp, xrd.pl, xrdls and
xrdprel were tested against the test and production xrootd server.
The client version is rolled back by recreating the link to the old version.
https://jira.slac.stanford.edu/browse/SSC-202
The current xrdcp version will fail to overwrite a file that is on a data server that has no free space left. It will fail because the redirector will not redirect the client. The new xrdcp version however will first remove the file and then write it to a new server.
The other client tools have not been changed except xrd.pl for which an option to remove a directory tree has been added which is currently not applicable for the production xrootd.
In order to update the version the link /afs/slac.stanford.edu/g/glast/applications/xrootd/PROD has to be changed to point to dist/20090202-1402v2/@sys
The xrootd redirectors are configured to forward a file remove request to all of its data server. Therefore we would like to configure the redirectors so that clients have to authenticate them self and only one production account is authorized to remove files.
The Fermi xrootd test setup was configured to use authentication/authorization for the redirectors and data servers:
1) only glastxrw was allowed to remove files (through redirector or data server)
2) all clients were allowed to read/write files if connected to redirector
3) only Fermi users are allowed to read files from the data servers
4) only Fermi production accounts are allowed to write files
These rules were tested using the four accounts one being a Fermi user (read-only access), a production account, the account that that has privileges to remove files and a non Fermi
user account.
The configuration can be rolled back by using the previous xrootd configuration and authorization. A restart of the xrootd redirectors is needed.
https://jira.slac.stanford.edu/browse/SSC-199
Authentication and authorization is required for all of the xrootd data server in order to restrict access to the Fermi data to Fermi members only. Write and remove privileges are granted to production accounts only. No restrictions were needed for the redirectors as all they did was to redirect clients to the data server.
The redirectors got reconfigured so that they are able to remove files and therefore authentication and authorization has to be enabled.
The same authentication scheme as used for the data servers will be used and the authorization will be very simple:
1. All users are allowed to read and write files (this is later restricted by the data servers)
2. Only glastxrw is allowed to remove files
For the data servers we would like to change the authorization so that only the glastxrw user is able to remove files (so far other production accounts are also allowed).
After changing the authorization files and xrootd config file the xrootd on the redirectors have to be restarted in order to activate the changes.
The data server do not need to be restarted as they reread the authorization file periodically.
2.3.3 fixes a bug that prevents (large) dataset searches issued from the linemode client from completing.
See https://jira.slac.stanford.edu/browse/SSC-197 for details.
The current 10.2.0 driver used by applications needing passwordless (wallet) Oracle database access has a bug which prevents it from connecting to Oracle from a client machine which has been up for more than 200 days (32-bit linux kernel version 2.6) or 248 days (32-bit linux kernel version 2.4)
A patched version of the client tools has been made available which fixes this problem.
We have a central configuration script (actually 2, one for each of unix sh and csh) which requires a 1-line change to point everything to the new driver.
Backing out the change is only a matter of restoring the original pointer.
I've tested both clients on both 32-bit and 64-bit linux machines and the results are consistent.
This is a request to change the xrootd server version and have a minor update of the client tools.
We would like to upgrade the xrootd server version for the Fermi xrootd cluster from 20080828-1632 to 20090202-1402. The main change between these two versions are:
Due to this issue the crawler is currently not using the production xrootd but the test xrootd that runs the new version.
The first time a xrootd client connects to a cluster it tries FirstConnectMaxCnt times to connect before it will fail. The default for this number is 150 but for xrd.pl it is overwritten and set to 10. Therefore a client will fail after about 3.3 min (the wait between connection attempts is 20sec) whereas with the default setting the client will fail only after 50 min. This is import as for an outage which typically last from 5-30 min we stop the redirector to avoid clients from being redirected and with the short wait time xrd.pl might fail.
As every xrootd version basic tests were done reading from and writing to xrootd, and testing the client admin interface (rm, stat, checksum,...).
The new version has been installed as a test version on the Fermi xrootd cluster which allows access to the glast data. The production crawler is using this version for more than a month.
Also skimmer jobs were run successfully against this version.
The fix to the timeout for xrd.pl has been tested. It has been verified that it will wait the expected time if a xrootd server is not available.
To switch the servers back to the old version the xrootd configuration has to be reverted to the old version followed by a restart of the old version.
The client version is rolled back by recreating the link to the old version.
https://jira.slac.stanford.edu/browse/SSC-185
At the same time of the restart I would like to change the logfile name for the cmsd from olbdlog to cmsdlog this requires to change the name in StartXrd.cf.glast:
The restart should take less then five minutes. Stopping the redirectors first prevents clients being redirected and the chance that a file is not found because a data server is being restarted. The clients will wait while the xrootds are down and reconnect once the data servers and redirectors are up.
This update will not change the xrootd client binaries it will only change the xrd.pl
perl script.
The directory structure of the xrootd application changed. Instead of keeping releases in afs-sysname/release-name they are now in dist/release-name/afs-sysname (e.g.: i386_rhel30/20080728-0933v1 to dist/20080728-0933v1/i386_rhel30).
These small changes to the pipeline make it possible to perform reprocessing, and make it possible for the run status to default to good. It also adds a feature to timeout database connections to see if this fixes the problem with gradual pipeline slowdown requiring frequent restarts.
The timeout can be easily turned off, and the new version can be easily backed out if any problems occur. The associated JIRA is SSC-168@JIRA.
These small changes to the pipeline make it possible to perform reprocessing, and make it possible for the run status to default to good. It also adds a feature to timeout database connections to see if this fixes the problem with gradual pipeline slowdown requiring frequent restarts.
The timeout can be easily turned off, and the new version can be easily backed out if any problems occur. The associated JIRA is SSC-156@JIRA.
We would like to upgrade the xrootd server version for the glast xrootd cluster from version
20080513-1222 to 20080828-1632. The main changes between these two versions are:
The full cvs Changelog is available at http://xrootd.slac.stanford.edu/download/20080828-1632/ChangeLog_to_v20080513-1222
The xrootd client tools are installed in /afs/slac.stanford.edu/g/glast/applications/xrootd We also would like to update the PROD version from 20080513-1222 to 20080728-0933.
The main changes for the 20080728-0933 version are:
We also would like to link the FITS version to PROD.
As every xrootd version basic tests were done reading from and writing to xrootd, and testing the client admin interface (rm, stat, checksum,...).
The new version has been installed as a test version on the glast xrootd cluster which allows access to the glast data. MC jobs were run successfully against it.
The xrootd client version has been installed as TEST, DEV and FITS version in /afs/slac.stanford.edu/g/glast/applications/xrootd. The FITS and DEV versions have been successfully used for production activities.
To switch the servers back to the old version the xrootd configuration has to be reverted to the old version followed by a restart of the old version.
The client version is rolled back by recreating the link to the old version.
https://jira.slac.stanford.edu/browse/SSC-141
The restart should take less then five minutes. Stopping the redirectors first prevents clients being redirected and the chance that a file is not found because a data server is being restarted. The clients will wait during the restart and reconnect to the data servers and redirectors.
Approximately 20 minutes total based on length of time it took to add the new table-columns in DEV.
Because of the new stored procedures that will be installed the back-out procedure is slightly more involved than usual:
Note that the additional steps are quite simple to perform and only extend the outage for a back-out from ~5 minutes to ~10.
Updates that would not require backing out (with justification):
We need to apply a patch to the pipeline server to fix PII-353.
The problem is that currently all files sent with jobs are stored in a single unix directory. Currently there are 30924 files (up from 29000 at the end of last week). When that reaches 32765 the pipeline will cease to work.
A fix has been applied in DEV and after limited testing appears to work. The fix simply adds a non-flat structure to the fileStore (as we do for other directories in the pipeline). A script is available to transform the fileStore from the old flat structure to the new hierarchical structure.
More details are available here:
https://jira.slac.stanford.edu/browse/PII-353
The will about a 15 minute outage of the pipeline (to reorganize the file store).
* Install org-glast-datahandling-common (version 2.2.1) to PROD.
* Install org-glast-pipeline-server (version 1.2.4) to PROD.
* Modify PROD pipeline startup script to use version 1.2.4
* Restart PROD
Fixes a bug in parsing HalfPipe MET variables.
Simple -- Return server start-up script to version 1.2.3
Fixes a bug that prevents meta-data replacement from working in current DataCat Web Front End release.
Re-upload Data Catalog Stored procedures tagged v2r0.