Blog

Actions:

  • Remove Data Catalog folders for tasks that are being purged from disk.
  • Remove Task definitions from Pipeline Database.
  • Compile a list of, and delete, empty folders and groups from the Data Catalog.

Reason:

Monte Carlo datasets are being removed from disk to make room for real data.  Need to clean up the associated data catalog entries.  Cleaning up other empty folders and groups will make the data catalog tree easier to browse, and make some of the queries faster.

Rollback Procedure:

none. 

Associated Jira:

SSC-68@JIRA

Reason for change

The pipeline release contains a bug-fix that remedies a problem with streams being left in the 'waiting' state when they have successfully completed processing. It also depends on the new version of the Data Catalog Client library which has improved (~20x faster) dataset searching capabilities. The speed increase is obtained by using the new versioned tables directly instead of the views that were created to allow existing software to continue to work with the old dataset model.

Test Procedure

These changes have been tested in DEV and TEST. ASP-'like' tasks have tested the dataset-searching code against the PROD data catalog. L1Proc has been used to verify the status-setting code changes work.

Rollback procedure

This upgrade requires that the pipeline stored procedures be reinstalled, and the new libraries be used. To rollback, we can simply reinstall the previous (1.2.2) version of the stored procedures and back out to the 1.2.2 pipeline release (2.0.1 datacat-client release.)

Related JIRA

SSC-66@JIRA

Details

DataCatalog 2.0.2

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Pipeline 1.2.3

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for change

We would like to deploy a new xrootd version switching from 20080513-1222 to 20080531-0714p1.
The new version contains a few fixes:

  1. Allow removing files via the redirector from data servers that are filled (GXR-37@JIRA)
  2. Allow collection of xrootd statistics by Ganglia (GXR-38@JIRA)
  3. Fixes a bug that could cause the redirector cmsd to crash if a suspended data server connects
    to quickly again.

Besides that the new version contains only minor changes.

Testing

The standard tests for a new xrootd release have been done:
reading and writing files with xrdcp,
client admin tools like stat, rm files and create and remove directories.

The new version has been installed as a test version on the glast xrootd cluster which allows
access to the glast data. Tests were done to access the data using the test xrootd.

Rollback

The old xrootd version can be re-activated and restarted. The xrootd configuration file has to be rolled back. It will take about 5 minutes to do the roll back

CCB Jira

SSC-67@JIRA

Details

Only the xrootd version running on the xrootd cluster will be changed. The version that clients are using out of the application area (/afs/slac.stanford.edu/g/glast/applications/xrootd/PROD) will not be changed.

No outage is required for installing the new version the procedure will be:

  1. Update xrootd configuration file (for ganglia monitoring)
  2. activate the new version on all xrootd servers (just updates a link that points to
    the new version)
  3. restart the redirectors
  4. restart the data servers

The restart it self will take a couple of minutes and should not affect any client.

Details

Only the xrootd version running on the xrootd cluster will be changed. The version that clients are using out of the application area (/afs/slac.stanford.edu/g/glast/applications/xrootd/PROD) will not be changed.

No outage is required for installing the new version the procedure will be:

  1. Update xrootd configuration file (for ganglia monitoring)
  2. activate the new version on all xrootd servers (just updates a link that points to
    the new version)
  3. restart the redirectors
  4. restart the data servers

The restart it self will take a couple of minutes and should not affect any client.

Reason for Change:

The previous versions of the pipeline server contained an old version of the RunQuality plugin. This release will upgrade it to 1.0. We will also release the pipeline front-end which has been in use in DEV for sometime, with a few bug fixes and enhancements for monitoring and diagnostics.

Test Procedure:

Changes have been tested in DEV.

Rollback Procedure:

Normal rollback procedure.

Related JIRA

SSC-58@JIRA

Details:

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for Change:

Database connections are being left open when creating DataCatClient objects. The open connections are exhausting the supply of sockets on the server machine.

Test Procedure:

Install to TEST, DEV, have Maria Elana test L1Proc there to prove that the connections are being closed correctly.

Rollback Procedure:

Just restart server using org-glast-pipeline-server-1.2.jar

Related JIRA

SSC-57@JIRA

Details:

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for change

This new release incorporates changes required to support multiple "versions" of files as required by Level 1 processing. Currently L1 uses an arbitrary naming scheme which is not understood by the data catalog or tools like the data skimmer. This results in multiple copies of the same event being skimmed. This release fixes this problem, and also simplified the book keeping required by L1Proc.

The database will be converted to the new structure. Based on the time taken to convert the DEV database, and the number of datasets in PROD, we estimate that it will take about 8 hours to do the conversion on PROD. Database views have been created which will mean tools which care only about the most recent version of a file, and which do not update the database will not see any change/

Test Procedure

These changes have been tested on a DEV version of the server. We have tested ASP, MC and L1 tasks using the new release.

Rollback procedure

This update requires changes to the data catalog database. They can be rolled back if a problem is found immediately, but rolling back after a substantial number of datasets have been registered in the new schema will be difficult.

Related JIRA

SSC-51@JIRA

Details

DataCatalog 2.0

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Pipeline 1.2

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

This release also incorporates the performance improvements made in release 1.1.1

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

and minor changes to the job control daemon documented separately.

Data Crawler 1.3

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for change

The Job Control daemon is used by Pipeline II to submit jobs to LSF and BQS (at Lyon). This minor change adds a management interface to support monitoring the daemons, and adds improved error reporting when an error occurs during job submission. Neither of these changes will have any effect on operation, but will make tracking down problems easier.

Test Procedure

These changes have been tested on a DEV version of the server

Rollback procedure

It will be easy to rollback these changes should any problem occur.

Related JIRA

SSC-46@JIRA

Details

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for change

The group manager has been updated with very minor changes to support better interaction with the Shift Schedule application. Since the shift schedule app is to be announced tomorrow these changes have already been released, although they can be trivially backed out if need be.

Test Procedure

These changes have all been tested in the dev server http://glast-tomcat03.slac.stanford.edu:8080/GroupManager

Rollback procedure

It will be easy to rollback these changes should any problem occur.

Related JIRA

SSC-45@JIRA

Details

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

First draft

Reason for change

The current production version is 20071101-0808p2. We would like to install the new production candidate version 20080513-1222.

The new xrootd version allows to restrict the access to the GLAST data which is a requirement. The current production version does not support access control. The new xrootd version has various improvements and fixes that make the system more reliable.

  • allows access control to the data which is required.
  • fix for checksum problems that could crash the server.
  • better handling of redirection for existing files.
  • other small bug fixes.

Testing

The production candidate version 20080513-1222 and its predecessor version 20080410-0747 have been deployed on the glast xrootd cluster running parallel to the production xrootd daemons. This allows using the new version on the glast xrootd cluster without interfering with the production version.

An update from 20080410-0747 to 20080513-1222 was needed as a critical bug was found (https://jira.slac.stanford.edu/browse/GXR-33) that caused the server to crash when re-reading the updated authorization file. Otherwise there are only small bug fixes between 20080410-0747 and 20080513-1222 that don't affect GLAST.
The test setup on the glast xrootd cluster used the configuration that would be used in production, in particular authorization has been turned on. Only glast users are allowed to read data and certain production accounts are allowed to write to xrootd.

The basic functionality of the new xrootd version has been examined in a test cluster. The testing included reading/writing with xrdcp, stating files and checksumming of files.

All other test were done against the new version 20080410-0747 and repeated for production candidate version 20080513-1222 on the glast xrootd cluster:

The main tests were:

  • running the crawler
  • MC-simulation
  • skimmer jobs
  • reading and writing with xrdcp
  • reading from xrootd with root version 5.18.00(b)

and all tests passed.

The test that required the xrootd client tools (xrdcp to transfer files, xrd.pl to stat,checksum and remove files) were done with using the one from the new version.

Rollback

It is possible to re-activate the old xrootd version, no data will be lost.

CCB Jira

SSC-43@JIRA

Details

Deploying the new xrootd version requires to restart the new version on all glast xrootd servers, including the redirectors (glast-rdr). In addition the client admin tools in /afs/slac/g/glast/applications/xrootd/PROD have to be updated as the current version does not support authorization.

The steps are:

  1. Install new client admin tools:
    mv /afs/slac/g/glast/applications/xrootd/TEST to /afs/slac/g/glast/applications/xrootd/PROD
    The new version works with the old xrootd servers and this move doesn't cause any disruption to clients.
  2. Stop xrootd on all data servers
  3. Stop xrootd on the redirectors (glastlnx04/05)
  4. Activate the new version on the redirectors and restart them.
  5. Activate the new version on the data servers and restart them.

The whole procedure should take less then five minutes and clients will retry to access xrootd during that time and not fail. User will be informed when the restart will happen in order to reduce the load on xrootd, just as a precaution.

Reason for change

To comply with SLAC's new database password policy we must change the database passwords every 6 months. We have switched to using Oracle wallet to make this process as painless as possible. Oracle wallet will enable us to change the oracle password without any application downtime.

We have one remaining application (data catalog crawler) which needs to be updated to use Oracle wallet. We can then change the last oracle password used by pipeline II and the data catalog. This must be done by April 30.

Test Procedure

The change to the data catalog crawler has been tested on the DEV database. We have previously tested changing passwords with oracle wallet and do not anticipate any problems. We will change the password on TEST, DEV and PROD databases (in that order).

Rollback procedure

It will be easy to rollback to the old version of the data catalog crawler should any problems occur.

Related JIRA

SSC-34@JIRA

Data Catalog Crawler version 1.2

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Data Catalog Client 1.1.1

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for change

Miscellaneous small changes to improve the robustness of the tomcat servers and to improve logging to aid in diagnosing problems.

Test Procedure

These changes have all been tested in the dev server http://glast-tomcat03.slac.stanford.edu:8080/

Rollback procedure

It will be easy to rollback these changes should any problem occur.

Related JIRA

SSC-32@JIRA

Pipeline Server 1.1

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for change

The group manager has been updated to re-enable automatic synchronization with the stanford GLAST database (broken when a new password scheme was introduced on campus) and to remove hard-wired oracle passwords. In addition this version produces a file containing a list of all glast users required by xrootd to restrict read-access to data to GLAST users. Finally extra information is copied into the SLAC copy of the database which will be useful for contacting people during operations (pager #, home phone # etc).

Test Procedure

These changes have all been tested in the dev server http://glast-tomcat03.slac.stanford.edu:8080/GroupManager

Rollback procedure

It will be easy to rollback these changes should any problem occur.

Related JIRA

SSC-23@JIRA

Pipeline Server 1.1

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for change

There are no functional changes in this release, it only includes a few performance enhancements implemented during testing
of the new oracle database. In addition we will reconfigure the mail delivery from batch jobs to go via glast's own SMTP mail server instead of SLAC's exchange server, to reduce the load on exchange and isolate us from any exchange problems/outages.

It would be good to get these changes in before the L1 stress testing starts as they may impact performance (hopefully for the better).

Test Procedure

These changes have all been extensively tested in the TEST and DEV pipeline's.

Rollback procedure

Any of these changes can quickly and easily be backed out should unanticipated problems appear when the code is moved to production. Backing out the e-mail delivery change will need to be coordinated with Teresa Downey in SCCS, but can be done with one hour's notice.

Related JIRA

SSC-21@JIRA

Pipeline Server 1.1

type key summary assignee reporter priority status resolution created updated due

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Reason for change

Increase the disk space of the GLAST Xrootd cluster. The new server, wain017, has 32TB disk space and
adding the server will increase the total disk space of the xrootd cluster from about 75TB to 107TB.

Testing

An xrootd server has been run on wain017 and files were written to and read from the server without any problems.
Also checksumming files has been tested.

Rollback

The xrootd on wain017 could be stopped and files that were written to the server have to be copied to
the other servers. This is only possible if a small amount of data were written to wain017 as the other
glast xrootd servers have only a small amount of free space available.

CCB Jira

ssc-20@jira
ssc-38@jira
The ssc-38 jira is for wain018 - wain021 and the same procedure used for wain017 will be employed
to do add the servers to the GLAST Xrootd cluster.

Details

The procedure of adding wain007 to the xrootd cluster is:

  • Add wain017 as a read-only server.
  • Check reading from wain017 using the GLAST redirector
  • Restart the xrootd on wain017 as write-able

Wain0017 is configured the same as the other wain xrootd servers:

  • the same xrootd config file is used.
  • the same scripts for checksumming
  • the same xrootd version (20071101-0808p2)

Reason for change

Over the last few months we have noticed that our existing oracle servers are being pushed to 100% CPU utilization by the load we are putting on them, often resulting in poor performance of the pipeline server and web applications. We have purchased two new servers which will provide the following benefits:

  1. Supports 64 simultaneous threads of execution (from current 2). This will allow us to support the expected load from many people using the web interfaces at the same time as we are performing data processing.
  2. Faster and more reliable RAID 10 disks to improve IO performance
  3. Two redundant servers to provide failover in case one server fails

The new servers are running the same OS and Oracle versions as our current production setup, so we do not anticipate any compatibility problems with the new servers. We have done extensive testing of the performance and compatibility of the new servers as detailed below.

Oracle hardware details

Testing

Scalability Testing

We have performed tests to verify that we can really use all of the available threads in parallel. We see good scaling of total throughput as we add extra parallel threads.

Performed 64 units of work in 696,047ms using 1 threads
Performed 64 units of work in 348,767ms using 2 threads
Performed 64 units of work in 223,745ms using 3 threads
Performed 64 units of work in 169,165ms using 4 threads
Performed 64 units of work in 139,516ms using 5 threads
Performed 64 units of work in 118,259ms using 6 threads
Performed 64 units of work in 103,766ms using 7 threads
Performed 64 units of work in 89,952ms using 8 threads
Performed 64 units of work in 85,995ms using 9 threads
Performed 64 units of work in 77,336ms using 10 threads
Performed 64 units of work in 73,032ms using 11 threads
Performed 64 units of work in 69,325ms using 12 threads
Performed 64 units of work in 60,347ms using 13 threads
Performed 64 units of work in 61,213ms using 14 threads
Performed 64 units of work in 60,428ms using 15 threads
Performed 64 units of work in 51,031ms using 16 threads

Stress Testing

We have done extensive testing of the new database configuration using the pipeline II test server. We have run 15,000 real jobs and over 100,000 simulated jobs (when simulating jobs we did not actually submit any real batch jobs, but provided the same load to the pipeline server and database as when we are running real batch jobs). At the same time as we were running the pipeline server we also run various data ingest jobs to simulate the load of storing trending data into the database. The pipeline and trending ingest are the most database intensive activities that we perform.

We were able to run 1500 simulated MC jobs continuously for prolonged periods of time, and were able to ingest on orbits worth of trending data in a little over 1 minute.

Failover Testing

We have tested the ability to failover to the backup database if the primary database fails, and to resync the primary and secondary database. This procedure will be used if the primary database becomes inoperable for an extended period due to hardware or software failure. Currently failover is a manual operation requiring an oracle admin to designate the backup server as "primary". No change is required to GLAST software to switchover to the backup database.

Switchover methodology

We propose to switchover to the new oracle databases on Monday April 14. We will perform the following steps

Starting midnight Sunday

  1. full backup of glast-oracle01 (to NFS disk)

Starting 8am Monday

  1. Shutdown glast applications
  2. Shutdown glast-oracle01 database. <b>All glast database access will be lost at this time.</b>
  3. Restore into glast-oracle03
  4. Start glast-oracle01 as primary databse
  5. Switch glast-oracle01 to be a DNS alias for glast-oracle03
  6. Backup glast-oracle03 to NFS disk

Approximately 8pm Monday

  1. glast database access restored. <b>glast application can be restarted.</b>
  2. Restore backup to glast-oracle04
  3. Bring up glast-oracle04 as physical standby to glast-oracle03

Tuessday 8am

  1. Short DB outage to switch glast-oracle03 to max availability mode