Reason for change

Over the last few months we have noticed that our existing oracle servers are being pushed to 100% CPU utilization by the load we are putting on them, often resulting in poor performance of the pipeline server and web applications. We have purchased two new servers which will provide the following benefits:

  1. Supports 64 simultaneous threads of execution (from current 2). This will allow us to support the expected load from many people using the web interfaces at the same time as we are performing data processing.
  2. Faster and more reliable RAID 10 disks to improve IO performance
  3. Two redundant servers to provide failover in case one server fails

The new servers are running the same OS and Oracle versions as our current production setup, so we do not anticipate any compatibility problems with the new servers. We have done extensive testing of the performance and compatibility of the new servers as detailed below.

Oracle hardware details

Testing

Scalability Testing

We have performed tests to verify that we can really use all of the available threads in parallel. We see good scaling of total throughput as we add extra parallel threads.

Performed 64 units of work in 696,047ms using 1 threads
Performed 64 units of work in 348,767ms using 2 threads
Performed 64 units of work in 223,745ms using 3 threads
Performed 64 units of work in 169,165ms using 4 threads
Performed 64 units of work in 139,516ms using 5 threads
Performed 64 units of work in 118,259ms using 6 threads
Performed 64 units of work in 103,766ms using 7 threads
Performed 64 units of work in 89,952ms using 8 threads
Performed 64 units of work in 85,995ms using 9 threads
Performed 64 units of work in 77,336ms using 10 threads
Performed 64 units of work in 73,032ms using 11 threads
Performed 64 units of work in 69,325ms using 12 threads
Performed 64 units of work in 60,347ms using 13 threads
Performed 64 units of work in 61,213ms using 14 threads
Performed 64 units of work in 60,428ms using 15 threads
Performed 64 units of work in 51,031ms using 16 threads

Stress Testing

We have done extensive testing of the new database configuration using the pipeline II test server. We have run 15,000 real jobs and over 100,000 simulated jobs (when simulating jobs we did not actually submit any real batch jobs, but provided the same load to the pipeline server and database as when we are running real batch jobs). At the same time as we were running the pipeline server we also run various data ingest jobs to simulate the load of storing trending data into the database. The pipeline and trending ingest are the most database intensive activities that we perform.

We were able to run 1500 simulated MC jobs continuously for prolonged periods of time, and were able to ingest on orbits worth of trending data in a little over 1 minute.

Failover Testing

We have tested the ability to failover to the backup database if the primary database fails, and to resync the primary and secondary database. This procedure will be used if the primary database becomes inoperable for an extended period due to hardware or software failure. Currently failover is a manual operation requiring an oracle admin to designate the backup server as "primary". No change is required to GLAST software to switchover to the backup database.

Switchover methodology

We propose to switchover to the new oracle databases on Monday April 14. We will perform the following steps

Starting midnight Sunday

  1. full backup of glast-oracle01 (to NFS disk)

Starting 8am Monday

  1. Shutdown glast applications
  2. Shutdown glast-oracle01 database. <b>All glast database access will be lost at this time.</b>
  3. Restore into glast-oracle03
  4. Start glast-oracle01 as primary databse
  5. Switch glast-oracle01 to be a DNS alias for glast-oracle03
  6. Backup glast-oracle03 to NFS disk

Approximately 8pm Monday

  1. glast database access restored. <b>glast application can be restarted.</b>
  2. Restore backup to glast-oracle04
  3. Bring up glast-oracle04 as physical standby to glast-oracle03

Tuessday 8am

  1. Short DB outage to switch glast-oracle03 to max availability mode