Overview

Oracle crash

SCCS restarted database in ~30 minutes. Due to "out of shared pool space". Shared pool is used, among other things, to cache recent queries so that they can be reused. Non optimal queries in the DataQualityMonitoring application seem to have been responsible for this. The queries were fixed immediately.

We learned that we have to e-mail db-admin to report database outages.

Web Server Crashes

There were a few (5?) crashes of the web server on which the DataQualityMonitoring application is deployed. All due to the server running out of memory.

The DataQualityMonitoring application needs to be fined tuned:

When multiple runs are selected, histograms are added and trending quantities are fetched from the database for the resulting time period.

The first crash happened 2.5 hours into the test on Tue morning. Applications were rebalanced (i.e. DataQualityMonitoring application was isolated) and the memory was increased to 1.5 Gb. Three crashes happened in the last shift in an hour (guess that shifters liked to see the trended data over the 16 orbits (at 15 seconds intervals?) ).

Overall heavily used by shifters.

Digi Trending Ingestion Failure (2)

Due to a deadlock when digi ingestion overlapped with recon ingestion of the previous processing run.
This is not really a deadlock in the sense that Digi and Recon data is non overlapping. But they share the same tables.

Trending problems

We need experts to sit down and compare trending plots from the web to the ones produced with desktop tools opening root files (there are only 30K of them!).

Pipeline

NFS Problem

Half pipe for orbit 2 failed a few times due to NFS problems.

Safari Compatibility

We might have to spend some time making sure that the javascript we use is compatible with Safari.

Things we need to fix afterwards

In no particular order...

What else should we have learned?