You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Oracle crash

Due to "out of shared pool space". Shared pool is used, among other things, to cache recent queries so that they can be reused. Non optimal queries in the DataQualityMonitoring application seem to have been responsible for this. The queries were fixed immediately.

We learned that we have to e-mail db-admin.

  • Learn to use GRIDControl, an insanely complicated tool to monitor Oracle. We can use it to find out what queries are heavier on the system and ways to optimize them.
  • RAC system: Real Application Clusters. For load balancing and fail over.

Web Server Crashes

There were a few (5?) crashes of the web server on which the DataQualityMonitoring application is deployed. All due to the server running out of memory.

The DataQualityMonitoring application needs to be fined tuned:

  • provides (so far) 30,031 trending quantities, stored at 15 seconds intervals
  • 31,525 histograms

When multiple runs are selected, histograms are added and trending quantities are fetched from the database for the resulting time period.

The first crash happened 2.5 hours into the test on Tue morning. Applications were rebalanced (i.e. DataQualityMonitoring application was isolated) and the memory was increased to 1.5 Gb. Three crashes happened in the last shift in an hour (guess that shifters liked to see the trended data over the 16 orbits (at 15 seconds intervals?) ).

  • Clustering : Tomcat supports clustering.
  • Aggregate old data to bigger time intervals?
  • Prevent users from getting more data points than pixels on the screen? (6 hours)

Digi Trending Ingestion Failure (2)

Due to a deadlock when digi ingestion overlapped with recon ingestion of the previous processing run.
This is not really a deadlock in the sense that Digi and Recon data is non overlapping. But they share the same tables.

  • indexing might fix the problem
  • catching exceptions and waiting some time will fix it for sure

Pipeline

Things we need to fix afterwards

  • Stop using SLACDEV database
    • Rationalize use of Dev/prod etc, decide if we need other configuration options
  • Remove duplication between data processing page and other apps
  • Improve data catalog interface especially for real data
    • Get Warren to create groups rather than folders
    • Look at GUI for data catalog
  • Get Data Server tied into L1Proc
  • Get portal working, at least for items like data processing page, grb summary etc
    • Generate RSS feeds from LogWatcher, Ops Log, etc to display on portal page
  • Look into tomcat clustering
  • Ability to monitor all tomcat servers/applications from one page (and maybe restart them)
  • Figure out why automatic generation of tomcat configuration on glast-win01,02 did not work
  • Integration between monitoring tools and Ops Log
    • Ability to move plots to ops log and comment on them
    • Ability to find all recent comments on a plot
  • Make Ops Log use same login system as everything else
  • Cross App trending
  • No labels