Oracle Problems

GLAST processing pipeline is used to control, monitor, and keep a history of GLAST jobs run on the SLAC LSF farm (and is beginning to also be used to control other farms such as the IN2P3 farm in Lyon, France). A big MC run was started on December 14. For the first 4 days of running the pipeline software was running well, and was able to keep LSF running about 1100 jobs simultaneously (each job runs for approximately one hour, so that requires submitting and managing about 1100 job submissions per hour). Starting at 2am on December 18 the throughput suddenly dropped to less than 500 jobs and has stayed at the level ever since (the green line in the plot below shows how many jobs are running).

The pipeline works by receiving e-mails from the batch jobs, one when the job starts, and another when it ends. Analysis of the pipeline log files shows that drop-off in performance is a result of a backlog of mail messages which result from the pipeline software being unable to process incoming e-mail fast enough. The time to process e-mail messages is dominated by oracle response time, in particular a call to a Java stored procedure which decides what needs to be done as a result of the e-mail (for example submit more jobs). Note that the green line above represents what the pipeline thinks is running, so the mail backlog in fact makes the pipeline believe more jobs are running that actually are running, and so the real throughput hit is worse than it appears in the diagram above.

Looking with Oracle Grid control we can see that at 2am the stored procedure seems to start taking radically more CPU, and using radically more IO.

Time history of resource usage of stored procedure

Snapshot of query performance before 2am.

Snapshot of query performance after 2am.

Note that the number of "direct writes" before 2am is 0, and after goes to 65,000. Similarly the number of buffer gets before 2am is 25,000,000 and afterwards shoots up to 423,000,000. We do not currently know how to interpret these numbers, or what we can do to diagnose the problem.

Questions

What changed at 2am on December 18?

Some change in the size of nature of the data we are processing?
Some hardware problem, perhaps related to the power problems in SCCS?
?

What exactly is a "buffer get" and a "direct write" and what steps can we take to diagnose the problem?

Background notes

Tasks running during time span of plot 1 above.

Why does accessing a page like:

http://glast-tomcat03.slac.stanford.edu:8080/Pipeline-II/streams.jsp?status=RUNNING&task=1562625

take so long?

Why does a query like:

select processinstance from processinstance where workingdir='/nfs/farm/g/glast/u44/MC-tasks/backgnd-GR-v13r5p6/output/backgnd-GR-v13r5p6/1.0/runMonteCarlo/086xxx/360'

take 8 seconds even though we have an index on workingdir?

Confluence and Jira now require federated login. Read more.