...
The pipeline works by receiving e-mails from the batch jobs, one when the job starts, and another when it ends. Analysis of the pipeline log files shows that drop-off in performance is a result of a backlog of mail messages which result from the pipeline software being unable to process incoming e-mail fast enough. The time to process e-mail messages is dominated by oracle response time, in particular a call to a Java stored procedure which decides what needs to be done as a result of the e-mail (for example submit more jobs). Note that the green line above represents what the pipeline thinks is running, so the mail backlog in fact makes the pipeline believe more jobs are running that actually are running, and so the real throughput hit is worse than it appears in the diagram above.
Looking with Oracle Grid control we can see that at 2am the stored procedure seems to start taking radically more CPU, and using radically more IO.
...
- Some change in the size of nature of the data we are processing?
- Some hardware problem, perhaps related to the power problems in SCCS?
- ?
What exactly is a "buffer get" and a "direct write" and what steps can we take to diagnose the problem?
Background notes
Tasks running during time span of plot 1 above.
Why does accessing a page like:
http://glast-tomcat03.slac.stanford.edu:8080/Pipeline-II/streams.jsp?status=RUNNING&task=1562625
take so long?
Why does a query like:
select processinstance from processinstance where workingdir='/nfs/farm/g/glast/u44/MC-tasks/backgnd-GR-v13r5p6/output/backgnd-GR-v13r5p6/1.0/runMonteCarlo/086xxx/360'
take 8 seconds even though we have an index on workingdir?