Stress Testing Problem Reports

http://www-glast.stanford.edu/protected/mail/opsprob/

Friday April 18

Area	Problem	Comments	Resolution
Monitoring	Problem loading alarm files : javax.servlet.jsp.el.ELException: An error occurred while evaluating function "dmapp:getAlarms"	?	?
Crawler	Slow crawl? No data files for stress test runs.	Crawler had been reporting all files as missing all day. All files were recrawled and then appeared to be OK.	?
Monitoring	Monitoring data was being sent to DEV	Application was poked and then seemed to work	?
L1	Recon EOR timed out in 80418003.257755648.15606093 with /afs/slac/g/glast/isoc/flightOps/rhel3_gcc32/ISOC_PROD/bin/isoc: line 33: /afs/slac/g/glast/isoc/flightOps/volumes/vol5/isoc_rpm/rhel3_gcc32_install_20080227/etc/rpmenv.sh: Connection timed out Are we overwhelming isoc/flightOps afs?	?	?
Monitoring	DQM stuck	?	?
Pipeline	Mail backlog	Processing of e-mail from L1Proc is slow. This seems to be caused by contention for locking the top level stream (a problem not seen in the much simpler MC task we used for testing)	PII-306@JIRA PII-305@JIRA PII-318@JIRA

Saturday April 19

Note At 16:15 Dan installed a new version of the pipeline II stored procedures on DEV.

Area	Problem	Comments	Resolution
Oracle	GLASTTREND space full		The space was expanded
LSF	Only 100-200 jobs running, when 500+ in queue.	LSF reported 467 jobs in "RSV" status. Neal reports that this is a problem that they have seen before and are investigating with Platform. He requests we contact him if we see it again, but it has not reoccurred since 14:20 on Saturday	?
Xrootd	Xrootd slow	Wilko has postulated that the problem may have been that the scratch disk on the batch machine was too busy. He will ask Yemi to add monitoring of the batch scratch disks to ganglia	GXR-28@JIRA
Pipeline	Some DEV jobs failing in strange way	A race condition was discovered where the mail message from the batch job could be received before the stream had been transitioned to "QUEUED" state.	Work around installed in DEV PII-319@JIRA

Sunday April 20

Area	Problem	Comments	Resolution
Oracle	GLASTTREND space full again	?	Ian added 32 GB of space and changed the critical threshold to 90%
Pipeline	2 Stream on DEV are waiting, even though all their PIs are finished	Dan is investigating, probably a result of the patch he put into DEV on Saturday	?

Monday April 21

Note at 17:31 new stored procedures were installed into the PROD pipeline

Area	Problem	Comments	Resolution
Monitoring	When trying to go to the DQM page (from Glast ground) it just hangs.	The application itself was running when accessed from glast-tomcat07, but probe was hung. tomcat07 was restarted. We will add more logging to understand what is happening	GG-22@JIRA

Tuesday April 22

Area	Problem	Comments	Resolution
Tomcat	Web applications were all down	tomcat01 had run out of permgen space	GG-23@JIRA

10:00am (Daniel Flath):

The new stored procedures (See April 21 note) came with new queries. Until Oracle has run the new queries long enough to gather execution-plan statistics, it cannot recommend a more efficient execution-plan. Two of the queries that are most heavily executed during dependency evaluation were identified by the [Grid Control] top SQL activity report. Oracle is recommending execution-plan improvements for both. The revised plan estimates indicate that we may benefit significantly by implementing the changes.

I have implemented the explain plans, shown here, and Warren will submit another run to L1 to see if the changes made a difference, and then iterate over this process again if there are further recommendations from Oracle.

Space shortcuts

Child pages

Friday April 18

Saturday April 19

Sunday April 20

Monday April 21

Tuesday April 22