http://www-glast.stanford.edu/protected/mail/opsprob/
Area |
Problem |
Comments |
Resolution |
---|---|---|---|
Monitoring |
Problem loading alarm files : javax.servlet.jsp.el.ELException: An error |
? |
? |
Crawler |
Slow crawl? No data files for stress test runs. |
Crawler had been reporting all files as missing all day. All files were recrawled and then appeared to be OK. |
? |
Monitoring |
Monitoring data was being sent to DEV |
Application was poked and then seemed to work |
? |
L1 |
Recon EOR timed out in 80418003.257755648.15606093 with |
? |
? |
Monitoring |
DQM stuck |
? |
? |
Pipeline |
Mail backlog |
Processing of e-mail from L1Proc is slow. This seems to be caused by contention for locking the top level stream (a problem not seen in the much simpler MC task we used for testing) |
Note At 16:15 Dan installed a new version of the pipeline II stored procedures on DEV.
Area |
Problem |
Comments |
Resolution |
---|---|---|---|
Oracle |
GLASTTREND space full |
|
The space was expanded |
LSF |
Only 100-200 jobs running, when 500+ in queue. |
LSF reported 467 jobs in "RSV" status. Neal reports that this is a problem that they have seen before and are investigating with Platform. He requests we contact him if we see it again, but it has not reoccurred since 14:20 on Saturday |
? |
Xrootd |
Xrootd slow |
Wilko has postulated that the problem may have been that the scratch disk on the batch machine was too busy. He will ask Yemi to add monitoring of the batch scratch disks to ganglia |
|
Pipeline |
Some DEV jobs failing in strange way |
A race condition was discovered where the mail message from the batch job could be received before the stream had been transitioned to "QUEUED" state. |
Work around installed in DEV PII-319@JIRA |
Area |
Problem |
Comments |
Resolution |
---|---|---|---|
Oracle |
GLASTTREND space full again |
? |
Ian added 32 GB of space and changed the critical threshold to 90% |
Pipeline |
2 Stream on DEV are waiting, even though all their PIs are finished |
Dan is investigating, probably a result of the patch he put into DEV on Saturday |
? |
Note at 17:31 new stored procedures were installed into the PROD pipeline
Area |
Problem |
Comments |
Resolution |
---|---|---|---|
Monitoring |
When trying to go to the DQM page (from Glast ground) it just hangs. |
The application itself was running when accessed from glast-tomcat07, but probe was hung. tomcat07 was restarted. We will add more logging to understand what is happening |
Area |
Problem |
Comments |
Resolution |
---|---|---|---|
Tomcat |
Web applications were all down |
tomcat01 had run out of permgen space |
10:00am (Daniel Flath):
The new stored procedures (See April 21 note) came with new queries. Until Oracle has run the new queries long enough to gather execution-plan statistics, it cannot recommend a more efficient execution-plan. Two of the queries that are most heavily executed during dependency evaluation were identified by the [Grid Control] top SQL activity report. Oracle is recommending execution-plan improvements for both. The revised plan estimates indicate that we may benefit significantly by implementing the changes.
I have implemented the explain plans, shown here, and Warren will submit another run to L1 to see if the changes made a difference, and then iterate over this process again if there are further recommendations from Oracle.