http://www-glast.stanford.edu/protected/mail/opsprob/

Friday April 18

Area

Problem

Comments

Resolution

Monitoring

Problem loading alarm files : javax.servlet.jsp.el.ELException: An error
occurred while evaluating function "dmapp:getAlarms"

?

?

Crawler

Slow crawl? No data files for stress test runs.

Crawler had been reporting all files as missing all day. All files were recrawled and then appeared to be OK.

?

Monitoring

Monitoring data was being sent to DEV

Application was poked and then seemed to work

?

L1

Recon EOR timed out in 80418003.257755648.15606093 with
/afs/slac/g/glast/isoc/flightOps/rhel3_gcc32/ISOC_PROD/bin/isoc: line 33:
/afs/slac/g/glast/isoc/flightOps/volumes/vol5/isoc_rpm/rhel3_gcc32_install_20080227/etc/rpmenv.sh:
Connection timed out Are we overwhelming isoc/flightOps afs?

?

?

Monitoring

DQM stuck

?

?

Pipeline

Mail backlog

Processing of e-mail from L1Proc is slow. This seems to be caused by contention for locking the top level stream (a problem not seen in the much simpler MC task we used for testing)

PII-306@JIRA PII-305@JIRA PII-318@JIRA

Saturday April 19

Note At 16:15 Dan installed a new version of the pipeline II stored procedures on DEV.

Area

Problem

Comments

Resolution

Oracle

GLASTTREND space full

 

The space was expanded

LSF

Only 100-200 jobs running, when 500+ in queue.

LSF reported 467 jobs in "RSV" status. Neal reports that this is a problem that they have seen before and are investigating with Platform. He requests we contact him if we see it again, but it has not reoccurred since 14:20 on Saturday

?

Xrootd

Xrootd slow

Wilko has postulated that the problem may have been that the scratch disk on the batch machine was too busy. He will ask Yemi to add monitoring of the batch scratch disks to ganglia

GXR-28@JIRA

Pipeline

Some DEV jobs failing in strange way

A race condition was discovered where the mail message from the batch job could be received before the stream had been transitioned to "QUEUED" state.

Work around installed in DEV PII-319@JIRA

Sunday April 20

Area

Problem

Comments

Resolution

Oracle

GLASTTREND space full again

?

Ian added 32 GB of space and changed the critical threshold to 90%

Pipeline

2 Stream on DEV are waiting, even though all their PIs are finished

Dan is investigating, probably a result of the patch he put into DEV on Saturday

?

Monday April 21

Note at 17:31 new stored procedures were installed into the PROD pipeline

Area

Problem

Comments

Resolution

Monitoring

When trying to go to the DQM page (from Glast ground) it just hangs.

The application itself was running when accessed from glast-tomcat07, but probe was hung. tomcat07 was restarted. We will add more logging to understand what is happening

GG-22@JIRA

Tuesday April 22

Area

Problem

Comments

Resolution

Tomcat

Web applications were all down

tomcat01 had run out of permgen space

GG-23@JIRA


10:00am (Daniel Flath):

The new stored procedures (See April 21 note) came with new queries.  Until Oracle has run the new queries long enough to gather execution-plan statistics, it cannot recommend a more efficient execution-plan.  Two of the queries that are most heavily executed during dependency evaluation were identified by the [Grid Control] top SQL activity report.  Oracle is recommending execution-plan improvements for both.  The revised plan estimates indicate that we may benefit significantly by implementing the changes.

 I have implemented the explain plans, shown here, and Warren will submit another run to L1 to see if the changes made a difference, and then iterate over this process again if there are further recommendations from Oracle.


  • No labels