Infrastructure Shift Log

The page will be used to track issues arising during Infrastructure Shifts.

Please check the list of known problems

July 1 (Canada Day)

19:50 - Richard Data Processing page went unresponsive. Nagios complained about tomcat11. I manually restarted it, but no joy. Paged Tony.

The data processing app is failing with

javax.servlet.jsp.el.ELException: An error occurred while evaluating function "dp:getSummaryTable"
org.apache.commons.el.Logger.logError(Logger.java:484)
org.apache.commons.el.Logger.logError(Logger.java:545)
org.apache.commons.el.FunctionInvocation.evaluate(FunctionInvocation.java:174)
org.apache.commons.el.ExpressionEvaluatorImpl.evaluate(ExpressionEvaluatorImpl.java:263)
org.apache.commons.el.ExpressionEvaluatorImpl.evaluate(ExpressionEvaluatorImpl.java:190)
org.apache.jasper.runtime.PageContextImpl.proprietaryEvaluate(PageContextImpl.java:924)
org.apache.jsp.tag.web.queryies.downlinks_tag._jspx_meth_c_005fset_005f0(downlinks_tag.java:98)
org.apache.jsp.tag.web.queryies.downlinks_tag.doTag(downlinks_tag.java:72)
org.apache.jsp.summary_jsp._jspx_meth_query_005fdownlinks_005f0(summary_jsp.java:623)
org.apache.jsp.summary_jsp._jspService(summary_jsp.java:154)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:331)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
org.freehep.webutil.filter.NotAvailableFilter.doFilter(NotAvailableFilter.java:122)
org.glast.base.application.web.filter.ApplicationFilter.doFilter(ApplicationFilter.java:132)
org.glast.base.web.refresh.filter.RefreshFilter.doFilter(RefreshFilter.java:121)
org.glast.base.web.multipart.filter.MultipartFilter.doFilter(MultipartFilter.java:37)
com.opensymphony.module.sitemesh.filter.PageFilter.parsePage(PageFilter.java:118)
com.opensymphony.module.sitemesh.filter.PageFilter.doFilter(PageFilter.java:52)
org.glast.base.web.preferences.filter.PreferencesFilter.doFilter(PreferencesFilter.java:47)
org.glast.base.web.datasource.DataSourceFilter.doFilter(DataSourceFilter.java:118)
org.glast.base.web.login.filter.LoginFilter.doFilter(LoginFilter.java:127)

This also seems to cause it to leak connections so it runs out of DB connections, and tomcat monitor starts complaining.

In the tomcat logs I also see:

java.sql.SQLException: ORA-01427: single-row subquery returns more than one row

Yikes, I have extracted the failing SQL and it is:

giant.sql

select downlinkid, time, filename, nfiles, percentDone, percentFailed, runid, status, nevts, l0key, moot_alias as runintent, runstatus, runquality,
GLAST_DP_TEST.GLAST_UTIL.GetTimeFromEpochMS(evtutc0) as time0,
GLAST_DP_TEST.GLAST_UTIL.GetTimeFromEpochMS(evtutc1) as time1,
GLAST_DP_TEST.PII.GetStreamProgress('HalfPipe',downlinkid) as halfPipeProgress,
( select GLAST_DP_TEST.PII.getStreamProgress(stream) from GLAST_DP_TEST.stream
join GLAST_DP_TEST.task using (task) where parentstream=GLAST_DP_TEST.PII.GetMostRecentStreamPK('L1Proc',downlinkid)
and taskname='doRun' and streamId=runid ) as l1ProcProgress,
( select stream from GLAST_DP_TEST.stream
join GLAST_DP_TEST.task using (task) where parentstream=GLAST_DP_TEST.PII.GetMostRecentStreamPK('L1Proc',downlinkid)
and taskname='doRun' and streamId=runid ) as l1ProcStreamId,
(select count(1) || ':' || sum(case when lvl_fk=1 then 1 else 0 end) || ':' || sum(case when lvl_fk=2 then 1 else 0 end) || ':' || sum(case when lvl_fk=3 then 1 else 0 end) || ':' || sum(case when lvl_fk=4 then 1 else 0 end)
from isoc_flight.netlog_evnt where tgt = filename and tevnt >= ( timestamp'1970-01-01 00:00:00' + numtodsinterval( time, 'SECOND' ) ) ) as fastCopyLogs,
( select count(1) || ':' || sum(case when var.varname = 'L1_Alarm_undefined' then to_number(var.value) else 0 end) || ':' || sum(case when var.varname = 'L1_Alarm_error' then to_number(var.value) else 0 end) || ':' || sum(case when var.varname = 'L1_Alarm_warning' then to_number(var.value) else 0 end) || ':' || sum(case when var.varname = 'L1_Alarm_clean' then to_number(var.value) else 0 end)
from GLAST_DP_TEST.PROCESSINSTANCEVAR var
join GLAST_DP_TEST.processinstance processinstance on ( processinstance.processinstance = var.processinstance )
join GLAST_DP_TEST.stream stream on (stream.stream = processinstance.stream)
where stream.streamid = runid and stream.parentstream=GLAST_DP_TEST.PII.GetMostRecentStreamPK('L1Proc',downlinkid) and var.varname like 'L1_Alarm%' ) as l1ProcLogs,
( select sum(case when datasetdatatype = 'FASTMONHIST' then 1 else 0 end) || ':' || sum(case when datasetdatatype = 'RECONHIST' then 1 else 0 end) || ':' ||
sum(case when datasetdatatype = 'DIGIHIST' then 1 else 0 end) || ':' || sum(case when datasetdatatype = 'CALHIST' then 1 else 0 end)
from GLAST_DP_TEST.dataset join GLAST_DP_TEST.datasetmetanumber downlink on ( downlink.dataset = dataset.dataset and downlink.metaname = 'nDownlink' )
where runmin = runid and downlink.metavalue = downlinkid ) as datasets,
GLAST_DP_TEST.PII.GetStreamProgress('GRB_blind_search',downlinkid) as grbSearchProgress
from (select time, filename, percentDone, percentFailed, nfiles, l0key, downlinkid from isoc_flight.fcopy_incoming incoming, (
select max(incoming.incoming_pk) l0key, count(icdfile.filename) nfiles, max(incoming.treceive) time, max(downlink_id) downlinkid,
sum(case when jobstate_fk = 7 then 1 else 0 end)/count(icdfile.filename)*100 percentDone,
sum(case when jobstate_fk = 9 then 1 else 0 end)/count(icdfile.filename)*100 percentFailed
from isoc_flight.fcopy_incoming incoming join isoc_flight.fcopy_icdfile icdfile on ( icdfile.incoming_fk=incoming.incoming_pk )
left outer join isoc_flight.fcopy_l0dispatch l0dispatch on (l0dispatch.incoming_fk = incoming.incoming_pk )
where treceive > ? and treceive < ? and icdfile.filename like '%VC08%' group by icdfile.incoming_fk)
where incoming.incoming_pk = l0key and downlinkid is not null ) left outer join
( select id rundownlinkid, startedat runid, status, nevts, evtutc0, evtutc1, moot_alias from isoc_flight.glastops_DOWNLINK downlink
join isoc_flight.glastops_DOWNLINK_ACQSUMMARY bridge on downlink.ID=bridge.DOWNLINK_ID
join isoc_flight.glastops_ACQSUMMARY acqsum on ( acqsum.scid = bridge.scid and acqsum.startedat = bridge.startedat)
) rundownlink on (downlinkid = rundownlink.rundownlinkid )
left outer join ( select r.runid as runqualityid, r.runstatus as runstatus, r.runquality as runquality from GLAST_DP_TEST.Run r ) runq on (runq.runqualityid = rundownlink.runid )order by time desc

June 29

3:38 pm Restarted glast-tomcat07. Data Quality Monitoring Unresponsive

RunQuality Exception

Trying to change the run quality for run 236375554 we get:

org.apache.jasper.JasperException: javax.servlet.jsp.JspException:
                            insert into runMetaString (runid,metaname,metavalue)
                            values (?,?,?)



                        : ORA-00001: unique constraint (GLAST_DP_TEST.UNQ_RUNMETASTRING) violated

        org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460)
        org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:355)
        org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
        org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
        com.opensymphony.module.sitemesh.filter.PageFilter.parsePage(PageFilter.java:118)
        com.opensymphony.module.sitemesh.filter.PageFilter.doFilter(PageFilter.java:52)
        org.glast.base.web.datasource.DataSourceFilter.doFilter(DataSourceFilter.java:91)
        org.glast.base.web.login.filter.LoginFilter.doFilter(LoginFilter.java:127)
        org.glast.base.web.checkcookies.filter.CheckCookiesFilter.doFilter(CheckCookiesFilter.java:41)

June 27

11:55am Restarted glast-tomcat07. Data Quality Monitoring Unresponsive

12:25am: OpsLog and Monitoring/Trending web-apps interfering with each other

June 26

Outstanding Issues:

David Decotigny requests we get calibration trending working again.

10:18pm Web severs are working again:

Mail from Antonio Ceseracciu

The root problem was a software crash on rtr-slb1.
I just came in after Mike called me on the phone and power cycled the machine. It has come up fine and all services should now be restored.

Hogaboom, Michael

Update on this...
As Sean says, the servers are up. swh-servcore1 shows both ports connected vlan 1854, full 1000. They are on 7/5 and 7/6.

I called Antonio and he saw that a small router had a hiccup/crashed??
He is going to SLAC now to reboot the router.

10:06pm The problem is believed to be a "small network switch between our servers and the main switch". They are going to try power cycling it.

glast-win01 and glast-win02 became unreachable at 9:20pm. The SCCS on call technical contact has been paged (926 HELP). We just received the following e-mail:

Mail from Sean Sink

I just spoke with Mike, the servers are physically up but there seems to be a network problem. Mike is working on contacting the networking team to investigate on their end.

Had to restart PROD data crawler one time (because Nagios and http://glastlnx20.slac.stanford.edu:5080 were complaining). Looks like problem was caused by MC writing to DEV version of xrootd on glastlnx22.
Problems with Run Quality monitoring reported yesterday now fixed.

June 25

From OpsLog

Watching plots on the web is right now very slow...

**Comment by David Paneque on Thursday, June 26, 2008 5:20:39 AM UTC
It is so from both, the shifter computers and our laptops.
**Comment by David Paneque on Thursday, June 26, 2008 5:21:33 AM UTC
now it is fast again...
**Comment by Tony Johnson on Thursday, June 26, 2008 5:55:18 AM UTC
Trending plots, data quality plot, all plots? One possibility is that xrootd load slows down plotting (some plots are read using xrootd). I noticed there were some ASP and MC jobs running in the pipeline around this time which may have been slowing things down.
**Comment by Tony Johnson on Thursday, June 26, 2008 6:07:59 AM UTC
Indeed in the DataQualityMonitoring log file around this time I see lots of messages about waiting for response from xrootd.

Outstanding Issues:

ELG-18@jira OpsLog session times out immediately after login.
[] Some plots in data monitoring are inaccessible (message says plot not available), even while the same plots are accessible from another browser at the same time. Maybe this is also related to the workstations in the control room running firefox 1.5 (seems unlikely)? See discussion in OpsProbList.
[] Jim reported that several of his scripts were running and were killed when the pipeline server was restarted. We need to understand why his data catalog queries are still taking so long >10 minutes to run.
[GRQ-1] Run quality status is not being updated even though change entries are made in the history table. I looked at the code but could not see anything obvious wrong, maybe I am too sleepy.
[LONE-72] Attach intent as meta-data to files
[LONE-71] Digi merging loses IObfStatus - results in Digi files being marked as ContentError in Data Catalog

18:24 PDT A new version (1.2.4) of the pipeline has been installed. See https://jira.slac.stanford.edu/browse/SSC-74

TelemetryTrending application crashes

The telemetry trending application had to be restarted a couple of times in the last 24 hours. This has been caused by large memory usage when tabulating the data IOT-87@jira.

I am working on a fix.

In the meantime monitor the memory usage of tomcat12 from the Server Monitoring application.
When memory gets close to 90% try clicking on "Force GC" (If you don't see this link you need to be added to the ServerMonitoringAdmin group). If garbage collection does not reduce the memory usage a crash might be imminent.

Max

4am UTC

Tony: Waiting for data.... Registered my pager - sending to number@amsmsg.net seems to work (alias@MyAirMail.com does not)

Two outstanding (new) issues: PFE-172@jira IFO-24@jira

Space shortcuts

Child pages

July 1 (Canada Day)

June 29

RunQuality Exception

June 27

June 26

June 25

Outstanding Issues:

TelemetryTrending application crashes

4am UTC