Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
The page will be used to track issues arising during Infrastructure Shifts.

...


* [How to do infrastructure shifts

...

]
* [Weekly Shift

...

Please check the list of known problems

July 10

5:30 am TelemetryTrending problem fetching data from XML-RPC

No Format
 Schedule|http://glast-ground.slac.stanford.edu/ShiftSchedule/weeklyShiftSchedule.jsp]
* [Contact List|http://glast-ground.slac.stanford.edu/GroupManager/protected/contactList.jsp]
* [How To Fix|SASHOW2FIX:How to Fix - Home]

Please check the list of [known problems|Known Problems]

h2. July 10

5:30 am  TelemetryTrending problem fetching data from XML-RPC
{noformat}
http://glastlnx24:5441: org.apache.xmlrpc.XmlRpcException: Failed to create input stream:
  Connection reset
or
  Connection refused

{noformat}
This is a problem with the XML-RPC python server. This problem should be brought to the attention of the FO shifter.

...



19:20 (Richard, for Tony)

...

 Batch jobs were taking a long time, apparently being slow, but were in fact failed with no log files produced. Was tracked down to DNS failures on the balis. It has been reset (reported by Neal Adama at 18:15).

July 5

7:10 pm I restarted tomcat12 since the monitoring programs were complaining and ServerMonitoring showed it missing - Tony

...

6:00 pm Old DQM ingestion script put back into production. The new script worked fine for some 24 hours and then we started having "idle" sessions locking out all the following ones. There were some 60 of them waiting. Killing the first one did not solve the problem as the next one went in "idle" state. We decided to kill all the waiting sessions and put the old script back in production. The failed ingest scripts are being rolled back.

All the sessions have been killed off. Is it the same script that ran successfully yesterday. The database was waiting on sql*net message from client which usually means a process has gone idle. The two processes both went idle after issue insert into DQMTRENDDATAID (dataid, loweredge, upperedge, runversionid) July 3
}

h2. {anchor:July 3}July 3

01:00 Restarted tomcat07 due to [Data Quality Monitoring Unresponsive
.
|Known Problems#Data Quality Monitoring Unresponsive].

01:00 PM  New DQM ingestion script put into production to avoid ORACLE slowdowns. If any problems, please contact Max
.July 2
.

h2. {anchor:July 2}July 2

02:55 Restarted tomcat07 due to [Data Quality Monitoring Unresponsive
.July 1 (Canada Day)

19:50 - Data Processing page went unresponsive for 2.5 hours. See GDP-26@JIRA and SSC-84@JIRA

June 29

3:38 pm Restarted glast-tomcat07. Data Quality Monitoring Unresponsive

RunQuality Exception

Cannot set the run quality flag due to GRQ-4@JIRA

June 27

11:55am Restarted glast-tomcat07. Data Quality Monitoring Unresponsive

12:25am: OpsLog and Monitoring/Trending web-apps interfering with each other

June 26

Outstanding Issues:

David Decotigny requests we get calibration trending working again.

10:18pm Web severs are working again:

The root problem was a software crash on rtr-slb1.
I just came in after Mike called me on the phone and power cycled the machine. It has come up fine and all services should now be restored.

Panel
title


 Batch jobs were taking a long time, apparently being slow, but were in fact failed with no log files produced. Was tracked down to DNS failures on the balis. It has been reset (reported by Neal Adama at 18:15). Unix-admin ticket \[SLAC #120230\].


h2. July 5

7:10 pm I restarted tomcat12 since the monitoring programs were complaining and ServerMonitoring showed it missing - Tony

h2. {anchor:July 4}July 4

6:00 pm  Old DQM ingestion script put back into production. The new script worked fine for some 24 hours and then we started having "idle" sessions locking out all the following ones. There were some 60 of them waiting. Killing the first one did not solve the problem as the next one went in "idle" state. We decided to kill all the waiting sessions and put the old script back in production. The failed ingest scripts are being rolled back.
{panel:title= All the sessions have been killed off.   Is it the same  script that ran successfully yesterday.  The database was waiting on  sql*net message from  client which usually means a process has gone idle.  The two processes both went idle after   issue  insert into DQMTRENDDATAID (dataid, loweredge, upperedge, runversionid) values(:1,:2,:3,:4)  There was no further action being taken by either session such as reads, execute counts, etc.  So either the process was idle or it didn't have enough resource to even attempt what was to be executed next.  I think for now the old script is probably best to run.  It would be nice is serialization wasn't done via locking.  It would also be good if I could adjust a couple of database parameters which requires a shot shutdown. {null
Anchor
July 3July 3
Anchor
July 2July 2
Anchor
July 1July 1
Anchor
June 29June 29
Anchor
June 27June 27
Anchor
June 26June 26
Panel
titleMail from Antonio Ceseracciu
Panel
titleHogaboom, Michael

Update on this...
As Sean says, the servers are up. swh-servcore1 shows both ports connected vlan 1854, full 1000. They are on 7/5 and 7/6.

I called Antonio and he saw that a small router had a hiccup/crashed??
He is going to SLAC now to reboot the router.

10:06pm The problem is believed to be a "small network switch between our servers and the main switch". They are going to try power cycling it.

glast-win01 and glast-win02 became unreachable at 9:20pm. The SCCS on call technical contact has been paged (926 HELP). We just received the following e-mail:

Panel
titleMail from Sean Sink

I just spoke with Mike, the servers are physically up but there seems to be a network problem. Mike is working on contacting the networking team to investigate on their end.

...

|Known Problems#Data Quality Monitoring Unresponsive].

h2. {anchor:July 1}July 1 (Canada Day)

19:50 - Data Processing page went unresponsive for 2.5 hours. See [GDP-26@JIRA] and [SSC-84@JIRA]

h2. {anchor:June 29}June 29

3:38 pm Restarted glast-tomcat07. [Data Quality Monitoring Unresponsive|Known Problems#Data Quality Monitoring Unresponsive]

h3. RunQuality Exception

Cannot set the run quality flag due to [GRQ-4@JIRA]

h2. {anchor:June 27}June 27

11:55am Restarted glast-tomcat07. [Data Quality Monitoring Unresponsive|Known Problems#Data Quality Monitoring Unresponsive]

12:25am: [OpsLog and Monitoring/Trending web-apps interfering with each other|Known Problems#OpsLog, DataQualityMonitoring, TelemetryTrending losing sessions]

h2. {anchor:June 26}June 26

Outstanding Issues:

David Decotigny requests we get calibration trending working again.

10:18pm Web severs are working again:
{panel:title=Mail from Antonio Ceseracciu}

{panel}
The root problem was a software crash on rtr-slb1.
I just came in after Mike called me on the phone and power cycled the machine. It has come up fine and all services should now be restored.
{panel}
{panel:title=Hogaboom, Michael}
Update on this...
As Sean says, the servers are up. swh-servcore1 shows both ports connected vlan 1854, full 1000. They are on 7/5 and 7/6.

I called Antonio and he saw that a small router had a hiccup/crashed??
He is going to SLAC now to reboot the router.
{panel}
10:06pm The problem is believed to be a "small network switch between our servers and the main switch". They are going to try power cycling it.

glast-win01 and glast-win02 became unreachable at 9:20pm. The SCCS on call technical contact has been paged (926 HELP). We just received the following e-mail:
{panel:title=Mail from Sean Sink}
I just spoke with Mike, the servers are physically up but there seems to be a network problem.  Mike is working on contacting the networking team to investigate on their end.
{panel}
* Had to [restart PROD data crawler|SASHOW2FIX:HTF Data Catalog Crawler] one time (because Nagios and [http://glastlnx20.slac.stanford.edu:5080] were complaining). Looks like problem was caused by MC writing to DEV version of xrootd on glastlnx22.

...


* Problems with Run Quality monitoring reported yesterday now fixed.

...



h2. June 25

...



From OpsLog

...


{panel
}
Watching plots on the web is right now very slow...


\**Comment by David Paneque on Thursday, June 26, 2008 5:20:39 AM UTC


It is so from both, the shifter computers and our laptops.


\**Comment by David Paneque on Thursday, June 26, 2008 5:21:33 AM UTC


now it is fast again...


\**Comment by Tony Johnson on Thursday, June 26, 2008 5:55:18 AM UTC


Trending plots, data quality plot, all plots? One possibility is that xrootd load slows down plotting (some plots are read using xrootd). I noticed there were some ASP and MC jobs running in the pipeline around this time which may have been slowing things down.


\**Comment by Tony Johnson on Thursday, June 26, 2008 6:07:59 AM UTC


Indeed in the DataQualityMonitoring log file around this time I see lots of messages about waiting for response from xrootd
.

Outstanding Issues:

...

.
{panel}

h3. Outstanding Issues:

[ELG-18@jira] OpsLog session times out immediately after login.
\[\] Some plots in data monitoring are inaccessible (message says plot not available), even while the same plots are accessible from another browser at the same time. Maybe this is also related to the workstations in the control room running firefox 1.5 (seems unlikely)? See discussion in OpsProbList.
\[\] Jim reported that several of his scripts were running and were killed when the pipeline server was restarted. We need to understand why his data catalog queries are still taking so long >10 minutes to run.
[GRQ-1] Run quality status is not being updated even though change entries are made in the history table. I looked at the code but could not see anything obvious wrong, maybe I am too sleepy.
[LONE-72] Attach intent as meta-data to files
[LONE-71] Digi merging loses IObfStatus - results in Digi files being marked as ContentError in Data Catalog

...



*18:24 PDT* A new version (1.2.4) of the pipeline has been installed. See [https://jira.slac.stanford.edu/browse/SSC-74

...

TelemetryTrending application crashes

The telemetry trending application had to be restarted a couple of times in the last 24 hours. This has been caused by large memory usage when tabulating the data IOT-87@jira.

I am working on a fix.

In the meantime monitor the memory usage of tomcat12 from the Server Monitoring application.
When memory gets close to 90% try clicking on "Force GC" (If you don't see this link you need to be added to the ServerMonitoringAdmin group). If garbage collection does not reduce the memory usage a crash might be imminent.

Max

4am UTC

Tony: Waiting for data.... Registered my pager - sending to number@amsmsg.net seems to work (alias@MyAirMail.com does not)

...

]

h3. TelemetryTrending application crashes

The telemetry trending application had to be restarted a couple of times in the last 24 hours. This has been caused by large memory usage when tabulating the data [IOT-87@jira].

I am working on a fix.

In the meantime monitor the memory usage of tomcat12 from the [Server Monitoring|http://glast-ground.slac.stanford.edu/ServerMonitoring] application.
When memory gets close to 90% try clicking on "Force GC" (If you don't see this link you need to be added to the ServerMonitoringAdmin group). If garbage collection does not reduce the memory usage a crash might be imminent.

Max

h3. 4am UTC

Tony: Waiting for data.... Registered my pager - sending to number@amsmsg.net seems to work (alias@MyAirMail.com does not)

Two outstanding (new) issues: [PFE-172@jira] [IFO-24@jira]