Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The

...

page

...

will

...

be

...

used

...

to

...

track

...

issues

...

arising

...

during

...

Infrastructure

...

Shifts.

...

...

...

...

...

...

...

...

Please check the list of known problems

July 20

(TJ) A new application has been installed on glast-tomcat09 to make the CountdownClock available at: http://glast-ground.slac.stanford.edu/CountdownClock/

(RD) Problems with LSF around 5am or so. One machine (fell0147) ran out of memory and became a black hole for jobs, killing them instantly. SCCS was paged and the problem resolved by around 09:30.

July 19

(CH) glastlnx04 seemed to show up on the nagios critical list and remained in the red more frequently than normal. Reported this to Tony and unix-admin.
(TJ) This appears to be a problem with the nagios monitoring timing out, rather than a real problem with glastlnx04 or with the pipeline job control.

July 18

(TJ) The DataCatalog has been moved from tomcat09 to tomcat08, to isolate it from the other critical application on tomcat09 and to see if it is responsible for using up jdbc connections. A copy has been left running on tomcat09 to keep nagios happy until Emmanuel can update the nagios configuration.

July 17

21:30 (CH) No major problems to report.

July 16

3:30 - 6:30 am Slow/Unresponsive applications from tomcat09. JDBC connection pool was 100% busy. Increased maxActive to 20 (it was 8). The server was restarted.

July 15

20:45 (CH) There was a scheduled outage of the LSF master server causing some runs to fail.
After the server came back up some jobs were tagged as suspended (SSUSP) and it wasn't clear whether to wait or kill them and resubmit. The boer* batch machines did not respond when ping'd. This was reported to unix-admin.

July 14

09:00 (KH) Same problem as at 00:30. Tomcat09 restarted.

00:30 (TJ) I restart tomcat09 since all of the pipeline-II JDBC connections were in use. The problem persisted after the restart, although the applications were still responding, then after about 15 minutes the problem seemed to go away by itself.

July 13

01:00 am Restarted tomcat07 due to Data Quality Monitoring Unresponsive.

19:30 (TG) Another instance of DQM web page failing to come up on DS terminal. Problem resolved after two restarts of Firefox. GDM-122@jira.

July 12

glastlnx12 (OpsLog and Pipeline-II PROD) crashed for unknown reasons, and was restarted by Chuck Boehiem.

DQM trending app was failing for Anders with GDM-122@jira.
Clearing the session (by restarting the browser, or in Ander's case clearing the glast-ground cookies) fixed the problem

July 11

With reference to the TelemetryTrending problem, nagios had been complaining about tomcat12 for sometime. It monitors the tomcat servers using probe's "quickcheck" feature. This was showing that all of the JDBC connections were used up. This could just be a side effect, since if the application hangs while it has a DB connection open that will soon use up all of the connections.

8:45 am I restarted the dev datacat crawler (using the button on the datacat web admin page)
8:20 am same as below. TelemetryTrending application unresponsive.
5:30 am TelemetryTrending application unresponsive. Restarting the server fixed the problem. No evidence was found in the log files.

July 10

5:30 am TelemetryTrending problem fetching data from XML-RPC

No Format


h2. July 19

(CH) glastlnx04 seemed to show up on the nagios critical list and remained in the red more frequently than normal. Reported this to Tony and unix-admin.
(TJ) This appears to be a problem with the nagios monitoring timing out, rather than a real problem with glastlnx04 or with the pipeline job control.

h2. July 18

(TJ) The DataCatalog has been moved from tomcat09 to tomcat08, to isolate it from the other critical application on tomcat09 and to see if it is responsible for using up jdbc connections. A copy has been left running on tomcat09 to keep nagios happy until Emmanuel can update the nagios configuration.

h2. July 17

21:30 (CH) No major problems to report.

h2. July 16

3:30 - 6:30 am Slow/Unresponsive applications from tomcat09. JDBC connection pool was 100% busy. Increased maxActive to 20 (it was 8). The server was restarted.

h2. July 15

20:45 (CH) There was a scheduled outage of the LSF master server causing some runs to fail.
After the server came back up some jobs were tagged as suspended (SSUSP) and it wasn't clear whether to wait or kill them and resubmit. The boer\* batch machines did not respond when ping'd. This was reported to unix-admin.

h2. July 14

09:00 (KH) Same problem as at 00:30. Tomcat09 restarted.

00:30 (TJ) I restart tomcat09 since all of the pipeline-II JDBC connections were in use. The problem persisted after the restart, although the applications were still responding, then after about 15 minutes the problem seemed to go away by itself.

h2. July 13

01:00 am Restarted tomcat07 due to [Data Quality Monitoring Unresponsive|Known Problems#Data Quality Monitoring Unresponsive].

19:30 (TG) Another instance of DQM web page failing to come up on DS terminal.  Problem resolved after two restarts of Firefox. [GDM-122@jira].

h2. July 12

glastlnx12 (OpsLog and Pipeline-II PROD) crashed for unknown reasons, and was restarted by Chuck Boehiem.

DQM trending app was failing for Anders with [GDM-122@jira].
Clearing the session (by restarting the browser, or in Ander's case clearing the glast-ground cookies) fixed the problem

h2. July 11

With reference to the TelemetryTrending problem, nagios had been complaining about tomcat12 for sometime. It monitors the tomcat servers using probe's "quickcheck" feature. This was showing that all of the JDBC connections were used up. This could just be a side effect, since if the application hangs while it has a DB connection open that will soon use up all of the connections.

8:45 am I restarted the dev datacat crawler (using the button on the datacat web admin page)
8:20 am same as below. TelemetryTrending application unresponsive.
5:30 am TelemetryTrending application unresponsive. Restarting the server fixed the problem. No evidence was found in the log files.

h2. July 10

5:30 am  TelemetryTrending problem fetching data from XML-RPC
{noformat}
http://glastlnx24:5441: org.apache.xmlrpc.XmlRpcException: Failed to create input stream:
  Connection reset
or
  Connection refused

{noformat}

This

...

is

...

a

...

problem

...

with

...

the

...

XML-RPC

...

python

...

server.

...

This

...

problem

...

should

...

be

...

brought

...

to

...

the

...

attention

...

of

...

the

...

FO

...

shifter.

...

19:20

...

(Richard,

...

for

...

Tony)

...

Wiki Markup
 Batch jobs were taking a long time, apparently being slow, but were in fact failed with no log files produced. Was tracked down to DNS failures on the balis. It has been reset (reported by Neal Adama at 18:15). Unix-admin ticket \[SLAC #120230\].

...

July

...

5

...

7:10

...

pm

...

I

...

restarted

...

tomcat12

...

since

...

the

...

monitoring

...

programs

...

were

...

complaining

...

and

...

ServerMonitoring

...

showed

...

it

...

missing

...

-

...

Tony

Anchor
July 4
July 4
July 4

6:00

...

pm

...

Old

...

DQM

...

ingestion

...

script

...

put

...

back

...

into

...

production.

...

The

...

new

...

script

...

worked

...

fine

...

for

...

some

...

24

...

hours

...

and

...

then

...

we

...

started

...

having

...

"idle"

...

sessions

...

locking

...

out

...

all

...

the

...

following

...

ones.

...

There

...

were

...

some

...

60

...

of

...

them

...

waiting.

...

Killing

...

the

...

first

...

one

...

did

...

not

...

solve

...

the

...

problem

...

as

...

the

...

next

...

one

...

went

...

in

...

"idle"

...

state.

...

We

...

decided

...

to

...

kill

...

all

...

the

...

waiting

...

sessions

...

and

...

put

...

the

...

old

...

script

...

back

...

in

...

production.

...

The

...

failed

...

ingest

...

scripts

...

are

...

being

...

rolled

...

back.

{:=
Panel
title
All
the
sessions
have
been
killed
off.
Is
it
the
same
script
that
ran
successfully
yesterday.
The
database
was
waiting
on
sql*net
message
from
client
which
usually
means
a
process
has
gone
idle.
The
two
processes
both
went
idle
after
issue
insert
into
DQMTRENDDATAID
(dataid,
loweredge,
upperedge,
runversionid)
values(:1,:2,:3,:4)
There
was
no
further
action
being
taken
by
either
session
such
as
reads,
execute
counts,
etc.
So
either
the
process
was
idle
or
it
didn't
have
enough
resource
to
even
attempt
what
was
to
be
executed
next.
I
think
for
now
the
old
script
is
probably
best
to
run.
It
would
be
nice
is
serialization
wasn't
done
via
locking.
It
would
also
be
good
if
I
could
adjust
a
couple
of
database
parameters
which
requires
a
shot
shutdown.
{null

Anchor
July 3
July 3
July 3

} h2. {anchor:July 3}July 3

01:00

Restarted

tomcat07

due

to

[

Data

Quality Monitoring Unresponsive|Known Problems#Data

Quality

Monitoring

Unresponsive

]

.

01:00

PM

New

DQM

ingestion

script

put

into

production

to

avoid

ORACLE

slowdowns.

If

any

problems,

please

contact

Max.

h2. {anchor:July 2}July 2

Anchor
July 2
July 2
July 2

02:55

Restarted

tomcat07

due to [Data Quality Monitoring Unresponsive|Known Problems#Data Quality Monitoring Unresponsive]. h2. {anchor:July 1}July 1 (Canada Day) 19:50 - Data Processing page went unresponsive for 2.5 hours. See [GDP-26@JIRA] and [SSC-84@JIRA] h2. {anchor:June 29}June 29 3:38 pm Restarted glast-tomcat07. [Data Quality Monitoring Unresponsive|Known Problems#Data Quality Monitoring Unresponsive] h3. RunQuality Exception Cannot set the run quality flag due to [GRQ-4@JIRA] h2. {anchor:June 27}June 27 11:55am Restarted glast-tomcat07. [Data Quality Monitoring Unresponsive|Known Problems#Data Quality Monitoring Unresponsive] 12:25am: [OpsLog and Monitoring/Trending web-apps interfering with each other|Known Problems#OpsLog, DataQualityMonitoring, TelemetryTrending losing sessions] h2. {anchor:June 26}June 26 Outstanding Issues: David Decotigny requests we get calibration trending working again. 10:18pm Web severs are working again: {panel:title=Mail from Antonio Ceseracciu} {panel} The root problem was a software crash on rtr-slb1. I just came in after Mike called me on the phone and power cycled the machine. It has come up fine and all services should now be restored. {panel} {panel:title=Hogaboom, Michael} {panel} {panel} {panel} {panel} {panel} Update on this... As Sean says, the servers are up. swh-servcore1 shows both ports connected vlan 1854, full 1000. They are on 7/5 and 7/6. I called Antonio and he saw that a small router had a hiccup/crashed?? He is going to SLAC now to reboot the router. {panel} 10:06pm The problem is believed to be a "small network switch between our servers and the main switch". They are going to try power cycling it. glast-win01 and glast-win02 became unreachable at 9:20pm. The SCCS on call technical contact has been paged (926 HELP). We just received the following e-mail: {panel:title=Mail from Sean Sink} {panel} {panel} I just spoke with Mike, the servers are physically up but there seems to be a network problem. Mike is working on contacting the networking team to investigate on their end. {panel} * Had to [restart PROD data crawler|SASHOW2FIX:HTF Data Catalog Crawler] one time (because Nagios and [

due to Data Quality Monitoring Unresponsive.

Anchor
July 1
July 1
July 1 (Canada Day)

19:50 - Data Processing page went unresponsive for 2.5 hours. See GDP-26@JIRA and SSC-84@JIRA

Anchor
June 29
June 29
June 29

3:38 pm Restarted glast-tomcat07. Data Quality Monitoring Unresponsive

RunQuality Exception

Cannot set the run quality flag due to GRQ-4@JIRA

Anchor
June 27
June 27
June 27

11:55am Restarted glast-tomcat07. Data Quality Monitoring Unresponsive

12:25am: OpsLog and Monitoring/Trending web-apps interfering with each other

Anchor
June 26
June 26
June 26

Outstanding Issues:

David Decotigny requests we get calibration trending working again.

10:18pm Web severs are working again:

Panel
titleMail from Antonio Ceseracciu

The root problem was a software crash on rtr-slb1.
I just came in after Mike called me on the phone and power cycled the machine. It has come up fine and all services should now be restored.

Panel
Panel
titleHogaboom, Michael

Panel

Panel

Panel

Update on this...
As Sean says, the servers are up. swh-servcore1 shows both ports connected vlan 1854, full 1000. They are on 7/5 and 7/6.

I called Antonio and he saw that a small router had a hiccup/crashed??
He is going to SLAC now to reboot the router.

10:06pm The problem is believed to be a "small network switch between our servers and the main switch". They are going to try power cycling it.

glast-win01 and glast-win02 became unreachable at 9:20pm. The SCCS on call technical contact has been paged (926 HELP). We just received the following e-mail:

Panel
titleMail from Sean Sink

Panel

I just spoke with Mike, the servers are physically up but there seems to be a network problem. Mike is working on contacting the networking team to investigate on their end.

Panel
]
  • were
  • complaining).
  • Looks
  • like
  • problem
  • was
  • caused
  • by
  • MC
  • writing
  • to
  • DEV
  • version
  • of
  • xrootd
  • on
  • glastlnx22.
*
  • Problems
  • with
  • Run
  • Quality
  • monitoring
  • reported
  • yesterday
  • now
  • fixed.
h2.

June

25

From

OpsLog

{panel}

Watching

...

plots

...

on

...

the

...

web

...

is

...

right

...

now

...

very

...

slow...

...

**Comment

...

by

...

David

...

Paneque

...

on

...

Thursday,

...

June

...

26,

...

2008

...

5:20:39

...

AM

...

UTC

...


It

...

is

...

so

...

from

...

both,

...

the

...

shifter

...

computers

...

and

...

our

...

laptops.

...


**Comment

...

by

...

David

...

Paneque

...

on

...

Thursday,

...

June

...

26,

...

2008

...

5:21:33

...

AM

...

UTC

...


now

...

it

...

is

...

fast

...

again...

...


**Comment

...

by

...

Tony

...

Johnson

...

on

...

Thursday,

...

June

...

26,

...

2008

...

5:55:18

...

AM

...

UTC

...


Trending

...

plots,

...

data

...

quality

...

plot,

...

all

...

plots?

...

One

...

possibility

...

is

...

that

...

xrootd

...

load

...

slows

...

down

...

plotting

...

(some

...

plots

...

are

...

read

...

using

...

xrootd).

...

I

...

noticed

...

there

...

were

...

some

...

ASP

...

and

...

MC

...

jobs

...

running

...

in

...

the

...

pipeline

...

around

...

this

...

time

...

which

...

may

...

have

...

been

...

slowing

...

things

...

down.

...


**Comment

...

by

...

Tony

...

Johnson

...

on

...

Thursday,

...

June

...

26,

...

2008

...

6:07:59

...

AM

...

UTC

...


Indeed

...

in

...

the

...

DataQualityMonitoring

...

log

...

file

...

around

...

this

...

time

...

I

...

see

...

lots

...

of

...

messages

...

about

...

waiting

...

for

...

response

...

from

...

xrootd.

...

Panel

Panel

Outstanding Issues:

Wiki Markup
[ELG-18@jira] OpsLog session times out immediately after login.
\[\] Some plots in data monitoring are inaccessible (message says plot not available), even while the same plots are accessible from another browser at the same time. Maybe this is also related to the workstations in the control room running firefox 1.5 (seems unlikely)? See discussion in OpsProbList.
\[\] Jim reported that several of his scripts were running and were killed when the pipeline server was restarted. We need to understand why his data catalog queries are still taking so long >10 minutes to run.
[GRQ-1] Run quality status is not being updated even though change entries are made in the history table. I looked at the code but could not see anything obvious wrong, maybe I am too sleepy.
[LONE-72] Attach intent as meta-data to files
[LONE-71] Digi merging loses IObfStatus - results in Digi files being marked as ContentError in Data Catalog

...

18:24

...

PDT

...

A

...

new

...

version

...

(1.2.4)

...

of

...

the

...

pipeline

...

has

...

been

...

installed.

...

See

...

https://jira.slac.stanford.edu/browse/SSC-74

...

TelemetryTrending application crashes

The telemetry trending application had to be restarted a couple of times in the last 24 hours. This has been caused by large memory usage when tabulating the data IOT-87@jira.

I am working on a fix.

In the meantime monitor the memory usage of tomcat12 from the Server Monitoring application.
When memory gets close to 90% try clicking on "Force GC" (If you don't see this link you need to be added to the ServerMonitoringAdmin group). If garbage collection does not reduce the memory usage a crash might be imminent.

Max

4am UTC

Tony: Waiting for data.... Registered my pager - sending to number@amsmsg.net seems to work (alias@MyAirMail.com does not)

Two outstanding (new) issues: PFE-172@jira IFO-24@jira