Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The

...

page

...

will

...

be

...

used

...

to

...

track

...

issues

...

arising

...

during

...

Infrastructure

...

Shifts.

...

...

Please check the list of known problems

Anchor
July 4
July 4
July 4

6:00 pm Old DQM ingestion script put back into production. The new script worked fine for some 24 hours and then we started having "idle" sessions locking out all the following ones. There were some 60 of them waiting. Killing the first one did not solve the problem as the next one went in "idle" state. We decided to kill all the waiting sessions and put the old script back in production. The failed ingest scripts are being rolled back.

Panel
title"Mail from Ian"

All the sessions have been killed off. Is it the same script that ran successfully yesterday. The database was waiting on sql*net message from client which usually means a process has gone idle. The two processes both went idle after issue

insert into DQMTRENDDATAID (dataid, loweredge, upperedge, runversionid)

Schedule|http://glast-ground.slac.stanford.edu/ShiftSchedule/weeklyShiftSchedule.jsp] * [Contact List|http://glast-ground.slac.stanford.edu/GroupManager/protected/contactList.jsp] * [How To Fix|SASHOW2FIX:How to Fix - Home] Please check the list of [known problems|Known Problems] h2. {anchor:July 4}July 4 6:00 pm Old DQM ingestion script put back into production. The new script worked fine for some 24 hours and then we started having "idle" sessions locking out all the following ones. There were some 60 of them waiting. Killing the first one did not solve the problem as the next one went in "idle" state. We decided to kill all the waiting sessions and put the old script back in production. The failed ingest scripts are being rolled back. {panel title="Mail from Ian"} All the sessions have been killed off. Is it the same script that ran successfully yesterday. The database was waiting on sql*net message from client which usually means a process has gone idle. The two processes both went idle after issue insert into DQMTRENDDATAID (dataid, loweredge, upperedge, runversionid)

values(:1,:2,:3,:4)

There

was

no

further

action

being

taken

by

either

session

such

as

reads,

execute

counts,

etc.

So

either

the

process

was

idle

or

it

didn't

have

enough

resource

to

even

attempt

what

was

to

be

executed

next.

I

think

for

now

the

old

script

is

probably

best

to

run.

It

would

be

nice

is

serialization

wasn't

done

via

locking.

It

would

also

be

good

if

I

could

adjust

a

couple

of

database

parameters

which

requires

a

shot

shutdown.

{panel} h2. {anchor:July 3}July 3

Anchor
July 3
July 3
July 3

01:00

...

Restarted

...

tomcat07

...

due

...

to

...

Data

...

Quality

...

Monitoring

...

Unresponsive

...

.

01:00

...

PM

...

New

...

DQM

...

ingestion

...

script

...

put

...

into

...

production

...

to

...

avoid

...

ORACLE

...

slowdowns.

...

If

...

any

...

problems,

...

please

...

contact

...

Max.

...

Anchor
July 2
July 2
July 2

02:55

...

Restarted

...

tomcat07

...

due

...

to

...

Data

...

Quality

...

Monitoring

...

Unresponsive

...

.

Anchor
July 1
July 1
July 1 (Canada Day)

19:50 - Data Processing page went unresponsive for 2.5 hours. See GDP-26@JIRA and SSC-84@JIRA

Anchor
June 29
June 29
June 29

3:38 pm Restarted glast-tomcat07. Data Quality Monitoring Unresponsive

RunQuality Exception

Cannot set the run quality flag due to GRQ-4@JIRA

Anchor
June 27
June 27
June 27

11:55am Restarted glast-tomcat07. Data Quality Monitoring Unresponsive

12:25am: OpsLog and Monitoring/Trending web-apps interfering with each other

Anchor
June 26
June 26
June 26

Outstanding Issues:

David Decotigny requests we get calibration trending working again.

10:18pm Web severs are working again:

Panel
titleMail from Antonio Ceseracciu

The root problem was a software crash on rtr-slb1.
I just came in after Mike called me on the phone and power cycled the machine. It has come up fine and all services should now be restored.

Panel
titleHogaboom, Michael

Update on this...
As Sean says, the servers are up. swh-servcore1 shows both ports connected vlan 1854, full 1000. They are on 7/5 and 7/6.

I called Antonio and he saw that a small router had a hiccup/crashed??
He is going to SLAC now to reboot the router.

10:06pm The problem is believed to be a "small network switch between our servers and the main switch". They are going to try power cycling it.

glast-win01 and glast-win02 became unreachable at 9:20pm. The SCCS on call technical contact has been paged (926 HELP). We just received the following e-mail:

Panel
titleMail from Sean Sink

I just spoke with Mike, the servers are physically up but there seems to be a network problem. Mike is working on contacting the networking team to investigate on their end.

...

  • were

...

  • complaining).

...

  • Looks

...

  • like

...

  • problem

...

  • was

...

  • caused

...

  • by

...

  • MC

...

  • writing

...

  • to

...

  • DEV

...

  • version

...

  • of

...

  • xrootd

...

  • on

...

  • glastlnx22.

...

  • Problems

...

  • with

...

  • Run

...

  • Quality

...

  • monitoring

...

  • reported

...

  • yesterday

...

  • now

...

  • fixed.

...

June

...

25

...

From

...

OpsLog

{
Panel
}

Watching

plots

on

the

web

is

right

now

very

slow...

\

**Comment

by

David

Paneque

on

Thursday,

June

26,

2008

5:20:39

AM

UTC


It

is

so

from

both,

the

shifter

computers

and

our

laptops.

\


**Comment

by

David

Paneque

on

Thursday,

June

26,

2008

5:21:33

AM

UTC


now

it

is

fast

again...

\


**Comment

by

Tony

Johnson

on

Thursday,

June

26,

2008

5:55:18

AM

UTC


Trending

plots,

data

quality

plot,

all

plots?

One

possibility

is

that

xrootd

load

slows

down

plotting

(some

plots

are

read

using

xrootd).

I

noticed

there

were

some

ASP

and

MC

jobs

running

in

the

pipeline

around

this

time

which

may

have

been

slowing

things

down.

\


**Comment

by

Tony

Johnson

on

Thursday,

June

26,

2008

6:07:59

AM

UTC


Indeed

in

the

DataQualityMonitoring

log

file

around

this

time

I

see

lots

of

messages

about

waiting

for

response

from

xrootd.

{panel} h3. Outstanding Issues:

Outstanding Issues:

Wiki Markup
[ELG-18@jira] OpsLog session times out immediately after login.
\[\] Some plots in data monitoring are inaccessible (message says plot not available), even while the same plots are accessible from another browser at the same time. Maybe this is also related to the workstations in the control room running firefox 1.5 (seems unlikely)? See discussion in OpsProbList.
\[\] Jim reported that several of his scripts were running and were killed when the pipeline server was restarted. We need to understand why his data catalog queries are still taking so long >10 minutes to run.
[GRQ-1] Run quality status is not being updated even though change entries are made in the history table. I looked at the code but could not see anything obvious wrong, maybe I am too sleepy.
[LONE-72] Attach intent as meta-data to files
[LONE-71] Digi merging loses IObfStatus - results in Digi files being marked as ContentError in Data Catalog

...

18:24

...

PDT

...

A

...

new

...

version

...

(1.2.4)

...

of

...

the

...

pipeline

...

has

...

been

...

installed.

...

See

...

https://jira.slac.stanford.edu/browse/SSC-74

...

TelemetryTrending application crashes

The telemetry trending application had to be restarted a couple of times in the last 24 hours. This has been caused by large memory usage when tabulating the data IOT-87@jira.

I am working on a fix.

In the meantime monitor the memory usage of tomcat12 from the Server Monitoring application.
When memory gets close to 90% try clicking on "Force GC" (If you don't see this link you need to be added to the ServerMonitoringAdmin group). If garbage collection does not reduce the memory usage a crash might be imminent.

Max

4am UTC

Tony: Waiting for data.... Registered my pager - sending to number@amsmsg.net seems to work (alias@MyAirMail.com does not)

Two outstanding (new) issues: PFE-172@jira IFO-24@jira