Page History

...

Task

...

of

...

the

...

Data

...

Processing

...

on-call

...

expert

...

Monitor
...
L1Proc
...
and
...
halfPipe:
...
Every
...
time
...
a
...
red
...
cross
...
appears
...
on
...
the
...
Data
...
Processing
...
page
...
,
...
next
...
to
...
the
...
L1Proc
...
or
...
halfPipe
...
processing
...
status
...
bar,
...
try
...
to
...
resolve
...
the
...
failure.
...
We
...
are
...
not
...
on-call
...
for
...
ASP/GRB
...
search
...
(Jim
...
Chiang
...
(jchiang
...
{at
...
}slac)
...
should
...
be
...
emailed
...
(not
...
paged)
...
for
...
these
...
failures)
...
and
...
we
...
are
...
definitely
...
NOT
...
on-call
...
for
...
infrastructure
...
problems
...
(can't
...
see
...
monitoring
...
plots,
...
etc.).
...
If
...
you
...
get
...
paged
...
for
...
something
...
that
...
is
...
not
...
under
...
your
...
responsibility,
...
don't
...
try
...
to
...
fix
...
it:
...
forward
...
the
...
message
...
to
...
the
...
appropriate
...
people
...
and
...
report
...
everything
...
in
...
the
...
Ops
...
Log.
Familiarize yourself with understanding the Pipeline-II page by reading through the Pipeline-II User's Guide. It is a good starting point for understanding the general organization of the pipeline and the tools needed to track down problems.
It may be good to look at the task chart to see the interdependencies of tasks ("tasks" defined in Pipeline-II User's Guide). "Success" dependency means that a process needs to successfully complete in order for the dependent process to continue, while "All Done" means that even failed processes will result in the dependent process continuing.
Watch the Usage Plots and look for L1Proc/HalfPipe related tasks (doChunk, doCrumb, etc). Default rule of thumb of time is to use last 2 hours, because more than that will not give enough fidelity in the plot. If you see a series of points that make a flat line for an extended period of time, it may indicate problems with the pipeline.

Data Organization

There are four main categories of data organization. At the top, there is the "Delivery", which is the data that is sent down from GLAST. Completely unrelated are the "Runs", which are time-segments determined by GLAST. A delivery can consist of a part of a run, many runs, or pieces of runs - there is no particular order that is guaranteed within a delivery with regards to a run. Runs (or parts of the run contained in a delivery) are broken into "Chunks", which are always contiguous blocks of data. Chunks are further broken down into "Crumbs", which are also contiguous.

When looking at files or directories, run numbers are typically prefixed by an "r", chunk numbers with an "e", and crumb numbers with a "b".

Different types of failures

There are 3 main type of failures, and should be handled differently.

Transient failures

afs hiccup (can't read from/write to a resource located on an afs disk)
nfs hiccup (can't read from/write to a resource located on an nfs disk)
LSF hiccup (an lsf batch machine crashes or is rebooted)
lost Data Base connection : MOOT, Calibration database log, or more rarely on Oracle
recon segfaults for no apparent reason (nothing in logfile)
Pipeline deadlock

How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure). If a process complains about a missing file but the file exists, or gets a read error after opening a file, it's probably a disk server hiccup. If the log file ends suddenly, without the usual LSF postlog, the batch host probably crashed. There will probably be several jobs failing the same way at the same time.

What to do in case of transient failures: rollback the affected process(es) when possible (see below for the rollback procedure). Look for the dontCleanUp file and check the Log Watcher (see below). If recon segfaults for no apparent reason, email Heather and Anders before the rollback, including a link to the log file, which will tell them where the core file is. For pipeline deadlocks, email Dan and include a link to the process instance.

Transient failures are rare lately. For the last couple of months, most failed processes are automatically retried once. This usually fixes transient issues, so usually when there's a failure it indicates an actual problem.

Bad merges: If a process that's merging crumb-level files into chunks or chunks into runs can't find all of its input files, it won't fail. See the "dontCleanUp" section below. Processes downstream of such a merge may fail because they are trying to use different types of input files (e.g., digi and recon) and the events don't match up because some are missing from one file and not the other. Then you need to roll back the merge even though it "succeeded" the first time.

Infrastructure failures

single batch host is killing several jobs ('rogue' LSF host)
group of batch hosts crashed or went offline
afs crashed on one of the host machines
scratch disk is full on one or more of the host machines - see what to do here

Wiki Markup
A staging disk is full (these are accessed from /afs/slac.stanford.edu/g/glast/ground/PipelineStaging\[1-7\])

...

/nfs/farm/g/glast/u52

...

or

...

/nfs/farm/g/glast/u15

...

is

...

full

...

How

...

to

...

recognize

...

infrastructure

...

failures:

...

they

...

usually

...

affect

...

a

...

large

...

number

...

of

...

jobs,

...

either

...

on

...

the

...

same

...

LSF

...

host

...

or

...

on

...

different

...

LSF

...

hosts.

...

What

...

to

...

do

...

in

...

case

...

of

...

infrastructure

...

failures:

...

these

...

failures

...

involve

...

a

...

large

...

number

...

of

...

people

...

to

...

be

...

taken

...

care

...

of

...

(the

...

infrastructure

...

expert

...

on-call

...

and

...

often

...

also

...

the

...

SCCS),

...

so

...

for

...

the

...

time

...

being

...

still

...

page

...

Warren

...

and/or

...

Maria

...

Elena

...

(see

...

L1

...

shift

...

schedule)

...

if

...

you

...

think

...

that

...

one

...

of

...

those

...

failures

...

might

...

be

...

happening

...

during

...

the

...

night

...

(if

...

in

...

doubt,

...

page

...

anyways).

...

Permanent

...

failures

...

G4
...
propagator
...
error.
...
If
...
recon
...
dies
...
complaining
...
about
...
g4propagator,
...
we
...
can't
...
fix
...
it.
...
If
...
this
...
happens,
...
email
...
Heather
...
(heather625
...
{at
...
}gmail)
...
and
...
Anders
...
(borgland
...
{at
...
}slac)
...
(and
...
possibly
...
Warren
...
(focke
...
{at
...
}slac)
...
and
...
Maria
...
Elena
...
(monzani
...
{at
...
}slac)).
...
Include
...
a
...
link
...
to
...
the
...
log
...
file,
...
which
...
will
...
tell
...
them
...
where
...
the
...
core
...
file
...
is.
...
Overlapping chunks.
...
If
...
findChunks
...
complains
...
about
...
overlapping
...
chunks,
...
tell
...
Bryson;
...
if
...
trending
...
merges
...
complain,
...
tell
...
Bryson
...
and
...
datamonlist@glast2.Stanford.EDU.
...
Rollback
...
won't
...
help.
...
Deliveries
...
arriving
...
too
...
closely
...
will
...
mangle
...
the
...
display
...
.
...
The
...
underlying
...
processes
...
are
...
fine,
...
but
...
email
...
obsproblist
...
{at
...
}glast2.stanford.edu,
...
Jim
...
(jchiang
...
{at
...
}slac)
...
or
...
Bryson
...
(blee
...
{at
...
}slac)
...
to
...
fix
...
the
...
display.
...
How
...
to
...
recognize
...
permanent
...
failures:
...
besides
...
those
...
2
...
cases,
...
everything
...
that
...
doesn't
...
get
...
fixed
...
after
...
a
...
rollback
...
is
...
by
...
definition
...
a
...
permanent
...
failure.
...
What
...
to
...
do
...
in
...
case
...
of
...
permanent
...
failures:
...
contact
...
the
...
appropriate
...
people
...
above,
...
if
...
you
...
are
...
sure
...
you
...
know
...
what
...
happened.
...
Otherwise,
...
page
...
Warren
...
and/or
...
Maria
...
Elena
...
(see
...
L1
...
shift
...
schedule).
...
If
...
there
...
is
...
another
...
part
...
of
...
the
...
run
...
waiting,
...
the
...
run
...
lock
...
(see
...
below)
...
will
...
have
...
to
...
be
...
removed
...
by
...
hand;
...
page
...
unless
...
you're
...
really
...
sure
...
of
...
what
...
you're
...
doing.
...

Other

...

failures

...

This
...
is
...
a
...
comprised
...
list
...
of
...
failures
...
that
...
don't
...
really
...
fit
...
into
...
the
...
other
...
major
...
three
...
ones
...
Too
...
few
...
events
...
in
...
the
...
run,
...
or
...
gaps
...
-
...
can
...
lead
...
to
...
too
...
few
...
events
...
in
...
the
...
magic7
...
file
...
FT2
...
failure
...
.
...
Try
...
to
...
copy
...
the
...
/afs/slac/g/glast/ground/PipelineStaging6/halfPipe/XXX/magic7_XXX.txt
...
(where
...
XXX
...
is
...
the
...
run
...
number)
...
to
...
/afs/slac/g/glast/ground/PipelineStaging6/YYY/magic7_YYY.txt
...
(where
...
YYY
...
is
...
the
...
run
...
where
...
fakeFT2
...
failed
...
...
...
remember
...
to
...
change
...
XXX
...
to
...
YYY
...
on
...
the
...
magic7
...
file
...
name).
...
Then
...
rollback
...
fakeFT2.
...
If
...
this
...
fails,
...
email
...
Andrea
...
(tramacer
...
{at
...
}slac)
...
for
...
additional
...
information
...
regarding
...
problems
...
with
...
FT2.
...

Everything to know about the Rollback

A rollback is essentially a restart of the stream or substream. It will re-run a particular process and all processes that depend on its output.

How to rollback

You can roll back from the pipeline front end. The entire stream can be rolled back by clicking "Rollback Stream" at the top, or individual streams in the main stream can be rolled back by selecting the pink boxes under "Stream Processes" and clicking "Rollback Selected".

But if multiple processes have failed (common), it's usually better to use the command line.

Wiki Markup
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline \-m PROD rollbackStream \--minimum 'L1Proc\[80819007\]'

...

This

...

will

...

roll

...

back

...

all

...

of

...

the

...

failed,

...

terminated,

...

or

...

cancelled

...

processes

...

in

...

delivery

...

80819007.

...

If

...

you

...

don't

...

say

...

--minimum,

...

it

...

will

...

roll

...

back

...

the

...

whole

...

delivery.

...

That's

...

usually

...

not

...

what

...

you

...

want.

...

Also

...

note

...

that

...

it

...

will

...

not

...

rollback

...

processes

...

that

...

have

...

succeeded,

...

but

...

with

...

incomplete

...

information

...

(ie

...

-

...

problems

...

arising

...

from

...

afs/nfs

...

hiccups).

...

Such

...

processes

...

may

...

need

...

to

...

be

...

rolled

...

back

...

via

...

the

...

front

...

end.

...

After

...

a

...

rollback,

...

the

...

red

...

x

...

on

...

the

...

data

...

processing

...

page

...

will

...

be

...

gone,

...

but

...

the

...

L1

...

status

...

will

...

still

...

say

...

Failed.

...

This

...

tends

...

to

...

confuse

...

the

...

duty

...

scientists.

...

You

...

might

...

want

...

to

...

use

...

the

...

setL1Status

...

task

...

(see

...

bellow)

...

to

...

make

...

it

...

say

...

Running.

...

This

...

is

...

really

...

optional,

...

it

...

won't

...

affect

...

the

...

processing

...

in

...

any

...

way.

...

But

...

there

...

will

...

be

...

fewer

...

pagers

...

beeping.

...

Removing

...

"dontCleanUp"

...

is

...

not

...

necessary

...

to

...

process

...

the

...

data.

...

The

...

file

...

just

...

stops

...

temporary

...

files

...

from

...

getting

...

deleted

...

when

...

we're

...

done

...

with

...

them.

...

Rolling

...

back

...

substreams

...

From

...

the

...

front

...

end

...

,

...

find

...

the

...

"Substreams"

...

area

...

and

...

click

...

the

...

pink

...

boxes

...

for

...

substreams

...

that

...

you

...

want

...

to

...

roll

...

back.

...

Then

...

click

...

"Rollback

...

Selected

...

SubStreams".

...

Wiki Markup
From the command line it's a bit more tricky: /afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline \-m PROD rollbackStream \--minimum 'L1Proc\[90117001\]/doRun\[253889937\]'

...

Remember

...

to

...

escape

...

the

...

square

...

brackets

...

if

...

you

...

are

...

in

...

tcsh.

...

When

...

to

...

rollback

...

Wait
...
for
...
the
...
"setStatus"
...
stream
...
to
...
have
...
run.
...
Rollback
...
won't
...
work
...
unless
...
everything
...
downstream
...
of
...
the
...
failed
...
process
...
is
...
in
...
a
...
final
...
state.
...
It's
...
generally
...
not
...
harmful
...
to
...
try
...
too
...
soon,
...
you
...
just
...
get
...
an
...
unhelpful
...
error
...
message.
...
Most
...
things
...
at
...
run
...
level
...
can
...
be
...
rolled
...
back
...
right
...
away.
...
If
...
a
...
recon
...
job
...
fails,
...
you'll
...
have
...
to
...
wait
...
at
...
least
...
an
...
hour.
...
Maybe
...
half
...
a
...
day.
...
Notice
...
that
...
the
...
GRB
...
search
...
is
...
executed
...
per
...
delivery
...
and
...
depends
...
on
...
all
...
the
...
FT1
...
and
...
FT2
...
files
...
in
...
each
...
run
...
to
...
be
...
registered
...
(therefore,
...
it
...
depends
...
on
...
almost
...
"everything"
...
that
...
belongs
...
to
...
that
...
delivery).
...
For
...
this
...
reason,
...
you
...
might
...
need
...
to
...
wait
...
for
...
the
...
entire
...
delivery
...
to
...
be
...
completed
...
before
...
being
...
able
...
to
...
roll
...
back
...
any
...
failed
...
recon
...
jobs.
...
And
...
because
...
of
...
the
...
run
...
lock
...
(see
...
below),
...
some
...
of
...
the
...
(parts
...
of)
...
runs
...
in
...
the
...
delivery
...
might
...
have
...
to
...
wait
...
for
...
other
...
deliveries
...
to
...
finish,
...
which
...
might
...
have
...
their
...
own
...
failures...
...
It's
...
possible,
...
but
...
rare,
...
to
...
get
...
deadlocks,
...
where
...
nothing
...
can
...
proceed
...
until
...
a
...
lock
...
is
...
removed
...
by
...
hand.
...
Best
...
to
...
ask
...
for
...
help
...
then.
...
In
...
general,
...
experience
...
will
...
tell
...
you
...
when
...
you
...
can
...
roll
...
back
...
what.
...
So,
...
in
...
doubt,
...
you
...
can
...
try
...
anyways
...
(if
...
it's
...
too
...
soon,
...
nothing
...
will
...
happen
...
and
...
you
...
will
...
get
...
an
...
error)
...
!
...
Often
...
you
...
can
...
roll
...
things
...
back
...
sooner
...
if
...
you
...
cancel
...
some
...
processes.
...
If
...
there
...
is
...
a
...
delivery
...
with
...
some
...
runs
...
that
...
are
...
ready
...
to
...
roll
...
back
...
and
...
others
...
that
...
aren't,
...
you
...
can
...
do
...
the
...
rollback
...
if
...
you
...
cancel
...
kludgeAsp.
...
"/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline
...
-m
...
PROD
...
cancelProcessInstance
...
13388768"
...
The
...
number
...
(or
...
numbers,
...
you
...
can
...
use
...
a
...
space-separated
...
list
...
to
...
do
...
more
...
than
...
one
...
at
...
a
...
time)
...
is
...
the
...
oracle
...
PK
...
for
...
the
...
process
...
instance,
...
it's
...
in
...
the
...
URL
...
for
...
the
...
process
...
instance
...
page
...
in
...
the
...
frontend.
...
This
...
takes
...
a
...
long
...
time,
...
10-30
...
minutes.
...

The dontCleanUp file

Any time one of the merges processes can't

...

find

...

all

...

of

...

its

...

input

...

files,

...

a

...

message

...

is

...

generated

...

in

...

the

...

Log

...

Watcher

...

(and

...

there

...

will

...

be

...

errors

...

in

...

the

...

log

...

of

...

the

...

failed

...

processes

...

complaining

...

about

...

"Different

...

number

...

of

...

events

...

...")

...

and

...

cleanup

...

for

...

the

...

run

...

is

...

disabled

...

by

...

a

...

file

...

called

...

dontCleanUp

...

in

...

the

...

run

...

The run lock

Only one delivery can process a run at a time. This is enforced by a lock file in the run directory on u52/L1. If findChunks fails or there are permanent failures in the run and another part of the run is waiting, it has to be removed by hand. It should never be removed unless the only failures in the run are findChunks or permanent ones, or there's a deadlock. Even then you have to wear a helmet and sign a waiver.

findChunks

This process is not automatically retried like most of the others. If it fails, you have to roll it back by hand. And remove the run lock (see above) and the throttle lock (next section) by hand. And you'll probably have to
mv /nfs/farm/g/glast/u52/L1/${runId}/

...

${runId}

...

${deliveryid}_chunkList.txt

...

/nfs/farm/g/glast/u52/L1/${runId}/${runId}

...

${deliveryid}_chunkList.txt.tmp

...

Also,

...

it

...

checks

...

the

...

chunks

...

to

...

make

...

sure

...

they

...

don't

...

overlap.

...

That

...

doesn't

...

happen

...

much

...

anymore.

...

Except

...

when

...

RePiping.

...

See

...

the

...

section

...

on

...

that.

...

The

...

Throttle

...

Yay

...

!

...

There's
...
a
...
throttle
...
that
...
limits
...
the
...
number
...
of
...
RDLs
...
that
...
can
...
be
...
in
...
process
...
(or
...
in
...
the
...
hard
...
part,
...
at
...
least)
...
at
...
once.
...
It
...
works
...
by
...
making
...
files
...
with
...
names
...
like
...
/nfs/farm/g/glast/u52/L1/throttle/1.lock
...
at
...
the
...
same
...
time
...
as
...
it
...
makes
...
the
...
run
...
lock.
...
config.throttleLimit
...
is
...
normally
...
set
...
at
...
2,
...
3
...
is
...
usually
...
safe
...
but
...
not
...
always.
...
Leave
...
it
...
at
...
2
...
unless
...
we're
...
way
...
behind
...
and
...
you're
...
up
...
for
...
watching
...
it
...
closely.
...
It's
...
still
...
under
...
development
...
and
...
a
...
bit
...
fragile,
...
so
...
it's
...
probably
...
better
...
not
...
to
...
mess
...
with
...
it
...
for
...
now.
...

Too

...

Much

...

Data

...

Hopefully
...
this
...
whole
...
section
...
is
...
now
...
obsolete,
...
see
...
the
...
previous
...
one.
...
Sometimes
...
we
...
get
...
no
...
data
...
from
...
the
...
MOC
...
for
...
half
...
a
...
day
...
and
...
then
...
it
...
all
...
arrives
...
at
...
once.
...
This
...
will
...
overload
...
the
...
AFS
...
buffers
...
(see
...
next
...
section)
...
unless
...
L1
...
processing
...
is
...
throttled
...
by
...
hand
...
(we
...
are
...
working
...
on
...
implementing
...
an
...
automatic
...
throttle,
...
and
...
ditching
...
AFS
...
buffering
...
in
...
favor
...
of
...
xrootd).
...
You
...
do
...
this
...
by
...
hand-creating
...
run
...
locks
...
for
...
runs
...
that
...
haven't
...
arrived
...
yet
...
and
...
suspending
...
batch
...
jobs.
...

To

...

predict

...

future

...

runs:

...

Go
...
to
...
the
...
mission
...
timeline
...
viewer
...
(glast-ground
...
->
...
Mission
...
Planning
...
Web
...
View
...
->
...
Timeline),
...
get
...
the
...
start
...
time
...
for
...
the
...
physics
...
runs,
...
then
...
plug
...
them
...
into
...
xTime
...
. Make sure to uncheck "Apply Clock Offset Correction(s)
...
for
...
RXTE
...
and
...
Swift"
...
at
...
the
...
bottom
...
of
...
the
...
page.
...
By
...
default
...
the
...
timeline
...
doesn't
...
go
...
very
...
far
...
into
...
the
...
past,
...
you
...
may
...
need
...
to
...
change
...
that
...
by
...
clicking
...
on
...
"selections"
...
in
...
the
...
top
...
right
...
corner
...
of
...
the
...
page.
...

Making

...

run

...

locks:

...

They
...
have
...
names
...
like
...
/nfs/farm/g/glast/u52/L1/r0263753970/r0263753970.lock
...

At
...
the
...
moment
...
it
...
doesn't
...
matter
...
what's
...
in
...
them , a half-sentence
...
explaining
...
why
...
you
...
made
...
the
...
lock
...
is
...
good.
...
An
...
empty
...
file,
...
or
...
a
...
rant
...
about
...
how
...
much
...
it
...
sucks
...
that
...
you
...
have
...
to
...
do
...
this,
...
works
...
also.
...
When
...
you're
...
ready
...
to
...
let
...
the
...
run
...
go,
...
just
...
remove
...
or
...
rename
...
the
...
file
...
and
...
the
...
run
...
should
...
start
...
up
...
in
...
5-10
...
minutes.
...
Actually the locks created by L1 do have meaningful content, and Bad Things will happen when it tries to remove them if it's not correct. But if you make one by hand L1 won't remove it, you have to, so it's OK to put whatever in there.

Suspending jobs

If the first part of a run is processing and you want to stop the second part from starting at a bad time, use the pipeline front end to get the LSF job ID of the findChunks process for the second part (which will be pending due to the run lock placed by the first part), log into a noric as glastraw and use bstop to suspend it. bresume it when you're ready to let it run.

If both parts of a run arrive while it's locked out, you can reduce the total amount of I/O that it does by letting the smaller part go first, since all of the data in the part that goes first has to be merged twice. Suspend findChunks for both parts, remove the run lock, then resume findChunks for the part with less data. "Less data" == "fewer chunks" unless it's highly fragmented, in that case du on the evt chunk directory (like /afs/slac.stanford.edu/g/glast/ground/PipelineStaging6/halfPipe/090512001/r0263753970)

...

may

...

give

...

a

...

better

...

idea.

...

AFS

...

buffers

...

&

...

idle

...

threads

...

When
...
the
...
AFS
...
servers
...
where
...
we
...
keep
...
temporary
...
files
...
hiccup,
...
it's
...
usually
...
because
...
they
...
ran
...
low
...
on
...
idle
...
threads.
...
It
...
is
...
possible
...
to
...
monitor
...
this
...
value
...
and
...
intervene
...
to
...
stave
...
off
...
disaster.
...
It
...
can
...
be
...
viewed
...
with
...
Ganglia
...
or
...
Nagios.
...
Nagios
...
only
...
works
...
inside
...
SLAC's
...
firewall,
...
but
...
is
...
somewhat
...
more
...
reliable.
...
Ganglia
...
works
...
from
...
anywhere,
...
and
...
shows
...
higher
...
time
...
resolution,
...
but
...
sometimes
...
it
...
stops
...
updating
...
and
...
just
...
shows
...
a
...
flat
...
line
...
with
...
some
...
old
...
value.
...

Ganglia

Ganglia threads page shows idle threads for all servers.
Warren's custom disk monitoring pages shows just the servers we're actually interested in, and shows the mapping between server and volume names.

Nagios

Using ssh as a proxy: You can use ssh as a SOCKS proxy to relay requests through a SLAC computer.

Pick a port. Should be a 4-digit number. Referred to as <port> below.
ssh -D <port> noric.slac.stanford.edu

...

Then

...

configure

...

your

...

browser

...

to

...

use

...

a

...

SOCKS

...

proxy

...

at

...

localhost:<port>

...

Now

...

you

...

should

...

be

...

able

...

to

...

access

...

SLAC-only

...

pages.

...

There's

...

2

...

places

...

to

...

get

...

the

...

threads:

...

new

...

nagios

...

(02)

...

click

...

"Hostgroup

...

Overview"

...

on

...

the

...

left

...

side,

...

then

...

the

...

host

...

you

...

want

...

in

...

the

...

"AFS

...

fileservers"

...

table.

...

This

...

shows

...

idle

...

threads

...

for

...

that

...

server

...

on

...

the

...

"AFSfs"

...

row.

...

old

...

nagios

...

(01)

...

then

...

click

...

"AFS

...

FS

...

Graphs"

...

in

...

the

...

upper

...

left.

...

This

...

shows

...

most

...

or

...

all

...

SLAC

...

AFS

...

servers.

...

Not

...

as

...

reliable

...

as

...

nagios02,

...

and

...

it's

...

going

...

away

...

eventually.

...

Warren's

...

"old"

...

1-hour

...

page

There's

...

3

...

places

...

to

...

get

...

the

...

information...

...

What

...

to

...

do

...

with

...

the

...

info

...

When
...
the
...
servers
...
are
...
idle,
...
idle
...
threads
...
should
...
be
...
122.
...
The
...
SLAC
...
IT
...
people
...
consider
...
it
...
a
...
warning
...
if
...
it
...
goes
...
below
...
110
...
and
...
an
...
error
...
at
...
100.
...
For
...
us,
...
it
...
seems
...
llike
...
things
...
usually
...
work
...
as
...
long
...
as
...
it
...
stays
...
above
...
60
...
or
...
so
...
(although
...
afs112
...
has
...
been
...
known
...
to
...
go
...
as
...
low
...
as
...
40,
...
since
...
PipelineStanging6
...
is
...
for
...
the
...
HalfPipe).
...
This
...
is
...
likely
...
to
...
occur
...
if
...
there
...
are
...
more
...
than
...
~300
...
chunk
...
jobs
...
running.
...
Usually
...
after
...
recon
...
finishes
...
and
...
the
...
chunk-level
...
jobs
...
downstream
...
of
...
recon
...
start
...
up.
...
I've
...
written
...
a
...
script
...
that
...
will
...
suspend
...
jobs
...
that
...
are
...
using
...
a
...
specified
...
server,
...
wait
...
a
...
bit,
...
and
...
then
...
resume
...
them
...
with
...
a
...
little
...
delay
...
between.
...
Suspending
...
jobs
...
will
...
make
...
the
...
server
...
recover
...
sooner,
...
but
...
some
...
of
...
the
...
jobs
...
are
...
likely
...
to
...
get
...
timeout
...
errors
...
and
...
fail
...
due
...
to
...
the
...
delay
...
of
...
being
...
suspended,
...
so
...
it
...
may
...
be
...
best
...
to
...
wait
...
until
...
idle
...
threads
...
hits
...
0
...
before
...
suspending.
...
From
...
a
...
noric,
...
as
...
glastraw:
...
setenv
...
L1ProcROOT
...
/afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/1.70
...
(any
...
L1
...
version
...
>=
...
1.70
...
should
...
work)
...
$L1ProcROOT/tools/grepJobs.py
...
PipelineStaging2
...
1200
...
60
...
first
...
arg
...
is
...
the
...
volume
...
to
...
be
...
suspended
...
second
...
arg
...
is
...
seconds
...
to
...
wait
...
between
...
stopping
...
and
...
resuming
...
third
...
arg
...
is
...
seconds
...
to
...
wait
...
between
...
resuming
...
jobs
...
If
...
you
...
want
...
to
...
do
...
afs106,
...
you
...
have
...
to
...
say
...
"PipelineStaging/";
...
if
...
you
...
say
...
"PipelineStaging"
...
it
...
will
...
suspend
...
everything.
...
The
...
association
...
between
...
disk
...
and
...
server
...
can
...
be
...
found
...
in
...
several
...
ways.
...
Here's
...
one:
...
$
...
fs
...
whereis
...
PipelineStaging
...
*
...
File
...
PipelineStaging
...
is
...
on
...
host
...
afs106.slac.stanford.edu
...
File
...
PipelineStaging1
...
is
...
on
...
host
...
afs106.slac.stanford.edu
...
File
...
PipelineStaging2
...
is
...
on
...
host
...
afs104.slac.stanford.edu
...
File
...
PipelineStaging3
...
is
...
on
...
host
...
afs107.slac.stanford.edu
...
File
...
PipelineStaging4
...
is
...
on
...
host
...
afs108.slac.stanford.edu
...
File
...
PipelineStaging5
...
is
...
on
...
host
...
afs109.slac.stanford.edu
...
File
...
PipelineStaging6
...
is
...
on
...
host
...
afs112.slac.stanford.edu
...
File
...
PipelineStaging7
...
is
...
on
...
host
...
afs113.slac.stanford.edu
...
You
...
can
...
also
...
find
...
the
...
amount
...
of
...
disk
...
usage
...
with
...
the
...
following
...
command:
...
$
...
fs
...
lq
...
PipelineStaging
...
*
...

Volume
...
Name
...
Quota
...
Used
...
%Used
...
Partition
...

G.glast.ground.PipStg
...
350000000
...
128422194
...
37%
...
26%
G.glast.ground.PipStg
...
350000000
...
128422194
...
37% 26%
G.glast.ground.PipStg2
...
350000000
...
118129732
...
34%
...
24%
G.glast.ground.PipStg3
...
350000000
...
124826187
...
36%
...
26%
G.glast.ground.PipStg4
...
350000000
...
124744630
...
36%
...
26%
G.glast.ground.PipStg5
...
350000000
...
132978520
...
38%
...
27%
G.glast.ground.PipStg6
...
350000000
...
59700367
...
17% 12%
G.glast.ground.PipStg7
...
350000000
...
107201000
...
31%
...
22%

Useful resources

List of failed runs. Report here failed runs and their status. When a run is fixed (meaning: no red cross appears anywhere anymore), you can delete it from the list (the runs listed as 'ignore error (it is fixed)' are the ones that were fixed with a fake delivery, so they still appear as failed in the Data Processing page.
AFS hiccups. We keep track here of the AFS hiccups (or at least we try to). You can add date/time, LFS job ID and the 'missing' file.
LSF hiccups. We keep track here of the LSF hiccups (or at least we try to). You can add date/time, LSF batch host(s) and the reason, if known.
List of runs incorrectly marked as InProgress. These are runs where acq status is Complete, but L1 status is InProgress.

You should definitely join the following mailing lists:

Data Handling (datalist)
Level 1 processing (l1proclist)
Data processing pipeline (pipelinelist)

And probably these:

LAT Instrument Data Monitoring (datamonlist)
Questions and problems concerning GLAST software (helpsoftlist)
LAT operations team (latopslist)
Nagios alert messages (nagios)
People who will work ISOC shifts (shiftslist)
ISOC Science Operations (solist)

Other things you may want to know

setL1Status task

Tired of being paged because L1Proc status still says Failed after a rollback?

/afs/slac/g/glast/ground/bin/pipeline

...

--mode

...

PROD

...

createStream

...

--define

...

"runNumber=240831665,l1RunStatus=Running"

...

setL1Status

...

OR:

...

/afs/slac/g/glast/ground/bin/pipeline

...

--mode

...

PROD

...

createStream

...

--define

...

"runNumber=240837713"

...

setL1Status

...

l1RunStatus

...

defaults

...

to

...

Running,

...

but

...

you

...

can

...

set

...

it

...

to

...

any

...

of

...

the

...

allowed

...

values

...

(Complete,

...

InProgress,

...

Incomplete,

...

Running,

...

Failed).

...

How

...

to

...

contact

...

SCCS

...

Send
...
an
...
email
...
to
...
unix-admin
...
(this
...
usually
...
works,
...
even
...
in
...
the
...
night
...
and
...
in
...
the
...
weekend).
...
If
...
you
...
don't
...
get
...
an
...
answer
...
and
...
the
...
issue
...
is
...
urgent,
...
call
...
xHELP
...
(650-926-4357).
...
Choose
...
4
...
to
...
page
...
the
...
on-call
...
person.
...
If
...
your
...
call/email
...
isn't
...
answered
...
and
...
the
...
issue
...
is
...
REALLY
...
urgent,
...
page
...
again
...
the
...
on-call
...
person
...
at
...
650-846-0178.
...

Log

...

Watcher

...

Messages

...

Message
...
text:
...
Can't
...
open
...
lockfile
...
/nfs/farm/g/glast/u52/L1/r0248039911/r0248039911.lock.
...
...

cancelProcessInstance

Here's

...

the

...

syntax

...

to

...

cancel

...

a

...

process

...

that

...

is

...

not

...

in

...

final

...

state,

...

and

...

all

...

its

...

dependencies:

...

/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline

...

--mode

...

PROD

...

cancelProcessInstance

...

8073657

...

The

...

Process

...

Instance

...

Primary

...

Key

...

(ie

...

-

...

8073657

...

in

...

the

...

example)

...

can

...

be

...

determined

...

in

...

the

...

the

...

URL

...

for

...

the

...

process

...

instance

...

page

...

(eg:

...

the

...

PIPK

...

for

...

http://glast-ground.slac.stanford.edu/Pipeline-II/pi.jsp?pi=20253756

...

is

...

20253756).

...

NOTE:

...

Please

...

don't

...

use

...

this

...

unless

...

you

...

really

...

(REALLY

...

!

...

!

...

!)

...

know

...

what

...

you

...

are

...

doing.

...

Host

...

or

...

host

...

group

...

is

...

not

...

used

...

by

...

the

...

queue.

...

If
...
you
...
see
...
jobs
...
being
...
terminated
...
with
...
exceptions
...
in
...
the
...
message
...
viewer
...
saying
...
things
...
like
...
"yili0148+5:
...
Host
...
or
...
host
...
group
...
is
...
not
...
used
...
by
...
the
...
queue.
...
Job
...
not
...
submitted.",
...
it
...
means
...
the
...
hosts
...
available
...
to
...
glastdataq
...
have
...
changed.
...
The
...
solution
...
is
...
to
...
roll
...
back
...
the
...
affected
...
chunks
...
(the
...
doChunk
...
streams)
...
with
...
a
...
new
...
value
...
for
...
HOSTLIST.
...
When
...
you
...
roll
...
back
...
the
...
streams
...
from
...
the
...
frontend,
...
on
...
the
...
confirmation
...
page
...
you
...
are
...
presented
...
an
...
opportunity
...
to
...
set
...
or
...
redefine
...
variables.
...
To
...
figure
...
out
...
what
...
the
...
new
...
value
...
needs
...
to
...
be,
...
do
...
a
...
"bqueues
...
-l
...
glastdataq".
...
The
...
output
...
will
...
include
...
a
...
line
...
like
...
"HOSTS:
...
bbfarm/".
...
In
...
this
...
case
...
you'd
...
enter
...
HOSTLIST=bbfarm
...
in
...
the
...
box
...
on
...
the
...
confirmation
...
page.
...
bbfarm
...
is
...
actually
...
a
...
temporary
...
thing
...
for
...
the
...
cooling
...
outage,
...
when
...
things
...
get
...
switched
...
back
...
to
...
normal,
...
the
...
relevant
...
line
...
from
...
bqueues
...
will
...
probably
...
look
...
more
...
like
...
"HOSTS:
...
glastyilis+3
...
glastcobs+2
...
preemptfarm+1".
...
Then
...
the
...
thing
...
to
...
enter
...
in
...
the
...
box
...
would
...
be
...
HOSTLIST="glastyilis
...
glastcobs
...
genfarm".
...

Red

...

question

...

marks

...

in

...

display

...

Red
...
question
...
marks show up in the display for HalfPipe. Click on the FASTCopy "Logs" number. First check to see if the "ProcessSCI.end"
...
event
...
has
...
occurred.
...
If
...
not,
...
then
...
wait
...
until
...
it
...
occurs
...
and
...
see
...
if
...
the
...
red
...
question
...
mark
...
persists.
...
If
...
it
...
persists
...
and
...
the
...
log
...
shows
...
messages
...
saying
...
"pipeline
...
nonEventReporting
...
submission
...
succeeded"
...
or
...
"no
...
reassembled
...
datagrams
...
found",
...
it
...
just
...
means
...
that
...
it
...
contained
...
data
...
that
...
had
...
already
...
been
...
received.
...
Meaningful
...
deliveries
...
will
...
say
...
"found
...
XXX
...
LSEP
...
datagrams
...
in
...
....".
...
If
...
it
...
is
...
a
...
meaningful
...
delivery,
...
then
...
further
...
intervention
...
is
...
required.
...

Stuck

...

processes

...

If
...
you
...
suspect
...
a
...
process
...
is
...
stuck,
...
visit
...
the
...
process
...
page
...
from
...
the
...
run
...
page
...
(ie
...
-
...
click
...
on
...
the
...
green
...
task
...
bar
...
in
...
the
...
Data
...
Processing
...
Page,
...
then
...
click
...
on
...
the
...
particular
...
job
...
that
...
is
...
running
...
in
...
the
...
list
...
at
...
the
...
bottom
...
of
...
the
...
page).
...
Each
...
entry
...
in
...
the
...
table
...
on
...
the
...
page
...
will
...
contain
...
a
...
link
...
to
...
"Messages",
...
"Log",
...
and
...
"Files".
...
If
...
you
...
click
...
on
...
"Log",
...
it
...
should
...
show
...
you
...
the
...
log,
...
but
...
more
...
importantly,
...
it'll
...
show
...
the
...
location
...
of
...
the
...
log
...
file
...
right
...
above
...
the
...
box
...
containing
...
the
...
log
...
information
...
(note:
...
if
...
the
...
file
...
is
...
too
...
big,
...
it
...
will
...
not
...
open,
...
but
...
the
...
link
...
to
...
"Files"
...
will
...
show
...
where
...
the
...
log
...
is
...
located,
...
so
...
you
...
can
...
get
...
the
...
information
...
that
...
way
...
as
...
well).
...
If
...
you
...
visit
...
the
...
location
...
of
...
the
...
file
...
on
...
the
...
noric
...
machines,
...
you
...
can
...
determine
...
when
...
the
...
last
...
time
...
the
...
file
...
was
...
updated
...
(ls
...
-l,
...
or
...
tail
...
-f
...
to
...
see
...
if
...
it's
...
still
...
streaming).
...
If
...
it
...
is
...
not
...
updating,
...
find
...
out
...
the
...
batch
...
job
...
number
...
(JobId
...
from
...
the
...
stream
...
table),
...
log
...
into
...
noric
...
as
...
glastraw,
...
and
...
issue
...
a
...
bkill
...
<JobId>.
...
The
...
process
...
should
...
automatically
...
be
...
restarted
...
after
...
this
...
command
...
is
...
issued.
...

Repiping

When weird things are happening with the delivery that you can't

...

figure

...

out,

...

it

...

may

...

be

...

necessary

...

to

...

repipe

...

the

...

stream

...

(probably

...

a

...

good

...

idea

...

to

...

check

...

with

...

someone

...

first).

...

Log

...

in

...

as

...

glastraw

...

and

...

issue

...

the

...

following

...

command:

...

/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline

...

-m

...

PROD

...

createStream

...

--stream

...

$

...

{ID

...

}

...

--define

...

downlinkID=$

...

{ID

...

},runID=$

...

{ID

...

}

...

RePipe

...

where

...

$

...

{ID

...

}

...

is

...

the

...

Run

...

ID

...

(just

...

numbers,

...

no

...

leading

...

r0).

...

This

...

will

...

create

...

a

...

Batch

...

commands

...

bjobs
...
-wu
...
glastraw
...
-
...
running
...
this
...
from
...
noric
...
will
...
list
...
all
...
the
...
jobs
...
in
...
the
...
LSF
...
batch
...
queue
...
owned
...
by
...
glastraw
...
(which
...
is
...
the
...
user
...
submitting
...
pipeline
...
streams).
...

Shift

...

sign

...

up

...

Sign
...
up
...
for
...
shifts
...
here
...
.
...
View
...
shift
...
calendar
...
here

Useful L1Proc links (these are designed to keep you from waiting for the entire L1Proc history to be retrieved ... which could be 10 min and up).

All L1Proc Deliveries that have not completed
All L1Proc Runs that have not completed
All L1Proc Runs that have failed with a timestamp (it is a good idea to update your timestamp to shorten the search).

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 120

New Version 121

Key

Task

of

the

Data

Processing

on-call

expert

Data Organization

Different types of failures

Transient failures

Infrastructure failures

Permanent

failures

Other

failures

Everything to know about the Rollback

How to rollback

Rolling

back

substreams

When

to

rollback

The dontCleanUp file

The run lock

findChunks

The

Throttle

Yay

!

Too

Much

Data

To

predict

future

runs:

Making

run

locks:

Suspending jobs

AFS

buffers

&

idle

threads

Ganglia

Nagios

What

to

do

with

the

info

Useful resources

Other things you may want to know

setL1Status task

How

to

contact

SCCS

Log

Watcher

Messages

Message ... text: ... Can't ... open ... lockfile ... /nfs/farm/g/glast/u52/L1/r0248039911/r0248039911.lock. ......

cancelProcessInstance

Host

or

host

group

is

not

used

Message
...
text:
...
Can't
...
open
...
lockfile
...
/nfs/farm/g/glast/u52/L1/r0248039911/r0248039911.lock.
...
...

bjobs
...
-wu
...
glastraw
...
-
...
running
...
this
...
from
...
noric
...
will
...
list
...
all
...
the
...
jobs
...
in
...
the
...
LSF
...
batch
...
queue
...
owned
...
by
...
glastraw
...
(which
...
is
...
the
...
user
...
submitting
...
pipeline
...
streams).
...

Sign
...
up
...
for
...
shifts
...
here
...
.
...
View
...
shift
...
calendar
...
here