...
...
...
...
...
...
...
...
Monitor
...
L1Proc
...
and
...
halfPipe:
...
Every
...
time
...
a
...
...
...
appears
...
on
...
the
...
...
...
...
,
...
next
...
to
...
the
...
L1Proc
...
or
...
halfPipe
...
processing
...
status
...
bar,
...
try
...
to
...
resolve
...
the
...
failure.
...
We
...
are
...
not
...
on-call
...
for
...
ASP/GRB
...
search
...
(Jim
...
Chiang
...
(jchiang
...
{at
...
}slac)
...
should
...
be
...
emailed
...
(not
...
paged)
...
for
...
these
...
failures)
...
and
...
we
...
are
...
definitely
...
NOT
...
on-call
...
for
...
infrastructure
...
problems
...
(can't
...
see
...
monitoring
...
plots,
...
etc.).
...
If
...
you
...
get
...
paged
...
for
...
something
...
that
...
is
...
not
...
under
...
your
...
responsibility,
...
don't
...
try
...
to
...
fix
...
it:
...
forward
...
the
...
message
...
to
...
the
...
appropriate
...
people
...
and
...
report
...
everything
...
in
...
the
...
...
Log.
Familiarize yourself with understanding the Pipeline-II page by reading through the Pipeline-II User's Guide. It is a good starting point for understanding the general organization of the pipeline and the tools needed to track down problems.
It may be good to look at the task chart to see the interdependencies of tasks ("tasks" defined in Pipeline-II User's Guide). "Success" dependency means that a process needs to successfully complete in order for the dependent process to continue, while "All Done" means that even failed processes will result in the dependent process continuing.
Watch the Usage Plots and look for L1Proc/HalfPipe related tasks (doChunk, doCrumb, etc). Default rule of thumb of time is to use last 2 hours, because more than that will not give enough fidelity in the plot. If you see a series of points that make a flat line for an extended period of time, it may indicate problems with the pipeline.
There are four main categories of data organization. At the top, there is the "Delivery", which is the data that is sent down from GLAST. Completely unrelated are the "Runs", which are time-segments determined by GLAST. A delivery can consist of a part of a run, many runs, or pieces of runs - there is no particular order that is guaranteed within a delivery with regards to a run. Runs (or parts of the run contained in a delivery) are broken into "Chunks", which are always contiguous blocks of data. Chunks are further broken down into "Crumbs", which are also contiguous.
When looking at files or directories, run numbers are typically prefixed by an "r", chunk numbers with an "e", and crumb numbers with a "b".
There are 3 main type of failures, and should be handled differently.
How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure). If a process complains about a missing file but the file exists, or gets a read error after opening a file, it's probably a disk server hiccup. If the log file ends suddenly, without the usual LSF postlog, the batch host probably crashed. There will probably be several jobs failing the same way at the same time.
What to do in case of transient failures: rollback the affected process(es) when possible (see below for the rollback procedure). Look for the dontCleanUp file and check the Log Watcher (see below). If recon segfaults for no apparent reason, email Heather and Anders before the rollback, including a link to the log file, which will tell them where the core file is. For pipeline deadlocks, email Dan and include a link to the process instance.
Transient failures are rare lately. For the last couple of months, most failed processes are automatically retried once. This usually fixes transient issues, so usually when there's a failure it indicates an actual problem.
Bad merges: If a process that's merging crumb-level files into chunks or chunks into runs can't find all of its input files, it won't fail. See the "dontCleanUp" section below. Processes downstream of such a merge may fail because they are trying to use different types of input files (e.g., digi and recon) and the events don't match up because some are missing from one file and not the other. Then you need to roll back the merge even though it "succeeded" the first time.
Wiki Markup |
---|
A staging disk is full (these are accessed from /afs/slac.stanford.edu/g/glast/ground/PipelineStaging\[1-7\]) |
...
...
...
...
...
...
How
...
to
...
recognize
...
infrastructure
...
failures:
...
they
...
usually
...
affect
...
a
...
large
...
number
...
of
...
jobs,
...
either
...
on
...
the
...
same
...
LSF
...
host
...
or
...
on
...
different
...
LSF
...
hosts.
...
What
...
to
...
do
...
in
...
case
...
of
...
infrastructure
...
failures:
...
these
...
failures
...
involve
...
a
...
large
...
number
...
of
...
people
...
to
...
be
...
taken
...
care
...
of
...
(the
...
infrastructure
...
expert
...
on-call
...
and
...
often
...
also
...
the
...
SCCS),
...
so
...
for
...
the
...
time
...
being
...
still
...
page
...
Warren
...
and/or
...
Maria
...
Elena
...
(see
...
L1
...
shift
...
schedule)
...
if
...
you
...
think
...
that
...
one
...
of
...
those
...
failures
...
might
...
be
...
happening
...
during
...
the
...
night
...
(if
...
in
...
doubt,
...
page
...
anyways).
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
How
...
to
...
recognize
...
permanent
...
failures:
...
besides
...
those
...
2
...
cases,
...
everything
...
that
...
doesn't
...
get
...
fixed
...
after
...
a
...
rollback
...
is
...
by
...
definition
...
a
...
permanent
...
failure.
...
What
...
to
...
do
...
in
...
case
...
of
...
permanent
...
failures:
...
contact
...
the
...
appropriate
...
people
...
above,
...
if
...
you
...
are
...
sure
...
you
...
know
...
what
...
happened.
...
Otherwise,
...
page
...
Warren
...
and/or
...
Maria
...
Elena
...
(see
...
L1
...
shift
...
schedule).
...
If
...
there
...
is
...
another
...
part
...
of
...
the
...
run
...
waiting,
...
the
...
run
...
lock
...
(see
...
below)
...
will
...
have
...
to
...
be
...
removed
...
by
...
hand;
...
page
...
unless
...
you're
...
really
...
sure
...
of
...
what
...
you're
...
doing.
...
...
...
This
...
is
...
a
...
comprised
...
list
...
of
...
failures
...
that
...
don't
...
really
...
fit
...
into
...
the
...
other
...
major
...
three
...
ones
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
A rollback is essentially a restart of the stream or substream. It will re-run a particular process and all processes that depend on its output.
You can roll back from the pipeline front end. The entire stream can be rolled back by clicking "Rollback Stream" at the top, or individual streams in the main stream can be rolled back by selecting the pink boxes under "Stream Processes" and clicking "Rollback Selected".
But if multiple processes have failed (common), it's usually better to use the command line.
Wiki Markup |
---|
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline \-m PROD rollbackStream \--minimum 'L1Proc\[80819007\]' |
...
This
...
will
...
roll
...
back
...
all
...
of
...
the
...
failed,
...
terminated,
...
or
...
cancelled
...
processes
...
in
...
delivery
...
80819007.
...
If
...
you
...
don't
...
say
...
--minimum,
...
it
...
will
...
roll
...
back
...
the
...
whole
...
delivery.
...
That's
...
usually
...
not
...
what
...
you
...
want.
...
 Also
...
note
...
that
...
it
...
will
...
not
...
rollback
...
processes
...
that
...
have
...
succeeded,
...
but
...
with
...
incomplete
...
information
...
(ie
...
-
...
problems
...
arising
...
from
...
afs/nfs
...
hiccups).
...
 Such
...
processes
...
may
...
need
...
to
...
be
...
rolled
...
back
...
via
...
the
...
front
...
end.
...
After
...
a
...
rollback,
...
the
...
red
...
x
...
on
...
the
...
data
...
processing
...
page
...
will
...
be
...
gone,
...
but
...
the
...
L1
...
status
...
will
...
still
...
say
...
Failed.
...
This
...
tends
...
to
...
confuse
...
the
...
duty
...
scientists.
...
You
...
might
...
want
...
to
...
use
...
the
...
setL1Status
...
task
...
(see
...
bellow)
...
to
...
make
...
it
...
say
...
Running.
...
This
...
is
...
really
...
optional,
...
it
...
won't
...
affect
...
the
...
processing
...
in
...
any
...
way.
...
But
...
there
...
will
...
be
...
fewer
...
pagers
...
beeping.
...
Removing
...
"dontCleanUp"
...
is
...
not
...
necessary
...
to
...
process
...
the
...
data.
...
The
...
file
...
just
...
stops
...
temporary
...
files
...
from
...
getting
...
deleted
...
when
...
we're
...
done
...
with
...
them.
...
...
...
...
From
...
the
...
...
...
,
...
find
...
the
...
"Substreams"
...
area
...
and
...
click
...
the
...
pink
...
boxes
...
for
...
substreams
...
that
...
you
...
want
...
to
...
roll
...
back.
...
Then
...
click
...
"Rollback
...
Selected
...
SubStreams".
...
Wiki Markup |
---|
From the command line it's a bit more tricky:
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline \-m PROD rollbackStream \--minimum 'L1Proc\[90117001\]/doRun\[253889937\]' |
...
Remember
...
to
...
escape
...
the
...
square
...
brackets
...
if
...
you
...
are
...
in
...
tcsh.
...
...
...
...
Wait
...
for
...
the
...
"setStatus"
...
stream
...
to
...
have
...
run.
...
Rollback
...
won't
...
work
...
unless
...
everything
...
downstream
...
of
...
the
...
failed
...
process
...
is
...
in
...
a
...
final
...
state.
...
It's
...
generally
...
not
...
harmful
...
to
...
try
...
too
...
soon,
...
you
...
just
...
get
...
an
...
unhelpful
...
error
...
message.
...
Most
...
things
...
at
...
run
...
level
...
can
...
be
...
rolled
...
back
...
right
...
away.
...
If
...
a
...
recon
...
job
...
fails,
...
you'll
...
have
...
to
...
wait
...
at
...
least
...
an
...
hour.
...
Maybe
...
half
...
a
...
day.
...
Notice
...
that
...
the
...
GRB
...
search
...
is
...
executed
...
per
...
delivery
...
and
...
depends
...
on
...
all
...
the
...
FT1
...
and
...
FT2
...
files
...
in
...
each
...
run
...
to
...
be
...
registered
...
(therefore,
...
it
...
depends
...
on
...
almost
...
"everything"
...
that
...
belongs
...
to
...
that
...
delivery).
...
For
...
this
...
reason,
...
you
...
might
...
need
...
to
...
wait
...
for
...
the
...
entire
...
delivery
...
to
...
be
...
completed
...
before
...
being
...
able
...
to
...
roll
...
back
...
any
...
failed
...
recon
...
jobs.
...
And
...
because
...
of
...
the
...
run
...
lock
...
(see
...
below),
...
some
...
of
...
the
...
(parts
...
of)
...
runs
...
in
...
the
...
delivery
...
might
...
have
...
to
...
wait
...
for
...
other
...
deliveries
...
to
...
finish,
...
which
...
might
...
have
...
their
...
own
...
failures...
...
It's
...
possible,
...
but
...
rare,
...
to
...
get
...
deadlocks,
...
where
...
nothing
...
can
...
proceed
...
until
...
a
...
lock
...
is
...
removed
...
by
...
hand.
...
Best
...
to
...
ask
...
for
...
help
...
then.
...
In
...
general,
...
experience
...
will
...
tell
...
you
...
when
...
you
...
can
...
roll
...
back
...
what.
...
So,
...
in
...
doubt,
...
you
...
can
...
try
...
anyways
...
(if
...
it's
...
too
...
soon,
...
nothing
...
will
...
happen
...
and
...
you
...
will
...
get
...
an
...
error)
...
!
...
Often
...
you
...
can
...
roll
...
things
...
back
...
sooner
...
if
...
you
...
cancel
...
some
...
processes.
...
If
...
there
...
is
...
a
...
delivery
...
with
...
some
...
runs
...
that
...
are
...
ready
...
to
...
roll
...
back
...
and
...
others
...
that
...
aren't,
...
you
...
can
...
do
...
the
...
rollback
...
if
...
you
...
cancel
...
kludgeAsp.
...
"/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline
...
-m
...
PROD
...
cancelProcessInstance
...
13388768"
...
The
...
number
...
(or
...
numbers,
...
you
...
can
...
use
...
a
...
space-separated
...
list
...
to
...
do
...
more
...
than
...
one
...
at
...
a
...
time)
...
is
...
the
...
oracle
...
PK
...
for
...
the
...
process
...
instance,
...
it's
...
in
...
the
...
URL
...
for
...
the
...
process
...
instance
...
page
...
in
...
the
...
frontend.
...
This
...
takes
...
a
...
long
...
time,
...
10-30
...
minutes.
...
Any time one of the merges processes can't
...
find
...
all
...
of
...
its
...
input
...
files,
...
a
...
message
...
is
...
generated
...
in
...
the
...
...
...
(and
...
there
...
will
...
be
...
errors
...
in
...
the
...
log
...
of
...
the
...
failed
...
processes
...
complaining
...
about
...
"Different
...
number
...
of
...
events
...
...")
...
and
...
cleanup
...
for
...
the
...
run
...
is
...
disabled
...
by
...
a
...
file
...
called
...
dontCleanUp
...
in
...
the
...
run
...
directory
...
on
...
u52/L1.
...
All
...
cleanup
...
jobs
...
will
...
fail
...
if
...
the
...
dontCleanUp
...
file
...
is
...
present.
...
If
...
everything
...
is
...
OK
...
(see
...
instructions
...
below),
...
that
...
file
...
can
...
be
...
removed
...
and
...
the
...
jobs
...
rolled
...
back.
...
To
...
check
...
that
...
everything
...
is
...
OK,
...
follow
...
these
...
steps:
...
The
...
new
...
way:
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
The
...
old
...
way:
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Now you should have a list of complaining about not being able to find input files. If any of the complaining merges are downstream of a failed process, you can ignore them. This is usually the case. But if they are not downstream from a failure, it means there was a disk read error while setting up the merge. Look at the logfile for that merge, it will say what files are missing. If they are really there, the merge can be rolled back and will probably work the second time. If they are not there, there's a bigger problem, and this should be considered a "permanent failure".
Any time one of these messages is generated, cleanup for the run is disabled by a file called dontCleanUp in the run directory on u52/L1. All cleanup jobs, and checkRun, will fail if that file is present. If everything is OK, that file can be removed and the jobs rolled back. The cleanupCrumbs jobs shouldn't fail if they're rolled back after cleanupChunks has run (they used to, but don't anymore).
Only one delivery can process a run at a time. This is enforced by a lock file in the run directory on u52/L1. If findChunks fails or there are permanent failures in the run and another part of the run is waiting, it has to be removed by hand. It should never be removed unless the only failures in the run are findChunks or permanent ones, or there's a deadlock. Even then you have to wear a helmet and sign a waiver.
This process is not automatically retried like most of the others. If it fails, you have to roll it back by hand. And remove the run lock (see above) and the throttle lock (next section) by hand. And you'll probably have to
mv /nfs/farm/g/glast/u52/L1/${runId}/
...
${runId}
...
${deliveryid}_chunkList.txt
...
/nfs/farm/g/glast/u52/L1/${runId}/${runId}
...
${deliveryid}_chunkList.txt.tmp
...
Also,
...
it
...
checks
...
the
...
chunks
...
to
...
make
...
sure
...
they
...
don't
...
overlap.
...
That
...
doesn't
...
happen
...
much
...
anymore.
...
Except
...
when
...
RePiping.
...
See
...
the
...
section
...
on
...
that.
...
...
...
...
...
There's
...
a
...
throttle
...
that
...
limits
...
the
...
number
...
of
...
RDLs
...
that
...
can
...
be
...
in
...
process
...
(or
...
in
...
the
...
hard
...
part,
...
at
...
least)
...
at
...
once.
...
It
...
works
...
by
...
making
...
files
...
with
...
names
...
like
...
/nfs/farm/g/glast/u52/L1/throttle/1.lock
...
at
...
the
...
same
...
time
...
as
...
it
...
makes
...
the
...
run
...
lock.
...
config.throttleLimit
...
is
...
normally
...
set
...
at
...
2,
...
3
...
is
...
usually
...
safe
...
but
...
not
...
always.
...
Leave
...
it
...
at
...
2
...
unless
...
we're
...
way
...
behind
...
and
...
you're
...
up
...
for
...
watching
...
it
...
closely.
...
It's
...
still
...
under
...
development
...
and
...
a
...
bit
...
fragile,
...
so
...
it's
...
probably
...
better
...
not
...
to
...
mess
...
with
...
it
...
for
...
now.
...
...
...
...
Hopefully
...
this
...
whole
...
section
...
is
...
now
...
obsolete,
...
see
...
the
...
previous
...
one.
...
Sometimes
...
we
...
get
...
no
...
data
...
from
...
the
...
MOC
...
for
...
half
...
a
...
day
...
and
...
then
...
it
...
all
...
arrives
...
at
...
once.
...
This
...
will
...
overload
...
the
...
AFS
...
buffers
...
(see
...
next
...
section)
...
unless
...
L1
...
processing
...
is
...
throttled
...
by
...
hand
...
(we
...
are
...
working
...
on
...
implementing
...
an
...
automatic
...
throttle,
...
and
...
ditching
...
AFS
...
buffering
...
in
...
favor
...
of
...
xrootd).
...
You
...
do
...
this
...
by
...
hand-creating
...
run
...
locks
...
for
...
runs
...
that
...
haven't
...
arrived
...
yet
...
and
...
suspending
...
batch
...
jobs.
...
...
...
...
...
Go
...
to
...
the
...
...
...
...
(glast-ground
...
->
...
Mission
...
Planning
...
Web
...
View
...
->
...
Timeline),
...
get
...
the
...
start
...
time
...
for
...
the
...
physics
...
runs,
...
then
...
plug
...
them
...
into
...
...
. Make sure to uncheck "Apply Clock Offset Correction(s)
...
for
...
RXTE
...
and
...
Swift"
...
at
...
the
...
bottom
...
of
...
the
...
page.
...
By
...
default
...
the
...
timeline
...
doesn't
...
go
...
very
...
far
...
into
...
the
...
past,
...
you
...
may
...
need
...
to
...
change
...
that
...
by
...
clicking
...
on
...
"selections"
...
in
...
the
...
top
...
right
...
corner
...
of
...
the
...
page.
...
...
...
...
They
...
have
...
names
...
like
...
/nfs/farm/g/glast/u52/L1/r0263753970/r0263753970.lock
...
At
...
the
...
moment
...
it
...
doesn't
...
matter
...
what's
...
in
...
them , a half-sentence
...
explaining
...
why
...
you
...
made
...
the
...
lock
...
is
...
good.
...
An
...
empty
...
file,
...
or
...
a
...
rant
...
about
...
how
...
much
...
it
...
sucks
...
that
...
you
...
have
...
to
...
do
...
this,
...
works
...
also.
...
When
...
you're
...
ready
...
to
...
let
...
the
...
run
...
go,
...
just
...
remove
...
or
...
rename
...
the
...
file
...
and
...
the
...
run
...
should
...
start
...
up
...
in
...
5-10
...
minutes.
...
Actually the locks created by L1 do have meaningful content, and Bad Things will happen when it tries to remove them if it's not correct. But if you make one by hand L1 won't remove it, you have to, so it's OK to put whatever in there.
If the first part of a run is processing and you want to stop the second part from starting at a bad time, use the pipeline front end to get the LSF job ID of the findChunks process for the second part (which will be pending due to the run lock placed by the first part), log into a noric as glastraw and use bstop to suspend it. bresume it when you're ready to let it run.
If both parts of a run arrive while it's locked out, you can reduce the total amount of I/O that it does by letting the smaller part go first, since all of the data in the part that goes first has to be merged twice. Suspend findChunks for both parts, remove the run lock, then resume findChunks for the part with less data. "Less data" == "fewer chunks" unless it's highly fragmented, in that case du on the evt chunk directory (like /afs/slac.stanford.edu/g/glast/ground/PipelineStaging6/halfPipe/090512001/r0263753970)
...
may
...
give
...
a
...
better
...
idea.
...
...
...
...
...
...
When
...
the
...
AFS
...
servers
...
where
...
we
...
keep
...
temporary
...
files
...
hiccup,
...
it's
...
usually
...
because
...
they
...
ran
...
low
...
on
...
idle
...
threads.
...
It
...
is
...
possible
...
to
...
monitor
...
this
...
value
...
and
...
intervene
...
to
...
stave
...
off
...
disaster.
...
It
...
can
...
be
...
viewed
...
with
...
Ganglia
...
or
...
Nagios.
...
Nagios
...
only
...
works
...
inside
...
SLAC's
...
firewall,
...
but
...
is
...
somewhat
...
more
...
reliable.
...
Ganglia
...
works
...
from
...
anywhere,
...
and
...
shows
...
higher
...
time
...
resolution,
...
but
...
sometimes
...
it
...
stops
...
updating
...
and
...
just
...
shows
...
a
...
flat
...
line
...
with
...
some
...
old
...
value.
...
...
...
...
...
...
...
...
...
...
...
...
...
Now
...
you
...
should
...
be
...
able
...
to
...
access
...
SLAC-only
...
pages.
...
There's
...
2
...
places
...
to
...
get
...
the
...
threads:
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
There's
...
3
...
places
...
to
...
get
...
the
...
information...
...
...
...
...
...
...
...
When
...
the
...
servers
...
are
...
idle,
...
idle
...
threads
...
should
...
be
...
122.
...
The
...
SLAC
...
IT
...
people
...
consider
...
it
...
a
...
warning
...
if
...
it
...
goes
...
below
...
110
...
and
...
an
...
error
...
at
...
100.
...
For
...
us,
...
it
...
seems
...
llike
...
things
...
usually
...
work
...
as
...
long
...
as
...
it
...
stays
...
above
...
60
...
or
...
so
...
(although
...
afs112
...
has
...
been
...
known
...
to
...
go
...
as
...
low
...
as
...
40,
...
since
...
PipelineStanging6
...
is
...
for
...
the
...
HalfPipe).
...
This
...
is
...
likely
...
to
...
occur
...
if
...
there
...
are
...
more
...
than
...
~300
...
chunk
...
jobs
...
running.
...
Usually
...
after
...
recon
...
finishes
...
and
...
the
...
chunk-level
...
jobs
...
downstream
...
of
...
recon
...
start
...
up.
...
I've
...
written
...
a
...
script
...
that
...
will
...
suspend
...
jobs
...
that
...
are
...
using
...
a
...
specified
...
server,
...
wait
...
a
...
bit,
...
and
...
then
...
resume
...
them
...
with
...
a
...
little
...
delay
...
between.
...
Suspending
...
jobs
...
will
...
make
...
the
...
server
...
recover
...
sooner,
...
but
...
some
...
of
...
the
...
jobs
...
are
...
likely
...
to
...
get
...
timeout
...
errors
...
and
...
fail
...
due
...
to
...
the
...
delay
...
of
...
being
...
suspended,
...
so
...
it
...
may
...
be
...
best
...
to
...
wait
...
until
...
idle
...
threads
...
hits
...
0
...
before
...
suspending.
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
The
...
association
...
between
...
disk
...
and
...
server
...
can
...
be
...
found
...
in
...
several
...
ways.
...
Here's
...
one:
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
You
...
can
...
also
...
find
...
the
...
amount
...
of
...
disk
...
usage
...
with
...
the
...
following
...
command:
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
You should definitely join the following mailing lists:
And probably these:
Tired of being paged because L1Proc status still says Failed after a rollback?
/afs/slac/g/glast/ground/bin/pipeline
...
--mode
...
PROD
...
createStream
...
--define
...
"runNumber=240831665,l1RunStatus=Running"
...
setL1Status
...
OR:
...
/afs/slac/g/glast/ground/bin/pipeline
...
--mode
...
PROD
...
createStream
...
--define
...
"runNumber=240837713"
...
setL1Status
...
l1RunStatus
...
defaults
...
to
...
Running,
...
but
...
you
...
can
...
set
...
it
...
to
...
any
...
of
...
the
...
allowed
...
values
...
(Complete,
...
InProgress,
...
Incomplete,
...
Running,
...
Failed).
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Message
...
text:
...
Can't
...
open
...
lockfile
...
/nfs/farm/g/glast/u52/L1/r0248039911/r0248039911.lock.
...
...
Here's
...
the
...
syntax
...
to
...
cancel
...
a
...
process
...
that
...
is
...
not
...
in
...
final
...
state,
...
and
...
all
...
its
...
dependencies:
...
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline
...
--mode
...
PROD
...
cancelProcessInstance
...
8073657
...
The
...
Process
...
Instance
...
Primary
...
Key
...
(ie
...
-
...
8073657
...
in
...
the
...
example)
...
can
...
be
...
determined
...
in
...
the
...
the
...
URL
...
for
...
the
...
process
...
instance
...
page
...
(eg:
...
the
...
PIPK
...
for
...
http://glast-ground.slac.stanford.edu/Pipeline-II/pi.jsp?pi=20253756
...
is
...
20253756).
...
NOTE:
...
Please
...
don't
...
use
...
this
...
unless
...
you
...
really
...
(REALLY
...
!
...
!
...
!)
...
know
...
what
...
you
...
are
...
doing.
...
...
...
...
...
...
...
...
...
...
...
If
...
you
...
see
...
jobs
...
being
...
terminated
...
with
...
exceptions
...
in
...
the
...
message
...
viewer
...
saying
...
things
...
like
...
"yili0148+5:
...
Host
...
or
...
host
...
group
...
is
...
not
...
used
...
by
...
the
...
queue.
...
Job
...
not
...
submitted.",
...
it
...
means
...
the
...
hosts
...
available
...
to
...
glastdataq
...
have
...
changed.
...
The
...
solution
...
is
...
to
...
roll
...
back
...
the
...
affected
...
chunks
...
(the
...
doChunk
...
streams)
...
with
...
a
...
new
...
value
...
for
...
HOSTLIST.
...
When
...
you
...
roll
...
back
...
the
...
streams
...
from
...
the
...
frontend,
...
on
...
the
...
confirmation
...
page
...
you
...
are
...
presented
...
an
...
opportunity
...
to
...
set
...
or
...
redefine
...
variables.
...
To
...
figure
...
out
...
what
...
the
...
new
...
value
...
needs
...
to
...
be,
...
do
...
a
...
"bqueues
...
-l
...
glastdataq".
...
The
...
output
...
will
...
include
...
a
...
line
...
like
...
"HOSTS:
...
bbfarm/".
...
In
...
this
...
case
...
you'd
...
enter
...
HOSTLIST=bbfarm
...
in
...
the
...
box
...
on
...
the
...
confirmation
...
page.
...
bbfarm
...
is
...
actually
...
a
...
temporary
...
thing
...
for
...
the
...
cooling
...
outage,
...
when
...
things
...
get
...
switched
...
back
...
to
...
normal,
...
the
...
relevant
...
line
...
from
...
bqueues
...
will
...
probably
...
look
...
more
...
like
...
"HOSTS:
...
glastyilis+3
...
glastcobs+2
...
preemptfarm+1".
...
Then
...
the
...
thing
...
to
...
enter
...
in
...
the
...
box
...
would
...
be
...
HOSTLIST="glastyilis
...
glastcobs
...
genfarm".
...
...
...
...
...
...
...
...
marks show up in the display for HalfPipe. Click on the FASTCopy "Logs" number. First check to see if the "ProcessSCI.end"
...
event
...
has
...
occurred.
...
 If
...
not,
...
then
...
wait
...
until
...
it
...
occurs
...
and
...
see
...
if
...
the
...
red
...
question
...
mark
...
persists.
...
 If
...
it
...
persists
...
and
...
the
...
log
...
shows
...
messages
...
saying
...
"pipeline
...
nonEventReporting
...
submission
...
succeeded"
...
or
...
"no
...
reassembled
...
datagrams
...
found",
...
it
...
just
...
means
...
that
...
it
...
contained
...
data
...
that
...
had
...
already
...
been
...
received.
...
Meaningful
...
deliveries
...
will
...
say
...
"found
...
XXX
...
LSEP
...
datagrams
...
in
...
....".
...
 If
...
it
...
is
...
a
...
meaningful
...
delivery,
...
then
...
further
...
intervention
...
is
...
required.
...
...
...
If
...
you
...
suspect
...
a
...
process
...
is
...
stuck,
...
visit
...
the
...
process
...
page
...
from
...
the
...
run
...
page
...
(ie
...
-
...
click
...
on
...
the
...
green
...
task
...
bar
...
in
...
the
...
Data
...
Processing
...
Page,
...
then
...
click
...
on
...
the
...
particular
...
job
...
that
...
is
...
running
...
in
...
the
...
list
...
at
...
the
...
bottom
...
of
...
the
...
page).
...
Each
...
entry
...
in
...
the
...
table
...
on
...
the
...
page
...
will
...
contain
...
a
...
link
...
to
...
"Messages",
...
"Log",
...
and
...
"Files".
...
If
...
you
...
click
...
on
...
"Log",
...
it
...
should
...
show
...
you
...
the
...
log,
...
but
...
more
...
importantly,
...
it'll
...
show
...
the
...
location
...
of
...
the
...
log
...
file
...
right
...
above
...
the
...
box
...
containing
...
the
...
log
...
information
...
(note:
...
if
...
the
...
file
...
is
...
too
...
big,
...
it
...
will
...
not
...
open,
...
but
...
the
...
link
...
to
...
"Files"
...
will
...
show
...
where
...
the
...
log
...
is
...
located,
...
so
...
you
...
can
...
get
...
the
...
information
...
that
...
way
...
as
...
well).
...
If
...
you
...
visit
...
the
...
location
...
of
...
the
...
file
...
on
...
the
...
noric
...
machines,
...
you
...
can
...
determine
...
when
...
the
...
last
...
time
...
the
...
file
...
was
...
updated
...
(ls
...
-l,
...
or
...
tail
...
-f
...
to
...
see
...
if
...
it's
...
still
...
streaming).
...
If
...
it
...
is
...
not
...
updating,
...
find
...
out
...
the
...
batch
...
job
...
number
...
(JobId
...
from
...
the
...
stream
...
table),
...
log
...
into
...
noric
...
as
...
glastraw,
...
and
...
issue
...
a
...
bkill
...
<JobId>.
...
The
...
process
...
should
...
automatically
...
be
...
restarted
...
after
...
this
...
command
...
is
...
issued.
...
When weird things are happening with the delivery that you can't
...
figure
...
out,
...
it
...
may
...
be
...
necessary
...
to
...
repipe
...
the
...
stream
...
(probably
...
a
...
good
...
idea
...
to
...
check
...
with
...
someone
...
first).
...
Log
...
in
...
as
...
glastraw
...
and
...
issue
...
the
...
following
...
command:
...
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline
...
-m
...
PROD
...
createStream
...
--stream
...
$
...
{ID
...
}
...
--define
...
downlinkID=$
...
{ID
...
},runID=$
...
{ID
...
}
...
RePipe
...
where
...
$
...
{ID
...
}
...
is
...
the
...
Run
...
ID
...
(just
...
numbers,
...
no
...
leading
...
r0).
...
This
...
will
...
create
...
a
...
directory
...
in
...
/nfs/farm/g/glast/u28/RePipe
...
...
...
you
...
will
...
need
...
to
...
then
...
manually
...
re-enter
...
the
...
run
...
into
...
L1Proc.
...
To
...
do
...
this,
...
bkill
...
any
...
findChunk
...
processes
...
that
...
are
...
associated
...
with
...
the
...
RunID,
...
remove
...
the
...
run
...
lock
...
from
...
/nfs/farm/g/glast/u52/rXXX
...
(where
...
XXX
...
is
...
the
...
run
...
number),
...
also
...
move
...
all
...
chunkList
...
.txt
...
(leave
...
the
...
.txt.tmp
...
ones
...
alone)
...
files
...
to
...
something
...
else
...
(just
...
suffixing
...
them
...
with
...
".ignore"
...
should
...
work)
...
and
...
issue
...
the
...
following
...
command:
...
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline
...
-m
...
PROD
...
createStream
...
--stream
...
$
...
{ID
...
}
...
--define
...
DOWNLINK_ID=$
...
{ID
...
},DOWNLINK_RAWDIR=/nfs/farm/g/glast/u28/RePipe/$
...
{ID
...
}
...
L1Proc
...
You
...
should
...
be
...
able
...
to
...
see
...
if
...
the
...
process
...
restarted
...
correctly
...
by
...
visiting
...
...
,
...
which
...
will
...
tell
...
all
...
the
...
streams
...
that
...
have
...
not
...
resulted
...
in
...
"Success".
...
...
...
bjobs
...
-wu
...
glastraw
...
-
...
running
...
this
...
from
...
noric
...
will
...
list
...
all
...
the
...
jobs
...
in
...
the
...
LSF
...
batch
...
queue
...
owned
...
by
...
glastraw
...
(which
...
is
...
the
...
user
...
submitting
...
pipeline
...
streams).
...
...
...
...
Sign
...
up
...
for
...
shifts
...
...
.
...
View
...
shift
...
calendar
...