Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Task

...

of

...

the

...

Data

...

Processing

...

on-call

...

expert

...

Monitor

...

L1Proc

...

and

...

halfPipe:

...

Every

...

time

...

a

...

red

...

cross

...

appears

...

on

...

the

...

Data

...

Processing

...

page

...

,

...

next

...

to

...

the

...

L1Proc

...

or

...

halfPipe

...

processing

...

status

...

bar,

...

try

...

to

...

resolve

...

the

...

failure.

...

We

...

are

...

not

...

on-call

...

for

...

ASP/GRB

...

search

...

(Jim

...

Chiang

...

(jchiang

...

{at

...

}slac)

...

should

...

be

...

emailed

...

(not

...

paged)

...

for

...

these

...

failures)

...

and

...

we

...

are

...

definitely

...

NOT

...

on-call

...

for

...

infrastructure

...

problems

...

(can't

...

see

...

monitoring

...

plots,

...

etc.).

...

If

...

you

...

get

...

paged

...

for

...

something

...

that

...

is

...

not

...

under

...

your

...

responsibility,

...

don't

...

try

...

to

...

fix

...

it:

...

forward

...

the

...

message

...

to

...

the

...

appropriate

...

people

...

and

...

report

...

everything

...

in

...

the

...

Ops

...

Log.

Familiarize yourself with understanding the Pipeline-II page by reading through the Pipeline-II User's Guide. It is a good starting point for understanding the general organization of the pipeline and the tools needed to track down problems.

It may be good to look at the task chart to see the interdependencies of tasks ("tasks" defined in Pipeline-II User's Guide). "Success" dependency means that a process needs to successfully complete in order for the dependent process to continue, while "All Done" means that even failed processes will result in the dependent process continuing.

Watch the Usage Plots and look for L1Proc/HalfPipe related tasks (doChunk, doCrumb, etc). Default rule of thumb of time is to use last 2 hours, because more than that will not give enough fidelity in the plot. If you see a series of points that make a flat line for an extended period of time, it may indicate problems with the pipeline.

Data Organization

There are four main categories of data organization.  At the top, there is the "Delivery", which is the data that is sent down from GLAST.  Completely unrelated are the "Runs", which are time-segments determined by GLAST.  A delivery can consist of a part of a run, many runs, or pieces of runs - there is no particular order that is guaranteed within a delivery with regards to a run.  Runs (or parts of the run contained in a delivery) are broken into "Chunks", which are always contiguous blocks of data.  Chunks are further broken down into "Crumbs", which are also contiguous.

When looking at files or directories, run numbers are typically prefixed by an "r", chunk numbers with an "e", and crumb numbers with a "b".

Different types of failures

There are 3 main type of failures, and should be handled differently.

Transient failures

  • afs hiccup (can't read from/write to a resource located on an afs disk)
  • nfs hiccup (can't read from/write to a resource located on an nfs disk)
  • LSF hiccup (an lsf batch machine crashes or is rebooted)
  • lost Data Base connection : MOOT, Calibration database log, or more rarely on Oracle
  • recon segfaults for no apparent reason (nothing in logfile)
  • Pipeline deadlock

How to recognize transient failures: they usually affect only one single job (disk or DB failures), or a small number of jobs all on the same machine (LFS failure). If a process complains about a missing file but the file exists, or gets a read error after opening a file, it's probably a disk server hiccup. If the log file ends suddenly, without the usual LSF postlog, the batch host probably crashed. There will probably be several jobs failing the same way at the same time.

What to do in case of transient failures: rollback the affected process(es) when possible (see below for the rollback procedure). Look for the dontCleanUp file and check the Log Watcher (see below). If recon segfaults for no apparent reason, email Heather and Anders before the rollback, including a link to the log file, which will tell them where the core file is. For pipeline deadlocks, email Dan and include a link to the process instance.

Transient failures are rare lately. For the last couple of months, most failed processes are automatically retried once. This usually fixes transient issues, so usually when there's a failure it indicates an actual problem.

Bad merges: If a process that's merging crumb-level files into chunks or chunks into runs can't find all of its input files, it won't fail. See the "dontCleanUp" section below. Processes downstream of such a merge may fail because they are trying to use different types of input files (e.g., digi and recon) and the events don't match up because some are missing from one file and not the other. Then you need to roll back the merge even though it "succeeded" the first time.

Infrastructure failures

  • single batch host is killing several jobs ('rogue' LSF host)
  • group of batch hosts crashed or went offline
  • afs crashed on one of the host machines
  • scratch disk is full on one or more of the host machines - see what to do here
  • Wiki Markup
    A staging disk is full (these are accessed from /afs/slac.stanford.edu/g/glast/ground/PipelineStaging\[1-7\])

...

  • /nfs/farm/g/glast/u52

...

  • or

...

  • /nfs/farm/g/glast/u15

...

  • is

...

  • full

...

How

...

to

...

recognize

...

infrastructure

...

failures:

...

they

...

usually

...

affect

...

a

...

large

...

number

...

of

...

jobs,

...

either

...

on

...

the

...

same

...

LSF

...

host

...

or

...

on

...

different

...

LSF

...

hosts.

...

What

...

to

...

do

...

in

...

case

...

of

...

infrastructure

...

failures:

...

these

...

failures

...

involve

...

a

...

large

...

number

...

of

...

people

...

to

...

be

...

taken

...

care

...

of

...

(the

...

infrastructure

...

expert

...

on-call

...

and

...

often

...

also

...

the

...

SCCS),

...

so

...

for

...

the

...

time

...

being

...

still

...

page

...

Warren

...

and/or

...

Maria

...

Elena

...

(see

...

L1

...

shift

...

schedule)

...

if

...

you

...

think

...

that

...

one

...

of

...

those

...

failures

...

might

...

be

...

happening

...

during

...

the

...

night

...

(if

...

in

...

doubt,

...

page

...

anyways).

...

Permanent

...

failures

...

  • G4

...

  • propagator

...

  • error.

...

  • If

...

  • recon

...

  • dies

...

  • complaining

...

  • about

...

  • g4propagator,

...

  • we

...

  • can't

...

  • fix

...

  • it.

...

  • If

...

  • this

...

  • happens,

...

  • email

...

  • Heather

...

  • (heather625

...

  • {at

...

  • }gmail)

...

  • and

...

  • Anders

...

  • (borgland

...

  • {at

...

  • }slac)

...

  • (and

...

  • possibly

...

  • Warren

...

  • (focke

...

  • {at

...

  • }slac)

...

  • and

...

  • Maria

...

  • Elena

...

  • (monzani

...

  • {at

...

  • }slac)).

...

  • Include

...

  • a

...

  • link

...

  • to

...

  • the

...

  • log

...

  • file,

...

  • which

...

  • will

...

  • tell

...

  • them

...

  • where

...

  • the

...

  • core

...

  • file

...

  • is.

...

  • Overlapping chunks.

...

  • If

...

  • findChunks

...

  • complains

...

  • about

...

  • overlapping

...

  • chunks,

...

  • tell

...

  • Bryson;

...

  • if

...

  • trending

...

  • merges

...

  • complain,

...

  • tell

...

  • Bryson

...

  • and

...

  • datamonlist@glast2.Stanford.EDU.

...

  • Rollback

...

  • won't

...

  • help.

...

  • Deliveries

...

  • arriving

...

  • too

...

  • closely

...

  • will

...

  • mangle

...

  • the

...

...

  • .

...

  • The

...

  • underlying

...

  • processes

...

  • are

...

  • fine,

...

  • but

...

  • email

...

  • obsproblist

...

  • {at

...

  • }glast2.stanford.edu,

...

  • Jim

...

  • (jchiang

...

  • {at

...

  • }slac)

...

  • or

...

  • Bryson

...

  • (blee

...

  • {at

...

  • }slac)

...

  • to

...

  • fix

...

  • the

...

  • display.

...

How

...

to

...

recognize

...

permanent

...

failures:

...

besides

...

those

...

2

...

cases,

...

everything

...

that

...

doesn't

...

get

...

fixed

...

after

...

a

...

rollback

...

is

...

by

...

definition

...

a

...

permanent

...

failure.

...

What

...

to

...

do

...

in

...

case

...

of

...

permanent

...

failures:

...

contact

...

the

...

appropriate

...

people

...

above,

...

if

...

you

...

are

...

sure

...

you

...

know

...

what

...

happened.

...

Otherwise,

...

page

...

Warren

...

and/or

...

Maria

...

Elena

...

(see

...

L1

...

shift

...

schedule).

...

If

...

there

...

is

...

another

...

part

...

of

...

the

...

run

...

waiting,

...

the

...

run

...

lock

...

(see

...

below)

...

will

...

have

...

to

...

be

...

removed

...

by

...

hand;

...

page

...

unless

...

you're

...

really

...

sure

...

of

...

what

...

you're

...

doing.

...

Other

...

failures

...

This

...

is

...

a

...

comprised

...

list

...

of

...

failures

...

that

...

don't

...

really

...

fit

...

into

...

the

...

other

...

major

...

three

...

ones

...

  • Too

...

  • few

...

  • events

...

  • in

...

  • the

...

  • run,

...

  • or

...

  • gaps

...

  • -

...

  • can

...

  • lead

...

  • to

...

  • too

...

  • few

...

  • events

...

  • in

...

  • the

...

  • magic7

...

  • file

...

...

...

  • .

...

  • Try

...

  • to

...

  • copy

...

  • the

...

  • /afs/slac/g/glast/ground/PipelineStaging6/halfPipe/XXX/magic7_XXX.txt

...

  • (where

...

  • XXX

...

  • is

...

  • the

...

  • run

...

  • number)

...

  • to

...

  • /afs/slac/g/glast/ground/PipelineStaging6/YYY/magic7_YYY.txt

...

  • (where

...

  • YYY

...

  • is

...

  • the

...

  • run

...

  • where

...

  • fakeFT2

...

  • failed

...

  • ...

...

  • remember

...

  • to

...

  • change

...

  • XXX

...

  • to

...

  • YYY

...

  • on

...

  • the

...

  • magic7

...

  • file

...

  • name).

...

  • Then

...

  • rollback

...

  • fakeFT2.

...

  • If

...

  • this

...

  • fails,

...

  • email

...

  • Andrea

...

  • (tramacer

...

  • {at

...

  • }slac)

...

  • for

...

  • additional

...

  • information

...

  • regarding

...

  • problems

...

  • with

...

  • FT2.

...

Everything to know about the Rollback

A rollback is essentially a restart of the stream or substream. It will re-run a particular process and all processes that depend on its output.

How to rollback

You can roll back from the pipeline front end. The entire stream can be rolled back by clicking "Rollback Stream" at the top, or individual streams in the main stream can be rolled back by selecting the pink boxes under "Stream Processes" and clicking "Rollback Selected".

But if multiple processes have failed (common), it's usually better to use the command line.

Wiki Markup
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline \-m PROD rollbackStream \--minimum 'L1Proc\[80819007\]'

...

This

...

will

...

roll

...

back

...

all

...

of

...

the

...

failed,

...

terminated,

...

or

...

cancelled

...

processes

...

in

...

delivery

...

80819007.

...

If

...

you

...

don't

...

say

...

--minimum,

...

it

...

will

...

roll

...

back

...

the

...

whole

...

delivery.

...

That's

...

usually

...

not

...

what

...

you

...

want.

...

  Also

...

note

...

that

...

it

...

will

...

not

...

rollback

...

processes

...

that

...

have

...

succeeded,

...

but

...

with

...

incomplete

...

information

...

(ie

...

-

...

problems

...

arising

...

from

...

afs/nfs

...

hiccups).

...

  Such

...

processes

...

may

...

need

...

to

...

be

...

rolled

...

back

...

via

...

the

...

front

...

end.

...

After

...

a

...

rollback,

...

the

...

red

...

x

...

on

...

the

...

data

...

processing

...

page

...

will

...

be

...

gone,

...

but

...

the

...

L1

...

status

...

will

...

still

...

say

...

Failed.

...

This

...

tends

...

to

...

confuse

...

the

...

duty

...

scientists.

...

You

...

might

...

want

...

to

...

use

...

the

...

setL1Status

...

task

...

(see

...

bellow)

...

to

...

make

...

it

...

say

...

Running.

...

This

...

is

...

really

...

optional,

...

it

...

won't

...

affect

...

the

...

processing

...

in

...

any

...

way.

...

But

...

there

...

will

...

be

...

fewer

...

pagers

...

beeping.

...

Removing

...

"dontCleanUp"

...

is

...

not

...

necessary

...

to

...

process

...

the

...

data.

...

The

...

file

...

just

...

stops

...

temporary

...

files

...

from

...

getting

...

deleted

...

when

...

we're

...

done

...

with

...

them.

...

Rolling

...

back

...

substreams

...

From

...

the

...

front

...

end

...

,

...

find

...

the

...

"Substreams"

...

area

...

and

...

click

...

the

...

pink

...

boxes

...

for

...

substreams

...

that

...

you

...

want

...

to

...

roll

...

back.

...

Then

...

click

...

"Rollback

...

Selected

...

SubStreams".

...

Wiki Markup
From the command line it's a bit more tricky:
/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline \-m PROD rollbackStream \--minimum 'L1Proc\[90117001\]/doRun\[253889937\]'

...

Remember

...

to

...

escape

...

the

...

square

...

brackets

...

if

...

you

...

are

...

in

...

tcsh.

...

When

...

to

...

rollback

...

Wait

...

for

...

the

...

"setStatus"

...

stream

...

to

...

have

...

run.

...

Rollback

...

won't

...

work

...

unless

...

everything

...

downstream

...

of

...

the

...

failed

...

process

...

is

...

in

...

a

...

final

...

state.

...

It's

...

generally

...

not

...

harmful

...

to

...

try

...

too

...

soon,

...

you

...

just

...

get

...

an

...

unhelpful

...

error

...

message.

...

Most

...

things

...

at

...

run

...

level

...

can

...

be

...

rolled

...

back

...

right

...

away.

...

If

...

a

...

recon

...

job

...

fails,

...

you'll

...

have

...

to

...

wait

...

at

...

least

...

an

...

hour.

...

Maybe

...

half

...

a

...

day.

...

Notice

...

that

...

the

...

GRB

...

search

...

is

...

executed

...

per

...

delivery

...

and

...

depends

...

on

...

all

...

the

...

FT1

...

and

...

FT2

...

files

...

in

...

each

...

run

...

to

...

be

...

registered

...

(therefore,

...

it

...

depends

...

on

...

almost

...

"everything"

...

that

...

belongs

...

to

...

that

...

delivery).

...

For

...

this

...

reason,

...

you

...

might

...

need

...

to

...

wait

...

for

...

the

...

entire

...

delivery

...

to

...

be

...

completed

...

before

...

being

...

able

...

to

...

roll

...

back

...

any

...

failed

...

recon

...

jobs.

...

And

...

because

...

of

...

the

...

run

...

lock

...

(see

...

below),

...

some

...

of

...

the

...

(parts

...

of)

...

runs

...

in

...

the

...

delivery

...

might

...

have

...

to

...

wait

...

for

...

other

...

deliveries

...

to

...

finish,

...

which

...

might

...

have

...

their

...

own

...

failures...

...

It's

...

possible,

...

but

...

rare,

...

to

...

get

...

deadlocks,

...

where

...

nothing

...

can

...

proceed

...

until

...

a

...

lock

...

is

...

removed

...

by

...

hand.

...

Best

...

to

...

ask

...

for

...

help

...

then.

...

In

...

general,

...

experience

...

will

...

tell

...

you

...

when

...

you

...

can

...

roll

...

back

...

what.

...

So,

...

in

...

doubt,

...

you

...

can

...

try

...

anyways

...

(if

...

it's

...

too

...

soon,

...

nothing

...

will

...

happen

...

and

...

you

...

will

...

get

...

an

...

error)

...

!

...

Often

...

you

...

can

...

roll

...

things

...

back

...

sooner

...

if

...

you

...

cancel

...

some

...

processes.

...

If

...

there

...

is

...

a

...

delivery

...

with

...

some

...

runs

...

that

...

are

...

ready

...

to

...

roll

...

back

...

and

...

others

...

that

...

aren't,

...

you

...

can

...

do

...

the

...

rollback

...

if

...

you

...

cancel

...

kludgeAsp.

...

"/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline

...

-m

...

PROD

...

cancelProcessInstance

...

13388768"

...

The

...

number

...

(or

...

numbers,

...

you

...

can

...

use

...

a

...

space-separated

...

list

...

to

...

do

...

more

...

than

...

one

...

at

...

a

...

time)

...

is

...

the

...

oracle

...

PK

...

for

...

the

...

process

...

instance,

...

it's

...

in

...

the

...

URL

...

for

...

the

...

process

...

instance

...

page

...

in

...

the

...

frontend.

...

This

...

takes

...

a

...

long

...

time,

...

10-30

...

minutes.

...

The dontCleanUp file

Any time one of the merges processes can't

...

find

...

all

...

of

...

its

...

input

...

files,

...

a

...

message

...

is

...

generated

...

in

...

the

...

Log

...

Watcher

...

(and

...

there

...

will

...

be

...

errors

...

in

...

the

...

log

...

of

...

the

...

failed

...

processes

...

complaining

...

about

...

"Different

...

number

...

of

...

events

...

...")

...

and

...

cleanup

...

for

...

the

...

run

...

is

...

disabled

...

by

...

a

...

file

...

called

...

dontCleanUp

...

in

...

the

...

run

...

directory

...

on

...

u52/L1.

...

All

...

cleanup

...

jobs

...

will

...

fail

...

if

...

the

...

dontCleanUp

...

file

...

is

...

present.

...

If

...

everything

...

is

...

OK

...

(see

...

instructions

...

below),

...

that

...

file

...

can

...

be

...

removed

...

and

...

the

...

jobs

...

rolled

...

back.

...

To

...

check

...

that

...

everything

...

is

...

OK,

...

follow

...

these

...

steps:

...

The

...

new

...

way:

...

  • Look

...

  • at

...

  • the

...

  • dontCleanUp

...

  • file.

...

  • It

...

  • should

...

  • contain

...

  • a

...

  • list

...

  • of

...

  • all

...

  • merge

...

  • processes

...

  • that

...

  • missed

...

  • files,

...

  • and

...

  • what

...

  • files

...

  • were

...

  • missing.

...

  • I's

...

  • possible

...

  • that

...

  • it

...

  • will

...

  • get

...

  • garbled

...

  • by

...

  • multiple

...

  • jobs

...

  • writing

...

  • to

...

  • it

...

  • at

...

  • once,

...

  • so

...

  • if

...

  • it

...

  • doesn't

...

  • seem

...

  • to

...

  • make

...

  • sense,

...

  • you

...

  • can

...

  • still

...

  • get

...

  • the

...

  • information

...

  • by

...

  • following

...

  • the

...

  • steps

...

  • below.

...

The

...

old

...

way:

...

  • From

...

  • the

...

...

...

  • ,

...

  • click

...

  • "Logging"

...

  • (about

...

  • 1/3

...

  • of

...

  • the

...

  • way

...

  • down

...

  • left

...

  • side)

...

  • Click

...

  • "Selection"

...

  • (upper

...

  • right)

...

  • -

...

  • page

...

  • should

...

  • look

...

  • like

...

...

  • Selections"

...

  • (lower

...

  • right)

...

  • -

...

  • page

...

  • should

...

  • look

...

  • like

...

  • this
  • In the "Select

...

  • Program"

...

  • field,

...

  • enter

...

  • "mergeStuff.py"

...

  • In

...

  • the

...

  • "Select

...

  • Target"

...

  • field,

...

  • enter

...

  • the

...

  • run

...

  • number.

...

  • Click

...

  • "Submit

...

  • Selection"

...

  • -

...

  • page

...

  • should

...

  • look

...

  • like

...

...

  • after

...

  • filling

...

  • Program

...

  • and

...

  • Target

...

  • fields

...

  • and

...

  • submitting

...

  • Click

...

  • "Table"

...

  • (upper

...

  • right,

...

  • next

...

  • to

...

  • "Selection")

...

  • -

...

  • page

...

  • should

...

  • look

...

  • like

...

Now you should have a list of complaining about not being able to find input files. If any of the complaining merges are downstream of a failed process, you can ignore them. This is usually the case. But if they are not downstream from a failure, it means there was a disk read error while setting up the merge. Look at the logfile for that merge, it will say what files are missing. If they are really there, the merge can be rolled back and will probably work the second time. If they are not there, there's a bigger problem, and this should be considered a "permanent failure".

Any time one of these messages is generated, cleanup for the run is disabled by a file called dontCleanUp in the run directory on u52/L1. All cleanup jobs, and checkRun, will fail if that file is present. If everything is OK, that file can be removed and the jobs rolled back. The cleanupCrumbs jobs shouldn't fail if they're rolled back after cleanupChunks has run (they used to, but don't anymore).

The run lock

Only one delivery can process a run at a time. This is enforced by a lock file in the run directory on u52/L1. If findChunks fails or there are permanent failures in the run and another part of the run is waiting, it has to be removed by hand. It should never be removed unless the only failures in the run are findChunks or permanent ones, or there's a deadlock. Even then you have to wear a helmet and sign a waiver.

findChunks

This process is not automatically retried like most of the others. If it fails, you have to roll it back by hand. And remove the run lock (see above) and the throttle lock (next section) by hand. And you'll probably have to
mv /nfs/farm/g/glast/u52/L1/${runId}/

...

${runId}

...

${deliveryid}_chunkList.txt

...

/nfs/farm/g/glast/u52/L1/${runId}/${runId}

...

${deliveryid}_chunkList.txt.tmp

...

Also,

...

it

...

checks

...

the

...

chunks

...

to

...

make

...

sure

...

they

...

don't

...

overlap.

...

That

...

doesn't

...

happen

...

much

...

anymore.

...

Except

...

when

...

RePiping.

...

See

...

the

...

section

...

on

...

that.

...

The

...

Throttle

...

Yay

...

!

...

There's

...

a

...

throttle

...

that

...

limits

...

the

...

number

...

of

...

RDLs

...

that

...

can

...

be

...

in

...

process

...

(or

...

in

...

the

...

hard

...

part,

...

at

...

least)

...

at

...

once.

...

It

...

works

...

by

...

making

...

files

...

with

...

names

...

like

...

/nfs/farm/g/glast/u52/L1/throttle/1.lock

...

at

...

the

...

same

...

time

...

as

...

it

...

makes

...

the

...

run

...

lock.

...

config.throttleLimit

...

is

...

normally

...

set

...

at

...

2,

...

3

...

is

...

usually

...

safe

...

but

...

not

...

always.

...

Leave

...

it

...

at

...

2

...

unless

...

we're

...

way

...

behind

...

and

...

you're

...

up

...

for

...

watching

...

it

...

closely.

...

It's

...

still

...

under

...

development

...

and

...

a

...

bit

...

fragile,

...

so

...

it's

...

probably

...

better

...

not

...

to

...

mess

...

with

...

it

...

for

...

now.

...

Too

...

Much

...

Data

...

Hopefully

...

this

...

whole

...

section

...

is

...

now

...

obsolete,

...

see

...

the

...

previous

...

one.

...

Sometimes

...

we

...

get

...

no

...

data

...

from

...

the

...

MOC

...

for

...

half

...

a

...

day

...

and

...

then

...

it

...

all

...

arrives

...

at

...

once.

...

This

...

will

...

overload

...

the

...

AFS

...

buffers

...

(see

...

next

...

section)

...

unless

...

L1

...

processing

...

is

...

throttled

...

by

...

hand

...

(we

...

are

...

working

...

on

...

implementing

...

an

...

automatic

...

throttle,

...

and

...

ditching

...

AFS

...

buffering

...

in

...

favor

...

of

...

xrootd).

...

You

...

do

...

this

...

by

...

hand-creating

...

run

...

locks

...

for

...

runs

...

that

...

haven't

...

arrived

...

yet

...

and

...

suspending

...

batch

...

jobs.

...

To

...

predict

...

future

...

runs:

...

Go

...

to

...

the

...

mission

...

timeline

...

viewer

...

(glast-ground

...

->

...

Mission

...

Planning

...

Web

...

View

...

->

...

Timeline),

...

get

...

the

...

start

...

time

...

for

...

the

...

physics

...

runs,

...

then

...

plug

...

them

...

into

...

xTime

...

. Make sure to uncheck "Apply Clock Offset Correction(s)

...

for

...

RXTE

...

and

...

Swift"

...

at

...

the

...

bottom

...

of

...

the

...

page.

...

By

...

default

...

the

...

timeline

...

doesn't

...

go

...

very

...

far

...

into

...

the

...

past,

...

you

...

may

...

need

...

to

...

change

...

that

...

by

...

clicking

...

on

...

"selections"

...

in

...

the

...

top

...

right

...

corner

...

of

...

the

...

page.

...

Making

...

run

...

locks:

...

They

...

have

...

names

...

like

...

/nfs/farm/g/glast/u52/L1/r0263753970/r0263753970.lock

...


At

...

the

...

moment

...

it

...

doesn't

...

matter

...

what's

...

in

...

them (star) , a half-sentence

...

explaining

...

why

...

you

...

made

...

the

...

lock

...

is

...

good.

...

An

...

empty

...

file,

...

or

...

a

...

rant

...

about

...

how

...

much

...

it

...

sucks

...

that

...

you

...

have

...

to

...

do

...

this,

...

works

...

also.

...

When

...

you're

...

ready

...

to

...

let

...

the

...

run

...

go,

...

just

...

remove

...

or

...

rename

...

the

...

file

...

and

...

the

...

run

...

should

...

start

...

up

...

in

...

5-10

...

minutes.

...

(star) Actually the locks created by L1 do have meaningful content, and Bad Things will happen when it tries to remove them if it's not correct. But if you make one by hand L1 won't remove it, you have to, so it's OK to put whatever in there.

Suspending jobs

If the first part of a run is processing and you want to stop the second part from starting at a bad time, use the pipeline front end to get the LSF job ID of the findChunks process for the second part (which will be pending due to the run lock placed by the first part), log into a noric as glastraw and use bstop to suspend it. bresume it when you're ready to let it run.

If both parts of a run arrive while it's locked out, you can reduce the total amount of I/O that it does by letting the smaller part go first, since all of the data in the part that goes first has to be merged twice. Suspend findChunks for both parts, remove the run lock, then resume findChunks for the part with less data. "Less data" == "fewer chunks" unless it's highly fragmented, in that case du on the evt chunk directory (like /afs/slac.stanford.edu/g/glast/ground/PipelineStaging6/halfPipe/090512001/r0263753970)

...

may

...

give

...

a

...

better

...

idea.

...

AFS

...

buffers

...

&

...

idle

...

threads

...

When

...

the

...

AFS

...

servers

...

where

...

we

...

keep

...

temporary

...

files

...

hiccup,

...

it's

...

usually

...

because

...

they

...

ran

...

low

...

on

...

idle

...

threads.

...

It

...

is

...

possible

...

to

...

monitor

...

this

...

value

...

and

...

intervene

...

to

...

stave

...

off

...

disaster.

...

It

...

can

...

be

...

viewed

...

with

...

Ganglia

...

or

...

Nagios.

...

Nagios

...

only

...

works

...

inside

...

SLAC's

...

firewall,

...

but

...

is

...

somewhat

...

more

...

reliable.

...

Ganglia

...

works

...

from

...

anywhere,

...

and

...

shows

...

higher

...

time

...

resolution,

...

but

...

sometimes

...

it

...

stops

...

updating

...

and

...

just

...

shows

...

a

...

flat

...

line

...

with

...

some

...

old

...

value.

...

Ganglia

Nagios

  • Using ssh as a proxy: You can use ssh as a SOCKS proxy to relay requests through a SLAC computer.
  • Pick a port. Should be a 4-digit number. Referred to as <port> below.
  • ssh -D <port> noric.slac.stanford.edu

...

  • Then

...

  • configure

...

  • your

...

  • browser

...

  • to

...

  • use

...

  • a

...

  • SOCKS

...

  • proxy

...

  • at

...

  • localhost:<port>

...

Now

...

you

...

should

...

be

...

able

...

to

...

access

...

SLAC-only

...

pages.

...

There's

...

2

...

places

...

to

...

get

...

the

...

threads:

...

...

...

...

  • click

...

  • "Hostgroup

...

  • Overview"

...

  • on

...

  • the

...

  • left

...

  • side,

...

  • then

...

  • the

...

  • host

...

  • you

...

  • want

...

  • in

...

  • the

...

  • "AFS

...

  • fileservers"

...

  • table.

...

  • This

...

  • shows

...

  • idle

...

  • threads

...

  • for

...

  • that

...

  • server

...

  • on

...

  • the

...

  • "AFSfs"

...

  • row.

...

...

...

...

  • then

...

  • click

...

  • "AFS

...

  • FS

...

  • Graphs"

...

  • in

...

  • the

...

  • upper

...

  • left.

...

  • This

...

  • shows

...

  • most

...

  • or

...

  • all

...

  • SLAC

...

  • AFS

...

  • servers.

...

  • Not

...

  • as

...

  • reliable

...

  • as

...

  • nagios02,

...

  • and

...

  • it's

...

  • going

...

  • away

...

  • eventually.

...

...

...

...

There's

...

3

...

places

...

to

...

get

...

the

...

information...

...

What

...

to

...

do

...

with

...

the

...

info

...

When

...

the

...

servers

...

are

...

idle,

...

idle

...

threads

...

should

...

be

...

122.

...

The

...

SLAC

...

IT

...

people

...

consider

...

it

...

a

...

warning

...

if

...

it

...

goes

...

below

...

110

...

and

...

an

...

error

...

at

...

100.

...

For

...

us,

...

it

...

seems

...

llike

...

things

...

usually

...

work

...

as

...

long

...

as

...

it

...

stays

...

above

...

60

...

or

...

so

...

(although

...

afs112

...

has

...

been

...

known

...

to

...

go

...

as

...

low

...

as

...

40,

...

since

...

PipelineStanging6

...

is

...

for

...

the

...

HalfPipe).

...

This

...

is

...

likely

...

to

...

occur

...

if

...

there

...

are

...

more

...

than

...

~300

...

chunk

...

jobs

...

running.

...

Usually

...

after

...

recon

...

finishes

...

and

...

the

...

chunk-level

...

jobs

...

downstream

...

of

...

recon

...

start

...

up.

...

I've

...

written

...

a

...

script

...

that

...

will

...

suspend

...

jobs

...

that

...

are

...

using

...

a

...

specified

...

server,

...

wait

...

a

...

bit,

...

and

...

then

...

resume

...

them

...

with

...

a

...

little

...

delay

...

between.

...

Suspending

...

jobs

...

will

...

make

...

the

...

server

...

recover

...

sooner,

...

but

...

some

...

of

...

the

...

jobs

...

are

...

likely

...

to

...

get

...

timeout

...

errors

...

and

...

fail

...

due

...

to

...

the

...

delay

...

of

...

being

...

suspended,

...

so

...

it

...

may

...

be

...

best

...

to

...

wait

...

until

...

idle

...

threads

...

hits

...

0

...

before

...

suspending.

...

  • From

...

  • a

...

  • noric,

...

  • as

...

  • glastraw:

...

  • setenv

...

  • L1ProcROOT

...

  • /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/Level1/1.70

...

  • (any

...

  • L1

...

  • version

...

  • >=

...

  • 1.70

...

  • should

...

  • work)

...

  • $L1ProcROOT/tools/grepJobs.py

...

  • PipelineStaging2

...

  • 1200

...

  • 60

...

  • first

...

  • arg

...

  • is

...

  • the

...

  • volume

...

  • to

...

  • be

...

  • suspended

...

  • second

...

  • arg

...

  • is

...

  • seconds

...

  • to

...

  • wait

...

  • between

...

  • stopping

...

  • and

...

  • resuming

...

  • third

...

  • arg

...

  • is

...

  • seconds

...

  • to

...

  • wait

...

  • between

...

  • resuming

...

  • jobs

...

  • If

...

  • you

...

  • want

...

  • to

...

  • do

...

  • afs106,

...

  • you

...

  • have

...

  • to

...

  • say

...

  • "PipelineStaging/";

...

  • if

...

  • you

...

  • say

...

  • "PipelineStaging"

...

  • it

...

  • will

...

  • suspend

...

  • everything.

...

The

...

association

...

between

...

disk

...

and

...

server

...

can

...

be

...

found

...

in

...

several

...

ways.

...

Here's

...

one:

...

  • $

...

  • fs

...

  • whereis

...

  • PipelineStaging

...

  • *

...

  • File

...

  • PipelineStaging

...

  • is

...

  • on

...

  • host

...

  • afs106.slac.stanford.edu

...

  • File

...

  • PipelineStaging1

...

  • is

...

  • on

...

  • host

...

  • afs106.slac.stanford.edu

...

  • File

...

  • PipelineStaging2

...

  • is

...

  • on

...

  • host

...

  • afs104.slac.stanford.edu

...

  • File

...

  • PipelineStaging3

...

  • is

...

  • on

...

  • host

...

  • afs107.slac.stanford.edu

...

  • File

...

  • PipelineStaging4

...

  • is

...

  • on

...

  • host

...

  • afs108.slac.stanford.edu

...

  • File

...

  • PipelineStaging5

...

  • is

...

  • on

...

  • host

...

  • afs109.slac.stanford.edu

...

  • File

...

  • PipelineStaging6

...

  • is

...

  • on

...

  • host

...

  • afs112.slac.stanford.edu

...

  • File

...

  • PipelineStaging7

...

  • is

...

  • on

...

  • host

...

  • afs113.slac.stanford.edu

...

You

...

can

...

also

...

find

...

the

...

amount

...

of

...

disk

...

usage

...

with

...

the

...

following

...

command:

...

  • $

...

  • fs

...

  • lq

...

  • PipelineStaging

...

  • *

...


  • Volume

...

  • Name

...

  • Quota

...

  • Used

...

  • %Used

...

  • Partition

...


  • G.glast.ground.PipStg

...

  • 350000000

...

  • 128422194

...

  • 37%

...

  • 26%
    G.glast.ground.PipStg

...

  • 350000000

...

  • 128422194

...

  • 37% 26%
    G.glast.ground.PipStg2

...

  • 350000000

...

  • 118129732

...

  • 34%

...

  • 24%
    G.glast.ground.PipStg3

...

  • 350000000

...

  • 124826187

...

  • 36%

...

  • 26%
    G.glast.ground.PipStg4

...

  • 350000000

...

  • 124744630

...

  • 36%

...

  • 26%
    G.glast.ground.PipStg5

...

  • 350000000

...

  • 132978520

...

  • 38%

...

  • 27%
    G.glast.ground.PipStg6

...

  • 350000000

...

  • 59700367

...

  • 17% 12%
    G.glast.ground.PipStg7

...

  • 350000000

...

  • 107201000

...

  • 31%

...

  • 22%

Useful resources

  • List of failed runs. Report here failed runs and their status. When a run is fixed (meaning: no red cross appears anywhere anymore), you can delete it from the list (the runs listed as 'ignore error (it is fixed)' are the ones that were fixed with a fake delivery, so they still appear as failed in the Data Processing page.
  • AFS hiccups. We keep track here of the AFS hiccups (or at least we try to). You can add date/time, LFS job ID and the 'missing' file.
  • LSF hiccups. We keep track here of the LSF hiccups (or at least we try to). You can add date/time, LSF batch host(s) and the reason, if known.
  • List of runs incorrectly marked as InProgress. These are runs where acq status is Complete, but L1 status is InProgress.

You should definitely join the following mailing lists:

  • Data Handling (datalist)
  • Level 1 processing (l1proclist)
  • Data processing pipeline (pipelinelist)

And probably these:

  • LAT Instrument Data Monitoring (datamonlist)
  • Questions and problems concerning GLAST software (helpsoftlist)
  • LAT operations team (latopslist)
  • Nagios alert messages (nagios)
  • People who will work ISOC shifts (shiftslist)
  • ISOC Science Operations (solist)

Other things you may want to know

setL1Status task

Tired of being paged because L1Proc status still says Failed after a rollback?

/afs/slac/g/glast/ground/bin/pipeline

...

--mode

...

PROD

...

createStream

...

--define

...

"runNumber=240831665,l1RunStatus=Running"

...

setL1Status

...


OR:

...


/afs/slac/g/glast/ground/bin/pipeline

...

--mode

...

PROD

...

createStream

...

--define

...

"runNumber=240837713"

...

setL1Status

...

l1RunStatus

...

defaults

...

to

...

Running,

...

but

...

you

...

can

...

set

...

it

...

to

...

any

...

of

...

the

...

allowed

...

values

...

(Complete,

...

InProgress,

...

Incomplete,

...

Running,

...

Failed).

...

How

...

to

...

contact

...

SCCS

...

  • Send

...

  • an

...

  • email

...

  • to

...

  • unix-admin

...

  • (this

...

  • usually

...

  • works,

...

  • even

...

  • in

...

  • the

...

  • night

...

  • and

...

  • in

...

  • the

...

  • weekend).

...

  • If

...

  • you

...

  • don't

...

  • get

...

  • an

...

  • answer

...

  • and

...

  • the

...

  • issue

...

  • is

...

  • urgent,

...

  • call

...

  • xHELP

...

  • (650-926-4357).

...

  • Choose

...

  • 4

...

  • to

...

  • page

...

  • the

...

  • on-call

...

  • person.

...

  • If

...

  • your

...

  • call/email

...

  • isn't

...

  • answered

...

  • and

...

  • the

...

  • issue

...

  • is

...

  • REALLY

...

  • urgent,

...

  • page

...

  • again

...

  • the

...

  • on-call

...

  • person

...

  • at

...

  • 650-846-0178.

...

Log

...

Watcher

...

Messages

...

Message

...

text:

...

Can't

...

open

...

lockfile

...

/nfs/farm/g/glast/u52/L1/r0248039911/r0248039911.lock.

...

...

cancelProcessInstance

Here's

...

the

...

syntax

...

to

...

cancel

...

a

...

process

...

that

...

is

...

not

...

in

...

final

...

state,

...

and

...

all

...

its

...

dependencies:

...


/afs/slac.stanford.edu/u/gl/glast/pipeline-II/dev/pipeline

...

--mode

...

PROD

...

cancelProcessInstance

...

8073657

...

The

...

Process

...

Instance

...

Primary

...

Key

...

(ie

...

-

...

8073657

...

in

...

the

...

example)

...

can

...

be

...

determined

...

in

...

the

...

the

...

URL

...

for

...

the

...

process

...

instance

...

page

...

(eg:

...

the

...

PIPK

...

for

...

http://glast-ground.slac.stanford.edu/Pipeline-II/pi.jsp?pi=20253756

...

is

...

20253756).

...

NOTE:

...

Please

...

don't

...

use

...

this

...

unless

...

you

...

really

...

(REALLY

...

!

...

!

...

!)

...

know

...

what

...

you

...

are

...

doing.

...

Host

...

or

...

host

...

group

...

is

...

not

...

used

...

by

...

the

...

queue.

...

If

...

you

...

see

...

jobs

...

being

...

terminated

...

with

...

exceptions

...

in

...

the

...

message

...

viewer

...

saying

...

things

...

like

...

"yili0148+5:

...

Host

...

or

...

host

...

group

...

is

...

not

...

used

...

by

...

the

...

queue.

...

Job

...

not

...

submitted.",

...

it

...

means

...

the

...

hosts

...

available

...

to

...

glastdataq

...

have

...

changed.

...

The

...

solution

...

is

...

to

...

roll

...

back

...

the

...

affected

...

chunks

...

(the

...

doChunk

...

streams)

...

with

...

a

...

new

...

value

...

for

...

HOSTLIST.

...

When

...

you

...

roll

...

back

...

the

...

streams

...

from

...

the

...

frontend,

...

on

...

the

...

confirmation

...

page

...

you

...

are

...

presented

...

an

...

opportunity

...

to

...

set

...

or

...

redefine

...

variables.

...

To

...

figure

...

out

...

what

...

the

...

new

...

value

...

needs

...

to

...

be,

...

do

...

a

...

"bqueues

...

-l

...

glastdataq".

...

The

...

output

...

will

...

include

...

a

...

line

...

like

...

"HOSTS:

...

bbfarm/".

...

In

...

this

...

case

...

you'd

...

enter

...

HOSTLIST=bbfarm

...

in

...

the

...

box

...

on

...

the

...

confirmation

...

page.

...

bbfarm

...

is

...

actually

...

a

...

temporary

...

thing

...

for

...

the

...

cooling

...

outage,

...

when

...

things

...

get

...

switched

...

back

...

to

...

normal,

...

the

...

relevant

...

line

...

from

...

bqueues

...

will

...

probably

...

look

...

more

...

like

...

"HOSTS:

...

glastyilis+3

...

glastcobs+2

...

preemptfarm+1".

...

Then

...

the

...

thing

...

to

...

enter

...

in

...

the

...

box

...

would

...

be

...

HOSTLIST="glastyilis

...

glastcobs

...

genfarm".

...

Red

...

question

...

marks

...

in

...

display

...

Red

...

question

...

marks show up in the display for HalfPipe. Click on the FASTCopy "Logs" number.  First check to see if the "ProcessSCI.end"

...

event

...

has

...

occurred.

...

  If

...

not,

...

then

...

wait

...

until

...

it

...

occurs

...

and

...

see

...

if

...

the

...

red

...

question

...

mark

...

persists.

...

  If

...

it

...

persists

...

and

...

the

...

log

...

shows

...

messages

...

saying

...

"pipeline

...

nonEventReporting

...

submission

...

succeeded"

...

or

...

"no

...

reassembled

...

datagrams

...

found",

...

it

...

just

...

means

...

that

...

it

...

contained

...

data

...

that

...

had

...

already

...

been

...

received.

...

Meaningful

...

deliveries

...

will

...

say

...

"found

...

XXX

...

LSEP

...

datagrams

...

in

...

....".

...

  If

...

it

...

is

...

a

...

meaningful

...

delivery,

...

then

...

further

...

intervention

...

is

...

required.

...

Stuck

...

processes

...

If

...

you

...

suspect

...

a

...

process

...

is

...

stuck,

...

visit

...

the

...

process

...

page

...

from

...

the

...

run

...

page

...

(ie

...

-

...

click

...

on

...

the

...

green

...

task

...

bar

...

in

...

the

...

Data

...

Processing

...

Page,

...

then

...

click

...

on

...

the

...

particular

...

job

...

that

...

is

...

running

...

in

...

the

...

list

...

at

...

the

...

bottom

...

of

...

the

...

page).

...

Each

...

entry

...

in

...

the

...

table

...

on

...

the

...

page

...

will

...

contain

...

a

...

link

...

to

...

"Messages",

...

"Log",

...

and

...

"Files".

...

If

...

you

...

click

...

on

...

"Log",

...

it

...

should

...

show

...

you

...

the

...

log,

...

but

...

more

...

importantly,

...

it'll

...

show

...

the

...

location

...

of

...

the

...

log

...

file

...

right

...

above

...

the

...

box

...

containing

...

the

...

log

...

information

...

(note:

...

if

...

the

...

file

...

is

...

too

...

big,

...

it

...

will

...

not

...

open,

...

but

...

the

...

link

...

to

...

"Files"

...

will

...

show

...

where

...

the

...

log

...

is

...

located,

...

so

...

you

...

can

...

get

...

the

...

information

...

that

...

way

...

as

...

well).

...

If

...

you

...

visit

...

the

...

location

...

of

...

the

...

file

...

on

...

the

...

noric

...

machines,

...

you

...

can

...

determine

...

when

...

the

...

last

...

time

...

the

...

file

...

was

...

updated

...

(ls

...

-l,

...

or

...

tail

...

-f

...

to

...

see

...

if

...

it's

...

still

...

streaming).

...

If

...

it

...

is

...

not

...

updating,

...

find

...

out

...

the

...

batch

...

job

...

number

...

(JobId

...

from

...

the

...

stream

...

table),

...

log

...

into

...

noric

...

as

...

glastraw,

...

and

...

issue

...

a

...

bkill

...

<JobId>.

...

The

...

process

...

should

...

automatically

...

be

...

restarted

...

after

...

this

...

command

...

is

...

issued.

...

Repiping

When weird things are happening with the delivery that you can't

...

figure

...

out,

...

it

...

may

...

be

...

necessary

...

to

...

repipe

...

the

...

stream

...

(probably

...

a

...

good

...

idea

...

to

...

check

...

with

...

someone

...

first).

...

Log

...

in

...

as

...

glastraw

...

and

...

issue

...

the

...

following

...

command:

...

/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline

...

-m

...

PROD

...

createStream

...

--stream

...

$

...

{ID

...

}

...

--define

...

downlinkID=$

...

{ID

...

},runID=$

...

{ID

...

}

...

RePipe

...

where

...

$

...

{ID

...

}

...

is

...

the

...

Run

...

ID

...

(just

...

numbers,

...

no

...

leading

...

r0).

...

This

...

will

...

create

...

a

...

directory

...

in

...

/nfs/farm/g/glast/u28/RePipe

...

...

...

you

...

will

...

need

...

to

...

then

...

manually

...

re-enter

...

the

...

run

...

into

...

L1Proc.

...

To

...

do

...

this,

...

bkill

...

any

...

findChunk

...

processes

...

that

...

are

...

associated

...

with

...

the

...

RunID,

...

remove

...

the

...

run

...

lock

...

from

...

/nfs/farm/g/glast/u52/rXXX

...

(where

...

XXX

...

is

...

the

...

run

...

number),

...

also

...

move

...

all

...

chunkList

...

.txt

...

(leave

...

the

...

.txt.tmp

...

ones

...

alone)

...

files

...

to

...

something

...

else

...

(just

...

suffixing

...

them

...

with

...

".ignore"

...

should

...

work)

...

and

...

issue

...

the

...

following

...

command:

...

/afs/slac.stanford.edu/u/gl/glast/pipeline-II/prod/pipeline

...

-m

...

PROD

...

createStream

...

--stream

...

$

...

{ID

...

}

...

--define

...

DOWNLINK_ID=$

...

{ID

...

},DOWNLINK_RAWDIR=/nfs/farm/g/glast/u28/RePipe/$

...

{ID

...

}

...

L1Proc

...

You

...

should

...

be

...

able

...

to

...

see

...

if

...

the

...

process

...

restarted

...

correctly

...

by

...

visiting

...

here

...

,

...

which

...

will

...

tell

...

all

...

the

...

streams

...

that

...

have

...

not

...

resulted

...

in

...

"Success".

...

Batch

...

commands

...

bjobs

...

-wu

...

glastraw

...

-

...

running

...

this

...

from

...

noric

...

will

...

list

...

all

...

the

...

jobs

...

in

...

the

...

LSF

...

batch

...

queue

...

owned

...

by

...

glastraw

...

(which

...

is

...

the

...

user

...

submitting

...

pipeline

...

streams).

...

Shift

...

sign

...

up

...

Sign

...

up

...

for

...

shifts

...

here

...

.

...

View

...

shift

...

calendar

...

here

Useful L1Proc links (these are designed to keep you from waiting for the entire L1Proc history to be retrieved ... which could be 10 min and up).