It's overkill to roll back from the checkChunk. Maybe even harmful, but I guess the chunk data are still on xrootd. But this should be verified.

MeritHist - Different number of events in Merit chain

This has happened a few times. For instance,

https://www-glast.stanford.edu/protected/mail/opsprob/15247.html
https://www-glast.stanford.edu/protected/mail/opsprob/14337.html
https://www-glast.stanford.edu/protected/mail/opsprob/11489.html

A typical rollback probably won't work. It will likely need to be rolled back from an earlier process like scanDigi.

GlastCalibDB (mysql-node03) is down (Halting a specific Task, like L1Proc)

This mysql server hosts the GlastCalibDB database where the L1 pipeline looks up the calibration file to use. If it's down or needs to be taken down (e.g., to install an update), Brian left instructions to halt job submission for L1Proc so that other tasks may run as normal.

Stopping a single task:

You need to be a Pipeline admin to do this.
Navigate to https://glast-ground.slac.stanford.edu/Pipeline-II/exp/Fermi/admin_jmx.jsp.
Select jobsite you expect jobs to run at (e.g., "SLACDATA").
Find the "setJobsTotalLimitForTask" button.
Add your task name (e.g., "L1Proc") in the p1 parameter box next to it.
Set the value to 0 in the p2 parameter box.
Click the setJobsTotalLimitForTask button.

This should suspend job submission and leave processes in the Waiting state.

To clear it (and all other limits that were set for the jobsite):

Put "true" in the parameter box for "userClearLimits" (in the same widget) and then click the button. However, there seems to be a bug, and this may not work. According to Brian userClearLimits will clean out a file that is stored, but it won't clear out the in-memory state. The parameter userStoreLimits is what writes the file, persisting the in memory state. Without the file, it will forget limits that are set in the pipeline in the memory on restart.
If just setting it to "true" doesn't restart processing:
- Use the admin page to restart the pipeline server. According to Brian userClearLimits with "true" and a restart works because you delete the files and pick up database configs on restart.
- You can also try putting "false" in the box, and then "true".
- Alternatively, set the limit to something high/reasonable, such as 6000.

getTimes Terminates

When this happens, it will leave a lock file behind (e.g., /nfs/farm/g/glast/u41/L1/runs/697/r0697981861/r0697981861.lock) that you will need to delete before rolling the job back. Otherwise, the rollback will stay queued but not run. If you do the rollback before removing the lock file you should bkill the job, wait for the reaper to terminate it, then remove the lock file, and do the rollback again.

MOOT Key Mismatch between on-board LAT and pipeline expectation

This will cause an error in the logs like:

MootSvc FATAL Hw key 8517 in data does not match hw key belonging to Moot Config #3063

This has happened in 2023, 2017, and 2013 due to a problem with an update to the hot strip mask on the satellite, e.g., the change was reverted but the pipeline was expecting the MOOT key for the new configuration. Following what was first done in 2013, the fix is to update the databases to change the MOOT key for the failed runs back to the old value and rollback the failures. Other stuff mentioned in the original thread doesn't seem necessary.

1) Change the moot_key in glastops_acqsummary in the Oracle database.

Setup an ISOC environment terminal and then run sqlplus \@isocflight to access the database. You will need standard pipeline maintainer permissions.
For SQL commands, see: https://www-glast.stanford.edu/protected/mail/opsprob/10899.html

2) Change the config_fk (mootkey) in MOOD's Acq_summary in the MOOD/FMX MySQL database.

To access the database, you need to login to fermilnx02 (in 2023) and issue the command mysql -h glastMOOD -P8208 -u <username> -p. You will need to get a password from the secret/fpd/fermi/isoc vault. Richard may need to put in a ticket for you to get permission. In 2023, we used Steve Tether's account (tether) since the "magician" account is overpowered for what we need (it's basically the super-user account).
For SQL commands, see https://www-glast.stanford.edu/protected/mail/opsprob/10898.html. Before you have to activate the db with use mood;.

3) In some receipts (e.g. in older emails) is written that a repipe is neccessary. Michael doesn't recall repipes for the 2017 key mismatch. At least no helpful repipes. In 2023 for sure no repipes were performed. Don't repipe unless really neccessary!

4) Rollback the affected deliveries

If they are in a Failed state, you can just do a standard rollback.
If findChunks is stuck, then bkill the process and wait for the reaper to get it. Most of them should then auto rollback without intervention.
Most likely someone has created run locks manually to halt runs from being processed with wrong moot keys and successively failing. Don't forget to remove these run locks.

History

The MOOT Key Mismatch problem occurred on 2023 July 18. See the Fermi news item for 2023 July 20 at https://fermi.gsfc.nasa.gov/ssc/library/news/
- 15 LPA runs from the LAT had to have their MOOT keys manually fixed in Oracle and MySQL before they could be processed correctly through L1:
  - 711402686, 711405073, 711411046, 711416992, 711422925, 711428850, 711434773, 711440537, 711446088, 711451782, 711457476, 711463169, 711468863, 711474557, 711478257
The MOOT Key Mismatch problem occurred on 2017 November 30. See: https://www-glast.stanford.edu/protected/mail/opsprob/10893.html

Problems with Magic 7 File

For instance, this error in ft2Runs stream 240415006.734877891:

Code Block

language	bash

> terminate called after throwing an instance of 'std::runtime_error'
> what(): FATAL: the provided Magic 7 file does not cover the requested time interval. To cover the requested interval we would need to extrapolate position and attitude (forward) more than what permitted by the current configuration (see the parameter 'extrapolationLimit').

Michael fixed this by:

1) doRun.ft2Runs (case of 240415006.734877891) reads from the runs area:
stageIn for: /nfs/farm/g/glast/u41/L1/runs/734/r0734877891/r0734877891_v000_magic7L1.txt

The magic7 file in the staging area, which is also being copied into the
runs area, is complete. makeM7L1 reads from the staging area.

Thus, I just rolled back makeM7L1, which found all packets and created a
valid magic7L1 file in the runs area, to be read by ft2Runs.

2) doRun.doChunk.fakeFT2 (case of 240414007.734792595.6674757) stages from
the staging area:
stageIn for: /nfs/farm/g/glast/u28/stage/240414007/magic7_240414007.txt

This file was incomplete! I replaced it by the 240414008 magic7 file

PGWave drpMonitoring fails due to duplicate stream

PGWave reports a duplicate stream error, e.g.,

Code Block

language	bash

Task drpMonitoring Process launchDrpMonitoring Stream 745027200.0
org.srs.pipeline.server.sql.DatabaseUtilities$DuplicateStreamException: A stream ALREADY exists with specified task, parent, and id

According to Jim, if the downstream drpMonitoring task has already managed to be submitted, the failed PGWave.launchDrpMonitoring stream can be left as-is. If you want to clean it up, temporarily disable the code to launch the drpMoninitoring stream and then roll it back.

Beginning October 31, 2024, login to Confluence and Jira will change. Read more.

Space shortcuts

Child pages

Versions Compared

Old Version 21

New Version Current

Key

MeritHist - Different number of events in Merit chain

GlastCalibDB (mysql-node03) is down (Halting a specific Task, like L1Proc)

getTimes Terminates

MOOT Key Mismatch between on-board LAT and pipeline expectation

Problems with Magic 7 File

PGWave drpMonitoring fails due to duplicate stream

Beginning October 31, 2024, login to Confluence and Jira will change. Read more.

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 21

New Version Current

Key

MeritHist - Different number of events in Merit chain

GlastCalibDB (mysql-node03) is down (Halting a specific Task, like L1Proc)

getTimes Terminates

MOOT Key Mismatch between on-board LAT and pipeline expectation

Problems with Magic 7 File

PGWave drpMonitoring fails due to duplicate stream