Page History

First Edition: 6 Dec 2019

Version 1.1 (10:50 PT 6 Dec 2019)

Note
On 8 Dec 2019 this outage was postponed until July 2020

...

Power outage time-line.

[Tentative proposal] This Not many details are currently known, but this power outage will affect substations #7 (next to bldg 50) and #8 (located on the 4th floor of bldg 50). All of bldg 50 will be without normal power. The facilities (F&O) group plan to do their maintenance during the 4-day period starting 26 Dec 2019. However, the outage will start earlier due to lack of staff during the holiday shutdown. Minimally, it is expected that all H.A. (High Availability) and experiment-critical equipment will be powered throughout the 16+ days of the holiday shutdown. This page captures what Fermi will need to maintain a minimal data processing effort running during the outage.

Note
Note that the ability to perform general science analysis at SLAC by the LAT collaboration will be seriously hindered by this outage due to the fact that much of the batch farm will be unavailable.

Fri

Date	Time	Equipment ^*	Action
A day or two prior to 20 Dec 2019	TBA	Test of power source switching (i.e., normal line power to generator)
Fri 20 Dec 2019	TBA		switch to generator power (this could happen earlier) This will require a several-hour outage
Mon 6 Jan 2020		TBA	return to normal power. This will require a several-hour outage

...

Define needed xrootd resources (Wilko Kroeger)
Confirm sufficient xrootd space to handle 16+ day HPSS outage (Wilko Kroeger)
Define needed Level 0 and half-pipe resources (Steve Tether)
Define needed Level 1 resources (Tom Stephens)
Update service → VM → hypervisor host mapping (Brian Van Klaveren)
Suggest/Move non-HA VMs to HA (Brian Van Klaveren)
Define needed ASP resources (Jim Chiang)
Define needed resources for critical Fermi Science pipelines (various)
- Gravitational wave analysis (Nicola Omodei)
- Flare Advocates (Gulli Johannesson, Stefano Ciprini)
- Burst Analysis (Dan Kocevski)

...

LISOC Operations Functions

...

Function/Service	Sub-Functions	Needed Servers	Needed Databases	Needed File Systems	Other Needs	Needed During Shutdown?
Mission Planning, LAT Configurations	FastCopy	fermilnx01 and fermilnx02	TCDB	AFS	Fermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/w, FastCopy Monitoring Sharepoint (reference for PROCs and Narrative Procedures for commanding in case of anomalies)	yes
Real Time Telemetry Monitoring		fermilnx01 and fermilnx02			spread Fermi LAT Portal: Real Time Telemetry, Telemetry Monitor	during anomalies
Logging		fermilnx01 and fermilnx02	TCDB		Fermi LAT Portal: Log Watcher	yes
Trending			TCDB		Fermi LAT Portal: Telemetry Trending	yes
L0 File Ingest and Archive	FastCopy		L0 Archive			yes
Data Gap Checking and Reporting	FastCopy	fermilnx01 and fermilnx02	L0 Archive			yes, continuously
L1 processing	pipeline	SLAC Farm	Data Catalog		Fermi LAT Portal: Pipeline, Data Processing	yes
L1 Data Quality Monitoring					Fermi LAT Portal, Telemetry Trending
L1 delivery	FastCopy	fermilnx01 and fermilnx02	Data Catalog			yes
L2 processing (ASP) and Delivery	FastCopy	fermilnx01 and fermilnx02	Data Catalog		Fermi LAT Portal: Pipeline, Data Processing	daily, weekly

The following table of servers must remain powered up and operational for Fermi Level 1 and critical Science Pipelines to

...

function.

Fermi has requested that all VMs be relocated (at least temporarily) to the two

...

H.A. hypervisor machines, thus some of the tasks listed below are no longer relevant.

Confirm current H.A. rack occupants. rack occupants. spreadsheet from Christian Pama
Old (2017) spreadsheet here
(thanks Shirley!) Confirm the VM-master for a given VM. Use the 'node' command, e.g., $ node -whereis fermilnx-v12 (obsolete)
Confirm the tomcat <-> service associations. Table here.
Confirm the tomcat-VM associations in this table. Use the 'node' command, e.g., $ node -whereis glast-tomcat01

Info
NOTE: Fermi has four VMware hypervisors, each of which contain some number of VMs running Fermi services. Two of these hypervisor machines are in the H.A. racks (fermi-vmclust03/04), while the others (fermi-vmclust01/02) are not. At this writing there are no user-level tools to allow one to discover which VMs are running on which hypervisor machines.

HAXCXC

Category†	server	VM/service	function
XC	fermi-gpfs01 fermi-gpfs02 fermi-gpfs05 fermi-gpfs06 fermi-gpfs07 fermi-gpfs08	xrootd	xrootd server and storage
XC/HA	fermi-vmclust01/02/03/04	fermilnx-v02	xrootd redirector
XC/HA
Category†	server	VM/service	function
XC	fermi-gpfs01 fermi-gpfs02 fermi-gpfs05 fermi-gpfs06 fermi-gpfs07 fermi-gpfs08		xrootd server and storage
XC	fermi-vmclust01/02/03/04	fermilnx-v02	xrootd redirector
XC	fermi-vmclust01/02/03/04	fermilnx-v12	xrootd redirector
XC	fermi-gpfs03 fermi-gpfs04	GPFS	Fermi NFS/GPFS storage
XC	fermi-cnfs01 fermi-cnfs02	GPFS/NFS bridge	Fermi NFS storage access
HA	staas-gpfs50 staas-gpfs51		Critical ISOC NFS storage
HA	fermilnx01		LAT config, fastcopy and real-time telemetry
HA	fermilnx02		LAT config, fastcopy and real-time telemetry
XC	fermi-vmclust01/02/03/04	fermilnx-v03v12archiver	xrootd redirector
XC	fermi- oracle03		gpfs03 fermi-gpfs04	GPFS	Fermi NFS/GPFS storageoracle primary
XC	fermi- oracle04		cnfs01 fermi-cnfs02	GPFS/NFS bridge	Fermi NFS storage accessoracle secondary
HA	mysql05 mysql06	mysql-node03	calibration, etc. DB
XC	400 cores		(25 "hequ" equivalents) batch hosts for LISOC queues={express,short,medium,long,glastdataq} users={glast,lsstsim,lsstprod,glastmc,glastraw}
XC	200 cores		(12.5 "hequ" equivalents) batch hosts for Science Pipelines
staas-gpfs50 staas-gpfs51		Critical ISOC NFS storage
HA	fermilnx01		LAT config, fastcopy and real-time telemetry
HA	fermilnx02		LAT config, fastcopy and real-time telemetry
XC/HA	fermi-vmclust01/02/03/04	fermilnx-v07/tomcat01	Commons, Group manager	v03	archiver
HA	fermi-vmclust01/02/03/04	fermilnx-v16/tomcat06	oracle03		oracle primaryrm2
XC	fermi-vmclust01/02/03/04	fermilnx-v05/tomcat08	dataCatalog
XC	fermi-vmclust01/02/03/04	fermilnx-v17/tomcat09	Pipeline-II
oracle04		oracle secondary
HA	mysql05 mysql06	mysql-node03	calibration, etc. DB
XC	400 cores		(50 "hequ" equivalents) batch hosts for LISOC queues={express,short,medium,long,glastdataq} users={glast,lsstsim,lsstprod,glastmc,glastraw}
XC	200 cores		(25 "hequ" equivalents) batch hosts for Science Pipelines
XC/HA	fermi	XC	fermi-vmclust01/02/03/04	fermilnx-v15v07/pipeline-mail01	Pipeline-II email server	tomcat01	Commons, Group manager
XC/HAXC	fermi-vmclust01/02/03/04	fermilnx-v18v16/tomcat10	FCWebView, ISOCLogging, MPWebView TelemetryMonitor, TelemetryTableWebUI	tomcat06	rm2
XC/HAXC	fermi-vmclust01/02/03/04	fermilnx-v10v05/tomcat11tomcat08	DataProcessingdataCatalog
XC/HA	fermi-vmclust01/02/03/04	fermilnx-v11v17/tomcat12tomcat09	TelemetryTrending
NC	(non-Fermi server)	astore-new (HPSS)	FastCopy data archive We have arranged a temporary quota increase of 1 TB on /nfs/farm/g/glast/u23, which has allowed this item to become "NC"
HA	(non-Fermi server)	trscron	tokenized cron
HA	(non-Fermi server)	lnxcron	cron
Pipeline-II
XC/HA	fermi-vmclust01/02/03/04	fermilnx-v15/pipeline-mail01	Pipeline-II email server
XC/HA	fermi-vmclust01/02/03/04	fermilnx-v18/tomcat10	FCWebView, ISOCLogging, MPWebView TelemetryMonitor, TelemetryTableWebUI
XC/HA	fermi-vmclust01/02/03/04	fermilnx-v10/tomcat11	DataProcessing
XC/HA	fermi-vmclust01/02/03/04	fermilnx-v11/tomcat12	TelemetryTrending
NC		(non-Fermi server)	astore-new (HPSS)	FastCopy data archive We have been granted a temporary quota increase of 1 TB on /nfs/farm/g/glast/u23, which has allowed this item to become "NC"
HA	(farm manager, etc.)	LSF management	HA	yfs01/NN (non-Fermi server)		trscron	tokenized cronbasically all of AFS
HA	(non-Fermi server)	lnxcronJIRA	cron
XC	issue tracking (HA as of 10/20/2017)

(non-Fermi server)	(farm manager, etc.)	LSF management
HA	yfs01/NN (non-Fermi)		basically all of AFS
HA	(non-Fermi server)	JIRA	issue tracking (HA as of 10/20/2017)
XC	rhel6-64		public login nodes (a small number is needed for interactive access)

† Equipment categories

Category	Machine status
NC	non-critical for entire 16-day shutdown period
XC	experiment critical but not in H.A. rack, only a few, short outages acceptable
HA	high-availability (continuous operation)


HA	high-availability (continuous operation)

Total non-HA machines to receive emergency power:

Machine Type	Total	Notes
GPFS servers	8
NFS/GPFS bridge	2
VMware hypervisors	2	Not needed if all Fermi services can be moved to the two H.A. hypervisors
batch nodes ("hequ" equivalents)	75	Depending on which batch nodes are selected, some may already be in H.A. power
Oracle servers	1	There is rumor that this machine may already be on H.A. power – to be confirmed
Public login nodes	N	(where "N" is a small integer)
TOTAL	88+N

Note that HPSS is NOT required by Fermi.

The services for L1:

oracle

...

Runs once per GW event reported from Global GW detectors
Large variability in CPU requirement due to varying size of GW localization in sky
Estimate 300 core-hours per day per GW event (e.g., 10 hours on 30 cores)
→This would be 4 hequ hosts for about 10 hours per GW event

...

Batch jobs submitted to follow up on flare alert

...

, typically once/day at most
Batch job runs ~30 minutes on ~100 cores?
→About 12 hequ-class nodes are needed for this analysis

Burst Analysis

Batch job(s) submitted to follow up on gamma-ray burst detection
Six jobs/burst, medium queue, rhel6
Recent 7-day week had 11 triggers, so >1/day
→Six hequ batch nodes should cover this need

...

FAVA (Fermi All-Sky Variability Analysis)

Runs weekly
Could Can postpone routine FAVA analysis until after the outage

...

High availability racks

For general information about the High-availability racks, Shirley provided this pointer to the latest list:

...

Space shortcuts

Child pages

Versions Compared

Old Version 48

New Version Current

Key

On 8 Dec 2019 this outage was postponed until July 2020