Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

First Edition: 6 Dec 2019

Version 1.1 (10:50 PT 6 Dec 2019)

Note
 

On 8 Dec 2019 this outage was postponed until July 2020

 

...

Power outage time-line.  

[Tentative proposal]  This  Not many details are currently known, but this power outage will affect substations #7 (next to bldg 50) and #8 (located on the 4th floor of bldg 50).  All of bldg 50 will be without normal power.  The facilities (F&O) group plan to do their maintenance during the 4-day period starting 26 Dec 2019.  However, the outage will start earlier due to lack of staff during the holiday shutdown.  Minimally, it is expected that all H.A. (High Availability) and experiment-critical equipment will be powered throughout the 16+ days of the holiday shutdown.  This page captures what Fermi will need to maintain a minimal data processing effort running during the outage.

Note

Note that the ability to perform general science analysis at SLAC by the LAT collaboration will be seriously hindered by this outage due to the fact that much of the batch farm will be unavailable.

Fri
DateTimeEquipment *Action
A day or two prior to 20 Dec 2019TBATest of power source switching (i.e., normal line power to generator)
Fri 20 Dec 2019TBA switch to generator power (this could happen earlier) This will require a several-hour outage
Mon 6 Jan 2020  TBAreturn to normal power. This will require a several-hour outage

...

  •  

    Define needed xrootd resources (Wilko Kroeger)

  •  

    Confirm sufficient xrootd space to handle 16+ day HPSS outage (Wilko Kroeger)

  •  Define needed Level 0 and half-pipe resources (Steve Tether)
  •  Define needed Level 1 resources (Tom Stephens)
  •  Update service → VM → hypervisor host mapping (Brian Van Klaveren)
  •  Suggest/Move non-HA VMs to HA (Brian Van Klaveren)
  •  Define needed ASP resources (Jim Chiang)
  •  Define needed resources for critical Fermi Science pipelines (various)
    •  Gravitational wave analysis (Nicola Omodei)
    •  Flare Advocates (Gulli Johannesson, Stefano Ciprini)
    •  Burst Analysis (Dan Kocevski)
       

...

    •  


LISOC Operations Functions

...

Function/ServiceSub-FunctionsNeeded ServersNeeded DatabasesNeeded File SystemsOther NeedsNeeded During Shutdown?Available During Shutdown?
Mission Planning, LAT ConfigurationsFastCopy

fermilnx01 and
fermilnx02

TCDB

AFS

Fermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/w, FastCopy Monitoring

Sharepoint (reference for PROCs and Narrative Procedures for commanding in case of anomalies)

yes 
Real Time Telemetry Monitoring fermilnx01 and fermilnx02  

spread

Fermi LAT Portal: Real Time Telemetry, Telemetry Monitor

during anomalies 
Logging fermilnx01 and fermilnx02TCDB Fermi LAT Portal: Log Watcheryes 
Trending  TCDB Fermi LAT Portal: Telemetry Trendingyes 
L0 File Ingest and ArchiveFastCopy L0 Archive  yes 
Data Gap Checking and ReportingFastCopyfermilnx01 and fermilnx02L0 Archive  yes, continuously 
L1 processingpipelineSLAC FarmData Catalog Fermi LAT Portal: Pipeline, Data Processingyes 
L1 Data Quality Monitoring    Fermi LAT Portal, Telemetry Trending  
L1 deliveryFastCopyfermilnx01 and fermilnx02Data Catalog  yes 
L2 processing (ASP) and DeliveryFastCopyfermilnx01 and fermilnx02Data Catalog Fermi LAT Portal: Pipeline, Data Processingdaily, weekly 


The following table of servers must remain powered up and operational for Fermi Level 1 and critical Science Pipelines to

...

function.

Fermi has requested that all VMs be relocated (at least temporarily) to the two

...

H.A. hypervisor machines, thus some of the tasks listed below are no longer relevant.

  •  Confirm current H.A. rack occupants.  rack occupants.  spreadsheet from Christian Pama
    Old (2017) spreadsheet here
  •  (thanks Shirley!)  Confirm the VM-master for a given VM.  Use the 'node' command, e.g., $ node -whereis fermilnx-v12 (obsolete)
  •  Confirm the tomcat <-> service associations.  Table here.
  •  Confirm the tomcat-VM associations in this table. Use the 'node' command, e.g., $ node -whereis glast-tomcat01

 


Info

NOTE: Fermi has four VMware hypervisors, each of which contain some number of VMs running Fermi services.  Two of these hypervisor machines are in the H.A. racks (fermi-vmclust03/04), while the others (fermi-vmclust01/02) are not.  At this writing there are no user-level tools to allow one to discover which VMs are running on which hypervisor machines.

HAXCXC
Category†serverVM/servicefunction
XC

fermi-gpfs01

fermi-gpfs02

fermi-gpfs05

fermi-gpfs06

fermi-gpfs07

fermi-gpfs08

xrootdxrootd server and storage
XC/HAfermi-vmclust01/02/03/04fermilnx-v02xrootd redirector
XC/HA
Category†serverVM/servicefunction
XC

fermi-gpfs01

fermi-gpfs02

fermi-gpfs05

fermi-gpfs06

fermi-gpfs07

fermi-gpfs08

 xrootd server and storage
XCfermi-vmclust01/02/03/04fermilnx-v02xrootd redirector
XCfermi-vmclust01/02/03/04fermilnx-v12xrootd redirector
XC

fermi-gpfs03

fermi-gpfs04

GPFSFermi NFS/GPFS storage
XC

fermi-cnfs01

fermi-cnfs02

GPFS/NFS bridgeFermi NFS storage access
HA

staas-gpfs50

staas-gpfs51

 Critical ISOC NFS storage
HAfermilnx01 LAT config, fastcopy and real-time telemetry
HAfermilnx02 LAT config, fastcopy and real-time telemetry
XCfermi-vmclust01/02/03/04fermilnx-v03v12archiverxrootd redirector
XC

fermi-

oracle03
 

gpfs03

fermi-gpfs04

GPFSFermi NFS/GPFS storageoracle primary
XC

fermi-

oracle04
 

cnfs01

fermi-cnfs02

GPFS/NFS bridgeFermi NFS storage accessoracle secondary
HA

mysql05

mysql06

mysql-node03calibration, etc. DB
XC400 cores (25 "hequ" equivalents) batch hosts for LISOC
queues={express,short,medium,long,glastdataq}
users={glast,lsstsim,lsstprod,glastmc,glastraw}
XC200 cores
 (12.5 "hequ" equivalents) batch hosts for Science Pipelines

staas-gpfs50

staas-gpfs51

 Critical ISOC NFS storage
HAfermilnx01 LAT config, fastcopy and real-time telemetry
HAfermilnx02 LAT config, fastcopy and real-time telemetry
XC/HAfermi-vmclust01/02/03/04fermilnx-v07/tomcat01Commons, Group managerv03archiver
HAfermi-vmclust01/02/03/04fermilnx-v16/tomcat06oracle03 oracle primaryrm2
XCfermi-vmclust01/02/03/04fermilnx-v05/tomcat08dataCatalog
XCfermi-vmclust01/02/03/04fermilnx-v17/tomcat09Pipeline-II
oracle04 oracle secondary
HA

mysql05

mysql06

mysql-node03calibration, etc. DB
XC400 cores (50 "hequ" equivalents) batch hosts for LISOC
queues={express,short,medium,long,glastdataq}
users={glast,lsstsim,lsstprod,glastmc,glastraw}
XC200 cores
 (25 "hequ" equivalents) batch hosts for Science Pipelines
XC/HAfermiXCfermi-vmclust01/02/03/04fermilnx-v15v07/pipeline-mail01Pipeline-II email servertomcat01Commons, Group manager
XC/HAXCfermi-vmclust01/02/03/04fermilnx-v18v16/tomcat10FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
tomcat06rm2
XC/HAXCfermi-vmclust01/02/03/04fermilnx-v10v05/tomcat11tomcat08DataProcessingdataCatalog
XC/HAfermi-vmclust01/02/03/04fermilnx-v11v17/tomcat12tomcat09TelemetryTrending
NC(non-Fermi server)astore-new (HPSS)FastCopy data archive
**We have arranged a temporary quota increase of 1 TB on /nfs/farm/g/glast/u23, which has allowed this item to become "NC"**
HA(non-Fermi server)trscrontokenized cron
HA(non-Fermi server)lnxcroncron
Pipeline-II
XC/HAfermi-vmclust01/02/03/04fermilnx-v15/pipeline-mail01Pipeline-II email server
XC/HAfermi-vmclust01/02/03/04fermilnx-v18/tomcat10FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
XC/HAfermi-vmclust01/02/03/04fermilnx-v10/tomcat11DataProcessing
XC/HAfermi-vmclust01/02/03/04fermilnx-v11/tomcat12TelemetryTrending
NC(non-Fermi server)astore-new (HPSS)FastCopy data archive
**We have been granted a temporary quota increase of 1 TB on /nfs/farm/g/glast/u23, which has allowed this item to become "NC"**
HA(farm manager, etc.)LSF managementHAyfs01/NN (non-Fermi server) trscrontokenized cronbasically all of AFS
HA(non-Fermi server)lnxcronJIRAcron
XCissue tracking (HA as of 10/20/2017)
    
(non-Fermi server)(farm manager, etc.)LSF management
HAyfs01/NN (non-Fermi) basically all of AFS
HA(non-Fermi server)JIRAissue tracking (HA as of 10/20/2017)
XCrhel6-64 public login nodes (a small number is needed for interactive access)

† Equipment categories

Category
Machine status
NCnon-critical for entire 16-day shutdown period
XCexperiment critical but not in H.A. rack, only a few, short outages acceptable
HAhigh-availability (continuous operation)

 

HAhigh-availability (continuous operation)

 

Total non-HA machines to receive emergency power:

Machine TypeTotalNotes
GPFS servers8 
NFS/GPFS bridge2 
VMware hypervisors2Not needed if all Fermi services can be moved to the two H.A. hypervisors
batch nodes ("hequ" equivalents)75Depending on which batch nodes are selected, some may already be in H.A. power
Oracle servers1There is rumor that this machine may already be on H.A. power – to be confirmed
Public login nodesN(where "N" is a small integer)
TOTAL88+N 

(red star) Note that HPSS is NOT required by Fermi.


The services for L1:

oracle

...

  • Runs once per GW event reported from Global GW detectors
  • Large variability in CPU requirement due to varying size of GW localization in sky
  • Estimate 300 core-hours per day per GW event (e.g., 10 hours on 30 cores)
    →This would be 4 hequ hosts for about 10 hours per GW event

...

  • Batch jobs submitted to follow up on flare alert

...

  • , typically once/day at most
  • Batch job runs ~30 minutes on ~100 cores?
    →About 12 hequ-class nodes are needed for this analysis

Burst Analysis

  • Batch job(s) submitted to follow up on gamma-ray burst detection
  • Six jobs/burst, medium queue, rhel6
  • Recent 7-day week had 11 triggers, so >1/day
    →Six hequ batch nodes should cover this need

...

FAVA (Fermi All-Sky Variability Analysis)

  • Runs weekly
  • Could Can postpone routine FAVA analysis until after the outage

...

High availability racks

For general information about the High-availability racks, Shirley provided this pointer to the latest list:

...