Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

(warning) This page is under construction (warning)

Power outage time-line.  

[Tentative proposal]  This power outage will affect substations #7 (next to bldg 50) and #8 (located on the 4th floor of bldg 50).  All of bldg 50 will be without normal power.  The facilities (F&O) group plan to do their maintenance during the 4-day period starting 26 Dec 2019.  However, the outage will start earlier due to lack of staff during the holiday shutdown.  Minimally, it is expected that all H.A. (High Availability) and experiment-critical equipment will be powered throughout the 16+ days of the holiday shutdown.  This page captures what Fermi will need to maintain a minimal data processing effort running during the outage.

...

DateTimeEquipment *Action
Fri 20 Dec 2019TBA switch to generator power (this could happen earlier) This will require a several-hour outage
Mon 6 Jan 2020  return to normal power. This will require a several-hour outage

Outage preparation task list

  •  

    Define needed xrootd resources (Wilko Kroeger)

  •  

    Confirm sufficient xrootd space to handle 16+ day HPSS outage (Wilko Kroeger)

  •  Define needed Level 0 and half-pipe resources (Steve Tether)
  •  Define needed Level 1 resources (Tom Stephens)
  •  Update VM to hypervisor host mapping (Brian Van Klaveren)
  •  Suggest/Move non-HA VMs to HA (Brian Van Klaveren)
  •  Define needed ASP resources (Jim Chiang)
  •  Define needed resources for critical Fermi Science pipelines (various)
    •  Gravitational wave analysis (Nicola Omodei)
    •  Flare Advocates (Gulli Johannesson)
    •  Burst Analysis (Dan Kocevski)
       


THE FOLLOWING DATA IS IN PROCESS OF BEING UPDATED FOR DEC 2019 OUTAGE 

LISOC Operations Functions

Table of LISOC Tasks and Services

Function/ServiceSub-FunctionsNeeded ServersNeeded DatabasesNeeded File SystemsOther NeedsNeeded During Shutdown?Available During Shutdown?
Mission Planning, LAT ConfigurationsFastCopy

fermilnx01 and
fermilnx02

TCDB

AFS

Fermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/w, FastCopy Monitoring

Sharepoint (reference for PROCs and Narrative Procedures for commanding in case of anomalies)

yes 
Real Time Telemetry Monitoring fermilnx01 and fermilnx02  

spread

Fermi LAT Portal: Real Time Telemetry, Telemetry Monitor

during anomalies 
Logging fermilnx01 and fermilnx02TCDB Fermi LAT Portal: Log Watcheryes 
Trending  TCDB Fermi LAT Portal: Telemetry Trendingyes 
L0 File Ingest and ArchiveFastCopy L0 Archive  yes 
Data Gap Checking and ReportingFastCopyfermilnx01 and fermilnx02L0 Archive  yes, continuously 
L1 processingpipelineSLAC FarmData Catalog Fermi LAT Portal: Pipeline, Data Processingyes 
L1 Data Quality Monitoring    Fermi LAT Portal, Telemetry Trending  
L1 deliveryFastCopyfermilnx01 and fermilnx02Data Catalog  yes 
L2 processing (ASP) and DeliveryFastCopyfermilnx01 and fermilnx02Data Catalog Fermi LAT Portal: Pipeline, Data Processingdaily, weekly 


The following table of servers must remain powered up and operational for Fermi Level 1 to succeed.

  •  Confirm current H.A. rack occupants.  spreadsheet from Christian Pama
    Old (2017) spreadsheet here (thanks Shirley!) 
  •  Confirm the VM-master for a given VM.  Use the 'node' command, e.g., $ node -whereis fermilnx-v12
  •  Confirm the tomcat <-> service associations.  Table here.
  •  Confirm the tomcat-VM associations in this table. Use the 'node' command, e.g., $ node -whereis glast-tomcat01

...

Category
Machine status
NCnon-critical for entire 16-day shutdown period
XCexperiment critical but not in H.A. rack, only a few, short outages acceptable
HAhigh-availability (continuous operation)

 


The services for L1:

oracle

  • pipeline
  • data catalog
  • group manager

...

  • 50 hosts should let us keep up (inlcuding ASP)

 

Here's what ISOC tasks need:

FASTCopy chain
--------------
staas-gpfs50/51
fermilnx01
fermilnx02
trscron
fermilnx-v03 (Archiver)
Whatever the pipeline server runs on.
xroot servers
astore-new system (HPSS)


Web servers
-----------
tomcat01 Commons
tomcat06 rm2
tomcat09 Pipeline-II
tomcat10 FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
tomcat11 DataProcessing
tomcat12 TelemetryTrending

Science Pipelines

Gravitational Wave analysis (Nicola)

  • Estimate 300 core-hours per day per GW event
    This would be 4 hequ hosts for about 10 hours per GW event

 

Flare Advocate analysis

 

Burst Analysis

High availability racks

For general information about the High-availability racks, Shirley provided this pointer to the latest list:

...

Expand
titleTether
Change "fermilnx01 or fermilnx02" to "fermilnx01 and fermilnx02". While services can all be shifted to one of those machines, frankly it's a pain.

The partition staas-cnfs50lb:/gpfs/slac/ha/fs1/g/fermi/u23 currently has 554 GB free. This is where we store:
    - Incoming FASTCopy packages (L0 data, HSK data).
    - Outgoing FASTCopy packages (L1 data, mission planning).
    - Unpacked LAT raw data (L0, HSK, etc.)

FASTCopy packages for both L0 and L1 data are archived daily to "astore-new" and are then deleted within 24 hours. "astore-new" is a POSIX-compliant filesystem interface to HPSS that replaced the old "astore" interface. This is driven by the old GLAST Disk Archiver service.  The packages are also archived to xrootd daily.  Unpacked raw data is also archived to xrootd but is retained for 60 days on u23. The unpacked raw data on xrootd is a "live" backup in the sense that it can be accessed by ISOC tools and L1 reconstruction if needed, though that option is not normally enabled.

We get something like 16 GB of L0 data daily. If archiving to astore-new is turned off then we would have to retain the original incoming L0 FC packages, the unpacked L0 data and the L1 FC packages. Naively assuming that all of these to be about the same size that means retaining 48GB or more per day so u23 would fill up in 11.5 days or less. And we'd probably start experiencing problems as it approached being 100% full.

If the astore-new archiving were kept going but the xrootd archiving were suspended, then we would retain only the 16 GB of unpacked L0 data per day  which would fill up u23 in 30 days or so.

So I would recommend changing the classification of  "astore (non-Fermi server)" from NC to XC for this long of an outage.  And rename "astore" to "astore-new (HPSS)". I see that the Archiver server fermilnx-v03 is already classified as XC, so that's good.

The partition staas-cnfs50lb:/gpfs/slac/ha/fs1/g/fermi/u41 is used by the halfpipe to store events extracted from LAT raw data. The events would take up 16 GB daily times some modest expansion factor. That partition needs to be kept going for normal processing. I don't know how long the event data is retained but the partition currently has 4.4 TB free so it shouldn't be a problem in any event.

All the rest of the page seems OK.

 

Gotchas from the Dec 2017 outage

  1. We did not specify the "medium" LSF queue in our requirements, but ended up needing it
  2. There was a delay in getting all 50 hequ's operating (some were/are on H.A., but others are not)
  3. The xrootd redirector had a problem and needed a restart

...