[Tentative proposal] This power outage will affect substations #7 (next to bldg 50) and #8 (located on the 4th floor of bldg 50). All of bldg 50 will be without normal power. The facilities (F&O) group plan to do their maintenance during the 4-day period starting 26 Dec 2019. However, the outage will start earlier due to lack of staff during the holiday shutdown. Minimally, it is expected that all H.A. (High Availability) and experiment-critical equipment will be powered throughout the 16+ days of the holiday shutdown. This page captures what Fermi will need to maintain a minimal data processing effort running during the outage.
Note that the ability to perform general science analysis at SLAC by the LAT collaboration will be seriously hindered by this outage due to the fact that much of the batch farm will be unavailable.
Date | Time | Equipment * | Action |
---|---|---|---|
Fri 20 Dec 2019 | TBA | switch to generator power (this could happen earlier) This will require a several-hour outage | |
Mon 6 Jan 2020 | return to normal power. This will require a several-hour outage |
Define needed xrootd resources (Wilko Kroeger)
Confirm sufficient xrootd space to handle 16+ day HPSS outage (Wilko Kroeger)
Function/Service | Sub-Functions | Needed Servers | Needed Databases | Needed File Systems | Other Needs | Needed During Shutdown? | Available During Shutdown? |
---|---|---|---|---|---|---|---|
Mission Planning, LAT Configurations | FastCopy | fermilnx01 and | TCDB | AFS | Fermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/w, FastCopy Monitoring Sharepoint (reference for PROCs and Narrative Procedures for commanding in case of anomalies) | yes | |
Real Time Telemetry Monitoring | fermilnx01 and fermilnx02 | spread Fermi LAT Portal: Real Time Telemetry, Telemetry Monitor | during anomalies | ||||
Logging | fermilnx01 and fermilnx02 | TCDB | Fermi LAT Portal: Log Watcher | yes | |||
Trending | TCDB | Fermi LAT Portal: Telemetry Trending | yes | ||||
L0 File Ingest and Archive | FastCopy | L0 Archive | yes | ||||
Data Gap Checking and Reporting | FastCopy | fermilnx01 and fermilnx02 | L0 Archive | yes, continuously | |||
L1 processing | pipeline | SLAC Farm | Data Catalog | Fermi LAT Portal: Pipeline, Data Processing | yes | ||
L1 Data Quality Monitoring | Fermi LAT Portal, Telemetry Trending | ||||||
L1 delivery | FastCopy | fermilnx01 and fermilnx02 | Data Catalog | yes | |||
L2 processing (ASP) and Delivery | FastCopy | fermilnx01 and fermilnx02 | Data Catalog | Fermi LAT Portal: Pipeline, Data Processing | daily, weekly |
Category† | server | VM/service | function |
---|---|---|---|
XC | fermi-gpfs01 fermi-gpfs02 fermi-gpfs05 fermi-gpfs06 fermi-gpfs07 fermi-gpfs08 | xrootd server and storage | |
XC | fermilnx05-vmm | fermilnx-v02 | xrootd redirector |
XC | fermilnx07-vmm | fermilnx-v12 | xrootd redirector |
XC | fermi-gpfs03 fermi-gpfs04 | GPFS | Fermi NFS/GPFS storage |
XC | fermi-cnfs01 fermi-cnfs02 | GPFS/NFS bridge | Fermi NFS storage access |
HA | staas-gpfs50 staas-gpfs51 | Critical ISOC NFS storage | |
HA | fermilnx01 | LAT config, fastcopy and real-time telemetry | |
HA | fermilnx02 | LAT config, fastcopy and real-time telemetry | |
XC | fermilnx05-vmm | fermilnx-v03 | archiver |
HA | fermi-oracle01 | oracle primary | |
HA | fermi-oracle02 | oracle secondary | |
HA | mysql05 mysql06 | mysql-node03 | calibration, etc. DB |
HA | hequ0001-0030 | batch hosts for LISOC queues={express,short,medium,long,glastdataq} users={glast,lsstsim,lsstprod,glastmc,glastraw} | |
XC | hequ0031-0050 | batch hosts for LISOC | |
XC | hequ0051-0065??? | batch hosts for Science Pipeline | |
HA | fermilnx03-vmm | fermilnx-v07/tomcat01 | Commons, Group manager |
XC | fermilnx09-vmm | fermilnx-v16/tomcat06 | rm2 |
XC | fermilnx07-vmm | fermilnx-v05/tomcat08 | dataCatalog |
XC | fermilnx09-vmm | fermilnx-v17/tomcat09 | Pipeline-II |
XC | fermilnx08-vmm | fermilnx-v15/pipeline-mail01 | Pipeline-II email server |
XC | fermilnx09-vmm | fermilnx-v18/tomcat10 | FCWebView, ISOCLogging, MPWebView TelemetryMonitor, TelemetryTableWebUI |
XC | fermilnx07-vmm | fermilnx-v10/tomcat11 | DataProcessing |
XC | fermilnx07-vmm | fermilnx-v11/tomcat12 | TelemetryTrending |
XC | (non-Fermi server) | astore-new (HPSS) | FastCopy data archive |
HA | (non-Fermi server) | trscron | tokenized cron |
HA | (non-Fermi server) | lnxcron | cron |
XC | (non-Fermi server) | (farm manager, etc.) | LSF management |
HA | yfs01/NN (non-Fermi) | basically all of AFS | |
HA | (non-Fermi server) | JIRA | issue tracking (HA as of 10/20/2017) |
† Equipment categories
Category | Machine status |
---|---|
NC | non-critical for entire 16-day shutdown period |
XC | experiment critical but not in H.A. rack, only a few, short outages acceptable |
HA | high-availability (continuous operation) |
oracle
mysql
tomcats
isoc servers
xroot
The following servers are needed to allow processing of new data (older data on fermi-xrd or HPSS will not be available):
nfs
LSF
FASTCopy chain
--------------
staas-gpfs50/51
fermilnx01
fermilnx02
trscron
fermilnx-v03 (Archiver)
Whatever the pipeline server runs on.
xroot servers
astore-new system (HPSS)
Web servers
-----------
tomcat01 Commons
tomcat06 rm2
tomcat09 Pipeline-II
tomcat10 FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
tomcat11 DataProcessing
tomcat12 TelemetryTrending
For general information about the High-availability racks, Shirley provided this pointer to the latest list:
"Service Now, Knowledge Base, search for "High Availability" , following link for current servers"
And here is the current statement about high-availability functionality:
Current Services in HA Racks •CATER application •Confluence application •Data center management tool •Drupal web •Email lists •Email transport infrastructure •ERP application •Exchange email •EXO application •Facilities monitoring •Fermi application •IT Ticketing system •Network infrastructure •Site Security infrastructure •Unix authentication infrastructure •Unix AFS infrastructure •Unix mailboxes •Unix monitoring •VPN •Windows authentication infrastructure •Windows file servers and SAN •Windows monitoring •Windows web
Supporting documentation
Email from Steve Tether with some storage-related information: