First Edition: 6 Dec 2019
Version 1.1 (10:50 PT 6 Dec 2019)
Note |
---|
On 8 Dec 2019 this outage was postponed until July 2020 |
...
[Tentative proposal] This Not many details are currently known, but this power outage will affect substations #7 (next to bldg 50) and #8 (located on the 4th floor of bldg 50). All of bldg 50 will be without normal power. The facilities (F&O) group plan to do their maintenance during the 4-day period starting 26 Dec 2019. However, the outage will start earlier due to lack of staff during the holiday shutdown. Minimally, it is expected that all H.A. (High Availability) and experiment-critical equipment will be powered throughout the 16+ days of the holiday shutdown. This page captures what Fermi will need to maintain a minimal data processing effort running during the outage.
Note |
---|
Note that the ability to perform general science analysis at SLAC by the LAT collaboration will be seriously hindered by this outage due to the fact that much of the batch farm will be unavailable. |
Date | Time | Equipment * | ActionAction | |
---|---|---|---|---|
A day or two prior to 20 Dec 2019 | TBA | Test of power source switching (i.e., normal line power to generator) | ||
Fri 20 Dec 2019 | TBA | switch to generator power (this could happen earlier) This will require a several-hour outage | ||
Mon 6 Jan 2020 | TBA | return to normal power. This will require a several-hour outage |
Define needed xrootd resources (Wilko Kroeger)
Confirm sufficient xrootd space to handle 16+ day HPSS outage (Wilko Kroeger)
...
* Equipment categories
Category | Machine status |
---|---|
NC | non-critical for entire 16-day shutdown period |
XC | experiment critical but not in H.A. rack, only a few, short outages acceptable |
HA | high-availability (continuous operation) |
...
...
Function/Service | Sub-Functions | Needed Servers | Needed Databases | Needed File Systems | Other Needs | Needed During Shutdown? | Available During Shutdown? |
---|---|---|---|---|---|---|---|
Mission Planning, LAT Configurations | FastCopy | fermilnx01 orand | TCDB | AFS | Fermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/w, FastCopy Monitoring Sharepoint (reference for PROCs and Narrative Procedures for commanding in case of anomalies) | yes | |
Real Time Telemetry Monitoring | fermilnx01 or and fermilnx02 | spread Fermi LAT Portal: Real Time Telemetry, Telemetry Monitor | during anomalies | ||||
Logging | fermilnx01 or and fermilnx02 | TCDB | Fermi LAT Portal: Log Watcher | yes | |||
Trending | TCDB | Fermi LAT Portal: Telemetry Trending | yes | ||||
L0 File Ingest and Archive | FastCopy | L0 Archive | yes | ||||
Data Gap Checking and Reporting | FastCopy | fermilnx01 or and fermilnx02 | L0 Archive | yes, continuously | |||
L1 processing | pipeline | SLAC Farm | Data Catalog | Fermi LAT Portal: Pipeline, Data Processing | yes | ||
L1 Data Quality Monitoring | Fermi LAT Portal, Telemetry Trending | ||||||
L1 delivery | FastCopy | fermilnx01 or and fermilnx02 | Data Catalog | yes | |||
L2 processing (ASP) and Delivery | FastCopy | fermilnx01 or and fermilnx02 | Data Catalog | Fermi LAT Portal: Pipeline, Data Processing | daily, weekly | ||
RSP | ? | ||||||
Fermi has requested that all VMs be relocated (at least temporarily) to the two H.A. hypervisor machines, thus some of the tasks listed below are no longer relevant to succeed.
Info |
---|
NOTE: Fermi has four VMware hypervisors, each of which contain some number of VMs running Fermi services. Two of these hypervisor machines are in the H.A. racks (fermi-vmclust03/04), while the others (fermi-vmclust01/02) are not. |
...
At this writing there are no user-level tools to allow one to discover which VMs are running on which hypervisor machines. |
Category†Category | server | VM/service | function | ||||
---|---|---|---|---|---|---|---|
XC | fermi-gpfs01 fermi-gpfs02 fermi-gpfs05 fermi-gpfs06 fermi-gpfs07 fermi-gpfs08 | xrootd | xrootd server and storage | ||||
XC/HA | fermilnx05-vmmfermi-vmclust01/02/03/04 | fermilnx-v02 | xrootd redirector | ||||
XC/HA | fermilnx07-vmmfermi-vmclust01/02/03/04 | fermilnx-v12 | xrootd redirector | ||||
XC | fermi-gpfs03 fermi-gpfs04 | GPFS | Fermi NFS/GPFS storage | ||||
XC | fermi-cnfs01 fermi-cnfs02 | GPFS/NFS bridge | Fermi NFS storage access | ||||
HA | staas-gpfs50 staas-gpfs51 wain031 (or equivalent) | Critical ISOC NFS storage | |||||
HA | fermilnx01 | LAT config, fastcopy and real-time telemetry | |||||
HA | fermilnx02 | LAT config, fastcopy and real-time telemetry | |||||
XC/HA | fermilnx05-vmmfermi-vmclust01/02/03/04 | fermilnx-v03 | archiver | ||||
HA | fermi-oracle01oracle03 | oracle primary | HA|||||
XC | fermi-oracle02oracle04 | oracle secondary | |||||
HA | mysql05 /06mysql06 | mysql-node03 | calibration, etc. DB | ||||
XC | 400 cores | HA | hequ0001-0030 | (50 "hequ" equivalents) batch hosts . for LISOC queues={express,short,medium,long,glastdataq} users={glast,lsstsim,lsstprod,glastmc,glastraw} | |||
XC | hequ0031-0050200 cores | (25 "hequ" equivalents) batch hosts for Science Pipelines | |||||
XC/HA | fermilnx03-vmmfermi-vmclust01/02/03/04 | fermilnx-v07/tomcat01 | Commons, Group manager | ||||
XC/HA | fermilnx09-vmmfermi-vmclust01/02/03/04 | fermilnx-v16/tomcat06 | rm2 | ||||
XC/HA | fermilnx07-vmmfermi-vmclust01/02/03/04 | fermilnx-v05/tomcat08 | dataCatalog | ||||
XC/HA | fermilnx09-vmmfermi-vmclust01/02/03/04 | fermilnx-v17/tomcat09 | Pipeline-II | ||||
XC/HA | fermilnx08-vmmfermi-vmclust01/02/03/04 | fermilnx-v15/pipeline-mail01 | Pipeline-II email server | ||||
XC/HA | fermilnx09-vmmfermi-vmclust01/02/03/04 | fermilnx-v18/tomcat10 | FCWebView, ISOCLogging, MPWebView TelemetryMonitor, TelemetryTableWebUI | ||||
XC/HA | fermilnx07-vmmfermi-vmclust01/02/03/04 | fermilnx-v10/tomcat11 | DataProcessing | ||||
XC/HA | fermilnx07-vmmfermi-vmclust01/02/03/04 | fermilnx-v11/tomcat12 | TelemetryTrending | NC5||||
NC | (non-Fermi server) | astore-new (HPSS) | FastCopy data archive **We have been granted a temporary quota increase of 1 TB on /nfs/farm/g/glast/u23, which has allowed this item to become "NC"** | ||||
HA | (non-Fermi server) | trscron | tokenized cron | ||||
HA | (non-Fermi server) | lnxcron | cron | ||||
XC | (non-Fermi server) | (farm manager, etc.) | LSF management | ||||
HA | yfs01/NN (non-Fermi) | basically all of AFS | |||||
HA | yfs02 (non-Fermi ) | AFS | HA | (non-Fermi server) | JIRA | issue tracking (HA as of 10/20/2017) | |
XC | rhel6-64 | public login nodes (a small number is needed for interactive access) | |||||
XC | fermi-gpfs03/04 | GPFS | Fermi user disk | ||||
XC | fermi-cnfs01/02 | GPFS/NFS bridge | Fermi user disk |
...
† Equipment categories
Category | Machine status |
---|---|
NC | non-critical for entire 16-day shutdown period |
XC | experiment critical but not in H.A. rack, only a few, short outages acceptable |
HA | high-availability (continuous operation) |
Machine Type | Total | Notes |
---|---|---|
GPFS servers | 8 | |
NFS/GPFS bridge | 2 | |
VMware hypervisors | 2 | Not needed if all Fermi services can be moved to the two H.A. hypervisors |
batch nodes ("hequ" equivalents) | 75 | Depending on which batch nodes are selected, some may already be in H.A. power |
Oracle servers | 1 | There is rumor that this machine may already be on H.A. power – to be confirmed |
Public login nodes | N | (where "N" is a small integer) |
TOTAL | 88+N |
Note that HPSS is NOT required by Fermi.
oracle
mysql
tomcats
isoc servers
xroot
The following servers are needed to allow processing of new data (older data on fermi-xrd or HPSS will not be available):
nfs
LSF
FASTCopy chain
--------------
staas-gpfs50/51
fermilnx01
fermilnx02
trscron
fermilnx-v03 (Archiver)
Whatever the pipeline server runs on.
xroot servers
astore-new system (HPSS)
Web servers
-----------
tomcat01 Commons
tomcat06 rm2
tomcat09 Pipeline-II
tomcat10 FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
tomcat11 DataProcessing
tomcat12 TelemetryTrending
For general information about the High-availability racks, Shirley provided this pointer to the latest list:
...
Code Block |
---|
Current Services in HA Racks •CATER application •Confluence application •Data center management tool •Drupal web •Email lists •Email transport infrastructure •ERP application •Exchange email •EXO application •Facilities monitoring •Fermi application •IT Ticketing system •Network infrastructure •Site Security infrastructure •Unix authentication infrastructure •Unix AFS infrastructure •Unix mailboxes •Unix monitoring •VPN •Windows authentication infrastructure •Windows file servers and SAN •Windows monitoring •Windows web |
The services for L1:
oracle
mysql
tomcats
isoc servers
xroot
The following sewrvers are needed to allow processing of new data (older data on fermi-xrd or HPSS will not be available):
...
LSF
Here's what ISOC tasks need:
FASTCopy chain
--------------
wain031
fermilnx01
fermilnx02
trscron
fermilnx-v03 (Archiver)
Whatever the pipeline server runs on.
xroot servers
astore system
...
Supporting documentation
Email from Steve Tether with some storage-related information:
Expand | ||
---|---|---|
| ||
Change "fermilnx01 or fermilnx02" to "fermilnx01 and fermilnx02". While services can all be shifted to one of those machines, frankly it's a pain.
The partition staas-cnfs50lb:/gpfs/slac/ha/fs1/g/fermi/u23 currently has 554 GB free. This is where we store:
- Incoming FASTCopy packages (L0 data, HSK data).
- Outgoing FASTCopy packages (L1 data, mission planning).
- Unpacked LAT raw data (L0, HSK, etc.)
FASTCopy packages for both L0 and L1 data are archived daily to "astore-new" and are then deleted within 24 hours. "astore-new" is a POSIX-compliant filesystem interface to HPSS that replaced the old "astore" interface. This is driven by the old GLAST Disk Archiver service. The packages are also archived to xrootd daily. Unpacked raw data is also archived to xrootd but is retained for 60 days on u23. The unpacked raw data on xrootd is a "live" backup in the sense that it can be accessed by ISOC tools and L1 reconstruction if needed, though that option is not normally enabled.
We get something like 16 GB of L0 data daily. If archiving to astore-new is turned off then we would have to retain the original incoming L0 FC packages, the unpacked L0 data and the L1 FC packages. Naively assuming that all of these to be about the same size that means retaining 48GB or more per day so u23 would fill up in 11.5 days or less. And we'd probably start experiencing problems as it approached being 100% full.
If the astore-new archiving were kept going but the xrootd archiving were suspended, then we would retain only the 16 GB of unpacked L0 data per day which would fill up u23 in 30 days or so.
So I would recommend changing the classification of "astore (non-Fermi server)" from NC to XC for this long of an outage. And rename "astore" to "astore-new (HPSS)". I see that the Archiver server fermilnx-v03 is already classified as XC, so that's good.
The partition staas-cnfs50lb:/gpfs/slac/ha/fs1/g/fermi/u41 is used by the halfpipe to store events extracted from LAT raw data. The events would take up 16 GB daily times some modest expansion factor. That partition needs to be kept going for normal processing. I don't know how long the event data is retained but the partition currently has 4.4 TB free so it shouldn't be a problem in any event.
All the rest of the page seems OK.
|
Wilko's statement regarding space currently available in xrootd:
Expand | ||
---|---|---|
| ||
there are currently about ~290TB free in the xrootd gpfs space, which is plenty . Also, if needed we can always purge old recon files from disk. |
Nicola's estimate of batch power needed for GW follow up pipeline:
Expand | ||
---|---|---|
| ||
I am trying to figure out the right numbers looking at the resource plots…. Not sure how to read the plots. I think they are running on 300 cores for about an hour. So my estimation was 30 cores for 10 hours… |
Stefano's comment on Flare Advocates:
Expand | ||
---|---|---|
| ||
for the FA shifts |
Sara's response to Flare advocate script question:
Expand | ||
---|---|---|
| ||
They are usually submitted to the “medium” queue, in general relatively fast to finish (~30min) and not that much demanding resources wise, roughly a hundred cores I’d say. If you need more detailed info I do not know them on top of my head, there were also some changes lately to the code (I believe). I or some of the FA coord. managing this would need to take a look into the code. E.g. you were asking about the batch node, what other info may help? |
Teddy adds: "The FA scripts tend to be run each day at most."
Dan's statement on various Pipelines (including FAVA):
Expand | ||
---|---|---|
| ||
We have a few analysis pipelines that currently use the batch system. These include the burst advocate analysis, the gravitational wave followup, and FAVA. The gravitational wave analysis typically requires thousands of jobs to be launched to analyze a large portion of the sky, so I think it’s probably hopeless to keep that up during the outage. FAVA runs on weekly timescales, so we can probably safely catch up that analysis once the batch farm comes back to full strength. The burst advocate analysis gets launched a little more than once a day. Counting up the past week, we had 11 triggers in 7 days. Each trigger launches 6 jobs and each job goes to the medium queue using rhel6.
I can take the appropriate steps to deactivate the gravitational wave followup analysis and FAVA leading up to the outage. Let me know if you think we’d be able to keep the burst advocate analysis running and I’ll take the appropriate actions.
|
Brian's proposal to move all VMs to H.A.:
Expand | ||
---|---|---|
| ||
I think we can move all fermilnx VMs to HA without oversubscribing memory or disk. Can we verify this? * I think each fermilnx VM, except for fermilnx01 and fermilnx02, has 384GB memory. * I think we have two VMWare Hypervisors in in HA. I'd suggest distributing the VMs such that: * fermilnx01 is on one hypervisor * fermilnx02 is on another hypervisor (I think this is currently the case) All other fermilnx-v* VMs are distributed between the other two hypervisors (live migration if possible) |
Juliyana's report on VMs in H.A.:
Expand | ||
---|---|---|
| ||
Currently Fermi has these hypervisors and VMS in HA and Non-HA
HA:
Fermilnx-v06/v07/08
Fermilnx01/02/04
NON-HA:
Fermi-ci-test01
Fermilnx-v02/v03/v04/v05/v10/v11/v12/v13/v14/v15/v16/v17/v18/v19 --- These are fermi hypervisors. There are four of them. Two in HA and two in Non-HA.
HA: fermi-vmclust03 and fermivmclust04
Non-HA: fermi-vmclust01 and fermi-vmclust02
|
Excerpt from Christian Pama's spreadsheet on H.A. rack contents:
Expand | ||
---|---|---|
| ||
fermi-vmclust03
fermi-vmclust04
fermilnx03-vmm
fermi-oracle03
wain031
staas-gpfs50
staas-gpfs51
sca-oracle01
scalnx02-vmm
scalnx10-vmm01
scalnx11-vmm02
scalnx12-vmm01
scalnx13-vmm02
|
Current tomcat server VMs:
Expand | ||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
As of 11:30 on 5 Dec 2019
Note that old 'kvm' hypervisors have been replaced by two VMware hypervisors: fermi-vmclust01/02 (non-HA) + fermi-vmclust03/04 (HA) |
Other VM's:
Expand | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||
|
...
Notes:
9/29/2017 - (TG and WK) added list of xrootd servers needed to bridge the gap in December; Started table of server names