Figuring out what we need to get through. None of the pages mapping services to servers are up to date.
Power outage time-line (from Shirley). Note that the start time of "10am" (PST) is solid, but that the "5pm" end time is notional and could change.
Date | Time | Equipment | Action |
---|---|---|---|
Fri 22 Dec 2017 | EOB? | non-critical, general use | power off |
Tue 26 Dec 2017 | before 10am | non-critical, special request exp-critical | power off power off |
10am | Bldg 50 power off | ||
after 10am | exp-critical | power ON | |
High-avail | continuously powered ON | ||
Sat 30 Dec 2017 | before 5pm | exp-critical | power off |
5pm | Bldg 50 power ON | ||
after 5pm | exp-critical non-critical, special request | power ON power ON | |
Mon 8 Jan 2018 | starting 8am | non-critical, general use | power ON |
Category | Machine status |
---|---|
NC | non-critical for entire 16-day shutdown period |
NC5 | non-critical if down only during the 5 days of the outage and, possibly, the three days preceding the outage |
XC | experiment critical but not in H.A. rack, only a few, short outages acceptable |
HA | high-availability (continuous operation) |
Category | server | VM/service | function |
---|---|---|---|
XC | fermi-gpfs02 | xrootd server | |
XC | fermilnx05-vmm | fermilnx-v02 | xrootd redirector |
XC | fermilnx07-vmm | fermilnx-v12 | xrootd redirector |
HA | wain031 (or equivalent) | NFS storage | |
HA | fermilnx01 | LAT config, fastcopy and real-time telemetry | |
HA | fermilnx02 | LAT config, fastcopy and real-time telemetry | |
XC | fermilnx05-vmm | fermilnx-v03 | archiver |
HA | fermi-oracle01 | oracle primary | |
HA | fermi-oracle02 | oracle secondary | |
HA | mysql05/06 | mysql-node03 | calibration, etc. DB |
XC | hequNNN - hequNNN+24 | batch hosts | |
HA | fermilnx03-vmm | fermilnx-v07/tomcat01 | Commons, Group manager |
XC | fermilnx09-vmm | fermilnx-v16/tomcat06 | rm2 |
XC | fermilnx07-vmm | fermilnx-v05/tomcat08 | dataCatalog |
XC | fermilnx09-vmm | fermilnx-v17/tomcat09 | Pipeline-II |
XC | fermilnx09-vmm | fermilnx-v18/tomcat10 | FCWebView, ISOCLogging, MPWebView TelemetryMonitor, TelemetryTableWebUI |
XC | fermilnx07-vmm | fermilnx-v10/tomcat11 | DataProcessing |
XC | fermilnx07-vmm | fermilnx-v11/tomcat12 | TelemetryTrending |
XC | (non-Fermi server) | astore | data archive |
XC | (non-Fermi server) | trscron | tokenized cron |
XC | (non-Fermi server) | lnxcron | cron |
XC | (non-Fermi server) | (farm manager, etc.) | LSF |
HA | yfs01 (non-Fermi) | AFS | |
HA | yfs02 (non-Fermi) | AFS |
For general information about the High-availability racks, Shirley provided this pointer to the latest list:
"Service Now, Knowledge Base, search for "High Availability" , following link for current servers"
And here is the current statement about high-availability functionality:
The services for L1:
oracle
mysql
tomcats
isoc servers
xroot
nfs
LSF
Here's what ISOC tasks need:
FASTCopy chain
--------------
wain031
fermilnx01
fermilnx02
trscron
fermilnx-v03 (Archiver)
Whatever the pipeline server runs on.
xroot servers
astore system
Web servers
-----------
tomcat01 Commons
tomcat06 rm2
tomcat09 Pipeline-II
tomcat10 FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
tomcat11 DataProcessing
tomcat12 TelemetryTrending
Notes:
9/29/2017 - (TG and WK) added list of xrootd servers needed to bridge the gap in December; Started table of server names