Figuring out what we need to get through. None of the pages mapping services to servers are up to date.
Power outage time-line (from Shirley). Note that the start time of "10am" (PST) is solid, but that the "5pm" end time is notional and could change.
Date | Time | Equipment | Action |
---|---|---|---|
Fri 22 Dec 2017 | EOB? | non-critical, general use | power off |
Tue 26 Dec 2017 | before 10am | non-critical, special request exp-critical | power off power off |
10am | Bldg 50 power off | ||
after 10am | exp-critical | power ON | |
High-avail | continuously powered ON | ||
Sat 30 Dec 2017 | before 5pm | exp-critical | power off |
5pm | Bldg 50 power ON | ||
after 5pm | exp-critical non-critical, special request | power ON power ON | |
Mon 8 Jan 2018 | starting 8am | non-critical, general use | power ON |
Category | Machine status |
---|---|
NC | non-critical for entire 16-day shutdown period |
NC5 | non-critical if down only during the 5 days of the outage and, possibly, the three days preceding the outage |
XC | experiment critical but not in H.A. rack, only a few, short outages acceptable |
HA | high-availability (continuous operation) |
Category | server | VM/service | function |
---|---|---|---|
XC | fermi-gpfs02 | xrootd server | |
XC | fermilnx05-vmm | fermilnx-v02 | xrootd redirector |
XC | fermilnx07-vmm | fermilnx-v12 | xrootd redirector |
HA | wain031 (or equivalent) | NFS storage | |
HA | fermilnx01 | LAT config, fastcopy and real-time telemetry | |
HA | fermilnx02 | LAT config, fastcopy and real-time telemetry | |
XC | fermilnx05-vmm | fermilnx-v03 | archiver |
HA | fermi-oracle01 | oracle primary | |
HA | fermi-oracle02 | oracle secondary | |
HA | mysql05/06 | mysql-node03 | calibration, etc. DB |
XC | hequNNN - hequNNN+24 | batch hosts | |
HA | fermilnx03-vmm | fermilnx-v07/tomcat01 | Commons, Group manager |
XC | fermilnx09-vmm | fermilnx-v16/tomcat06 | rm2 |
XC | fermilnx07-vmm | fermilnx-v05/tomcat08 | dataCatalog |
XC | fermilnx09-vmm | fermilnx-v17/tomcat09 | Pipeline-II |
XC | fermilnx09-vmm | fermilnx-v18/tomcat10 | FCWebView, ISOCLogging, MPWebView TelemetryMonitor, TelemetryTableWebUI |
XC | fermilnx07-vmm | fermilnx-v10/tomcat11 | DataProcessing |
XC | fermilnx07-vmm | fermilnx-v11/tomcat12 | TelemetryTrending |
XC | (non-Fermi server) | astore | data archive |
XC | (non-Fermi server) | trscron | tokenized cron |
XC | (non-Fermi server) | lnxcron | cron |
XC | (non-Fermi server) | (farm manager, etc.) | LSF |
HA | yfs01 (non-Fermi) | AFS | |
HA | yfs02 (non-Fermi) | AFS |
For general information about the High-availability racks, Shirley provided this pointer to the latest list:
"Service Now, Knowledge Base, search for "High Availability" , following link for current servers"
And here is the current statement about high-availability functionality:
Current Services in HA Racks •CATER application •Confluence application •Data center management tool •Drupal web •Email lists •Email transport infrastructure •ERP application •Exchange email •EXO application •Facilities monitoring •Fermi application •IT Ticketing system •Network infrastructure •Site Security infrastructure •Unix authentication infrastructure •Unix AFS infrastructure •Unix mailboxes •Unix monitoring •VPN •Windows authentication infrastructure •Windows file servers and SAN •Windows monitoring •Windows web
The services for L1:
oracle
mysql
tomcats
isoc servers
xroot
nfs
LSF
Here's what ISOC tasks need:
FASTCopy chain
--------------
wain031
fermilnx01
fermilnx02
trscron
fermilnx-v03 (Archiver)
Whatever the pipeline server runs on.
xroot servers
astore system
Web servers
-----------
tomcat01 Commons
tomcat06 rm2
tomcat09 Pipeline-II
tomcat10 FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
tomcat11 DataProcessing
tomcat12 TelemetryTrending
Notes:
9/29/2017 - (TG and WK) added list of xrootd servers needed to bridge the gap in December; Started table of server names