Figuring out what we need to get through. None of the pages mapping services to servers are up to date.

Power outage time-line (from Shirley).  

Note that the start time of "10am" (PST) on 26 Dec is solid, but that the "5pm" on 30 Dec is notional and could change.

DateTimeEquipment *Action
Fri 22 Dec 2017EOB?NCpower off
Tue 26 Dec 2017before 10am

NC5
XC

power off
power off (briefly - for power source switch)

 ~8:30am Bldg 50 power off
 after 8:30amXCpower ON
  HAcontinuously powered ON
Sat 30 Dec 2017before 5pmXCpower off (briefly - for power source switch)
 5pm Bldg 50 power ON
 after 5pm

XC
NC5

power ON
power ON

Mon 8 Jan 2018starting 8amNCpower ON


* Equipment categories

Category
Machine status
NCnon-critical for entire 16-day shutdown period
NC5non-critical if down only during the 5 days of the outage and, possibly, the three days preceding the outage
XCexperiment critical but not in H.A. rack, only a few, short outages acceptable
HAhigh-availability (continuous operation)

LISOC Operations Functions

Table of LISOC Tasks and Services

Function/ServiceSub-FunctionsNeeded ServersNeeded DatabasesNeeded File SystemsOther NeedsNeeded During Shutdown?Available During Shutdown?
Mission Planning, LAT ConfigurationsFastCopyfermilnx01 or fermilnx02

TCDB

AFSFermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/wyes 
Real Time Telemetry Monitoring fermilnx01 or fermilnx02  spreadduring anomalies 
Logging fermilnx01 or fermilnx02TCDB Fermi LAT Portal: Log Watcheryes 
Trending  TCDB Fermi LAT Portal: Telemetry Trendingyes 
L0 File Ingest and ArchiveFastCopy L0 Archive  yes 
Data Gap Checking and ReportingFastCopyfermilnx01 or fermilnx02L0 Archive  yes, continuously 
L1 processingpipelineSLAC FarmData Catalog Fermi LAT Portal: Pipeline, Data Processingyes 
L1 Data Quality Monitoring    Fermi LAT Portal, Telemetry Trending  
L1 deliveryFastCopyfermilnx01 or fermilnx02Data Catalog  yes 
L2 processing (ASP) and DeliveryFastCopyfermilnx01 or fermilnx02Data Catalog Fermi LAT Portal: Pipeline, Data Processingdaily, weekly 
RSP     ? 
        


The following table of servers must remain powered up and operational for Fermi Level 1 to succeed.

  • Confirm current H.A. rack occupants.  See spreadsheet here (thanks Shirley!) https://portal.slac.stanford.edu/info/ITHelp/KB%20Assets/HA-Servers.xlsx
  • Confirm the VM-master for a given VM.  Use the 'node' command, e.g., $ node -whereis fermilnx-v12
  • Confirm the tomcat <-> service associations.  Table here.
  • Confirm the tomcat-VM associations in this table. Use the 'node' command, e.g., $ node -whereis glast-tomcat01

 

CategoryserverVM/servicefunction
XCfermi-gpfs02 xrootd server
XCfermilnx05-vmmfermilnx-v02xrootd redirector
XCfermilnx07-vmmfermilnx-v12xrootd redirector
HAwain031 (or equivalent) NFS storage
HAfermilnx01 LAT config, fastcopy and real-time telemetry
HAfermilnx02 LAT config, fastcopy and real-time telemetry
XCfermilnx05-vmmfermilnx-v03archiver
HAfermi-oracle01 oracle primary
HAfermi-oracle02 oracle secondary
HAmysql05/06mysql-node03calibration, etc. DB
HAhequ0001-0030 batch hosts. queues={express,short,long,glastdataq}
users={glast,lsstsim,lsstprod,glastmc,glastraw}
XChequ0031-0050 batch hosts
HAfermilnx03-vmmfermilnx-v07/tomcat01Commons, Group manager
XCfermilnx09-vmmfermilnx-v16/tomcat06rm2
XCfermilnx07-vmmfermilnx-v05/tomcat08dataCatalog
XCfermilnx09-vmmfermilnx-v17/tomcat09Pipeline-II
XCfermilnx08-vmmfermilnx-v15/pipeline-mail01Pipeline-II email server
XCfermilnx09-vmmfermilnx-v18/tomcat10FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
XCfermilnx07-vmmfermilnx-v10/tomcat11DataProcessing
XCfermilnx07-vmmfermilnx-v11/tomcat12TelemetryTrending
NC5(non-Fermi server)astoredata archive
HA(non-Fermi server)trscrontokenized cron
HA(non-Fermi server)lnxcroncron
XC(non-Fermi server)(farm manager, etc.)LSF management
HAyfs01 (non-Fermi) AFS
HAyfs02 (non-Fermi) AFS
HA(non-Fermi server)JIRAissue tracking (HA as of 10/20/2017)
XCfermi-gpfs03/04GPFSFermi user disk
XCfermi-cnfs01/02GPFS/NFS bridgeFermi user disk

 

High availability racks

For general information about the High-availability racks, Shirley provided this pointer to the latest list:

"Service Now, Knowledge Base,  search for "High Availability" , following link for current servers"

And here is the current statement about high-availability functionality:

Current Services in HA Racks
•CATER application
•Confluence application
•Data center management tool
•Drupal web
•Email lists
•Email transport infrastructure
•ERP application
•Exchange email
•EXO application
•Facilities monitoring
•Fermi application
•IT Ticketing system
•Network infrastructure
•Site Security infrastructure
•Unix authentication infrastructure
•Unix AFS infrastructure
•Unix mailboxes
•Unix monitoring
•VPN 
•Windows authentication infrastructure
•Windows file servers and SAN
•Windows monitoring
•Windows web

 

 

 

The services for L1:

oracle

  • pipeline
  • data catalog
  • group manager

mysql

  • calibrations

tomcats

  • pipeline
  • data catalog
  • data processing

isoc servers
xroot

  • fermi-gpfs02 (xrootd server)
  • fermilnx-v02 (redirector)
  • fermilnx-v12 (redirector)


nfs

  • Pretty much everything that's currently on wain031

LSF

  • ~25 hosts should let us keep up

 

Here's what ISOC tasks need:

FASTCopy chain
--------------
wain031
fermilnx01
fermilnx02
trscron
fermilnx-v03 (Archiver)
Whatever the pipeline server runs on.
xroot servers
astore system


Web servers
-----------
tomcat01 Commons
tomcat06 rm2
tomcat09 Pipeline-II
tomcat10 FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
tomcat11 DataProcessing
tomcat12 TelemetryTrending

 


Notes:

9/29/2017 - (TG and WK) added list of xrootd servers needed to bridge the gap in December; Started table of server names

  • No labels

2 Comments

  1. Remember wain031 will be replaced by then.

  2. I like the title of this page:

    Dec 2017 Power Outrage (Fermi)

    "outrage" indeed!