You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 39 Next »

Figuring out what we need to get through. None of the pages mapping services to servers are up to date.

Version 1.2 (5 Oct 2017)

Power outage time-line (from Shirley).  Note that the start time of "10am" (PST) is solid, but that the "5pm" end time is notional and could change.

DateTimeEquipmentAction
Fri 22 Dec 2017EOB?non-critical, general usepower off
Tue 26 Dec 2017before 10am

non-critical, special request

exp-critical

power off

power off to switch to backup power source

 10am Bldg 50 power off
 after 10amexp-criticalpower ON with backup power
  High-availcontinuously powered ON
Sat 30 Dec 2017before 5pmexp-criticalpower off two switch back to normal power source
 5pm Bldg 50 power ON
 after 5pm

exp-critical

non-critical, special request

power ON

power ON

Mon 8 Jan 2018starting 8amnon-critical, general usepower ON

 

The following table of servers must remain powered up and operational for Fermi Level 1 to succeed.

  • Confirm current H.A. rack occupants.  See spreadsheet here (thanks Shirley!) https://portal.slac.stanford.edu/info/ITHelp/KB%20Assets/HA-Servers.xlsx
  • Confirm the VM-master for a given VM.  Use the 'node' command, e.g., $ node -whereis fermilnx-v12
  • Confirm the tomcat <-> service associations.  Table here.
  • Confirm the tomcat-VM associations in this table. Use the 'node' command, e.g., $ node -whereis glast-tomcat01

Category
Machine status
NCnon-critical for entire 16-day shutdown period
NC5non-critical if down only during the 5 days of the outage and, possibly, the three days preceding the outage
XCexperiment critical but not in H.A. rack, only a few, short outages acceptable
HAhigh-availability (continuous operation)
CategoryserverVM/servicefunction
XCfermi-gpfs02 xrootd server
XCfermilnx05-vmmfermilnx-v02xrootd redirector
XCfermilnx07-vmmfermilnx-v12xrootd redirector
HAwain031 (or equivalent) NFS storage
HAfermilnx01 LAT config, fastcopy and real-time telemetry
HAfermilnx02 LAT config, fastcopy and real-time telemetry
XCfermilnx05-vmmfermilnx-v03archiver
HAfermi-oracle01 oracle primary
HAfermi-oracle02 oracle secondary
HAmysql05/06mysql-node03calibration, etc. DB
XChequNNN - hequNNN+24 batch hosts
HAfermilnx03-vmmfermilnx-v07/tomcat01Commons, Group manager
XCfermilnx09-vmmfermilnx-v16/tomcat06rm2
XCfermilnx07-vmmfermilnx-v05/tomcat08dataCatalog
XCfermilnx09-vmmfermilnx-v17/tomcat09Pipeline-II
XCfermilnx09-vmmfermilnx-v18/tomcat10FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
XCfermilnx07-vmmfermilnx-v10/tomcat11DataProcessing
XCfermilnx07-vmmfermilnx-v11/tomcat12TelemetryTrending
XC(non-Fermi server)astoredata archive
XC(non-Fermi server)trscrontokenized cron
XC(non-Fermi server)lnxcroncron
XC(non-Fermi server)(farm manager, etc.)LSF
HAyfs01 (non-Fermi) AFS
HAyfs02 (non-Fermi) AFS
????JIRAissue tracking

 

High availability racks

For general information about the High-availability racks, Shirley provided this pointer to the latest list:

"Service Now, Knowledge Base,  search for "High Availability" , following link for current servers"

And here is the current statement about high-availability functionality:

Current Services in HA Racks
•CATER application
•Confluence application
•Data center management tool
•Drupal web
•Email lists
•Email transport infrastructure
•ERP application
•Exchange email
•EXO application
•Facilities monitoring
•Fermi application
•IT Ticketing system
•Network infrastructure
•Site Security infrastructure
•Unix authentication infrastructure
•Unix AFS infrastructure
•Unix mailboxes
•Unix monitoring
•VPN 
•Windows authentication infrastructure
•Windows file servers and SAN
•Windows monitoring
•Windows web

 

 

 

The services for L1:

oracle

  • pipeline
  • data catalog
  • group manager

mysql

  • calibrations

tomcats

  • pipeline
  • data catalog
  • data processing

isoc servers
xroot

  • fermi-gpfs02 (xrootd server)
  • fermilnx-v02 (redirector)
  • fermilnx-v12 (redirector)


nfs

  • Pretty much everything that's currently on wain031

LSF

  • ~25 hosts should let us keep up

 

Here's what ISOC tasks need:

FASTCopy chain
--------------
wain031
fermilnx01
fermilnx02
trscron
fermilnx-v03 (Archiver)
Whatever the pipeline server runs on.
xroot servers
astore system


Web servers
-----------
tomcat01 Commons
tomcat06 rm2
tomcat09 Pipeline-II
tomcat10 FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
tomcat11 DataProcessing
tomcat12 TelemetryTrending

 


Notes:

9/29/2017 - (TG and WK) added list of xrootd servers needed to bridge the gap in December; Started table of server names

  • No labels