You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Power outage time-line.  

[Tentative proposal]  This power outage will affect substations #7 (next to bldg 50) and #8 (located on the 4th floor of bldg 50).  All of bldg 50 will be without normal power.  The facilities (F&O) group plan to do their maintenance during the 4-day period starting 26 Dec 2019.  However, the outage will start earlier due to lack of staff during the holiday shutdown.  Minimally, it is expected that all H.A. (High Availability) and experiment-critical equipment will be powered throughout the 16+ days of the holiday shutdown.  This page captures what Fermi will need to maintain a minimal data processing effort running during the outage.

DateTimeEquipment *Action
Fri 20 Dec 2019TBA switch to generator power (this could happen earlier) This will require a several-hour outage
Mon 6 Jan 2020  return to normal power. This will require a several-hour outage


* Equipment categories

Category
Machine status
NCnon-critical for entire 16-day shutdown period
XCexperiment critical but not in H.A. rack, only a few, short outages acceptable
HAhigh-availability (continuous operation)


BELOW THIS LINE HAS NOT YET BEEN FULLY UPDATED FOR DEC 2019 OUTAGE


LISOC Operations Functions

Table of LISOC Tasks and Services

Function/ServiceSub-FunctionsNeeded ServersNeeded DatabasesNeeded File SystemsOther NeedsNeeded During Shutdown?Available During Shutdown?
Mission Planning, LAT ConfigurationsFastCopyfermilnx01 or fermilnx02

TCDB

AFS

Fermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/w, FastCopy Monitoring

Sharepoint (reference for PROCs and Narrative Procedures for commanding in case of anomalies)

yes 
Real Time Telemetry Monitoring fermilnx01 or fermilnx02  

spread

Fermi LAT Portal: Real Time Telemetry, Telemetry Monitor

during anomalies 
Logging fermilnx01 or fermilnx02TCDB Fermi LAT Portal: Log Watcheryes 
Trending  TCDB Fermi LAT Portal: Telemetry Trendingyes 
L0 File Ingest and ArchiveFastCopy L0 Archive  yes 
Data Gap Checking and ReportingFastCopyfermilnx01 or fermilnx02L0 Archive  yes, continuously 
L1 processingpipelineSLAC FarmData Catalog Fermi LAT Portal: Pipeline, Data Processingyes 
L1 Data Quality Monitoring    Fermi LAT Portal, Telemetry Trending  
L1 deliveryFastCopyfermilnx01 or fermilnx02Data Catalog  yes 
L2 processing (ASP) and DeliveryFastCopyfermilnx01 or fermilnx02Data Catalog Fermi LAT Portal: Pipeline, Data Processingdaily, weekly 
RSP     ? 
        


The following table of servers must remain powered up and operational for Fermi Level 1 to succeed.

  • Confirm current H.A. rack occupants.  See spreadsheet here (thanks Shirley!) https://portal.slac.stanford.edu/info/ITHelp/KB%20Assets/HA-Servers.xlsx
  • Confirm the VM-master for a given VM.  Use the 'node' command, e.g., $ node -whereis fermilnx-v12
  • Confirm the tomcat <-> service associations.  Table here.
  • Confirm the tomcat-VM associations in this table. Use the 'node' command, e.g., $ node -whereis glast-tomcat01

 

CategoryserverVM/servicefunction
XCfermi-gpfs02 xrootd server
XCfermilnx05-vmmfermilnx-v02xrootd redirector
XCfermilnx07-vmmfermilnx-v12xrootd redirector
HAwain031 (or equivalent) NFS storage
HAfermilnx01 LAT config, fastcopy and real-time telemetry
HAfermilnx02 LAT config, fastcopy and real-time telemetry
XCfermilnx05-vmmfermilnx-v03archiver
HAfermi-oracle01 oracle primary
HAfermi-oracle02 oracle secondary
HAmysql05/06mysql-node03calibration, etc. DB
HAhequ0001-0030 batch hosts. queues={express,short,long,glastdataq}
users={glast,lsstsim,lsstprod,glastmc,glastraw}
XChequ0031-0050 batch hosts
HAfermilnx03-vmmfermilnx-v07/tomcat01Commons, Group manager
XCfermilnx09-vmmfermilnx-v16/tomcat06rm2
XCfermilnx07-vmmfermilnx-v05/tomcat08dataCatalog
XCfermilnx09-vmmfermilnx-v17/tomcat09Pipeline-II
XCfermilnx08-vmmfermilnx-v15/pipeline-mail01Pipeline-II email server
XCfermilnx09-vmmfermilnx-v18/tomcat10FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
XCfermilnx07-vmmfermilnx-v10/tomcat11DataProcessing
XCfermilnx07-vmmfermilnx-v11/tomcat12TelemetryTrending
NC5(non-Fermi server)astoredata archive
HA(non-Fermi server)trscrontokenized cron
HA(non-Fermi server)lnxcroncron
XC(non-Fermi server)(farm manager, etc.)LSF management
HAyfs01 (non-Fermi) AFS
HAyfs02 (non-Fermi) AFS
HA(non-Fermi server)JIRAissue tracking (HA as of 10/20/2017)
XCfermi-gpfs03/04GPFSFermi user disk
XCfermi-cnfs01/02GPFS/NFS bridgeFermi user disk

 

High availability racks

For general information about the High-availability racks, Shirley provided this pointer to the latest list:

"Service Now, Knowledge Base,  search for "High Availability" , following link for current servers"

And here is the current statement about high-availability functionality:

Current Services in HA Racks
•CATER application
•Confluence application
•Data center management tool
•Drupal web
•Email lists
•Email transport infrastructure
•ERP application
•Exchange email
•EXO application
•Facilities monitoring
•Fermi application
•IT Ticketing system
•Network infrastructure
•Site Security infrastructure
•Unix authentication infrastructure
•Unix AFS infrastructure
•Unix mailboxes
•Unix monitoring
•VPN 
•Windows authentication infrastructure
•Windows file servers and SAN
•Windows monitoring
•Windows web

 

 

 

The services for L1:

oracle

  • pipeline
  • data catalog
  • group manager

mysql

  • calibrations

tomcats

  • pipeline
  • data catalog
  • data processing

isoc servers
xroot

     The following sewrvers are needed to allow processing of new data (older data on fermi-xrd or HPSS will not be available):

  • fermi-gpfs01/02, fermi-gpfs05/06 and fermi-gpfs07/08
    These are the servers that make up the fermi xrootd gpfs space.  Each pair servers a part of the total gpfs space.
    • fermi-gpfs05 runs the xrootd server for the gpfs space
  • fermilnx-v02 (redirector)
  • fermilnx-v12 (redirector)


nfs

  • Pretty much everything that's currently on wain031

LSF

  • ~25 hosts should let us keep up

 

Here's what ISOC tasks need:

FASTCopy chain
--------------
wain031
fermilnx01
fermilnx02
trscron
fermilnx-v03 (Archiver)
Whatever the pipeline server runs on.
xroot servers
astore system


Web servers
-----------
tomcat01 Commons
tomcat06 rm2
tomcat09 Pipeline-II
tomcat10 FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
tomcat11 DataProcessing
tomcat12 TelemetryTrending

 


Notes:

9/29/2017 - (TG and WK) added list of xrootd servers needed to bridge the gap in December; Started table of server names

  • No labels