You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

History

This page is a partial rewrite of the following pages to better help understand what hosts are running and where they are running.

December 2019 Central Computing Outage (Fermi)

Server and applications migration from glastlnx to fermilnx boxes

Notes on VMWare Cluster

The VMWare cluster, noted as fermi-vmclust, consists of 4 hypervisors. Two of the machines are on HA, and two are not. A VM may migrate between these two hypervisors on demand- the machines have 128GB of memory and two hypervisors are enough to run all the VMs without oversubscription of memory.

The VMWare is cluster managed by OCIO Platforms team, and they can migrate VMs on demand.

fermi-vmclust01 and fermi-vmclust02 are not HA.

fermi-vmclust03 and fermivmclust04 are HA.

Note: OCIO should be contacted about setting up occasional VM snapshots, so that a VM may always be brought up even if a hypervisor dies.

Hosts and Services

Host/ServiceCategory†Physical Server(s)OSFunction
xrootdXC

fermi-gpfs01

fermi-gpfs02

fermi-gpfs05

fermi-gpfs06

fermi-gpfs07

fermi-gpfs08


xrootd server and storage
GPFSXC

fermi-gpfs03

fermi-gpfs04


Fermi NFS/GPFS storage
GPFS/NFS bridgeXC

fermi-cnfs01

fermi-cnfs02


Fermi NFS storage access
NFS (ISOC)HA

staas-gpfs50

staas-gpfs51


Critical ISOC NFS storage
OracleHAfermi-oracle03
Oracle (primary)
OracleXCfermi-oracle04
Oracle (failover)
mysql-node03HA

mysql05

mysql06


calibration, etc. DB
fermilnx01HAfermilnx01
LAT config, fastcopy and real-time telemetry
fermilnx02HAfermilnx02
LAT config, fastcopy and real-time telemetry
fermilnx-v02XC/HAfermi-vmclust
xrootd redirector
fermilnx-v03XC/HAfermi-vmclust
archiver
fermilnx-v04


DataCatalog Crawler (Prod)
fermilnx-v05/tomcat08XC/HAfermi-vmclust
DataCatalog Web
fermilnx-v06
fermi-vmclust
Xroot proxy server
fermilnx-v07/tomcat01XC/HAfermi-vmclust
Commons, Group manager

fermilnx-v08/tomcat02

glast-jobcontrol01


fermi-vmclust

LSF Job Control Daemons (Notably glast, glastraw)

Note: No fermilnx-v09



fermilnx-v10/tomcat11XC/HAfermi-vmclust
DataProcessing
fermilnx-v11/tomcat12XC/HAfermi-vmclust
TelemetryTrending
fermilnx-v12XC/HAfermi-vmclust
xrootd redirector
fermilnx-v13/tomcat05


Pipeline-II (Prod)
fermilnx-v14/centaurusa
fermi-vmclust
This machine requires user login. This machine is used as a Fermi CVS server, and a subversion server for a variety of user groups. svn functionality should move elsewhere
fermilnx-v15/pipeline-mail01XC/HAfermi-vmclust
Pipeline-II email server (james)
fermilnx-v16/tomcat06XC/HAfermi-vmclust
rm2
fermilnx-v17/tomcat09XC/HAfermi-vmclust
Pipeline-II (Web)
fermilnx-v18/tomcat10XC/HAfermi-vmclust
FCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
fermilnx-v19/tomcat04
fermi-vmclust

elog

fermilnx7-v01
fermi-vmclustcentos7Docker installed - must be added to proper group.
fermilnx7-v02
fermi-vmclustcentos7Docker installed - must be added to proper group.
fermilnx7-v03
fermi-vmclustcentos7Docker installed - must be added to proper group.
fermi-ci-test01
fermi-vmclustcentos7

Docker is installed. Runs a Jenkins worker. Jenkins can dispatch GlastRelease jobs to this node.

Note: This host should likely be renamed and/or removed. If so, the Jenkins should run on a fermilnx7 host.

sca-resty01
fermi-vmclustcentos7

Shared nginx server - sca-resty01. See Also

Server Locations and Functions

SCA NGINX Configuration

https://github.com/slaclab/sca-resty (requires authorization)

sca-nginx01/02
fermilnx12/13-vmm

Shared nginx server. This passes through all traffic ALL traffic to sca-resty01.

See Also:

SCA NGINX Configuration






astore-new (HPSS)NC(non-Fermi server)
FastCopy data archive
**We have been granted a temporary quota increase of 1 TB on /nfs/farm/g/glast/u23, which has allowed this item to become "NC"**
trscronHA(non-Fermi server)
tokenized cron
lnxcronHA(non-Fermi server)
cron
(farm manager, etc.)XC(non-Fermi server)
LSF management
AFSHAyfs01/NN (non-Fermi)
basically all of AFS
JIRAHA(non-Fermi server)
issue tracking (HA as of 10/20/2017)



† Equipment categories

Category
Machine status
NCnon-critical for entire 16-day shutdown period
XCexperiment critical but not in H.A. rack, only a few, short outages acceptable
HAhigh-availability (continuous operation)

Table of LISOC Tasks and Services

Function/ServiceSub-FunctionsNeeded ServersNeeded DatabasesNeeded File SystemsOther NeedsNeeded During Shutdown?Available During Shutdown?
Mission Planning, LAT ConfigurationsFastCopy

fermilnx01 and
fermilnx02

TCDB

AFS

Fermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/w, FastCopy Monitoring

Sharepoint (reference for PROCs and Narrative Procedures for commanding in case of anomalies)

yes
Real Time Telemetry Monitoring
fermilnx01 and fermilnx02

spread

Fermi LAT Portal: Real Time Telemetry, Telemetry Monitor

during anomalies
Logging
fermilnx01 and fermilnx02TCDB
Fermi LAT Portal: Log Watcheryes
Trending

TCDB
Fermi LAT Portal: Telemetry Trendingyes
L0 File Ingest and ArchiveFastCopy
L0 Archive

yes
Data Gap Checking and ReportingFastCopyfermilnx01 and fermilnx02L0 Archive

yes, continuously
L1 processingpipelineSLAC FarmData Catalog
Fermi LAT Portal: Pipeline, Data Processingyes
L1 Data Quality Monitoring



Fermi LAT Portal, Telemetry Trending

L1 deliveryFastCopyfermilnx01 and fermilnx02Data Catalog

yes
L2 processing (ASP) and DeliveryFastCopyfermilnx01 and fermilnx02Data Catalog
Fermi LAT Portal: Pipeline, Data Processingdaily, weekly
  • No labels