You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Notes on VMWare Cluster

The VMWare cluster, noted as fermi-vmclust, consists of 4 hypervisors. Two of the machines are on HA, and two are not. A VM may migrate between these two hypervisors on demand- the machines have 128GB of memory and two hypervisors are enough to run all the VMs without oversubscription of memory.

The VMWare is cluster managed by OCIO Platforms team, and they can migrate VMs on demand.

Note: OCIO should be contacted about setting up occasional VM snapshots, so that a VM may always be brought up even if a hypervisor dies.

Table of LISOC Tasks and Services

Function/ServiceSub-FunctionsNeeded ServersNeeded DatabasesNeeded File SystemsOther NeedsNeeded During Shutdown?Available During Shutdown?
Mission Planning, LAT ConfigurationsFastCopy

fermilnx01 and
fermilnx02

TCDB

AFS

Fermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/w, FastCopy Monitoring

Sharepoint (reference for PROCs and Narrative Procedures for commanding in case of anomalies)

yes
Real Time Telemetry Monitoring
fermilnx01 and fermilnx02

spread

Fermi LAT Portal: Real Time Telemetry, Telemetry Monitor

during anomalies
Logging
fermilnx01 and fermilnx02TCDB
Fermi LAT Portal: Log Watcheryes
Trending

TCDB
Fermi LAT Portal: Telemetry Trendingyes
L0 File Ingest and ArchiveFastCopy
L0 Archive

yes
Data Gap Checking and ReportingFastCopyfermilnx01 and fermilnx02L0 Archive

yes, continuously
L1 processingpipelineSLAC FarmData Catalog
Fermi LAT Portal: Pipeline, Data Processingyes
L1 Data Quality Monitoring



Fermi LAT Portal, Telemetry Trending

L1 deliveryFastCopyfermilnx01 and fermilnx02Data Catalog

yes
L2 processing (ASP) and DeliveryFastCopyfermilnx01 and fermilnx02Data Catalog
Fermi LAT Portal: Pipeline, Data Processingdaily, weekly



VM or ServiceCategory†serverfunction
xrootdXC

fermi-gpfs01

fermi-gpfs02

fermi-gpfs05

fermi-gpfs06

fermi-gpfs07

fermi-gpfs08

xrootd server and storage
fermilnx-v12XC/HAfermi-vmclustxrootd redirector
GPFSXC

fermi-gpfs03

fermi-gpfs04

Fermi NFS/GPFS storage
GPFS/NFS bridgeXC

fermi-cnfs01

fermi-cnfs02

Fermi NFS storage access

HA

staas-gpfs50

staas-gpfs51

Critical ISOC NFS storage

HAfermilnx01LAT config, fastcopy and real-time telemetry

HAfermilnx02LAT config, fastcopy and real-time telemetry
fermilnx-v03XC/HAfermi-vmclustarchiver

HAfermi-oracle03oracle primary

XCfermi-oracle04oracle secondary
mysql-node03HA

mysql05

mysql06

calibration, etc. DB

XC400 cores(50 "hequ" equivalents) batch hosts for LISOC
queues={express,short,medium,long,glastdataq}
users={glast,lsstsim,lsstprod,glastmc,glastraw}

XC200 cores
(25 "hequ" equivalents) batch hosts for Science Pipelines
fermilnx-v02XC/HAfermi-vmclust[1]xrootd redirector
fermilnx-v07/tomcat01XC/HAfermi-vmclustCommons, Group manager
fermilnx-v16/tomcat06XC/HAfermi-vmclustrm2
fermilnx-v05/tomcat08XC/HAfermi-vmclustdataCatalog
fermilnx-v17/tomcat09XC/HAfermi-vmclustPipeline-II
fermilnx-v15/pipeline-mail01XC/HAfermi-vmclustPipeline-II email server
fermilnx-v18/tomcat10XC/HAfermi-vmclustFCWebView, ISOCLogging, MPWebView
TelemetryMonitor, TelemetryTableWebUI
fermilnx-v10/tomcat11XC/HAfermi-vmclustDataProcessing
fermilnx-v11/tomcat12XC/HAfermi-vmclustTelemetryTrending












astore-new (HPSS)NC(non-Fermi server)FastCopy data archive
**We have been granted a temporary quota increase of 1 TB on /nfs/farm/g/glast/u23, which has allowed this item to become "NC"**
trscronHA(non-Fermi server)tokenized cron
lnxcronHA(non-Fermi server)cron
(farm manager, etc.)XC(non-Fermi server)LSF management

HAyfs01/NN (non-Fermi)basically all of AFS
JIRAHA(non-Fermi server)issue tracking (HA as of 10/20/2017)

XCrhel6-64public login nodes (a small number is needed for interactive access)

















[1] fermi-vmclust is the VMWare cluster.


† Equipment categories

Category
Machine status
NCnon-critical for entire 16-day shutdown period
XCexperiment critical but not in H.A. rack, only a few, short outages acceptable
HAhigh-availability (continuous operation)
  • No labels