Notes on VMWare Cluster
The VMWare cluster, noted as fermi-vmclust
, consists of 4 hypervisors. Two of the machines are on HA, and two are not. A VM may migrate between these two hypervisors on demand- the machines have 128GB of memory and two hypervisors are enough to run all the VMs without oversubscription of memory.
The VMWare is cluster managed by OCIO Platforms team, and they can migrate VMs on demand.
fermi-vmclust01 and fermi-vmclust02 are not HA.
fermi-vmclust03 and fermivmclust04 are HA.
Note: OCIO should be contacted about setting up occasional VM snapshots, so that a VM may always be brought up even if a hypervisor dies.
Hosts and Services
Host/Service | Category† | Physical Server(s) | OS | Function |
---|---|---|---|---|
xrootd | XC | fermi-gpfs01 fermi-gpfs02 fermi-gpfs05 fermi-gpfs06 fermi-gpfs07 fermi-gpfs08 | xrootd server and storage | |
GPFS | XC | fermi-gpfs03 fermi-gpfs04 | Fermi NFS/GPFS storage | |
GPFS/NFS bridge | XC | fermi-cnfs01 fermi-cnfs02 | Fermi NFS storage access | |
NFS (ISOC) | HA | staas-gpfs50 staas-gpfs51 | Critical ISOC NFS storage | |
Oracle | HA | fermi-oracle03 | Oracle (primary) | |
Oracle | XC | fermi-oracle04 | Oracle (failover) | |
mysql-node03 | HA | mysql05 mysql06 | calibration, etc. DB | |
fermilnx01 | HA | fermilnx01 | LAT config, fastcopy and real-time telemetry | |
fermilnx02 | HA | fermilnx02 | LAT config, fastcopy and real-time telemetry | |
fermilnx-v02 | XC/HA | fermi-vmclust | xrootd redirector | |
fermilnx-v03 | XC/HA | fermi-vmclust | archiver | |
fermilnx-v04 | DataCatalog Crawler (Prod) | |||
fermilnx-v05/tomcat08 | XC/HA | fermi-vmclust | DataCatalog Web | |
fermilnx-v06 | fermi-vmclust | Xroot proxy server | ||
fermilnx-v07/tomcat01 | XC/HA | fermi-vmclust | Commons, Group manager | |
fermilnx-v08/tomcat02 glast-jobcontrol01 | fermi-vmclust | LSF Job Control Daemons (Notably glast, glastraw) | ||
Note: No fermilnx-v09 | ||||
fermilnx-v10/tomcat11 | XC/HA | fermi-vmclust | DataProcessing | |
fermilnx-v11/tomcat12 | XC/HA | fermi-vmclust | TelemetryTrending | |
fermilnx-v12 | XC/HA | fermi-vmclust | xrootd redirector | |
fermilnx-v13/tomcat05 | Pipeline-II (Prod) | |||
fermilnx-v14/centaurusa | fermi-vmclust | This machine requires user login. This machine is used as a Fermi CVS server, and a subversion server for a variety of user groups. svn functionality should move elsewhere | ||
fermilnx-v15/pipeline-mail01 | XC/HA | fermi-vmclust | Pipeline-II email server (james) | |
fermilnx-v16/tomcat06 | XC/HA | fermi-vmclust | rm2 | |
fermilnx-v17/tomcat09 | XC/HA | fermi-vmclust | Pipeline-II (Web) | |
fermilnx-v18/tomcat10 | XC/HA | fermi-vmclust | FCWebView, ISOCLogging, MPWebView TelemetryMonitor, TelemetryTableWebUI | |
fermilnx-v19/tomcat04 | fermi-vmclust | elog | ||
fermilnx7-v01 | fermi-vmclust | centos7 | Docker installed - must be added to proper group. | |
fermilnx7-v02 | fermi-vmclust | centos7 | Docker installed - must be added to proper group. | |
fermilnx7-v03 | fermi-vmclust | centos7 | Docker installed - must be added to proper group. | |
fermi-ci-test01 | fermi-vmclust | centos7 | Docker is installed. Runs a Jenkins worker. Jenkins can dispatch GlastRelease jobs to this node. Note: This host should likely be renamed and/or removed. If so, the Jenkins should run on a fermilnx7 host. | |
astore-new (HPSS) | NC | (non-Fermi server) | FastCopy data archive **We have been granted a temporary quota increase of 1 TB on /nfs/farm/g/glast/u23, which has allowed this item to become "NC"** | |
trscron | HA | (non-Fermi server) | tokenized cron | |
lnxcron | HA | (non-Fermi server) | cron | |
(farm manager, etc.) | XC | (non-Fermi server) | LSF management | |
AFS | HA | yfs01/NN (non-Fermi) | basically all of AFS | |
JIRA | HA | (non-Fermi server) | issue tracking (HA as of 10/20/2017) |
† Equipment categories
Category | Machine status |
---|---|
NC | non-critical for entire 16-day shutdown period |
XC | experiment critical but not in H.A. rack, only a few, short outages acceptable |
HA | high-availability (continuous operation) |
Table of LISOC Tasks and Services
Function/Service | Sub-Functions | Needed Servers | Needed Databases | Needed File Systems | Other Needs | Needed During Shutdown? | Available During Shutdown? |
---|---|---|---|---|---|---|---|
Mission Planning, LAT Configurations | FastCopy | fermilnx01 and | TCDB | AFS | Fermi LAT Portal: Timeline Webview; Confluence, JIRA, Mission Planning s/w, FastCopy Monitoring Sharepoint (reference for PROCs and Narrative Procedures for commanding in case of anomalies) | yes | |
Real Time Telemetry Monitoring | fermilnx01 and fermilnx02 | spread Fermi LAT Portal: Real Time Telemetry, Telemetry Monitor | during anomalies | ||||
Logging | fermilnx01 and fermilnx02 | TCDB | Fermi LAT Portal: Log Watcher | yes | |||
Trending | TCDB | Fermi LAT Portal: Telemetry Trending | yes | ||||
L0 File Ingest and Archive | FastCopy | L0 Archive | yes | ||||
Data Gap Checking and Reporting | FastCopy | fermilnx01 and fermilnx02 | L0 Archive | yes, continuously | |||
L1 processing | pipeline | SLAC Farm | Data Catalog | Fermi LAT Portal: Pipeline, Data Processing | yes | ||
L1 Data Quality Monitoring | Fermi LAT Portal, Telemetry Trending | ||||||
L1 delivery | FastCopy | fermilnx01 and fermilnx02 | Data Catalog | yes | |||
L2 processing (ASP) and Delivery | FastCopy | fermilnx01 and fermilnx02 | Data Catalog | Fermi LAT Portal: Pipeline, Data Processing | daily, weekly |