All times are PST (Pacific Standard Time).  Most recent entry first.

Nodes

Services

Start Time     

Expected End Time

Actual End Time

Reason

Impact and Comments

============= Ongoing ============
Restore from tapefilemanager ws
Feb 2024
After the switch to the S3DF ANAFS the restore service needs to be updated.

File restore and status for files in S3DF is not working








============ Upcoming ============







Nodes

Services

Start Time     

Expected End Time

Actual End Time

Reason

Impact and Comments

============ Completed ============

S33DF data storage sdfdataFeb 9th, morning
unknown

Feb 10th, 15:00

The object store for the S3DF data storage (/sdf/data/lcls/ds) has connection problems and data on it can not be accessed and therefore accessing files will likely fail with  "Remote I/O error"


next update 3pm PST Feb. 10  – see https://s3df.slac.stanford.edu/

Please refrain from writing new data to /sdf/data/lcls/ as that might fill up the flash storage.

The Object store itself is up and the data are fully protected the issue is that the fast flash layer can not interact with the object store.
If the outage persists /sdf/data/lcls/ will be unmounted.

The file systems are available now.

PCDSN NetworkAll ServicesNov 27th, noon
Dec 11th
Scheduled NEH power outage. 11/28-12/8

The PCDSN routers are located in NEH, thus, all PCDSN services will be impacted: Storage (NFS), Login (Bastion Hosts), NoMachine, Interactive, DRP, Build, and Diskless (IOC and DAQ).

Service in 208 will go down only when power is down. Controls and DAQ servers in affected areas will be turned off starting at noon the day before the power outage.

When power is restored, systems will be restored in this order

  • central systems (208) to get UED operational (Dec 6, later afternoon)
  • NEH: LFE/KFE, TMO, RIX, Laser Hall, XPP (Dec 6-7)
  • FEH: XRT, XCS, MFX, CXI, MEC (Dec 8)

EPICS will remain offline until the above services are restored. Recovery can take up to 8hrs after these services are restored. PLC systems configured for automatic recovery will resume as soon as power is restored.

S3DF will not be impacted during this outage.

S3DFBatch Computing and Interactive NodesOct 16th, 8:00 PMOct 17th, 12:01 AMOct 17th, 11:00 PMweka upgrades

Upgrading Weka to the latest release to fix ACLs issues. Although no impact is expected, this is being added for informational purposes.

Home directories on the psana/daq systems.All, except S3DF.Sep 7th, 2023, 12:30pm

Sep 7th, 2023 5pmweka problems

IT writes: If you experience issues first, try to log-out and log back in again. If that doesn't work, then reboot the host

DRP
June 25th 8:00 PMJuly 5th 2:00 PMJune 16th, 2023 7:00 PMFinalize the storage client upgrade.

DRP {cmp,eb,mon} nodes in SRCF are not available

psanaana lustre filesystems, data movers, interactive machinesJuly 6, 9:20am
July 6, 4:00pmUnplanned power outage


ALLALLJune 25th at 8:00 PMJuly 4th
SRCF-II Electrical System

NFS, DHCP, Home directories, Archiving, Logging, Grafana, Routing (inter-subnet traffic), EPICS + Gateways, DAQ + analysis systems, Virtual machines (eg. lfe-console), ssh, NTP

PLCs, and networked controllers will continue to function in-so-far as no inter-subnet communication is required. PLC work and maintenance will require local connection. Intra-subnet traffic will still be possible. IOC hosts may remain online but performance will likely degrade until they crash and will not recover until the outage is over. Reboot of any system may fail due to lack of IP assignment (DHCP is down). Archiving and logging will be dark.


This outage is required to integrate the new SRCF-II building with the existing data center.

FFBFFB StorageMay 1st, 2023 6:00 AMMay 2nd, 2023 6:00 AMMay 5th ??Preparation for S3DF functionality integration.

Upgrading system and storage version. 

LCLS2 DAQ unavailable, LCLS1 DAQ only able to record to event-nodes during outage.

The FFB was upgrade but some clients still need some work which will be done by 4th July (SRCF-II outage).

pslogin, psana, NoMachineLoginJune 16, 2023, 2:00pm
June 16th, 2023 7:00 PM

Unplanned power outage in SRCF rack F1

KafkaKafka, elog
April 28th, 2023 14:00
17:00 same day16:00Switching data-management from PCDS Kafka to S3DF Kafka

The elog and some services might not be available for a short times.

Let pcds-datamgt-l know if something is still amiss

ana02ana02April 4th, 2023

Decommissioning offline filesystem ana02

All data on ana02 were copied, cleaned or archived.
ana02 is not accessible anymore.

AllAllMarch 21, 2023 3pmUnknownMarch 22, 2023 4pmSLAC site-wide power outage

ARPMar 9, 2023 9AMMar 12, 2023 9PMMar 9, 2023 9PMS3DF supportWe're adding S3DF support for the ARP and as a result will be making major changes to the Automatic Run Processing infrastructure.
psanaANA03Feb 23, 2023 11:30AMUnknownFeb 23, 2023 1:00PMUnhappy OSTs.From IT: There was an issue with the zpools in ANA03.

After coordinating the restarts, all seem operational.

AllAllJan 18, 2023 7AMUnknownJan 18, 2023 4PMPlanned power outage in B950pslogin and psexport nodes are reachable again as of ~4pm. (nodes located in SRCF, but not those in NEH)
ConfigDBDAQ Configuration for LCLS2Oct 18 3PMOct 18 5PMOct 18 4PM
Enhancements for restore configuration
psdb, psdmeLog, data-moverSep 1st, 9amSep 2nd, 9amSep 1st, 4pmUpdating OS, Kafka, MongoDBMongoDB, Kafka, pswww were updated. all services are up and running but some checks are still performed.
psossana0101ANA01Aug 29st, 17:10Aug, 30, 10am
disk issuesThree OSTs of ana01 are not accessible and access to ana01 will hang
ManyNSFJuly 24, 7:00
July 25, 14:00Some weka servers had problems.Weka NFS filesystem showing multiple problems on many nodes
ANA FSLustreJune 24th, 8:00
June 24, 8:00June 24, 15:00Relocate OSSs
ANA FSInteractive, Batch, ANA FS, ExportMay 25, 9:00AMMay 25, 10:00AMMay 25, 11:22AMUpgrade firmware to fix bug.Part of the core upgrade, but encountered issues during the upgrade.
Core Services NetworkAllMay 25, 9:00AMMay 25, 10:00AMMay 25, 10:12AMUpgrade firmware to fix bug.No server reboots will be required.
All Diskless SystemsAll Diskless SystemsApril 21, 12:10PMTBDApril 21, 1:10PMDiskless system was rebooted, did not come back as expectedCritical impact
WEKA ClusterNFSApril 7th, 8:00 AMApril 7th, 9:00 AMApril 7th, 9:30 AMPatch SoftwareMinimal impact, no reboot required.  Expect latency during the process.
Network Reorganization #3ANA FS,  Export Services, Interactive Pool and Batch cluster.TBD

Network Upgrade
psana and ffb batch nodesana15 filesystemFeb. 15 2022, 06:30
Feb. 15 2022, 11:10OSS reboot was necessarystooped again at 9:40
oss and mds got rebooted
psana and ffb batch nodesana01/02/03/15/16 filesystemsFeb. 9 2022,  0:00
Feb 9, 2022, 16:20Some of the oss or mds have problems with their IP addresses.ioc-und-rec01 was misconfigured and created a conflicting IP address with ana02/03
MFX nodesNetwork




psmetric04grafana (psdm)Jan 12, 10:40
Jan 12, 10:45Upgrade  6.7.1 -> 8.3.3.PSDH-161
Network Reorganization #2Name Services, Login Services, NoMachine, Web ServicesJan 6, 2022 9:00 AM12:00 PM1:30PMNetwork Upgrade
Network Reorganization #1 ALLJan 5, 2022 8:00 AM12:00 PM
Establish LCLS Routers
WekaALLJan 5, 2022 9:00 AM10:00 AM
New Features and bug
XCS NodesNetworkNov 16, 9.30am

Switch upgrade
psdb0xeLog, file manager, all web services Dec 8th, 9AMDec 8th, 12PMDec 8th, 10AM
Upgrading Mongo to 4.4.10.
drp-neh-xxxNEH DRP


Upgrading to the latest Lustre versionNo DRP in the NEH alcove
All NEH and FEH nodesIOCs and DAQ nodes, alcove DRP, and control room workstationsAug 19, 6:00AMSep 18, 5:00PM
Electrical work in experimental hallsSystems in the experimental halls will be on-line whenever power is available. Note: psana is not in the experimental halls and won't be affected.
psanaoss215
ana1, ana12 Dec 2

oss psanaoss215 of ana11 is downdecommissioned
 psanaoss231ana13Aug 5th, 5am

oss problem after power outagedecomissioned
all nodesnfs file-system Aug 31, 10am
Aug 31, 20:00Some weka filesystems are not available.file systems (e/g/: /cds/sw, /cds/groupcfel/..) show I/O error. Sometimes it might work.

xtc file restoreNov 6th,Nov 8thNov 7thData Restore from tape is not workingHPSS might have just been very slow mounting the tapes.
pshub01JupyterHubOct 21, 9amOct 21, 9amOct 21, 9amUpgrade hub versionupgrade hub from 1.4.2 -> 2.0.0., fix matplotlib/ipympl issue
psjhub01JupyterHub Sep 16, 9amSep 16, 9:15amSep 16, 9:05amUpgrade jupyterhub code
questionnaireLCLS and UED questionnairesSep 14, 6PMSep 14, 10PM
Upgrade the questionnaire backend to new database
psexportAll data mover servicesAug 16, 1:00PMAug 19, 3:00PM

Aug 19, 15:30


All services have been started except the NEH data mover.
Diskless nodesIOC and DAQ nodesAug 16, 9:00AMAug 20, 1:00PM
Diskless server will be migrated to the weka clusterCancelled
CXI NodesNetworkAug 18, 2:00PMAug 23, 5:00PMAug 27, 5:00PMNetwork upgradeNo control room workstations or IOCs working in CXI during this time.
MEC NodesNetworkAug 18, 2:00PMAug 23, 5:00PMAug 26, 2:00PMNetwork upgradeNo control room workstations or IOCs working in MEC during this time.
psnfslusteARP, Summary resultsAug 15th, 5:00PM
Aug 25, 1amVarious file system upgradesana-filesystems have been mounted on the psdmNN nodes
psdbdev01LCLS2 calibration serviceAug 26th, 5PMAug 26th, 9PMAug 26th, 11:30PMMove the databases to psdb0x machinesMay affect analysis for TMO/RIX experiments. 

drp-srcf-xxx

Fast feedbackAug 15th, 5:00PMAug 25th, 5:00PMAug 25, 11:30PMReorganizing DRP racks and connectivity, upgrading Weka FFB ClusterNo fast feedback analysis during this time.
psana, JupyterHubInteractive pool, batch nodesAug 16, 9:00AMAug 19, 2:00PM
Upgrading file systems, batch and interactive nodesSome data analysis capabilities. Working on getting the remaining interactive and batch nodes on-line.
pswwwWeb ServicesAug 16, 6:00PMAug 17, 12:00PMAug 17, 4:15PMService failed after upgrade.Permission issues from the Weka upgrade
psnx, pslogin, psdev,All login servicesAug 16, 1:00PMAug 16, 5:00PM

Aug 16, 6:30PM

Upgrading host to latest packages and servicesNo ability to ssh into the system during this time.
pswekaNFSAug 16th, 9:00AM

Aug 16th, 1:00PM

Aug 16th, 5:00PM

Upgrading and re-configuring Weka Cluster

The users home won't be available during this time so nothing will work. This will require rebooting all LCLS servers to use the new NFS stack.

psanagpu, lustreinteractive, jupyter, anafsAug 5th, 5am
Aug 5th, 9:20ampower issue SRCFMost system have been restored. Only ana13 needs some attention
Weka ClusterNFS (home directories, central storage, all related systems)5/22/2021Unknown5/24/2021 ~10pmUnder investigation

The LCLS IT team is working with the vendor to diagnose root cause and future actions.

The software version was rolled back, and the system is performing more stably.

psdb0x, psdm0x, pswww0xlogbooks, movers, questionnaires, most of the data management infrastructureMay 18, 9:00AMMay 18, 12:00PMMay 18, 9:55AMUpgrading all the NFS mounts to the new version of NFS.
pswekaAllApril 28th, 8:00PMApril 28th, 9:00PMApril 28th, 9:45PMDeploying NFS-Ganesha
psanaoss121ana02Apr 23, 18:15
Apr 26, 11:50oss crashed
pswekaAllApril 26th 8amApril 26th 12pmApril 26th, 11amWeka upgradeNothing will work during this time
psdb0x, psdm0xlogbooks, movers, questionnaires, most of the data management infrastructureApr 21, 3:00PMApr 21, 4:00PMApr 21, 5:00 PMMoving all machines to 10Gbps networks.psdm02 had a bad NIC port, we had to reconfigure this to use the alternate.
psanaoss12nana02Apr 15, 4:30pmApr 15, 7pmApr 15, 5:40pmReplacing broken fans
psexportglobus, data transfersMarch 24, 9:00amMarch 24, 10:00am
reboot psexports to remove ana11/12
psanaoss121ana02Apr 6th, 17:00Apr 7th, afternoonApr 7th, 11amdisk related hardware issues
HPSStape archive, restoreMarch 23, 6:00amMarch 23, 6pmMarch 23, 14:20HPSS upgrade
drp-srcfFFB for TMO+XPP/XCSMarch 11th, 10amMarch 11th, 7pmMarch 12th, 1amDisable SMT on DRP SRCF nodes
psdb(psdb4)Questionnaire/File Restore/File Manager/LCLS 1 DAQ
data mover
Mar 3 6:00PMMar 3 10:00PM
Moving databases to new clusterThe questionnaire, the file manager services and the LCLS1 DAQ will be unavailable as we migrate to newer machines.
Allldap/dhcp/dnsFeb 24th at 10mFeb 24th at 1pmFeb 24th at 3pmpsrelay migrationNetconfig and reboots will fail during this time and name resolution lookups may be delayed for a few seconds.
psdb(psdb4)

Questionnaire/File Restore/File Manager/LCLS 1 DAQ
data mover

Feb 11 9:00AMFeb 12 2:30AMFeb 12, 8:00PMMoving databases to new cluster

The questionnaire, the file manager services and the LCLS1 DAQ will be unavailable as we migrate to newer machines.

Reverted back to psdb4 owing to routing issues.

Instrument Networkall NEH/FEH computing2/4/2021 10:30am
2/4/2021 11:20amPower supply failureReplaced.  Redundant supply added.
pswww/pswebkdcelog/file restoreJan 20, 2021 ~4:00PM

Jan 21, 2021
~1 am

Issues with VMWare hypervisor

psdm0x

psdb0x

eLog/File restore etcDec 21Dec 24Dec 22Upgrade of backends to new releases of software
ana03ana03Dec 17, 2:49AM
DEC 17, 10:15amLustre issuepsossana0303 was stuck and needed a reboot (hard reboot)
Ana file systemsana13Nov 18Nov 20,Nov23, 11amone ost is not accessible for write

ost0xe can not be written to due to mds/ost issues. It has been set to ro but writes to ana13 are slowed down. Reboot required.

Fixed by rebooting MDS, (OSSs were also rebooted)

pshub01JupyterHubOct 29 1:00PM
Oct 29 5:44PMDisk failure.The is a disk failure in the filesystem that stores the JupyterHub sessions. We are trying to recover this and restart the node.
We could not recover the sessions; please log out and restart your session.
<<All>>Network File System (NFS)10/19/20, 4pm10/22/20, 9am

 

Upgrade to new Solid State Drive-based, distributed clusters

Remaining /reg/[d,g,neh]/* NFS volumes

(see Detailed list)

ANA file systemsana02/03/11/13/1510/13/20, 6.30am
10/15/20, 17:40(Unscheduled/failure)

21:20  UED has been moved to weka-nfs:  /cds/data/ued/ana,

The ana-filesystems are accessible now.
The FFB->anafs migration has started but it will take time to clear the backlog

<<All>>CDS Router10/5/20, 9am10/5/20, 12pm10/5/20, 1pmUpgrade to new routers, 100Gb opticsDuring 4-day PAMM. Actual disruption should be order of minutes.
<<All>>Network File System (NFS)10/5/20, 12pm10/8/20, 12pm10/8/20, 12pmUpgrade to new Solid State Drive-based, distributed clusters

4-day PAMM. Replace aging legacy systems. Service disruptions may endure for several days as hard-links in applications are identified and repaired in real-time.

Only /reg/neh/opr/ and /reg/g/pcds/ mounts were migrated.  Future outage(s) will address remaining volumes.

ANA ana04April 16thMay 22nd
ana04 is downHardware problem with one ana04 ost
ANA file systemana14Dec, 2019may 27th
ana14 is down Hardware problem with one ana14 oss
nfs serverhome directoriesSep 29th
Sep 29th, 11:30nfs server issuesAll  servers are back up.  Most if not all issues have been fixed.
psnfs02home directores on home5Sep 3, 16:40 Sep3, 19:11psnfs02 crashed
ANA FS; GPUs, Export, Batch nodesANA02, ANA03, ANA11,ANA15, psana, psanafarm, psexportJuly 7th, 5:00PMJuly 9th, 8:00AMJuly 8th, 6:00PMCircuit breaker replacement

Circuit breaker has been replaced and sensitivity decreased.

In order to replace the unit, the entire distribution panel (UDB-C) must be powered off.

ANAana13June 8thJune 10thJune10th, 10:45psanaoss232 is down
Gateway, Build nodes, ML node pscag1 - pscag4, psbuild-rhel5, psbuild-rhel6, psbuild-rhel7, psjerry

June 8th

7:30AM

June 8th

9:00AM

June 8th,

6:00PM

Power maintenance


Power maintenance completed. The gateway and build servers are now accessible.
VMwarepswww, pslogion, psdevMay, 18th,  08:00May 18th

May 19th,

02:00

 

pslogin, psdev and psww are not accessible

Access using the LCLS NX servers will continue to work

VMwarepswww, pslogion, psdevMay, 16th, 10:00May 16th, 17:00May 17th, 18:30VM hypervisor upgradeAccess using the LCLS NX servers will continue to work
psnfs02home directoriesMay 12thMay 13thMay 13th, 10:14server is down Some users home directories are not accessible
psexportGlobus, gridftpMay 11thMay 12May 13th, 19:30psexport01 is down

psexport works again, gridftp (globus) is available.

pslogin, pswwwlogin, web services May 12thMay 12th,
May 12th, 11:20virtual machine clusterMost VM should be accessible now.
pswwwData Management PortalFeb 29Mar 3 5PMMar 5thMigrating databases to LCLS2 infrastructureWe'll be migrating all the databases to LCLS2 data management systems. The current data management portal should be available as read only.
ANAAnalysis Infrastructure4 December 19 4:00PM PDTFriday, December 13th

superseded
by other
outages

Troubleshooting short-circuit

While bringing up some of the file systems, we encountered electrical issues. The plan is to keep things stable until Monday to make sure the equipment which is currently powered on is working properly.

Status:

Up: psana, psanaq, psexport, ana02, ana03, ana04, ana12, ana15

Down until further notice: ana14

Update: 2020-02-25
    ana11 is up and batch nodes psana12xx

ANA file systemsAnalysis infrastructureNov 14th, 2019Nov 27thDec 4thMove from building 50 to building 54 (SRCF)

Update:

ANA14, ANA12, ANA04, and ANA02 up by Friday

ANA11 up by Monday

Completed tasks:

ana03, ana12, ana13, and ana15 are up

Most of the psana batch queue is now available.

The psana interactive pool can now be accessed, but several nodes are still down(e.g psanagpu115, psanagpu116, etc).

The psexport nodes are operational.

pslogin, kerberospslogin, passwordless access

November 8, 2019

11:45AM

TBDNovember 11th2 Hypervisors are unresponsive



ANA file systemsApril 11, 12:30pmUnknown

April 11,

7:30PM


SLAC wide power glitch

7:30PM: ANA14 is now online. All systems are up and functional.

6:00PM: All ANA lustre filesystems have been recovered except ANA14 which is suffering major hardware issues. We don't know yet how long it will take to recover it.

12:30PM: All systems are up by now except the ANA file systems because the Lustre MDS was damaged. We don't know yet how long it will take to recover it.

AllAll nodesJanuary 4thJanuary 11th

January 25th (User Services)


Server room relocation

Systems that are up:

  • psnxserv
  • psana (interactive)
  • psexport
  • psdev
  • pslogin
  • pswww(elog)
  • pshub(JupyterHub)
  • psana(batch)
psana, psexport
Aug 13, 7pmAug 14Aug 14Configuration errorWrong MTU setting in building 50 causes psana and psexport to not be able to mount NFS. Will fix this morning.
AllAll nodesJuly 26, 5:00PMJuly 30, 5:00PMAugust 2ndPower Outage

Systems are now on-line fully functional.

We encountered several systems with bad hardware and corrupted files.

Unnecessary systems are now off-line.

psnfs03 and psnfs04/reg/g and /reg/commonApr 24, 6:30amApr 24, 9amApr 24, 8:30firmware update
psnfsoprOperators homeApr 24, 6:30amApr 24, 9amApr 24, 7:30Move and firmware update
AllNetworking will be down, so all machines and services will be unavailable.

Jan 17,

2:45 PM

Jan 17,

5:00 PM

Jan 17,

4:45 PM

Central Router firmware upgrades and replace NFS SAS module.

The new SAS NFS module woked, but psnfs03 and pnfs04 took longer than expected to boot.

The pslogin, psdev and psnxserv nodes will be unavailable.

User home directories and some /reg/* NFS shares will be unavailable

Dec 26, 6:00 AM

Dec 26, 8:00 PMDec 26, 9:30pmFirmware upgradesProblems found with one of the NFS servers (psnfs03), may need to take another outage to fix
All ANA filesystems, interactive (GPU & Phi) nodes, psexport nodes and LSF batch nodesScience data and associated serversDec 26, 6:00 AMDec 31, 5:00 PM

Dec 31,

8:00 AM

Electrical Work
HPSS Storage System will be unavailableData backup and recoveryDec 25, 6:00 AMJan 8, 5:00 PMJan 1Electrical Work
All ana filesystems, interactive nodes, psexport nodes and most batch nodesScience data and associated serversAug 22, 1PM

Aug 23, 8:00 PM

Aug 23, 10:00PMCooling and Electrical WorkOutage recovery was late because we had several  hardware problems which required intervention.
LCLS ComputingAll LCLS computing servicesJuly 25, 4:00 AM (PDT)

July 25, 6:00 PM (PDT)


Electrical WorkOutage Canceled.
All ana filesystems, interactive nodes, psexport nodes and most batch nodesScience data and associated serversJune 7, 5:30 AM (PDT)June 7, 6:00 PM (PDT)June 7, 6:53 PM (PDT)Electrical WorkOutage recovery was an hour late because server room electrical work extended beyond anticipated outage time and a Lustre-system RAID card failed.
ana02Science dataWed, April 19, 11 AMWed, April 19, 4 PMWed, April 19, 4 PMUpdate OS and Lustre version

psnxserv03

psnxopr


No Machine upgradePlease use psnxserv01 and 02 while 03 is upgraded
AllNFSFri Feb 24, 2017 9amFri Feb 24, 2017 9pmFri Feb 24, 2017 8:08pmNFS upgradeDuring this outage it won't be possible any users or operator home
ana12Science dataAug 16, 9amAug 16, 12pmAug 16, 12pmHardware failureAccess to six OSTs is very slow, presumably because of a failing RAID card. We will shutdown one of five ana12 OSSs to replace the card and reconfigure as needed. No data is expected to be lost, but some data will be unavailable during the outage.
ana04Science dataMay 5th, 10amMay 10th, 10amMay 30thHardware failure

One of the ana04 OSTs doesn't detect a drive. There is enough redundancy to rebuild the array, but it's extremely slow and sometime it hangs. The OST has been set read only. Moving the data to other OSTs.

End: file system couldn't be recovered, but we were able to move all the data somewhere else, wipe and rebuild the file system.

Batch nodesBatch jobsOct 5, 2015 11amOct 5, 2015 1pm
Move to RHEL7Interactive nodes will also be moved to rhel7. Users logged into rhel5 interactive nodes will be able to continue their session, but batch submission will fail once batch nodes are converted to rhel7.
HPSSRestoring files from tape

Sep 22, 2015
7am

Sep 24, 2015
5pm

Sep 24, 2015
12pm
Upgrade of HPSS to version 7.4HPSS will become read-only on Sep21th at 5pm.
psnehprioq/psfehprioqAll nodesApril 1, 2015 9amApril 1, 2015 8pmMarch 31, 2015Maintenance on nodes to allow addition of 640 cores to computing systemEnded early due to technical issues.
psnehq/psfehqAll nodesMarch 31, 2015 11amApril 1, 2015 8pmMarch 31, 2015Maintenance on nodes to allow addition of 640 cores to computing systemEnded early due to technical issues.
All NEH/FEH computingAll LCLS computing servicesDec 31, 00:00hrsDec 31, 18:00hrs
Switching of generator power to building power at Building 950 where servers are housed.
All NEH/FEH computing
Nov. 7, 2014 ~12pmUnknown7:30pmUnscheduled power outagePower has been restored. We are running file system checks for our NFS servers before we could bring everything else back up. We are hoping to get things back up by 4.30pm.
LCLS Offline FilesystemAll LCLS Offline filesystems

Sep 3, 2014

~7.30pm

N/A

Sep 4, 2014
12:45pm

Network outage causing offline filesystems to be inaccessible.We have identified the root caused and issue should be fixed now.
LCLS ComputingAll LCLS computing servicesFriday Aug 8th, Midnight (00:00 hrs)

Monday Aug 18th, 2014

Noon

Aug 18th, 2014 12pm

To prepare for the power shut down on Aug 11.
To perform hardware and software maintenance before the power outage.

Not all machines will be shut down at once as we start performing system updates. Please do not depend on any service pass midnight Aug 10th.
psexport, psana104, psana105Export nodes, HPSS, scratch and calib backupsThursday August 7th, Noon

Monday Aug 18th, 2014

COB



These machines need to be moved to Bld 50.
LCLS Online ComputingAll Online Computing Nodes, DSS, FFB

Tuesday Aug 5th, 2014
5pm

Wednesday Aug 20th, 2014

COB


CXI DSS nodes and ffb nodes need to be recabled and reconfigured. The IB switch is moved to B50 as part of the offline storage move.


LCLS Offline Filesystem

All LCLS Offline filesystems

ana01, ana02, ana03. ana04. ana11, ana12, ana14

Tuesday Aug 5th, 2014
5pm

Monday Aug 18th, 2014

COB


Offline Analysis Hardware will be moved from B950 and B999 to B50.The equipment needs to be disconnected, and carefully moved before the power outage on Aug 11, and reconnected at B50.
LSFLSF Job Submission and ManagementWednesday, Mar 19, 2014N/AWednesday, 6:45pmUnplanned software outage

No job can be submitted and managed at this time. We were informed that software vendor has been contacted, and SLAC Computing Division is working on the issue.

Post-mortem from SLAC comp-out: The problem resulted from a bug in one of the new LSF 9.1.2 daemons. IBM is researching a fix and in the meantime we are still running 9.1.2, but that one daemon has been reverted to 9.1.1. We believe that jobs submitted prior to the problem should have continued to run and should continue to be tracked by LSF.

All computing services at LCLS (NEH, FEH, XRT, FEE, Undulator Hall)

All services at LCLS

Friday, Aug 9th, 2013
13:00hrs

Tuesday, Aug 13th, 2013
13:00hrs


Planned power outage at LCLS buildings


psana11*,psana12*
psana13*, psana14*

LSF Compute Nodes

Thursday, May 30th, 2013 16:30hrs

Friday, May 31st, 2013 12:00hrs


Unplanned power outage at SLAC

Most of the nodes are brought up. A handful have memory related problem and have been disabled in LSF awaiting diagnostic.

psanafeh


Thursday, May 30th, 2013 16:30hrs

Friday, May 31st, 2013 12:00hrs

Friday, May 31st, 2013 10:37am

Unplanned power outage at SLAC


psexport01


Thursday, May 30th, 2013 16:30hrs

Friday May 31st, 2013 12:00hrs

Friday, May 31st, 2013 11.00am

Unplanned power outage at SLAC


ana01, ana02

/reg/d/ana01, /reg/d/ana02 filesystems

Thursday, May 30th, 2013 16:30hrs

Friday May 31st, 2013 14:00hrs

Friday, May 31st, 6pm.

Unplanned power outage at SLAC


pssrv100 (psnfs)

NFS mountpoint for PCDS diskless nodes

Tuesday, Mar 26th, 2013
12:30pm

Tuesday, Mar 26th, 2013
5pm

Tuesday, Mar 26th, 2013
4:45pm

 

 

pssrv100 (psnfs)

NFS mountpoint for PCDS diskless nodes

Monday, Jan 7th, 2013 (1030 hrs)

Monday, Jan 7th, 2013
(1600 hrs)

Wednesday, Jan 9th, 2013
(1140 hrs)

RAID controller malfunctioned upon power restoral after planned power outage in B950 203A

pssrv101 (old data) was used to bring up the FEE nodes for part of the outage.  pssrv100 was restored to operation after a new RAID controller was delivered and installed.

ana01

/reg/d/ana01 filesystem

Tuesday, Dec 18th 2012

unknown

Partial (98%) restoral Monday Dec 24th (0800 hrs)

Controller failed causing corrupted parity data

Parity errors  fixed and new controller installed.  2 OSTs (LUNs) needed fsck'ing.  One took a few hours, the other took 10 days.

psanaoss21*

/reg/d/ana12 filesystem

Monday, Oct 8th, 2012 (1700 hrs)

Monday, Oct 8th, 2012 (1900 hrs)

Monday, Oct 8th, 2012 (1900 hrs)

Hardware upgrades


psanaoss2**

/reg/d/ana11 and /reg/d/ana12 filesystem

Thursday, Sep 27, 2012 (1700 hrs)

Friday, Sep 28, 2012 (0100 hrs)

Friday, Sep 28, 2012 (0400 hrs)

Hardware upgrades


Sitewide outage. All Linux Servers at NEH, FEH, XRT, FEE.

All computing services at LCLS.

Wednesday August 15, 2012
1:15 PM (1300 hrs)

August 17, 2012 1:00 PM
(Except psanafeh, ana11 and ana12 file systems which will be down till Aug 21, 2012.


SLAC sitewide power outage on August 16.
Electrical work at LCLS.
Server maintenance.

Expect logging in to any machines to be unavailable between 8/15 and 8/17 even if some of the servers are powered up before the expected end time. They will be maintenance performed on various servers during these 2 days.

All machines in XPP hutch and control room will be inaccessible.

XPP

Monday April 9, 2012 11:15AM

Monday April 9, 2012 11:45AM

Monday April 9, 2012 11:30AM

Electrical Work at XPP Hutch

Completed


ana01/ana02 file systems

Wed Mar 28th, 2012 9am

Wed Mar 28th, 2012 1pm

Wed Mar 28th, 2012 4pm

Upgrade to IB

Completed

psananeh
lclsq
ana01
ana02

NEH storage and processing


Wed Dec 21, 2011 6am

Tue Dec 27, 2011 4pm

Mon Dec 26, 2011 1pm


Chilled water outage

Completed. Chilled water was restored on Friday.

psana batch nodes

All Science data is currently unavailable. Psananeh psanafeh is up for Matlab use, but no access to data on Lustre file system.

Saturday Oct 1, 2011 6am



Lustre file system remains down after the unplanned power outage on Saturday.

The system administrators are working to bring them back.


All LCLS computing services

Monday
Nov 14, 2011 7am



Electrical work at NEH server room and FEH.

pslogin is up. NFS server, LDAP, DNS, pswww are up.
The daq nodes will not come up until after 4PM.
Lustre will not come up until after about 4.30PM.
Batch nodes (psana11* psana12*) and psana01* will not be up until Lustre is up.

psana,
NEH Online Nodes,
psimport,
psexport,
pslogin,
psdev.
psanasrv100,
psanasrv101,
psanasrv102

All Science data, All user home directories, all DAQ cache nodes. All online services.

Wed
Sep 28,
2011
10am

Wed
Sep 28,
2011
6pm

Wed
Sep 28,
2011
6pm

Upgrade of Lustre hardware.
Installation of taylor on several offline systems. Update of kernel on Online nodes.


psana

Science data access

Tue
Sep 20,
2011
11:15am


Tue
Sep 20,
2011
6:15pm

NEH power outage

B950 and several other buildings experienced short power glitch but the lustre file servers did not survive the interruption and is still being brought up.

psana

Science data access

Thu
Jun 2,
2011
1pm

Thu
Jun 2,
2011
5pm


Lustre failover testing.


NEH online nodes
ana02
psexport, psimport

NEH DAQ, outside ssh access

Thu
May 25,
2011
noon

Thu
May 25,
2011
7pm


Server room upgrade, ana02 memory upgrade

Completed

psana

Science data access

Thu
May 12,
2011
1pm

Thu
May 12,
2011
6pm

Thu
May 12,
2011
6.30pm

Lustre maintenance

Completed. Upgraded memory on psanaoss101-104, and replaced 10Gbit cards with 1 port SMCs. 717W power supplies are in place on psanaoss103-104 now.

psana

Science data access

Thu
May 5,
2011
1pm

Thu
May 5,
2011
5pm

Thu
May 5,
2011
5pm

Lustre maintenance

Completed

All

All

Fri
Apr 29,
2011
6.30pm

Sun
May 1,
2011
11pm

Sun
May 1,
2011
9pm

NEH power outage

Completed

psana

Science data access

Thu
Apr 28,
2011
2pm

Thu
Apr 28,
2011
6pm

Thu
Apr 28,
2011
3pm

Lustre maintenance
pssrv100 NFS volume reconstruction.

Completed
Lustre maintenance postponed.
Raid reconstruction pssrv100 will take 2-3 days. The new volume size is not released by the controller, so we will have to perform the file system resize on another day.

psana

Science data access

Fri
Apr 1,
2011
6pm

Mon
Apr 4,
2011
10am


NEH cooling outage

Completed

psana

Science data access

Thu
Mar 31,
11am

Thu
Mar 31,
5pm


Enabling HA for Lustre system

Completed

All

All

Sat
Mar 26,
2011
7am

Sat
Mar 26,
2011
7pm

Mon
Mar 28,
2011
1pm

NEH power cut

Completed

psana

Science data access

Thu
Mar 24,
2011
11am

Thu
Mar 24,
2011
5pm


Lustre testing

Completed

All

All

Wed
Mar 23,
2011
10am

Wed
Mar 23,
2011
3pm


NEH power cut

This power cut was NOT planned

All

All

Sat
Mar 19,
2011
7am

Sat
Mar 19,
2011
7pm

Mon
Mar 21,
2011
10am

NEH power cut

Completed

  • No labels