Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Nodes

Services

Start Time

Expected End Time

Actual End Time

Reason

Comments

AllAll nodesJanuary 4thJanuary 11th

January 25th (User Services)

 

Server room relocation

Systems that are up:

  • psnxserv
  • psana (interactive)
  • psexport
  • psdev
  • pslogin
  • pswww(elog)
  • pshub(JupyterHub)
  • psana(batch)

Systems that will be up later this week:

  • Internal Services
psana, psexport Aug 13, 7pmAug 14Aug 14Configuration errorWrong MTU setting in building 50 causes psana and psexport to not be able to mount NFS. Will fix this morning.
AllAll nodesJuly 26, 5:00PMJuly 30, 5:00PMAugust 2ndPower Outage

Systems are now on-line fully functional.

We encountered several systems with bad hardware and corrupted files.

Unnecessary systems are now off-line.

psnfs03 and psnfs04/reg/g and /reg/commonApr 24, 6:30amApr 24, 9amApr 24, 8:30firmware update 
psnfsoprOperators homeApr 24, 6:30amApr 24, 9amApr 24, 7:30Move and firmware update 
AllNetworking will be down, so all machines and services will be unavailable.

Jan 17,

2:45 PM

Jan 17,

5:00 PM

Jan 17,

4:45 PM

Central Router firmware upgrades and replace NFS SAS module.

The new SAS NFS module woked, but psnfs03 and pnfs04 took longer than expected to boot.

The pslogin, psdev and psnxserv nodes will be unavailable.

User home directories and some /reg/* NFS shares will be unavailable

Dec 26, 6:00 AM

Dec 26, 8:00 PMDec 26, 9:30pmFirmware upgradesProblems found with one of the NFS servers (psnfs03), may need to take another outage to fix
All ANA filesystems, interactive (GPU & Phi) nodes, psexport nodes and LSF batch nodesScience data and associated serversDec 26, 6:00 AMDec 31, 5:00 PM

Dec 31,

8:00 AM

Electrical Work 
HPSS Storage System will be unavailableData backup and recoveryDec 25, 6:00 AMJan 8, 5:00 PMJan 1Electrical Work 
All ana filesystems, interactive nodes, psexport nodes and most batch nodesScience data and associated serversAug 22, 1PM

Aug 23, 8:00 PM

Aug 23, 10:00PMCooling and Electrical WorkOutage recovery was late because we had several  hardware problems which required intervention.
LCLS ComputingAll LCLS computing servicesJuly 25, 4:00 AM (PDT)

July 25, 6:00 PM (PDT)

 Electrical WorkOutage Canceled.
All ana filesystems, interactive nodes, psexport nodes and most batch nodesScience data and associated serversJune 7, 5:30 AM (PDT)June 7, 6:00 PM (PDT)June 7, 6:53 PM (PDT)Electrical WorkOutage recovery was an hour late because server room electrical work extended beyond anticipated outage time and a Lustre-system RAID card failed.
ana02Science dataWed, April 19, 11 AMWed, April 19, 4 PMWed, April 19, 4 PMUpdate OS and Lustre version 

psnxserv03

psnxopr   No Machine upgradePlease use psnxserv01 and 02 while 03 is upgraded
AllNFSFri Feb 24, 2017 9amFri Feb 24, 2017 9pmFri Feb 24, 2017 8:08pmNFS upgradeDuring this outage it won't be possible any users or operator home
ana12Science dataAug 16, 9amAug 16, 12pmAug 16, 12pmHardware failureAccess to six OSTs is very slow, presumably because of a failing RAID card. We will shutdown one of five ana12 OSSs to replace the card and reconfigure as needed. No data is expected to be lost, but some data will be unavailable during the outage.
ana04Science dataMay 5th, 10amMay 10th, 10amMay 30thHardware failure

One of the ana04 OSTs doesn't detect a drive. There is enough redundancy to rebuild the array, but it's extremely slow and sometime it hangs. The OST has been set read only. Moving the data to other OSTs.

End: file system couldn't be recovered, but we were able to move all the data somewhere else, wipe and rebuild the file system.

Batch nodesBatch jobsOct 5, 2015 11amOct 5, 2015 1pm Move to RHEL7Interactive nodes will also be moved to rhel7. Users logged into rhel5 interactive nodes will be able to continue their session, but batch submission will fail once batch nodes are converted to rhel7.
HPSSRestoring files from tape

Sep 22, 2015
7am

Sep 24, 2015
5pm

Sep 24, 2015
12pm
Upgrade of HPSS to version 7.4HPSS will become read-only on Sep21th at 5pm.
psnehprioq/psfehprioqAll nodesApril 1, 2015 9amApril 1, 2015 8pmMarch 31, 2015Maintenance on nodes to allow addition of 640 cores to computing systemEnded early due to technical issues.
psnehq/psfehqAll nodesMarch 31, 2015 11amApril 1, 2015 8pmMarch 31, 2015Maintenance on nodes to allow addition of 640 cores to computing systemEnded early due to technical issues.
All NEH/FEH computingAll LCLS computing servicesDec 31, 00:00hrsDec 31, 18:00hrs Switching of generator power to building power at Building 950 where servers are housed. 
All NEH/FEH computing Nov. 7, 2014 ~12pmUnknown7:30pmUnscheduled power outagePower has been restored. We are running file system checks for our NFS servers before we could bring everything else back up. We are hoping to get things back up by 4.30pm.
LCLS Offline FilesystemAll LCLS Offline filesystems

Sep 3, 2014

~7.30pm

N/A

Sep 4, 2014
12:45pm

Network outage causing offline filesystems to be inaccessible.We have identified the root caused and issue should be fixed now.
LCLS ComputingAll LCLS computing servicesFriday Aug 8th, Midnight (00:00 hrs)

Monday Aug 18th, 2014

Noon

Aug 18th, 2014 12pm

To prepare for the power shut down on Aug 11.
To perform hardware and software maintenance before the power outage.

Not all machines will be shut down at once as we start performing system updates. Please do not depend on any service pass midnight Aug 10th.
psexport, psana104, psana105Export nodes, HPSS, scratch and calib backupsThursday August 7th, Noon

Monday Aug 18th, 2014

COB

 

 These machines need to be moved to Bld 50. 
LCLS Online ComputingAll Online Computing Nodes, DSS, FFB

Tuesday Aug 5th, 2014
5pm

Wednesday Aug 20th, 2014

COB

 

CXI DSS nodes and ffb nodes need to be recabled and reconfigured. The IB switch is moved to B50 as part of the offline storage move.

 
LCLS Offline Filesystem

All LCLS Offline filesystems

ana01, ana02, ana03. ana04. ana11, ana12, ana14

Tuesday Aug 5th, 2014
5pm

Monday Aug 18th, 2014

COB

 Offline Analysis Hardware will be moved from B950 and B999 to B50.The equipment needs to be disconnected, and carefully moved before the power outage on Aug 11, and reconnected at B50.
LSFLSF Job Submission and ManagementWednesday, Mar 19, 2014N/AWednesday, 6:45pmUnplanned software outage

No job can be submitted and managed at this time. We were informed that software vendor has been contacted, and SLAC Computing Division is working on the issue.

Post-mortem from SLAC comp-out: The problem resulted from a bug in one of the new LSF 9.1.2 daemons. IBM is researching a fix and in the meantime we are still running 9.1.2, but that one daemon has been reverted to 9.1.1. We believe that jobs submitted prior to the problem should have continued to run and should continue to be tracked by LSF.

All computing services at LCLS (NEH, FEH, XRT, FEE, Undulator Hall)

All services at LCLS

Friday, Aug 9th, 2013
13:00hrs

Tuesday, Aug 13th, 2013
13:00hrs

 

Planned power outage at LCLS buildings

 

psana11*,psana12*
psana13*, psana14*

LSF Compute Nodes

Thursday, May 30th, 2013 16:30hrs

Friday, May 31st, 2013 12:00hrs

 

Unplanned power outage at SLAC

Most of the nodes are brought up. A handful have memory related problem and have been disabled in LSF awaiting diagnostic.

psanafeh

 

Thursday, May 30th, 2013 16:30hrs

Friday, May 31st, 2013 12:00hrs

Friday, May 31st, 2013 10:37am

Unplanned power outage at SLAC

 

psexport01

 

Thursday, May 30th, 2013 16:30hrs

Friday May 31st, 2013 12:00hrs

Friday, May 31st, 2013 11.00am

Unplanned power outage at SLAC

 

ana01, ana02

/reg/d/ana01, /reg/d/ana02 filesystems

Thursday, May 30th, 2013 16:30hrs

Friday May 31st, 2013 14:00hrs

Friday, May 31st, 6pm.

Unplanned power outage at SLAC

 

pssrv100 (psnfs)

NFS mountpoint for PCDS diskless nodes

Tuesday, Mar 26th, 2013
12:30pm

Tuesday, Mar 26th, 2013
5pm

Tuesday, Mar 26th, 2013
4:45pm

 

 

pssrv100 (psnfs)

NFS mountpoint for PCDS diskless nodes

Monday, Jan 7th, 2013 (1030 hrs)

Monday, Jan 7th, 2013
(1600 hrs)

Wednesday, Jan 9th, 2013
(1140 hrs)

RAID controller malfunctioned upon power restoral after planned power outage in B950 203A

pssrv101 (old data) was used to bring up the FEE nodes for part of the outage.  pssrv100 was restored to operation after a new RAID controller was delivered and installed.

ana01

/reg/d/ana01 filesystem

Tuesday, Dec 18th 2012

unknown

Partial (98%) restoral Monday Dec 24th (0800 hrs)

Controller failed causing corrupted parity data

Parity errors  fixed and new controller installed.  2 OSTs (LUNs) needed fsck'ing.  One took a few hours, the other took 10 days.

psanaoss21*

/reg/d/ana12 filesystem

Monday, Oct 8th, 2012 (1700 hrs)

Monday, Oct 8th, 2012 (1900 hrs)

Monday, Oct 8th, 2012 (1900 hrs)

Hardware upgrades

 

psanaoss2**

/reg/d/ana11 and /reg/d/ana12 filesystem

Thursday, Sep 27, 2012 (1700 hrs)

Friday, Sep 28, 2012 (0100 hrs)

Friday, Sep 28, 2012 (0400 hrs)

Hardware upgrades

 

Sitewide outage. All Linux Servers at NEH, FEH, XRT, FEE.

All computing services at LCLS.

Wednesday August 15, 2012
1:15 PM (1300 hrs)

August 17, 2012 1:00 PM
(Except psanafeh, ana11 and ana12 file systems which will be down till Aug 21, 2012.

 

SLAC sitewide power outage on August 16.
Electrical work at LCLS.
Server maintenance.

Expect logging in to any machines to be unavailable between 8/15 and 8/17 even if some of the servers are powered up before the expected end time. They will be maintenance performed on various servers during these 2 days.

All machines in XPP hutch and control room will be inaccessible.

XPP

Monday April 9, 2012 11:15AM

Monday April 9, 2012 11:45AM

Monday April 9, 2012 11:30AM

Electrical Work at XPP Hutch

Completed

 

ana01/ana02 file systems

Wed Mar 28th, 2012 9am

Wed Mar 28th, 2012 1pm

Wed Mar 28th, 2012 4pm

Upgrade to IB

Completed

psananeh
lclsq
ana01
ana02

NEH storage and processing


Wed Dec 21, 2011 6am

Tue Dec 27, 2011 4pm

Mon Dec 26, 2011 1pm


Chilled water outage

Completed. Chilled water was restored on Friday.

psana batch nodes

All Science data is currently unavailable. Psananeh psanafeh is up for Matlab use, but no access to data on Lustre file system.

Saturday Oct 1, 2011 6am

 

 

Lustre file system remains down after the unplanned power outage on Saturday.

The system administrators are working to bring them back.

 

All LCLS computing services

Monday
Nov 14, 2011 7am

 

 

Electrical work at NEH server room and FEH.

pslogin is up. NFS server, LDAP, DNS, pswww are up.
The daq nodes will not come up until after 4PM.
Lustre will not come up until after about 4.30PM.
Batch nodes (psana11* psana12*) and psana01* will not be up until Lustre is up.

psana,
NEH Online Nodes,
psimport,
psexport,
pslogin,
psdev.
psanasrv100,
psanasrv101,
psanasrv102

All Science data, All user home directories, all DAQ cache nodes. All online services.

Wed
Sep 28,
2011
10am

Wed
Sep 28,
2011
6pm

Wed
Sep 28,
2011
6pm

Upgrade of Lustre hardware.
Installation of taylor on several offline systems. Update of kernel on Online nodes.

 

psana

Science data access

Tue
Sep 20,
2011
11:15am

 

Tue
Sep 20,
2011
6:15pm

NEH power outage

B950 and several other buildings experienced short power glitch but the lustre file servers did not survive the interruption and is still being brought up.

psana

Science data access

Thu
Jun 2,
2011
1pm

Thu
Jun 2,
2011
5pm

 

Lustre failover testing.

 

NEH online nodes
ana02
psexport, psimport

NEH DAQ, outside ssh access

Thu
May 25,
2011
noon

Thu
May 25,
2011
7pm

 

Server room upgrade, ana02 memory upgrade

Completed

psana

Science data access

Thu
May 12,
2011
1pm

Thu
May 12,
2011
6pm

Thu
May 12,
2011
6.30pm

Lustre maintenance

Completed. Upgraded memory on psanaoss101-104, and replaced 10Gbit cards with 1 port SMCs. 717W power supplies are in place on psanaoss103-104 now.

psana

Science data access

Thu
May 5,
2011
1pm

Thu
May 5,
2011
5pm

Thu
May 5,
2011
5pm

Lustre maintenance

Completed

All

All

Fri
Apr 29,
2011
6.30pm

Sun
May 1,
2011
11pm

Sun
May 1,
2011
9pm

NEH power outage

Completed

psana

Science data access

Thu
Apr 28,
2011
2pm

Thu
Apr 28,
2011
6pm

Thu
Apr 28,
2011
3pm

Lustre maintenance
pssrv100 NFS volume reconstruction.

Completed
Lustre maintenance postponed.
Raid reconstruction pssrv100 will take 2-3 days. The new volume size is not released by the controller, so we will have to perform the file system resize on another day.

psana

Science data access

Fri
Apr 1,
2011
6pm

Mon
Apr 4,
2011
10am

 

NEH cooling outage

Completed

psana

Science data access

Thu
Mar 31,
11am

Thu
Mar 31,
5pm

 

Enabling HA for Lustre system

Completed

All

All

Sat
Mar 26,
2011
7am

Sat
Mar 26,
2011
7pm

Mon
Mar 28,
2011
1pm

NEH power cut

Completed

psana

Science data access

Thu
Mar 24,
2011
11am

Thu
Mar 24,
2011
5pm

 

Lustre testing

Completed

All

All

Wed
Mar 23,
2011
10am

Wed
Mar 23,
2011
3pm

 

NEH power cut

This power cut was NOT planned

All

All

Sat
Mar 19,
2011
7am

Sat
Mar 19,
2011
7pm

Mon
Mar 21,
2011
10am

NEH power cut

Completed