Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Table Filter macro removed

All times are PDT. Red entries are active. Most PST (Pacific Standard Time).  Most recent entry first.

...

Nodes

...

Services

...

Start Time

...

Expected End Time

...

Actual End Time

...

Reason

...

Comments

...

psnxserv03

...

One of the ana04 OSTs doesn't detect a drive. There is enough redundancy to rebuild the array, but it's extremely slow and sometime it hangs. The OST has been set read only. Moving the data to other OSTs.

End: file system couldn't be recovered, but we were able to move all the data somewhere else, wipe and rebuild the file system.

...

Sep 22, 2015
7am

...

Sep 24, 2015
5pm

...

Sep 3, 2014

~7.30pm

...

Sep 4, 2014
12:45pm

...

Monday Aug 18th, 2014

Noon

...

To prepare for the power shut down on Aug 11.
To perform hardware and software maintenance before the power outage.

...

Monday Aug 18th, 2014

COB

 

...

Tuesday Aug 5th, 2014
5pm

...

Wednesday Aug 20th, 2014

COB

...

CXI DSS nodes and ffb nodes need to be recabled and reconfigured. The IB switch is moved to B50 as part of the offline storage move.

...

All LCLS Offline filesystems

ana01, ana02, ana03. ana04. ana11, ana12, ana14

...

Tuesday Aug 5th, 2014
5pm

...

Monday Aug 18th, 2014

COB

...

No job can be submitted and managed at this time. We were informed that software vendor has been contacted, and SLAC Computing Division is working on the issue.

Post-mortem from SLAC comp-out: The problem resulted from a bug in one of the new LSF 9.1.2 daemons. IBM is researching a fix and in the meantime we are still running 9.1.2, but that one daemon has been reverted to 9.1.1. We believe that jobs submitted prior to the problem should have continued to run and should continue to be tracked by LSF.

...

All computing services at LCLS (NEH, FEH, XRT, FEE, Undulator Hall)

...

All services at LCLS

...

Friday, Aug 9th, 2013
13:00hrs

...

Tuesday, Aug 13th, 2013
13:00hrs

...

 

...

Planned power outage at LCLS buildings

...

 

...

psana11*,psana12*
psana13*, psana14*

...

LSF Compute Nodes

...

Thursday, May 30th, 2013 16:30hrs

...

Friday, May 31st, 2013 12:00hrs

...

 

...

Unplanned power outage at SLAC

...

Most of the nodes are brought up. A handful have memory related problem and have been disabled in LSF awaiting diagnostic.

...

psanafeh

...

 

...

Thursday, May 30th, 2013 16:30hrs

...

Friday, May 31st, 2013 12:00hrs

...

Friday, May 31st, 2013 10:37am

...

Unplanned power outage at SLAC

...

 

...

psexport01

...

 

...

Thursday, May 30th, 2013 16:30hrs

...

Friday May 31st, 2013 12:00hrs

...

Friday, May 31st, 2013 11.00am

...

Unplanned power outage at SLAC

...

 

...

ana01, ana02

...

/reg/d/ana01, /reg/d/ana02 filesystems

...

Thursday, May 30th, 2013 16:30hrs

...

Friday May 31st, 2013 14:00hrs

...

Friday, May 31st, 6pm.

...

Unplanned power outage at SLAC

...

 

...

pssrv100 (psnfs)

...

NFS mountpoint for PCDS diskless nodes

...

Tuesday, Mar 26th, 2013
12:30pm

...

Tuesday, Mar 26th, 2013
5pm

...

Tuesday, Mar 26th, 2013
4:45pm

...

 

...

 

...

pssrv100 (psnfs)

...

NFS mountpoint for PCDS diskless nodes

...

Monday, Jan 7th, 2013 (1030 hrs)

...

Monday, Jan 7th, 2013
(1600 hrs)

...

Wednesday, Jan 9th, 2013
(1140 hrs)

...

RAID controller malfunctioned upon power restoral after planned power outage in B950 203A

...

pssrv101 (old data) was used to bring up the FEE nodes for part of the outage.  pssrv100 was restored to operation after a new RAID controller was delivered and installed.

...

ana01

...

/reg/d/ana01 filesystem

...

Tuesday, Dec 18th 2012

...

unknown

...

Partial (98%) restoral Monday Dec 24th (0800 hrs)

...

Controller failed causing corrupted parity data

...

Parity errors  fixed and new controller installed.  2 OSTs (LUNs) needed fsck'ing.  One took a few hours, the other took 10 days.

...

psanaoss21*

...

/reg/d/ana12 filesystem

...

Monday, Oct 8th, 2012 (1700 hrs)

...

Monday, Oct 8th, 2012 (1900 hrs)

...

Monday, Oct 8th, 2012 (1900 hrs)

...

Hardware upgrades

...

 

...

psanaoss2**

...

/reg/d/ana11 and /reg/d/ana12 filesystem

...

Thursday, Sep 27, 2012 (1700 hrs)

...

Friday, Sep 28, 2012 (0100 hrs)

...

Friday, Sep 28, 2012 (0400 hrs)

...

Hardware upgrades

...

 

...

Sitewide outage. All Linux Servers at NEH, FEH, XRT, FEE.

...

All computing services at LCLS.

...

Wednesday August 15, 2012
1:15 PM (1300 hrs)

...

August 17, 2012 1:00 PM
(Except psanafeh, ana11 and ana12 file systems which will be down till Aug 21, 2012.

...

 

...

SLAC sitewide power outage on August 16.
Electrical work at LCLS.
Server maintenance.

...

Expect logging in to any machines to be unavailable between 8/15 and 8/17 even if some of the servers are powered up before the expected end time. They will be maintenance performed on various servers during these 2 days.

...

All machines in XPP hutch and control room will be inaccessible.

...

XPP

...

Monday April 9, 2012 11:15AM

...

Monday April 9, 2012 11:45AM

...

Monday April 9, 2012 11:30AM

...

Electrical Work at XPP Hutch

...

Completed

...

 

...

ana01/ana02 file systems

...

Wed Mar 28th, 2012 9am

...

Wed Mar 28th, 2012 1pm

...

Wed Mar 28th, 2012 4pm

...

Upgrade to IB

...

Completed

...

psananeh
lclsq
ana01
ana02

...

NEH storage and processing

...

Tue Dec 27, 2011 4pm

...

Mon Dec 26, 2011 1pm

...

Completed. Chilled water was restored on Friday.

Info

Add planned outages or maintenance activities using this project: https://jira.slac.stanford.edu/projects/CDSO/

Excerpt

Planned or ongoing

Jira
serverSLAC National Accelerator Laboratory
columnIdspriority,summary,customfield_11121,customfield_11122,description,assignee
columnspriority,summary,Target start,Target end,description,assignee
maximumIssues20
jqlQueryresolution = Unresolved and project = "LCLS CDS Outages"
serverId1b8dc293-975d-3f2d-b988-18fd9aec1546

Completed

Jira
serverSLAC National Accelerator Laboratory
columnIdssummary,description,resolutiondate
columnssummary,description,resolutiondate
maximumIssues20
jqlQuerystatus=done and project = "LCLS CDS Outages"
serverId1b8dc293-975d-3f2d-b988-18fd9aec1546

...

psana batch nodes

...

All Science data is currently unavailable. Psananeh psanafeh is up for Matlab use, but no access to data on Lustre file system.

...

Saturday Oct 1, 2011 6am

...

 

...

 

...

Lustre file system remains down after the unplanned power outage on Saturday.

...

The system administrators are working to bring them back.

...

 

...

All LCLS computing services

...

Monday
Nov 14, 2011 7am

...

 

...

 

...

Electrical work at NEH server room and FEH.

...

pslogin is up. NFS server, LDAP, DNS, pswww are up.
The daq nodes will not come up until after 4PM.
Lustre will not come up until after about 4.30PM.
Batch nodes (psana11* psana12*) and psana01* will not be up until Lustre is up.

...

psana,
NEH Online Nodes,
psimport,
psexport,
pslogin,
psdev.
psanasrv100,
psanasrv101,
psanasrv102

...

All Science data, All user home directories, all DAQ cache nodes. All online services.

...

Wed
Sep 28,
2011
10am

...

Wed
Sep 28,
2011
6pm

...

Wed
Sep 28,
2011
6pm

...

Upgrade of Lustre hardware.
Installation of taylor on several offline systems. Update of kernel on Online nodes.

...

 

...

psana

...

Science data access

...

Tue
Sep 20,
2011
11:15am

...

 

...

Tue
Sep 20,
2011
6:15pm

...

NEH power outage

...

B950 and several other buildings experienced short power glitch but the lustre file servers did not survive the interruption and is still being brought up.

...

psana

...

Science data access

...

Thu
Jun 2,
2011
1pm

...

Thu
Jun 2,
2011
5pm

...

 

...

Lustre failover testing.

...

 

...

NEH online nodes
ana02
psexport, psimport

...

NEH DAQ, outside ssh access

...

Thu
May 25,
2011
noon

...

Thu
May 25,
2011
7pm

...

 

...

Server room upgrade, ana02 memory upgrade

...

Completed

...

psana

...

Science data access

...

Thu
May 12,
2011
1pm

...

Thu
May 12,
2011
6pm

...

Thu
May 12,
2011
6.30pm

...

Lustre maintenance

...

Completed. Upgraded memory on psanaoss101-104, and replaced 10Gbit cards with 1 port SMCs. 717W power supplies are in place on psanaoss103-104 now.

...

psana

...

Science data access

...

Thu
May 5,
2011
1pm

...

Thu
May 5,
2011
5pm

...

Thu
May 5,
2011
5pm

...

Lustre maintenance

...

Completed

...

All

...

All

...

Fri
Apr 29,
2011
6.30pm

...

Sun
May 1,
2011
11pm

...

Sun
May 1,
2011
9pm

...

NEH power outage

...

Completed

...

psana

...

Science data access

...

Thu
Apr 28,
2011
2pm

...

Thu
Apr 28,
2011
6pm

...

Thu
Apr 28,
2011
3pm

...

Lustre maintenance
pssrv100 NFS volume reconstruction.

...

Completed
Lustre maintenance postponed.
Raid reconstruction pssrv100 will take 2-3 days. The new volume size is not released by the controller, so we will have to perform the file system resize on another day.

...

psana

...

Science data access

...

Fri
Apr 1,
2011
6pm

...

Mon
Apr 4,
2011
10am

...

 

...

NEH cooling outage

...

Completed

...

psana

...

Science data access

...

Thu
Mar 31,
11am

...

Thu
Mar 31,
5pm

...

 

...

Enabling HA for Lustre system

...

Completed

...

All

...

All

...

Sat
Mar 26,
2011
7am

...

Sat
Mar 26,
2011
7pm

...

Mon
Mar 28,
2011
1pm

...

NEH power cut

...

Completed

...

psana

...

Science data access

...

Thu
Mar 24,
2011
11am

...

Thu
Mar 24,
2011
5pm

...

 

...

Lustre testing

...

Completed

...

All

...

All

...

Wed
Mar 23,
2011
10am

...

Wed
Mar 23,
2011
3pm

...

 

...

NEH power cut

...

This power cut was NOT planned

...

All

...

All

...

Sat
Mar 19,
2011
7am

...

Sat
Mar 19,
2011
7pm

...

Mon
Mar 21,
2011
10am

...

NEH power cut

...