Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

All times are PST (Pacific Standard Time).  Most recent entry first.

...

Nodes

...

Services

...

Start Time     

...

Expected End Time

...

Actual End Time

...

Reason

...

Impact and Comments

...

Please refrain from writing new data to /sdf/data/lcls/ as that might fill up the flash storage.

The Object store itself is up and the data are fully protected the issue is that the fast flash layer can not interact with the object store.
If the outage persists /sdf/data/lcls/ will be unmounted.

...

File restore and status for files in S3DF is not working

...

The PCDSN routers are located in NEH, thus, all PCDSN services will be impacted: Storage (NFS), Login (Bastion Hosts), NoMachine, Interactive, DRP, Build, and Diskless (IOC and DAQ).

Service in 208 will go down only when power is down. Controls and DAQ servers in affected areas will be turned off starting at noon the day before the power outage.

When power is restored, systems will be restored in this order

  • central systems (208) to get UED operational (Dec 6, later afternoon)
  • NEH: LFE/KFE, TMO, RIX, Laser Hall, XPP (Dec 6-7)
  • FEH: XRT, XCS, MFX, CXI, MEC (Dec 8)

EPICS will remain offline until the above services are restored. Recovery can take up to 8hrs after these services are restored. PLC systems configured for automatic recovery will resume as soon as power is restored.

S3DF will not be impacted during this outage.

...

Nodes

...

Services

...

Start Time     

...

Expected End Time

...

Actual End Time

...

Reason

...

Impact and Comments

...

working with security to open up psexport01

...

Upgrading Weka to the latest release to fix ACLs issues. Although no impact is expected, this is being added for informational purposes.

...

IT writes: If you experience issues first, try to log-out and log back in again. If that doesn't work, then reboot the host

...

DRP {cmp,eb,mon} nodes in SRCF are not available

...

NFS, DHCP, Home directories, Archiving, Logging, Grafana, Routing (inter-subnet traffic), EPICS + Gateways, DAQ + analysis systems, Virtual machines (eg. lfe-console), ssh, NTP

PLCs, and networked controllers will continue to function in-so-far as no inter-subnet communication is required. PLC work and maintenance will require local connection. Intra-subnet traffic will still be possible. IOC hosts may remain online but performance will likely degrade until they crash and will not recover until the outage is over. Reboot of any system may fail due to lack of IP assignment (DHCP is down). Archiving and logging will be dark.

This outage is required to integrate the new SRCF-II building with the existing data center.

Info

Add planned outages or maintenance activities using this project: https://jira.slac.stanford.edu/projects/CDSO/

Excerpt

Planned or ongoing

Jira
serverSLAC National Accelerator Laboratory
columnIdspriority,summary,customfield_11121,customfield_11122,description,assignee
columnspriority,summary,Target start,Target end,description,assignee
maximumIssues20
jqlQueryresolution = Unresolved and project = "LCLS CDS Outages"
serverId1b8dc293-975d-3f2d-b988-18fd9aec1546

Completed

Jira
serverSLAC National Accelerator Laboratory
columnIdssummary,description,resolutiondate
columnssummary,description,resolutiondate
maximumIssues20
jqlQuerystatus=done and project = "LCLS CDS Outages"
serverId1b8dc293-975d-3f2d-b988-18fd9aec1546

...

Upgrading system and storage version. 

LCLS2 DAQ unavailable, LCLS1 DAQ only able to record to event-nodes during outage.

The FFB was upgrade but some clients still need some work which will be done by 4th July (SRCF-II outage).

...

Unplanned power outage in SRCF rack F1

...

The elog and some services might not be available for a short times.

Let pcds-datamgt-l know if something is still amiss

...

All data on ana02 were copied, cleaned or archived.
ana02 is not accessible anymore.

...

After coordinating the restarts, all seem operational.

...

Aug 19, 15:30

...

drp-srcf-xxx

...

Aug 16, 6:30PM

...

Aug 16th, 1:00PM

...

Aug 16th, 5:00PM

...

The users home won't be available during this time so nothing will work. This will require rebooting all LCLS servers to use the new NFS stack.

...

The LCLS IT team is working with the vendor to diagnose root cause and future actions.

The software version was rolled back, and the system is performing more stably.

...

Questionnaire/File Restore/File Manager/LCLS 1 DAQ
data mover

...

The questionnaire, the file manager services and the LCLS1 DAQ will be unavailable as we migrate to newer machines.

Reverted back to psdb4 owing to routing issues.

...

Jan 21, 2021
~1 am

...

psdm0x

psdb0x

...

ost0xe can not be written to due to mds/ost issues. It has been set to ro but writes to ana13 are slowed down. Reboot required.

Fixed by rebooting MDS, (OSSs were also rebooted)

...

 

...

Remaining /reg/[d,g,neh]/* NFS volumes

(see Detailed list)

...

21:20  UED has been moved to weka-nfs:  /cds/data/ued/ana,

The ana-filesystems are accessible now.
The FFB->anafs migration has started but it will take time to clear the backlog

...

4-day PAMM. Replace aging legacy systems. Service disruptions may endure for several days as hard-links in applications are identified and repaired in real-time.

Only /reg/neh/opr/ and /reg/g/pcds/ mounts were migrated.  Future outage(s) will address remaining volumes.

...

Circuit breaker has been replaced and sensitivity decreased.

In order to replace the unit, the entire distribution panel (UDB-C) must be powered off.

...

June 8th

7:30AM

...

June 8th

9:00AM

...

June 8th,

6:00PM

Power maintenance

...

May 19th,

02:00

...

pslogin, psdev and psww are not accessible

Access using the LCLS NX servers will continue to work

...

psexport works again, gridftp (globus) is available.

...

superseded
by other
outages

...

While bringing up some of the file systems, we encountered electrical issues. The plan is to keep things stable until Monday to make sure the equipment which is currently powered on is working properly.

Status:

Up: psana, psanaq, psexport, ana02, ana03, ana04, ana12, ana15

Down until further notice: ana14

Update: 2020-02-25
    ana11 is up and batch nodes psana12xx

...

Update:

ANA14, ANA12, ANA04, and ANA02 up by Friday

ANA11 up by Monday

Completed tasks:

ana03, ana12, ana13, and ana15 are up

Most of the psana batch queue is now available.

The psana interactive pool can now be accessed, but several nodes are still down(e.g psanagpu115, psanagpu116, etc).

The psexport nodes are operational.

...

November 8, 2019

11:45AM

...

April 11,

7:30PM

...

7:30PM: ANA14 is now online. All systems are up and functional.

6:00PM: All ANA lustre filesystems have been recovered except ANA14 which is suffering major hardware issues. We don't know yet how long it will take to recover it.

12:30PM: All systems are up by now except the ANA file systems because the Lustre MDS was damaged. We don't know yet how long it will take to recover it.

...

January 25th (User Services)

...

Systems that are up:

  • psnxserv
  • psana (interactive)
  • psexport
  • psdev
  • pslogin
  • pswww(elog)
  • pshub(JupyterHub)
  • psana(batch)

...

Systems are now on-line fully functional.

We encountered several systems with bad hardware and corrupted files.

Unnecessary systems are now off-line.

...

Jan 17,

2:45 PM

...

Jan 17,

5:00 PM

...

Jan 17,

4:45 PM

...

Central Router firmware upgrades and replace NFS SAS module.

...

The pslogin, psdev and psnxserv nodes will be unavailable.

...

Dec 26, 6:00 AM

...

Dec 31,

8:00 AM

...

Aug 23, 8:00 PM

...

July 25, 6:00 PM (PDT)

...

psnxserv03

...

One of the ana04 OSTs doesn't detect a drive. There is enough redundancy to rebuild the array, but it's extremely slow and sometime it hangs. The OST has been set read only. Moving the data to other OSTs.

End: file system couldn't be recovered, but we were able to move all the data somewhere else, wipe and rebuild the file system.

...

Sep 22, 2015
7am

...

Sep 24, 2015
5pm

...

Sep 3, 2014

~7.30pm

...

Sep 4, 2014
12:45pm

...

Monday Aug 18th, 2014

Noon

...

To prepare for the power shut down on Aug 11.
To perform hardware and software maintenance before the power outage.

...

Monday Aug 18th, 2014

COB

...

Tuesday Aug 5th, 2014
5pm

...

Wednesday Aug 20th, 2014

COB

...

CXI DSS nodes and ffb nodes need to be recabled and reconfigured. The IB switch is moved to B50 as part of the offline storage move.

...

All LCLS Offline filesystems

ana01, ana02, ana03. ana04. ana11, ana12, ana14

...

Tuesday Aug 5th, 2014
5pm

...

Monday Aug 18th, 2014

COB

...

No job can be submitted and managed at this time. We were informed that software vendor has been contacted, and SLAC Computing Division is working on the issue.

Post-mortem from SLAC comp-out: The problem resulted from a bug in one of the new LSF 9.1.2 daemons. IBM is researching a fix and in the meantime we are still running 9.1.2, but that one daemon has been reverted to 9.1.1. We believe that jobs submitted prior to the problem should have continued to run and should continue to be tracked by LSF.

...

All computing services at LCLS (NEH, FEH, XRT, FEE, Undulator Hall)

...

All services at LCLS

...

Friday, Aug 9th, 2013
13:00hrs

...

Tuesday, Aug 13th, 2013
13:00hrs

...

Planned power outage at LCLS buildings

...

psana11*,psana12*
psana13*, psana14*

...

LSF Compute Nodes

...

Thursday, May 30th, 2013 16:30hrs

...

Friday, May 31st, 2013 12:00hrs

...

Unplanned power outage at SLAC

...

Most of the nodes are brought up. A handful have memory related problem and have been disabled in LSF awaiting diagnostic.

...

psanafeh

...

Thursday, May 30th, 2013 16:30hrs

...

Friday, May 31st, 2013 12:00hrs

...

Friday, May 31st, 2013 10:37am

...

Unplanned power outage at SLAC

...

psexport01

...

Thursday, May 30th, 2013 16:30hrs

...

Friday May 31st, 2013 12:00hrs

...

Friday, May 31st, 2013 11.00am

...

Unplanned power outage at SLAC

...

ana01, ana02

...

/reg/d/ana01, /reg/d/ana02 filesystems

...

Thursday, May 30th, 2013 16:30hrs

...

Friday May 31st, 2013 14:00hrs

...

Friday, May 31st, 6pm.

...

Unplanned power outage at SLAC

...

pssrv100 (psnfs)

...

NFS mountpoint for PCDS diskless nodes

...

Tuesday, Mar 26th, 2013
12:30pm

...

Tuesday, Mar 26th, 2013
5pm

...

Tuesday, Mar 26th, 2013
4:45pm

...

 

...

 

...

pssrv100 (psnfs)

...

NFS mountpoint for PCDS diskless nodes

...

Monday, Jan 7th, 2013 (1030 hrs)

...

Monday, Jan 7th, 2013
(1600 hrs)

...

Wednesday, Jan 9th, 2013
(1140 hrs)

...

RAID controller malfunctioned upon power restoral after planned power outage in B950 203A

...

pssrv101 (old data) was used to bring up the FEE nodes for part of the outage.  pssrv100 was restored to operation after a new RAID controller was delivered and installed.

...

ana01

...

/reg/d/ana01 filesystem

...

Tuesday, Dec 18th 2012

...

unknown

...

Partial (98%) restoral Monday Dec 24th (0800 hrs)

...

Controller failed causing corrupted parity data

...

Parity errors  fixed and new controller installed.  2 OSTs (LUNs) needed fsck'ing.  One took a few hours, the other took 10 days.

...

psanaoss21*

...

/reg/d/ana12 filesystem

...

Monday, Oct 8th, 2012 (1700 hrs)

...

Monday, Oct 8th, 2012 (1900 hrs)

...

Monday, Oct 8th, 2012 (1900 hrs)

...

Hardware upgrades

...

psanaoss2**

...

/reg/d/ana11 and /reg/d/ana12 filesystem

...

Thursday, Sep 27, 2012 (1700 hrs)

...

Friday, Sep 28, 2012 (0100 hrs)

...

Friday, Sep 28, 2012 (0400 hrs)

...

Hardware upgrades

...

Sitewide outage. All Linux Servers at NEH, FEH, XRT, FEE.

...

All computing services at LCLS.

...

Wednesday August 15, 2012
1:15 PM (1300 hrs)

...

August 17, 2012 1:00 PM
(Except psanafeh, ana11 and ana12 file systems which will be down till Aug 21, 2012.

...

SLAC sitewide power outage on August 16.
Electrical work at LCLS.
Server maintenance.

...

Expect logging in to any machines to be unavailable between 8/15 and 8/17 even if some of the servers are powered up before the expected end time. They will be maintenance performed on various servers during these 2 days.

...

All machines in XPP hutch and control room will be inaccessible.

...

XPP

...

Monday April 9, 2012 11:15AM

...

Monday April 9, 2012 11:45AM

...

Monday April 9, 2012 11:30AM

...

Electrical Work at XPP Hutch

...

Completed

...

ana01/ana02 file systems

...

Wed Mar 28th, 2012 9am

...

Wed Mar 28th, 2012 1pm

...

Wed Mar 28th, 2012 4pm

...

Upgrade to IB

...

Completed

...

psananeh
lclsq
ana01
ana02

...

NEH storage and processing

...

Tue Dec 27, 2011 4pm

...

Mon Dec 26, 2011 1pm

...

Completed. Chilled water was restored on Friday.

...

psana batch nodes

...

All Science data is currently unavailable. Psananeh psanafeh is up for Matlab use, but no access to data on Lustre file system.

...

Saturday Oct 1, 2011 6am

...

Lustre file system remains down after the unplanned power outage on Saturday.

...

The system administrators are working to bring them back.

...

All LCLS computing services

...

Monday
Nov 14, 2011 7am

...

Electrical work at NEH server room and FEH.

...

pslogin is up. NFS server, LDAP, DNS, pswww are up.
The daq nodes will not come up until after 4PM.
Lustre will not come up until after about 4.30PM.
Batch nodes (psana11* psana12*) and psana01* will not be up until Lustre is up.

...

psana,
NEH Online Nodes,
psimport,
psexport,
pslogin,
psdev.
psanasrv100,
psanasrv101,
psanasrv102

...

All Science data, All user home directories, all DAQ cache nodes. All online services.

...

Wed
Sep 28,
2011
10am

...

Wed
Sep 28,
2011
6pm

...

Wed
Sep 28,
2011
6pm

...

Upgrade of Lustre hardware.
Installation of taylor on several offline systems. Update of kernel on Online nodes.

...

psana

...

Science data access

...

Tue
Sep 20,
2011
11:15am

...

Tue
Sep 20,
2011
6:15pm

...

NEH power outage

...

B950 and several other buildings experienced short power glitch but the lustre file servers did not survive the interruption and is still being brought up.

...

psana

...

Science data access

...

Thu
Jun 2,
2011
1pm

...

Thu
Jun 2,
2011
5pm

...

Lustre failover testing.

...

NEH online nodes
ana02
psexport, psimport

...

NEH DAQ, outside ssh access

...

Thu
May 25,
2011
noon

...

Thu
May 25,
2011
7pm

...

Server room upgrade, ana02 memory upgrade

...

Completed

...

psana

...

Science data access

...

Thu
May 12,
2011
1pm

...

Thu
May 12,
2011
6pm

...

Thu
May 12,
2011
6.30pm

...

Lustre maintenance

...

Completed. Upgraded memory on psanaoss101-104, and replaced 10Gbit cards with 1 port SMCs. 717W power supplies are in place on psanaoss103-104 now.

...

psana

...

Science data access

...

Thu
May 5,
2011
1pm

...

Thu
May 5,
2011
5pm

...

Thu
May 5,
2011
5pm

...

Lustre maintenance

...

Completed

...

All

...

All

...

Fri
Apr 29,
2011
6.30pm

...

Sun
May 1,
2011
11pm

...

Sun
May 1,
2011
9pm

...

NEH power outage

...

Completed

...

psana

...

Science data access

...

Thu
Apr 28,
2011
2pm

...

Thu
Apr 28,
2011
6pm

...

Thu
Apr 28,
2011
3pm

...

Lustre maintenance
pssrv100 NFS volume reconstruction.

...

Completed
Lustre maintenance postponed.
Raid reconstruction pssrv100 will take 2-3 days. The new volume size is not released by the controller, so we will have to perform the file system resize on another day.

...

psana

...

Science data access

...

Fri
Apr 1,
2011
6pm

...

Mon
Apr 4,
2011
10am

...

NEH cooling outage

...

Completed

...

psana

...

Science data access

...

Thu
Mar 31,
11am

...

Thu
Mar 31,
5pm

...

Enabling HA for Lustre system

...

Completed

...

All

...

All

...

Sat
Mar 26,
2011
7am

...

Sat
Mar 26,
2011
7pm

...

Mon
Mar 28,
2011
1pm

...

NEH power cut

...

Completed

...

psana

...

Science data access

...

Thu
Mar 24,
2011
11am

...

Thu
Mar 24,
2011
5pm

...

Lustre testing

...

Completed

...

All

...

All

...

Wed
Mar 23,
2011
10am

...

Wed
Mar 23,
2011
3pm

...

NEH power cut

...

This power cut was NOT planned

...

All

...

All

...

Sat
Mar 19,
2011
7am

...

Sat
Mar 19,
2011
7pm

...

Mon
Mar 21,
2011
10am

...

NEH power cut

...