...
Nodes | Services | Start Time | Expected End Time | Actual End Time | Reason | Comments |
---|---|---|---|---|---|---|
All | Networking will be down, so all machines and services will be unavailable. | Jan 17, 4:00 PM | Jan 17, 5:00 PM | Central Router firmware upgrades | ||
The pslogin, psdev and psnxserv nodes will be unavailable. | User home directories and some /reg/* NFS shares will be unavailable | Dec 26, 6:00 AM | Dec 26, 8:00 PM | Dec 26, 9:30pm | Firmware upgrades | Problems found with one of the NFS servers (psnfs03), may need to take another outage to fix |
All ANA filesystems, interactive (GPU & Phi) nodes, psexport nodes and LSF batch nodes | Science data and associated servers | Dec 26, 6:00 AM | Dec 31, 5:00 PM | Dec 31, 8:00 AM | Electrical Work | |
HPSS Storage System will be unavailable | Data backup and recovery | Dec 25, 6:00 AM | Jan 8, 5:00 PM | Jan 1 | Electrical Work | |
All ana filesystems, interactive nodes, psexport nodes and most batch nodes | Science data and associated servers | Aug 22, 1PM | Aug 23, 8:00 PM | Aug 23, 10:00PM | Cooling and Electrical Work | Outage recovery was late because we had several hardware problems which required intervention. |
| Electrical Work | Outage Canceled. | ||||
All ana filesystems, interactive nodes, psexport nodes and most batch nodes | Science data and associated servers | June 7, 5:30 AM (PDT) | June 7, 6:00 PM (PDT) | June 7, 6:53 PM (PDT) | Electrical Work | Outage recovery was an hour late because server room electrical work extended beyond anticipated outage time and a Lustre-system RAID card failed. |
ana02 | Science data | Wed, April 19, 11 AM | Wed, April 19, 4 PM | Wed, April 19, 4 PM | Update OS and Lustre version | |
psnxserv03 | psnxopr | No Machine upgrade | Please use psnxserv01 and 02 while 03 is upgraded | |||
All | NFS | Fri Feb 24, 2017 9am | Fri Feb 24, 2017 9pm | Fri Feb 24, 2017 8:08pm | NFS upgrade | During this outage it won't be possible any users or operator home |
ana12 | Science data | Aug 16, 9am | Aug 16, 12pm | Aug 16, 12pm | Hardware failure | Access to six OSTs is very slow, presumably because of a failing RAID card. We will shutdown one of five ana12 OSSs to replace the card and reconfigure as needed. No data is expected to be lost, but some data will be unavailable during the outage. |
ana04 | Science data | May 5th, 10am | May 10th, 10am | May 30th | Hardware failure | One of the ana04 OSTs doesn't detect a drive. There is enough redundancy to rebuild the array, but it's extremely slow and sometime it hangs. The OST has been set read only. Moving the data to other OSTs. End: file system couldn't be recovered, but we were able to move all the data somewhere else, wipe and rebuild the file system. |
Batch nodes | Batch jobs | Oct 5, 2015 11am | Oct 5, 2015 1pm | Move to RHEL7 | Interactive nodes will also be moved to rhel7. Users logged into rhel5 interactive nodes will be able to continue their session, but batch submission will fail once batch nodes are converted to rhel7. | |
HPSS | Restoring files from tape | Sep 22, 2015 | Sep 24, 2015 | Sep 24, 2015 12pm | Upgrade of HPSS to version 7.4 | HPSS will become read-only on Sep21th at 5pm. |
psnehprioq/psfehprioq | All nodes | April 1, 2015 9am | April 1, 2015 8pm | March 31, 2015 | Maintenance on nodes to allow addition of 640 cores to computing system | Ended early due to technical issues. |
psnehq/psfehq | All nodes | March 31, 2015 11am | April 1, 2015 8pm | March 31, 2015 | Maintenance on nodes to allow addition of 640 cores to computing system | Ended early due to technical issues. |
All NEH/FEH computing | All LCLS computing services | Dec 31, 00:00hrs | Dec 31, 18:00hrs | Switching of generator power to building power at Building 950 where servers are housed. | ||
All NEH/FEH computing | Nov. 7, 2014 ~12pm | Unknown | 7:30pm | Unscheduled power outage | Power has been restored. We are running file system checks for our NFS servers before we could bring everything else back up. We are hoping to get things back up by 4.30pm. | |
LCLS Offline Filesystem | All LCLS Offline filesystems | Sep 3, 2014 ~7.30pm | N/A | Sep 4, 2014 | Network outage causing offline filesystems to be inaccessible. | We have identified the root caused and issue should be fixed now. |
LCLS Computing | All LCLS computing services | Friday Aug 8th, Midnight (00:00 hrs) | Monday Aug 18th, 2014 Noon | Aug 18th, 2014 12pm | To prepare for the power shut down on Aug 11. | Not all machines will be shut down at once as we start performing system updates. Please do not depend on any service pass midnight Aug 10th. |
psexport, psana104, psana105 | Export nodes, HPSS, scratch and calib backups | Thursday August 7th, Noon | Monday Aug 18th, 2014 COB
| These machines need to be moved to Bld 50. | ||
LCLS Online Computing | All Online Computing Nodes, DSS, FFB | Tuesday Aug 5th, 2014 | Wednesday Aug 20th, 2014 COB | CXI DSS nodes and ffb nodes need to be recabled and reconfigured. The IB switch is moved to B50 as part of the offline storage move. | ||
LCLS Offline Filesystem | All LCLS Offline filesystems ana01, ana02, ana03. ana04. ana11, ana12, ana14 | Tuesday Aug 5th, 2014 | Monday Aug 18th, 2014 COB | Offline Analysis Hardware will be moved from B950 and B999 to B50. | The equipment needs to be disconnected, and carefully moved before the power outage on Aug 11, and reconnected at B50. | |
LSF | LSF Job Submission and Management | Wednesday, Mar 19, 2014 | N/A | Wednesday, 6:45pm | Unplanned software outage | No job can be submitted and managed at this time. We were informed that software vendor has been contacted, and SLAC Computing Division is working on the issue. Post-mortem from SLAC comp-out: The problem resulted from a bug in one of the new LSF 9.1.2 daemons. IBM is researching a fix and in the meantime we are still running 9.1.2, but that one daemon has been reverted to 9.1.1. We believe that jobs submitted prior to the problem should have continued to run and should continue to be tracked by LSF. |
All computing services at LCLS (NEH, FEH, XRT, FEE, Undulator Hall) | All services at LCLS | Friday, Aug 9th, 2013 | Tuesday, Aug 13th, 2013 |
| Planned power outage at LCLS buildings |
|
psana11*,psana12* | LSF Compute Nodes | Thursday, May 30th, 2013 16:30hrs | Friday, May 31st, 2013 12:00hrs |
| Unplanned power outage at SLAC | Most of the nodes are brought up. A handful have memory related problem and have been disabled in LSF awaiting diagnostic. |
psanafeh |
| Thursday, May 30th, 2013 16:30hrs | Friday, May 31st, 2013 12:00hrs | Friday, May 31st, 2013 10:37am | Unplanned power outage at SLAC |
|
psexport01 |
| Thursday, May 30th, 2013 16:30hrs | Friday May 31st, 2013 12:00hrs | Friday, May 31st, 2013 11.00am | Unplanned power outage at SLAC |
|
ana01, ana02 | /reg/d/ana01, /reg/d/ana02 filesystems | Thursday, May 30th, 2013 16:30hrs | Friday May 31st, 2013 14:00hrs | Friday, May 31st, 6pm. | Unplanned power outage at SLAC |
|
pssrv100 (psnfs) | NFS mountpoint for PCDS diskless nodes | Tuesday, Mar 26th, 2013 | Tuesday, Mar 26th, 2013 | Tuesday, Mar 26th, 2013 |
|
|
pssrv100 (psnfs) | NFS mountpoint for PCDS diskless nodes | Monday, Jan 7th, 2013 (1030 hrs) | Monday, Jan 7th, 2013 | Wednesday, Jan 9th, 2013 | RAID controller malfunctioned upon power restoral after planned power outage in B950 203A | pssrv101 (old data) was used to bring up the FEE nodes for part of the outage. pssrv100 was restored to operation after a new RAID controller was delivered and installed. |
ana01 | /reg/d/ana01 filesystem | Tuesday, Dec 18th 2012 | unknown | Partial (98%) restoral Monday Dec 24th (0800 hrs) | Controller failed causing corrupted parity data | Parity errors fixed and new controller installed. 2 OSTs (LUNs) needed fsck'ing. One took a few hours, the other took 10 days. |
psanaoss21* | /reg/d/ana12 filesystem | Monday, Oct 8th, 2012 (1700 hrs) | Monday, Oct 8th, 2012 (1900 hrs) | Monday, Oct 8th, 2012 (1900 hrs) | Hardware upgrades |
|
psanaoss2** | /reg/d/ana11 and /reg/d/ana12 filesystem | Thursday, Sep 27, 2012 (1700 hrs) | Friday, Sep 28, 2012 (0100 hrs) | Friday, Sep 28, 2012 (0400 hrs) | Hardware upgrades |
|
Sitewide outage. All Linux Servers at NEH, FEH, XRT, FEE. | All computing services at LCLS. | Wednesday August 15, 2012 | August 17, 2012 1:00 PM |
| SLAC sitewide power outage on August 16. | Expect logging in to any machines to be unavailable between 8/15 and 8/17 even if some of the servers are powered up before the expected end time. They will be maintenance performed on various servers during these 2 days. |
All machines in XPP hutch and control room will be inaccessible. | XPP | Monday April 9, 2012 11:15AM | Monday April 9, 2012 11:45AM | Monday April 9, 2012 11:30AM | Electrical Work at XPP Hutch | Completed |
| ana01/ana02 file systems | Wed Mar 28th, 2012 9am | Wed Mar 28th, 2012 1pm | Wed Mar 28th, 2012 4pm | Upgrade to IB | Completed |
psananeh | NEH storage and processing |
| Tue Dec 27, 2011 4pm | Mon Dec 26, 2011 1pm |
| Completed. Chilled water was restored on Friday. |
psana batch nodes | All Science data is currently unavailable. Psananeh psanafeh is up for Matlab use, but no access to data on Lustre file system. | Saturday Oct 1, 2011 6am |
|
| Lustre file system remains down after the unplanned power outage on Saturday. | The system administrators are working to bring them back. |
| All LCLS computing services | Monday |
|
| Electrical work at NEH server room and FEH. | pslogin is up. NFS server, LDAP, DNS, pswww are up. |
psana, | All Science data, All user home directories, all DAQ cache nodes. All online services. | Wed | Wed | Wed | Upgrade of Lustre hardware. |
|
psana | Science data access | Tue |
| Tue | NEH power outage | B950 and several other buildings experienced short power glitch but the lustre file servers did not survive the interruption and is still being brought up. |
psana | Science data access | Thu | Thu |
| Lustre failover testing. |
|
NEH online nodes | NEH DAQ, outside ssh access | Thu | Thu |
| Server room upgrade, ana02 memory upgrade | Completed |
psana | Science data access | Thu | Thu | Thu | Lustre maintenance | Completed. Upgraded memory on psanaoss101-104, and replaced 10Gbit cards with 1 port SMCs. 717W power supplies are in place on psanaoss103-104 now. |
psana | Science data access | Thu | Thu | Thu | Lustre maintenance | Completed |
All | All | Fri | Sun | Sun | NEH power outage | Completed |
psana | Science data access | Thu | Thu | Thu | Lustre maintenance | Completed |
psana | Science data access | Fri | Mon |
| NEH cooling outage | Completed |
psana | Science data access | Thu | Thu |
| Enabling HA for Lustre system | Completed |
All | All | Sat | Sat | Mon | NEH power cut | Completed |
psana | Science data access | Thu | Thu |
| Lustre testing | Completed |
All | All | Wed | Wed |
| NEH power cut | This power cut was NOT planned |
All | All | Sat | Sat | Mon | NEH power cut | Completed |