All times are PT. Red entries are active. Most recent entry first.
Nodes | Services | Start Time | Expected End Time | Actual End Time | Reason | Comments |
---|---|---|---|---|---|---|
============ Ongoing ============ | ||||||
psanaoss231 | ana13 | Aug 5th, 5am | oss problem after power outage | |||
psanaoss215 | ana11 | Dec 2 | oss psanaoss215 of ana11 is down | |||
drp-srcf-xxx | Fast feedback | Aug 15th, 5:00PM | Aug 18th, 3:00PM | Reorganizing DRP racks and connectivity, upgrading Weka FFB Cluster | No fast feedback analysis during this time. | |
Diskless nodes | IOC and DAQ nodes | Aug 16, 9:00AM | Aug 17, 1:00PM | Diskless server will be migrated to the weka cluster | All diskless clients will be rebooted to use the new network interface. | |
psana | Interactive pool, batch nodes | Aug 16, 9:00AM | Aug 18, 5:00PM | Upgrading file systems, batch and interactive nodes | No data analysis capabilities during this time. | |
psexport | All data mover services | Aug 16, 1:00PM | Aug 18, 5:00PM | Waiting for psana and DRP | No ability to move science data during this time. | |
============ Upcoming ============ | ||||||
MEC and CXI nodes | Network | Aug 18, 2:00PM | Aug 18, 5:00PM | Network upgrade | No control room workstations or IOCs working in MEC/CXI during this time. | |
All NEH and FEH nodes | IOCs and DAQ nodes, alcove DRP, and control room workstations | Aug 19, 6:00AM | Sep 18, 5:00PM | Electrical work in experimental halls | Systems in the experimental halls will be on-line whenever power is available. Note: psana is not in the experimental halls and won't be affected. | |
drp-neh-xxx | NEH DRP | Aug 19, 2:00PM | Aug 19, 5:00PM | Upgrading to the latest Lustre version | No DRP in the NEH alcove | |
============ Completed ============ | ||||||
pswww | Web Services | Aug 16, 6:00PM | Aug 17, 12:00PM | Aug 17, 4:15PM | Service failed after upgrade. | Permission issues from the Weka upgrade |
psnx, pslogin, psdev, | All login services | Aug 16, 1:00PM | Aug 16, 5:00PM | Aug 16, 6:30PM | Upgrading host to latest packages and services | No ability to ssh into the system during this time. |
psweka | NFS | Aug 16th, 9:00AM | Aug 16th, 1:00PM | Aug 16th, 5:00PM | Upgrading and re-configuring Weka Cluster | The users home won't be available during this time so nothing will work. This will require rebooting all LCLS servers to use the new NFS stack. |
psanagpu, lustre | interactive, jupyter, anafs | Aug 5th, 5am | Aug 5th, 9:20am | power issue SRCF | Most system have been restored. Only ana13 needs some attention | |
Weka Cluster | NFS (home directories, central storage, all related systems) | 5/22/2021 | Unknown | 5/24/2021 ~10pm | Under investigation | The LCLS IT team is working with the vendor to diagnose root cause and future actions. The software version was rolled back, and the system is performing more stably. |
psdb0x, psdm0x, pswww0x | logbooks, movers, questionnaires, most of the data management infrastructure | May 18, 9:00AM | May 18, 12:00PM | May 18, 9:55AM | Upgrading all the NFS mounts to the new version of NFS. | |
psweka | All | April 28th, 8:00PM | April 28th, 9:00PM | April 28th, 9:45PM | Deploying NFS-Ganesha | |
psanaoss121 | ana02 | Apr 23, 18:15 | Apr 26, 11:50 | oss crashed | ||
psweka | All | April 26th 8am | April 26th 12pm | April 26th, 11am | Weka upgrade | Nothing will work during this time |
psdb0x, psdm0x | logbooks, movers, questionnaires, most of the data management infrastructure | Apr 21, 3:00PM | Apr 21, 4:00PM | Apr 21, 5:00 PM | Moving all machines to 10Gbps networks. | psdm02 had a bad NIC port, we had to reconfigure this to use the alternate. |
psanaoss12n | ana02 | Apr 15, 4:30pm | Apr 15, 7pm | Apr 15, 5:40pm | Replacing broken fans | |
psexport | globus, data transfers | March 24, 9:00am | March 24, 10:00am | reboot psexports to remove ana11/12 | ||
psanaoss121 | ana02 | Apr 6th, 17:00 | Apr 7th, afternoon | Apr 7th, 11am | disk related hardware issues | |
HPSS | tape archive, restore | March 23, 6:00am | March 23, 6pm | March 23, 14:20 | HPSS upgrade | |
drp-srcf | FFB for TMO+XPP/XCS | March 11th, 10am | March 11th, 7pm | March 12th, 1am | Disable SMT on DRP SRCF nodes | |
psdb(psdb4) | Questionnaire/File Restore/File Manager/LCLS 1 DAQ data mover | Mar 3 6:00PM | Mar 3 10:00PM | Moving databases to new cluster | The questionnaire, the file manager services and the LCLS1 DAQ will be unavailable as we migrate to newer machines. | |
All | ldap/dhcp/dns | Feb 24th at 10m | Feb 24th at 1pm | Feb 24th at 3pm | psrelay migration | Netconfig and reboots will fail during this time and name resolution lookups may be delayed for a few seconds. |
psdb(psdb4) | Questionnaire/File Restore/File Manager/LCLS 1 DAQ | Feb 11 9:00AM | Feb 12 2:30AM | Feb 12, 8:00PM | Moving databases to new cluster | The questionnaire, the file manager services and the LCLS1 DAQ will be unavailable as we migrate to newer machines. Reverted back to psdb4 owing to routing issues. |
Instrument Network | all NEH/FEH computing | 2/4/2021 10:30am | 2/4/2021 11:20am | Power supply failure | Replaced. Redundant supply added. | |
pswww/pswebkdc | elog/file restore | Jan 20, 2021 ~4:00PM | Jan 21, 2021 | Issues with VMWare hypervisor | ||
psdm0x psdb0x | eLog/File restore etc | Dec 21 | Dec 24 | Dec 22 | Upgrade of backends to new releases of software | |
ana03 | ana03 | Dec 17, 2:49AM | DEC 17, 10:15am | Lustre issue | psossana0303 was stuck and needed a reboot (hard reboot) | |
Ana file systems | ana13 | Nov 18 | Nov 20, | Nov23, 11am | one ost is not accessible for write | ost0xe can not be written to due to mds/ost issues. It has been set to ro but writes to ana13 are slowed down. Reboot required. Fixed by rebooting MDS, (OSSs were also rebooted) |
pshub01 | JupyterHub | Oct 29 1:00PM | Oct 29 5:44PM | Disk failure. | The is a disk failure in the filesystem that stores the JupyterHub sessions. We are trying to recover this and restart the node. We could not recover the sessions; please log out and restart your session. | |
<<All>> | Network File System (NFS) | 10/19/20, 4pm | 10/22/20, 9am |
| Upgrade to new Solid State Drive-based, distributed clusters | Remaining /reg/[d,g,neh]/* NFS volumes (see Detailed list) |
ANA file systems | ana02/03/11/13/15 | 10/13/20, 6.30am | 10/15/20, 17:40 | (Unscheduled/failure) | 21:20 UED has been moved to weka-nfs: /cds/data/ued/ana, The ana-filesystems are accessible now. | |
<<All>> | CDS Router | 10/5/20, 9am | 10/5/20, 12pm | 10/5/20, 1pm | Upgrade to new routers, 100Gb optics | During 4-day PAMM. Actual disruption should be order of minutes. |
<<All>> | Network File System (NFS) | 10/5/20, 12pm | 10/8/20, 12pm | 10/8/20, 12pm | Upgrade to new Solid State Drive-based, distributed clusters | 4-day PAMM. Replace aging legacy systems. Service disruptions may endure for several days as hard-links in applications are identified and repaired in real-time. Only /reg/neh/opr/ and /reg/g/pcds/ mounts were migrated. Future outage(s) will address remaining volumes. |
ANA | ana04 | April 16th | May 22nd | ana04 is down | Hardware problem with one ana04 ost | |
ANA file system | ana14 | Dec, 2019 | may 27th | ana14 is down | Hardware problem with one ana14 oss | |
nfs server | home directories | Sep 29th | Sep 29th, 11:30 | nfs server issues | All servers are back up. Most if not all issues have been fixed. | |
psnfs02 | home directores on home5 | Sep 3, 16:40 | Sep3, 19:11 | psnfs02 crashed | ||
ANA FS; GPUs, Export, Batch nodes | ANA02, ANA03, ANA11,ANA15, psana, psanafarm, psexport | July 7th, 5:00PM | July 9th, 8:00AM | July 8th, 6:00PM | Circuit breaker replacement | Circuit breaker has been replaced and sensitivity decreased. In order to replace the unit, the entire distribution panel (UDB-C) must be powered off. |
ANA | ana13 | June 8th | June 10th | June10th, 10:45 | psanaoss232 is down | |
Gateway, Build nodes, ML node | pscag1 - pscag4, psbuild-rhel5, psbuild-rhel6, psbuild-rhel7, psjerry | June 8th 7:30AM | June 8th 9:00AM | June 8th, 6:00PM | Power maintenance | Power maintenance completed. The gateway and build servers are now accessible. |
VMware | pswww, pslogion, psdev | May, 18th, 08:00 | May 18th | May 19th, 02:00 | pslogin, psdev and psww are not accessible Access using the LCLS NX servers will continue to work | |
VMware | pswww, pslogion, psdev | May, 16th, 10:00 | May 16th, 17:00 | May 17th, 18:30 | VM hypervisor upgrade | Access using the LCLS NX servers will continue to work |
psnfs02 | home directories | May 12th | May 13th | May 13th, 10:14 | server is down | Some users home directories are not accessible |
psexport | Globus, gridftp | May 11th | May 12 | May 13th, 19:30 | psexport01 is down | psexport works again, gridftp (globus) is available. |
pslogin, pswww | login, web services | May 12th | May 12th, | May 12th, 11:20 | virtual machine cluster | Most VM should be accessible now. |
pswww | Data Management Portal | Feb 29 | Mar 3 5PM | Mar 5th | Migrating databases to LCLS2 infrastructure | We'll be migrating all the databases to LCLS2 data management systems. The current data management portal should be available as read only. |
ANA | Analysis Infrastructure | 4 December 19 4:00PM PDT | Friday, December 13th | superseded | Troubleshooting short-circuit | While bringing up some of the file systems, we encountered electrical issues. The plan is to keep things stable until Monday to make sure the equipment which is currently powered on is working properly. Status: Up: psana, psanaq, psexport, ana02, ana03, ana04, ana12, ana15 Down until further notice: ana14 Update: 2020-02-25 |
ANA file systems | Analysis infrastructure | Nov 14th, 2019 | Nov 27th | Dec 4th | Move from building 50 to building 54 (SRCF) | Update: ANA14, ANA12, ANA04, and ANA02 up by Friday ANA11 up by Monday Completed tasks: ana03, ana12, ana13, and ana15 are up Most of the psana batch queue is now available. The psana interactive pool can now be accessed, but several nodes are still down(e.g psanagpu115, psanagpu116, etc). The psexport nodes are operational. |
pslogin, kerberos | pslogin, passwordless access | November 8, 2019 11:45AM | TBD | November 11th | 2 Hypervisors are unresponsive | |
ANA file systems | April 11, 12:30pm | Unknown | April 11, 7:30PM | SLAC wide power glitch | 7:30PM: ANA14 is now online. All systems are up and functional. 6:00PM: All ANA lustre filesystems have been recovered except ANA14 which is suffering major hardware issues. We don't know yet how long it will take to recover it. 12:30PM: All systems are up by now except the ANA file systems because the Lustre MDS was damaged. We don't know yet how long it will take to recover it. | |
All | All nodes | January 4th | January 11th | January 25th (User Services) | Server room relocation | Systems that are up:
|
psana, psexport | Aug 13, 7pm | Aug 14 | Aug 14 | Configuration error | Wrong MTU setting in building 50 causes psana and psexport to not be able to mount NFS. Will fix this morning. | |
All | All nodes | July 26, 5:00PM | July 30, 5:00PM | August 2nd | Power Outage | Systems are now on-line fully functional. We encountered several systems with bad hardware and corrupted files. Unnecessary systems are now off-line. |
psnfs03 and psnfs04 | /reg/g and /reg/common | Apr 24, 6:30am | Apr 24, 9am | Apr 24, 8:30 | firmware update | |
psnfsopr | Operators home | Apr 24, 6:30am | Apr 24, 9am | Apr 24, 7:30 | Move and firmware update | |
All | Networking will be down, so all machines and services will be unavailable. | Jan 17, 2:45 PM | Jan 17, 5:00 PM | Jan 17, 4:45 PM | Central Router firmware upgrades and replace NFS SAS module. | The new SAS NFS module woked, but psnfs03 and pnfs04 took longer than expected to boot. |
The pslogin, psdev and psnxserv nodes will be unavailable. | User home directories and some /reg/* NFS shares will be unavailable | Dec 26, 6:00 AM | Dec 26, 8:00 PM | Dec 26, 9:30pm | Firmware upgrades | Problems found with one of the NFS servers (psnfs03), may need to take another outage to fix |
All ANA filesystems, interactive (GPU & Phi) nodes, psexport nodes and LSF batch nodes | Science data and associated servers | Dec 26, 6:00 AM | Dec 31, 5:00 PM | Dec 31, 8:00 AM | Electrical Work | |
HPSS Storage System will be unavailable | Data backup and recovery | Dec 25, 6:00 AM | Jan 8, 5:00 PM | Jan 1 | Electrical Work | |
All ana filesystems, interactive nodes, psexport nodes and most batch nodes | Science data and associated servers | Aug 22, 1PM | Aug 23, 8:00 PM | Aug 23, 10:00PM | Cooling and Electrical Work | Outage recovery was late because we had several hardware problems which required intervention. |
| Electrical Work | Outage Canceled. | ||||
All ana filesystems, interactive nodes, psexport nodes and most batch nodes | Science data and associated servers | June 7, 5:30 AM (PDT) | June 7, 6:00 PM (PDT) | June 7, 6:53 PM (PDT) | Electrical Work | Outage recovery was an hour late because server room electrical work extended beyond anticipated outage time and a Lustre-system RAID card failed. |
ana02 | Science data | Wed, April 19, 11 AM | Wed, April 19, 4 PM | Wed, April 19, 4 PM | Update OS and Lustre version | |
psnxserv03 | psnxopr | No Machine upgrade | Please use psnxserv01 and 02 while 03 is upgraded | |||
All | NFS | Fri Feb 24, 2017 9am | Fri Feb 24, 2017 9pm | Fri Feb 24, 2017 8:08pm | NFS upgrade | During this outage it won't be possible any users or operator home |
ana12 | Science data | Aug 16, 9am | Aug 16, 12pm | Aug 16, 12pm | Hardware failure | Access to six OSTs is very slow, presumably because of a failing RAID card. We will shutdown one of five ana12 OSSs to replace the card and reconfigure as needed. No data is expected to be lost, but some data will be unavailable during the outage. |
ana04 | Science data | May 5th, 10am | May 10th, 10am | May 30th | Hardware failure | One of the ana04 OSTs doesn't detect a drive. There is enough redundancy to rebuild the array, but it's extremely slow and sometime it hangs. The OST has been set read only. Moving the data to other OSTs. End: file system couldn't be recovered, but we were able to move all the data somewhere else, wipe and rebuild the file system. |
Batch nodes | Batch jobs | Oct 5, 2015 11am | Oct 5, 2015 1pm | Move to RHEL7 | Interactive nodes will also be moved to rhel7. Users logged into rhel5 interactive nodes will be able to continue their session, but batch submission will fail once batch nodes are converted to rhel7. | |
HPSS | Restoring files from tape | Sep 22, 2015 | Sep 24, 2015 | Sep 24, 2015 12pm | Upgrade of HPSS to version 7.4 | HPSS will become read-only on Sep21th at 5pm. |
psnehprioq/psfehprioq | All nodes | April 1, 2015 9am | April 1, 2015 8pm | March 31, 2015 | Maintenance on nodes to allow addition of 640 cores to computing system | Ended early due to technical issues. |
psnehq/psfehq | All nodes | March 31, 2015 11am | April 1, 2015 8pm | March 31, 2015 | Maintenance on nodes to allow addition of 640 cores to computing system | Ended early due to technical issues. |
All NEH/FEH computing | All LCLS computing services | Dec 31, 00:00hrs | Dec 31, 18:00hrs | Switching of generator power to building power at Building 950 where servers are housed. | ||
All NEH/FEH computing | Nov. 7, 2014 ~12pm | Unknown | 7:30pm | Unscheduled power outage | Power has been restored. We are running file system checks for our NFS servers before we could bring everything else back up. We are hoping to get things back up by 4.30pm. | |
LCLS Offline Filesystem | All LCLS Offline filesystems | Sep 3, 2014 ~7.30pm | N/A | Sep 4, 2014 | Network outage causing offline filesystems to be inaccessible. | We have identified the root caused and issue should be fixed now. |
LCLS Computing | All LCLS computing services | Friday Aug 8th, Midnight (00:00 hrs) | Monday Aug 18th, 2014 Noon | Aug 18th, 2014 12pm | To prepare for the power shut down on Aug 11. | Not all machines will be shut down at once as we start performing system updates. Please do not depend on any service pass midnight Aug 10th. |
psexport, psana104, psana105 | Export nodes, HPSS, scratch and calib backups | Thursday August 7th, Noon | Monday Aug 18th, 2014 COB | These machines need to be moved to Bld 50. | ||
LCLS Online Computing | All Online Computing Nodes, DSS, FFB | Tuesday Aug 5th, 2014 | Wednesday Aug 20th, 2014 COB | CXI DSS nodes and ffb nodes need to be recabled and reconfigured. The IB switch is moved to B50 as part of the offline storage move. | ||
LCLS Offline Filesystem | All LCLS Offline filesystems ana01, ana02, ana03. ana04. ana11, ana12, ana14 | Tuesday Aug 5th, 2014 | Monday Aug 18th, 2014 COB | Offline Analysis Hardware will be moved from B950 and B999 to B50. | The equipment needs to be disconnected, and carefully moved before the power outage on Aug 11, and reconnected at B50. | |
LSF | LSF Job Submission and Management | Wednesday, Mar 19, 2014 | N/A | Wednesday, 6:45pm | Unplanned software outage | No job can be submitted and managed at this time. We were informed that software vendor has been contacted, and SLAC Computing Division is working on the issue. Post-mortem from SLAC comp-out: The problem resulted from a bug in one of the new LSF 9.1.2 daemons. IBM is researching a fix and in the meantime we are still running 9.1.2, but that one daemon has been reverted to 9.1.1. We believe that jobs submitted prior to the problem should have continued to run and should continue to be tracked by LSF. |
All computing services at LCLS (NEH, FEH, XRT, FEE, Undulator Hall) | All services at LCLS | Friday, Aug 9th, 2013 | Tuesday, Aug 13th, 2013 | Planned power outage at LCLS buildings | ||
psana11*,psana12* | LSF Compute Nodes | Thursday, May 30th, 2013 16:30hrs | Friday, May 31st, 2013 12:00hrs | Unplanned power outage at SLAC | Most of the nodes are brought up. A handful have memory related problem and have been disabled in LSF awaiting diagnostic. | |
psanafeh | Thursday, May 30th, 2013 16:30hrs | Friday, May 31st, 2013 12:00hrs | Friday, May 31st, 2013 10:37am | Unplanned power outage at SLAC | ||
psexport01 | Thursday, May 30th, 2013 16:30hrs | Friday May 31st, 2013 12:00hrs | Friday, May 31st, 2013 11.00am | Unplanned power outage at SLAC | ||
ana01, ana02 | /reg/d/ana01, /reg/d/ana02 filesystems | Thursday, May 30th, 2013 16:30hrs | Friday May 31st, 2013 14:00hrs | Friday, May 31st, 6pm. | Unplanned power outage at SLAC | |
pssrv100 (psnfs) | NFS mountpoint for PCDS diskless nodes | Tuesday, Mar 26th, 2013 | Tuesday, Mar 26th, 2013 | Tuesday, Mar 26th, 2013 |
|
|
pssrv100 (psnfs) | NFS mountpoint for PCDS diskless nodes | Monday, Jan 7th, 2013 (1030 hrs) | Monday, Jan 7th, 2013 | Wednesday, Jan 9th, 2013 | RAID controller malfunctioned upon power restoral after planned power outage in B950 203A | pssrv101 (old data) was used to bring up the FEE nodes for part of the outage. pssrv100 was restored to operation after a new RAID controller was delivered and installed. |
ana01 | /reg/d/ana01 filesystem | Tuesday, Dec 18th 2012 | unknown | Partial (98%) restoral Monday Dec 24th (0800 hrs) | Controller failed causing corrupted parity data | Parity errors fixed and new controller installed. 2 OSTs (LUNs) needed fsck'ing. One took a few hours, the other took 10 days. |
psanaoss21* | /reg/d/ana12 filesystem | Monday, Oct 8th, 2012 (1700 hrs) | Monday, Oct 8th, 2012 (1900 hrs) | Monday, Oct 8th, 2012 (1900 hrs) | Hardware upgrades | |
psanaoss2** | /reg/d/ana11 and /reg/d/ana12 filesystem | Thursday, Sep 27, 2012 (1700 hrs) | Friday, Sep 28, 2012 (0100 hrs) | Friday, Sep 28, 2012 (0400 hrs) | Hardware upgrades | |
Sitewide outage. All Linux Servers at NEH, FEH, XRT, FEE. | All computing services at LCLS. | Wednesday August 15, 2012 | August 17, 2012 1:00 PM | SLAC sitewide power outage on August 16. | Expect logging in to any machines to be unavailable between 8/15 and 8/17 even if some of the servers are powered up before the expected end time. They will be maintenance performed on various servers during these 2 days. | |
All machines in XPP hutch and control room will be inaccessible. | XPP | Monday April 9, 2012 11:15AM | Monday April 9, 2012 11:45AM | Monday April 9, 2012 11:30AM | Electrical Work at XPP Hutch | Completed |
ana01/ana02 file systems | Wed Mar 28th, 2012 9am | Wed Mar 28th, 2012 1pm | Wed Mar 28th, 2012 4pm | Upgrade to IB | Completed | |
psananeh | NEH storage and processing |
| Tue Dec 27, 2011 4pm | Mon Dec 26, 2011 1pm |
| Completed. Chilled water was restored on Friday. |
psana batch nodes | All Science data is currently unavailable. Psananeh psanafeh is up for Matlab use, but no access to data on Lustre file system. | Saturday Oct 1, 2011 6am | Lustre file system remains down after the unplanned power outage on Saturday. | The system administrators are working to bring them back. | ||
All LCLS computing services | Monday | Electrical work at NEH server room and FEH. | pslogin is up. NFS server, LDAP, DNS, pswww are up. | |||
psana, | All Science data, All user home directories, all DAQ cache nodes. All online services. | Wed | Wed | Wed | Upgrade of Lustre hardware. | |
psana | Science data access | Tue | Tue | NEH power outage | B950 and several other buildings experienced short power glitch but the lustre file servers did not survive the interruption and is still being brought up. | |
psana | Science data access | Thu | Thu | Lustre failover testing. | ||
NEH online nodes | NEH DAQ, outside ssh access | Thu | Thu | Server room upgrade, ana02 memory upgrade | Completed | |
psana | Science data access | Thu | Thu | Thu | Lustre maintenance | Completed. Upgraded memory on psanaoss101-104, and replaced 10Gbit cards with 1 port SMCs. 717W power supplies are in place on psanaoss103-104 now. |
psana | Science data access | Thu | Thu | Thu | Lustre maintenance | Completed |
All | All | Fri | Sun | Sun | NEH power outage | Completed |
psana | Science data access | Thu | Thu | Thu | Lustre maintenance | Completed |
psana | Science data access | Fri | Mon | NEH cooling outage | Completed | |
psana | Science data access | Thu | Thu | Enabling HA for Lustre system | Completed | |
All | All | Sat | Sat | Mon | NEH power cut | Completed |
psana | Science data access | Thu | Thu | Lustre testing | Completed | |
All | All | Wed | Wed | NEH power cut | This power cut was NOT planned | |
All | All | Sat | Sat | Mon | NEH power cut | Completed |