All times are PT. Red entries are active. Most PST (Pacific Standard Time). Most recent entry first.
Excerpt | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Nodes | Services | Start Time | Expected End Time | Actual End Time | Reason | Impact and Comments |
---|---|---|---|---|---|---|
============ Completed ============ | ||||||
S33DF data storage | sdfdata | Feb 9th, morning | unknown | Feb 10th, 15:00 | The object store for the S3DF data storage (/sdf/data/lcls/ds) has connection problems and data on it can not be accessed and therefore accessing files will likely fail with | Please refrain from writing new data to /sdf/data/lcls/ as that might fill up the flash storage. The Object store itself is up and the data are fully protected the issue is that the fast flash layer can not interact with the object store. The file systems are available now. |
PCDSN Network | All Services | Nov 27th, noon | Dec 11th | Scheduled NEH power outage. 11/28-12/8 | The PCDSN routers are located in NEH, thus, all PCDSN services will be impacted: Storage (NFS), Login (Bastion Hosts), NoMachine, Interactive, DRP, Build, and Diskless (IOC and DAQ). Service in 208 will go down only when power is down. Controls and DAQ servers in affected areas will be turned off starting at noon the day before the power outage. When power is restored, systems will be restored in this order
EPICS will remain offline until the above services are restored. Recovery can take up to 8hrs after these services are restored. PLC systems configured for automatic recovery will resume as soon as power is restored. S3DF will not be impacted during this outage. | |
S3DF | Batch Computing and Interactive Nodes | Oct 16th, 8:00 PM | Oct 17th, 12:01 AM | Oct 17th, 11:00 PM | weka upgrades | Upgrading Weka to the latest release to fix ACLs issues. Although no impact is expected, this is being added for informational purposes. |
Home directories on the psana/daq systems. | All, except S3DF. | Sep 7th, 2023, 12:30pm | Sep 7th, 2023 5pm | weka problems | IT writes: If you experience issues first, try to log-out and log back in again. If that doesn't work, then reboot the host | |
DRP | June 25th 8:00 PM | July 5th 2:00 PM | June 16th, 2023 7:00 PM | Finalize the storage client upgrade. | DRP {cmp,eb,mon} nodes in SRCF are not available | |
psana | ana lustre filesystems, data movers, interactive machines | July 6, 9:20am | July 6, 4:00pm | Unplanned power outage | ||
ALL | ALL | June 25th at 8:00 PM | July 4th | SRCF-II Electrical System | NFS, DHCP, Home directories, Archiving, Logging, Grafana, Routing (inter-subnet traffic), EPICS + Gateways, DAQ + analysis systems, Virtual machines (eg. lfe-console), ssh, NTP PLCs, and networked controllers will continue to function in-so-far as no inter-subnet communication is required. PLC work and maintenance will require local connection. Intra-subnet traffic will still be possible. IOC hosts may remain online but performance will likely degrade until they crash and will not recover until the outage is over. Reboot of any system may fail due to lack of IP assignment (DHCP is down). Archiving and logging will be dark. This outage is required to integrate the new SRCF-II building with the existing data center. | |
FFB | FFB Storage | May 1st, 2023 6:00 AM | May 2nd, 2023 6:00 AM | May 5th ?? | Preparation for S3DF functionality integration. | Upgrading system and storage version. LCLS2 DAQ unavailable, LCLS1 DAQ only able to record to event-nodes during outage. The FFB was upgrade but some clients still need some work which will be done by 4th July (SRCF-II outage). |
pslogin, psana, NoMachine | Login | June 16, 2023, 2:00pm | June 16th, 2023 7:00 PM | Unplanned power outage in SRCF rack F1 | ||
Kafka | Kafka, elog | April 28th, 2023 14:00 | 17:00 same day | 16:00 | Switching data-management from PCDS Kafka to S3DF Kafka | The elog and some services might not be available for a short times. Let pcds-datamgt-l know if something is still amiss |
ana02 | ana02 | April 4th, 2023 | Decommissioning offline filesystem ana02 | All data on ana02 were copied, cleaned or archived. | ||
All | All | March 21, 2023 3pm | Unknown | March 22, 2023 4pm | SLAC site-wide power outage | |
ARP | Mar 9, 2023 9AM | Mar 12, 2023 9PM | Mar 9, 2023 9PM | S3DF support | We're adding S3DF support for the ARP and as a result will be making major changes to the Automatic Run Processing infrastructure. | |
psana | ANA03 | Feb 23, 2023 11:30AM | Unknown | Feb 23, 2023 1:00PM | Unhappy OSTs. | From IT: There was an issue with the zpools in ANA03. After coordinating the restarts, all seem operational. |
All | All | Jan 18, 2023 7AM | Unknown | Jan 18, 2023 4PM | Planned power outage in B950 | pslogin and psexport nodes are reachable again as of ~4pm. (nodes located in SRCF, but not those in NEH) |
ConfigDB | DAQ Configuration for LCLS2 | Oct 18 3PM | Oct 18 5PM | Oct 18 4PM | Enhancements for restore configuration | |
psdb, psdm | eLog, data-mover | Sep 1st, 9am | Sep 2nd, 9am | Sep 1st, 4pm | Updating OS, Kafka, MongoDB | MongoDB, Kafka, pswww were updated. all services are up and running but some checks are still performed. |
psossana0101 | ANA01 | Aug 29st, 17:10 | Aug, 30, 10am | disk issues | Three OSTs of ana01 are not accessible and access to ana01 will hang | |
Many | NSF | July 24, 7:00 | July 25, 14:00 | Some weka servers had problems. | Weka NFS filesystem showing multiple problems on many nodes | |
ANA FS | Lustre | June 24th, 8:00 | June 24, 8:00 | June 24, 15:00 | Relocate OSSs | |
ANA FS | Interactive, Batch, ANA FS, Export | May 25, 9:00AM | May 25, 10:00AM | May 25, 11:22AM | Upgrade firmware to fix bug. | Part of the core upgrade, but encountered issues during the upgrade. |
Core Services Network | All | May 25, 9:00AM | May 25, 10:00AM | May 25, 10:12AM | Upgrade firmware to fix bug. | No server reboots will be required. |
All Diskless Systems | All Diskless Systems | April 21, 12:10PM | TBD | April 21, 1:10PM | Diskless system was rebooted, did not come back as expected | Critical impact |
WEKA Cluster | NFS | April 7th, 8:00 AM | April 7th, 9:00 AM | April 7th, 9:30 AM | Patch Software | Minimal impact, no reboot required. Expect latency during the process. |
Network Reorganization #3 | ANA FS, Export Services, Interactive Pool and Batch cluster. | TBD | Network Upgrade | |||
psana and ffb batch nodes | ana15 filesystem | Feb. 15 2022, 06:30 | Feb. 15 2022, 11:10 | OSS reboot was necessary | stooped again at 9:40 oss and mds got rebooted | |
psana and ffb batch nodes | ana01/02/03/15/16 filesystems | Feb. 9 2022, 0:00 | Feb 9, 2022, 16:20 | Some of the oss or mds have problems with their IP addresses. | ioc-und-rec01 was misconfigured and created a conflicting IP address with ana02/03 | |
MFX nodes | Network | |||||
psmetric04 | grafana (psdm) | Jan 12, 10:40 | Jan 12, 10:45 | Upgrade 6.7.1 -> 8.3.3. | PSDH-161 | |
Network Reorganization #2 | Name Services, Login Services, NoMachine, Web Services | Jan 6, 2022 9:00 AM | 12:00 PM | 1:30PM | Network Upgrade | |
Network Reorganization #1 | ALL | Jan 5, 2022 8:00 AM | 12:00 PM | Establish LCLS Routers | ||
Weka | ALL | Jan 5, 2022 9:00 AM | 10:00 AM | New Features and bug | ||
XCS Nodes | Network | Nov 16, 9.30am | Switch upgrade | |||
psdb0x | eLog, file manager, all web services | Dec 8th, 9AM | Dec 8th, 12PM | Dec 8th, 10AM | Upgrading Mongo to 4.4.10. | |
drp-neh-xxx | NEH DRP | Upgrading to the latest Lustre version | No DRP in the NEH alcove | |||
All NEH and FEH nodes | IOCs and DAQ nodes, alcove DRP, and control room workstations | Aug 19, 6:00AM | Sep 18, 5:00PM | Electrical work in experimental halls | Systems in the experimental halls will be on-line whenever power is available. Note: psana is not in the experimental halls and won't be affected. | |
psanaoss215 | ana1, ana12 | Dec 2 | oss psanaoss215 of ana11 is down | decommissioned | ||
psanaoss231 | ana13 | Aug 5th, 5am | oss problem after power outage | decomissioned | ||
all nodes | nfs file-system | Aug 31, 10am | Aug 31, 20:00 | Some weka filesystems are not available. | file systems (e/g/: /cds/sw, /cds/groupcfel/..) show I/O error. Sometimes it might work. | |
xtc file restore | Nov 6th, | Nov 8th | Nov 7th | Data Restore from tape is not working | HPSS might have just been very slow mounting the tapes. | |
pshub01 | JupyterHub | Oct 21, 9am | Oct 21, 9am | Oct 21, 9am | Upgrade hub version | upgrade hub from 1.4.2 -> 2.0.0., fix matplotlib/ipympl issue |
psjhub01 | JupyterHub | Sep 16, 9am | Sep 16, 9:15am | Sep 16, 9:05am | Upgrade jupyterhub code | |
questionnaire | LCLS and UED questionnaires | Sep 14, 6PM | Sep 14, 10PM | Upgrade the questionnaire backend to new database | ||
psexport | All data mover services | Aug 16, 1:00PM | Aug 19, 3:00PM | Aug 19, 15:30 | All services have been started except the NEH data mover. | |
Diskless nodes | IOC and DAQ nodes | Aug 16, 9:00AM | Aug 20, 1:00PM | Diskless server will be migrated to the weka cluster | Cancelled | |
CXI Nodes | Network | Aug 18, 2:00PM | Aug 23, 5:00PM | Aug 27, 5:00PM | Network upgrade | No control room workstations or IOCs working in CXI during this time. |
MEC Nodes | Network | Aug 18, 2:00PM | Aug 23, 5:00PM | Aug 26, 2:00PM | Network upgrade | No control room workstations or IOCs working in MEC during this time. |
psnfsluste | ARP, Summary results | Aug 15th, 5:00PM | Aug 25, 1am | Various file system upgrades | ana-filesystems have been mounted on the psdmNN nodes | |
psdbdev01 | LCLS2 calibration service | Aug 26th, 5PM | Aug 26th, 9PM | Aug 26th, 11:30PM | Move the databases to psdb0x machines | May affect analysis for TMO/RIX experiments. |
drp-srcf-xxx | Fast feedback | Aug 15th, 5:00PM | Aug 25th, 5:00PM | Aug 25, 11:30PM | Reorganizing DRP racks and connectivity, upgrading Weka FFB Cluster | No fast feedback analysis during this time. |
psana, JupyterHub | Interactive pool, batch nodes | Aug 16, 9:00AM | Aug 19, 2:00PM | Upgrading file systems, batch and interactive nodes | Some data analysis capabilities. Working on getting the remaining interactive and batch nodes on-line. | |
pswww | Web Services | Aug 16, 6:00PM | Aug 17, 12:00PM | Aug 17, 4:15PM | Service failed after upgrade. | Permission issues from the Weka upgrade |
psnx, pslogin, psdev, | All login services | Aug 16, 1:00PM | Aug 16, 5:00PM | Aug 16, 6:30PM | Upgrading host to latest packages and services | No ability to ssh into the system during this time. |
psweka | NFS | Aug 16th, 9:00AM | Aug 16th, 1:00PM | Aug 16th, 5:00PM | Upgrading and re-configuring Weka Cluster | The users home won't be available during this time so nothing will work. This will require rebooting all LCLS servers to use the new NFS stack. |
psanagpu, lustre | interactive, jupyter, anafs | Aug 5th, 5am | Aug 5th, 9:20am | power issue SRCF | Most system have been restored. Only ana13 needs some attention | |
Weka Cluster | NFS (home directories, central storage, all related systems) | 5/22/2021 | Unknown | 5/24/2021 ~10pm | Under investigation | The LCLS IT team is working with the vendor to diagnose root cause and future actions. The software version was rolled back, and the system is performing more stably. |
psdb0x, psdm0x, pswww0x | logbooks, movers, questionnaires, most of the data management infrastructure | May 18, 9:00AM | May 18, 12:00PM | May 18, 9:55AM | Upgrading all the NFS mounts to the new version of NFS. | |
psweka | All | April 28th, 8:00PM | April 28th, 9:00PM | April 28th, 9:45PM | Deploying NFS-Ganesha | |
psanaoss121 | ana02 | Apr 23, 18:15 | Apr 26, 11:50 | oss crashed | ||
psweka | All | April 26th 8am | April 26th 12pm | April 26th, 11am | Weka upgrade | Nothing will work during this time |
psdb0x, psdm0x | logbooks, movers, questionnaires, most of the data management infrastructure | Apr 21, 3:00PM | Apr 21, 4:00PM | Apr 21, 5:00 PM | Moving all machines to 10Gbps networks. | psdm02 had a bad NIC port, we had to reconfigure this to use the alternate. |
psanaoss12n | ana02 | Apr 15, 4:30pm | Apr 15, 7pm | Apr 15, 5:40pm | Replacing broken fans | |
psexport | globus, data transfers | March 24, 9:00am | March 24, 10:00am | reboot psexports to remove ana11/12 | ||
psanaoss121 | ana02 | Apr 6th, 17:00 | Apr 7th, afternoon | Apr 7th, 11am | disk related hardware issues | |
HPSS | tape archive, restore | March 23, 6:00am | March 23, 6pm | March 23, 14:20 | HPSS upgrade | |
drp-srcf | FFB for TMO+XPP/XCS | March 11th, 10am | March 11th, 7pm | March 12th, 1am | Disable SMT on DRP SRCF nodes | |
psdb(psdb4) | Questionnaire/File Restore/File Manager/LCLS 1 DAQ data mover | Mar 3 6:00PM | Mar 3 10:00PM | Moving databases to new cluster | The questionnaire, the file manager services and the LCLS1 DAQ will be unavailable as we migrate to newer machines. | |
All | ldap/dhcp/dns | Feb 24th at 10m | Feb 24th at 1pm | Feb 24th at 3pm | psrelay migration | Netconfig and reboots will fail during this time and name resolution lookups may be delayed for a few seconds. |
psdb(psdb4) | Questionnaire/File Restore/File Manager/LCLS 1 DAQ | Feb 11 9:00AM | Feb 12 2:30AM | Feb 12, 8:00PM | Moving databases to new cluster | The questionnaire, the file manager services and the LCLS1 DAQ will be unavailable as we migrate to newer machines. Reverted back to psdb4 owing to routing issues. |
Instrument Network | all NEH/FEH computing | 2/4/2021 10:30am | 2/4/2021 11:20am | Power supply failure | Replaced. Redundant supply added. | |
pswww/pswebkdc | elog/file restore | Jan 20, 2021 ~4:00PM | Jan 21, 2021 | Issues with VMWare hypervisor | ||
psdm0x psdb0x | eLog/File restore etc | Dec 21 | Dec 24 | Dec 22 | Upgrade of backends to new releases of software | |
ana03 | ana03 | Dec 17, 2:49AM | DEC 17, 10:15am | Lustre issue | psossana0303 was stuck and needed a reboot (hard reboot) | |
Ana file systems | ana13 | Nov 18 | Nov 20, | Nov23, 11am | one ost is not accessible for write | ost0xe can not be written to due to mds/ost issues. It has been set to ro but writes to ana13 are slowed down. Reboot required. Fixed by rebooting MDS, (OSSs were also rebooted) |
pshub01 | JupyterHub | Oct 29 1:00PM | Oct 29 5:44PM | Disk failure. | The is a disk failure in the filesystem that stores the JupyterHub sessions. We are trying to recover this and restart the node. We could not recover the sessions; please log out and restart your session. | |
<<All>> | Network File System (NFS) | 10/19/20, 4pm | 10/22/20, 9am |
| Upgrade to new Solid State Drive-based, distributed clusters | Remaining /reg/[d,g,neh]/* NFS volumes (see Detailed list) |
ANA file systems | ana02/03/11/13/15 | 10/13/20, 6.30am | 10/15/20, 17:40 | (Unscheduled/failure) | 21:20 UED has been moved to weka-nfs: /cds/data/ued/ana, The ana-filesystems are accessible now. | |
<<All>> | CDS Router | 10/5/20, 9am | 10/5/20, 12pm | 10/5/20, 1pm | Upgrade to new routers, 100Gb optics | During 4-day PAMM. Actual disruption should be order of minutes. |
<<All>> | Network File System (NFS) | 10/5/20, 12pm | 10/8/20, 12pm | 10/8/20, 12pm | Upgrade to new Solid State Drive-based, distributed clusters | 4-day PAMM. Replace aging legacy systems. Service disruptions may endure for several days as hard-links in applications are identified and repaired in real-time. Only /reg/neh/opr/ and /reg/g/pcds/ mounts were migrated. Future outage(s) will address remaining volumes. |
ANA | ana04 | April 16th | May 22nd | ana04 is down | Hardware problem with one ana04 ost | |
ANA file system | ana14 | Dec, 2019 | may 27th | ana14 is down | Hardware problem with one ana14 oss | |
nfs server | home directories | Sep 29th | Sep 29th, 11:30 | nfs server issues | All servers are back up. Most if not all issues have been fixed. | |
psnfs02 | home directores on home5 | Sep 3, 16:40 | Sep3, 19:11 | psnfs02 crashed | ||
ANA FS; GPUs, Export, Batch nodes | ANA02, ANA03, ANA11,ANA15, psana, psanafarm, psexport | July 7th, 5:00PM | July 9th, 8:00AM | July 8th, 6:00PM | Circuit breaker replacement | Circuit breaker has been replaced and sensitivity decreased. In order to replace the unit, the entire distribution panel (UDB-C) must be powered off. |
ANA | ana13 | June 8th | June 10th | June10th, 10:45 | psanaoss232 is down | |
Gateway, Build nodes, ML node | pscag1 - pscag4, psbuild-rhel5, psbuild-rhel6, psbuild-rhel7, psjerry | June 8th 7:30AM | June 8th 9:00AM | June 8th, 6:00PM | Power maintenance | Power maintenance completed. The gateway and build servers are now accessible. |
VMware | pswww, pslogion, psdev | May, 18th, 08:00 | May 18th | May 19th, 02:00 | pslogin, psdev and psww are not accessible Access using the LCLS NX servers will continue to work | |
VMware | pswww, pslogion, psdev | May, 16th, 10:00 | May 16th, 17:00 | May 17th, 18:30 | VM hypervisor upgrade | Access using the LCLS NX servers will continue to work |
psnfs02 | home directories | May 12th | May 13th | May 13th, 10:14 | server is down | Some users home directories are not accessible |
psexport | Globus, gridftp | May 11th | May 12 | May 13th, 19:30 | psexport01 is down | psexport works again, gridftp (globus) is available. |
pslogin, pswww | login, web services | May 12th | May 12th, | May 12th, 11:20 | virtual machine cluster | Most VM should be accessible now. |
Comments
pswww | Data Management Portal | Feb 29 | Mar 3 5PM | Mar 5th | Migrating databases to LCLS2 infrastructure | We'll be migrating all the databases to LCLS2 data management systems. The current data management portal should be available as read only. |
ANA | Analysis Infrastructure | 4 December 19 4:00PM PDT | Friday, December 13th | superseded |
outages | Troubleshooting short-circuit | While bringing up some of the file systems, we encountered electrical issues. The plan is to keep things stable until Monday to make sure the equipment which is currently powered on is working properly. Status: Up: psana, psanaq, psexport, ana02, ana03, ana04, ana12, ana15 Down until further notice: ana14 Update: 2020-02-25 | ||||
ANA file systems | Analysis infrastructure | Nov 14th, 2019 | Nov 27th | Dec 4th | Move from building 50 to building 54 (SRCF) | Update: ANA14, ANA12, ANA04, and ANA02 up by Friday ANA11 up by Monday Completed tasks: ana03, ana12, ana13, and ana15 are up Most of the psana batch queue is now available. The psana interactive pool can now be accessed, but several nodes are still down(e.g psanagpu115, psanagpu116, etc). The psexport nodes are operational. |
pslogin, kerberos | pslogin, passwordless access | November 8, 2019 11:45AM | TBD | November 11th | 2 Hypervisors are unresponsive |
ANA file systems | April 11, 12:30pm | Unknown | April 11, 7:30PM |
SLAC wide power glitch | 7:30PM: ANA14 is now online. All systems are up and functional. 6:00PM: All ANA lustre filesystems have been recovered except ANA14 which is suffering major hardware issues. We don't know yet how long it will take to recover it. 12:30PM: All systems are up by now except the ANA file systems because the Lustre MDS was damaged. We don't know yet how long it will take to recover it. | |||
All | All nodes | January 4th | January 11th | January 25th (User Services) |
Server room relocation | Systems that are up:
| |
psana, psexport |
Aug 13, 7pm | Aug 14 | Aug 14 | Configuration error | Wrong MTU setting in building 50 causes psana and psexport to not be able to mount NFS. Will fix this morning. | ||
All | All nodes | July 26, 5:00PM | July 30, 5:00PM | August 2nd | Power Outage | Systems are now on-line fully functional. We encountered several systems with bad hardware and corrupted files. Unnecessary systems are now off-line. |
psnfs03 and psnfs04 | /reg/g and /reg/common | Apr 24, 6:30am | Apr 24, 9am | Apr 24, 8:30 | firmware update |
psnfsopr | Operators home | Apr 24, 6:30am | Apr 24, 9am | Apr 24, 7:30 | Move and firmware update |
All | Networking will be down, so all machines and services will be unavailable. | Jan 17, 2:45 PM | Jan 17, 5:00 PM | Jan 17, 4:45 PM | Central Router firmware upgrades and replace NFS SAS module. | The new SAS NFS module woked, but psnfs03 and pnfs04 took longer than expected to boot. |
The pslogin, psdev and psnxserv nodes will be unavailable. | User home directories and some /reg/* NFS shares will be unavailable | Dec 26, 6:00 AM | Dec 26, 8:00 PM | Dec 26, 9:30pm | Firmware upgrades | Problems found with one of the NFS servers (psnfs03), may need to take another outage to fix |
All ANA filesystems, interactive (GPU & Phi) nodes, psexport nodes and LSF batch nodes | Science data and associated servers | Dec 26, 6:00 AM | Dec 31, 5:00 PM | Dec 31, 8:00 AM | Electrical Work |
HPSS Storage System will be unavailable | Data backup and recovery | Dec 25, 6:00 AM | Jan 8, 5:00 PM | Jan 1 | Electrical Work |
All ana filesystems, interactive nodes, psexport nodes and most batch nodes | Science data and associated servers | Aug 22, 1PM | Aug 23, 8:00 PM | Aug 23, 10:00PM | Cooling and Electrical Work | Outage recovery was late because we had several hardware problems which required intervention. |
|
Electrical Work | Outage Canceled. | |||||
All ana filesystems, interactive nodes, psexport nodes and most batch nodes | Science data and associated servers | June 7, 5:30 AM (PDT) | June 7, 6:00 PM (PDT) | June 7, 6:53 PM (PDT) | Electrical Work | Outage recovery was an hour late because server room electrical work extended beyond anticipated outage time and a Lustre-system RAID card failed. |
ana02 | Science data | Wed, April 19, 11 AM | Wed, April 19, 4 PM | Wed, April 19, 4 PM | Update OS and Lustre version |
psnxserv03 | psnxopr |
No Machine upgrade | Please use psnxserv01 and 02 while 03 is upgraded | |||||
All | NFS | Fri Feb 24, 2017 9am | Fri Feb 24, 2017 9pm | Fri Feb 24, 2017 8:08pm | NFS upgrade | During this outage it won't be possible any users or operator home |
ana12 | Science data | Aug 16, 9am | Aug 16, 12pm | Aug 16, 12pm | Hardware failure | Access to six OSTs is very slow, presumably because of a failing RAID card. We will shutdown one of five ana12 OSSs to replace the card and reconfigure as needed. No data is expected to be lost, but some data will be unavailable during the outage. |
ana04 | Science data | May 5th, 10am | May 10th, 10am | May 30th | Hardware failure | One of the ana04 OSTs doesn't detect a drive. There is enough redundancy to rebuild the array, but it's extremely slow and sometime it hangs. The OST has been set read only. Moving the data to other OSTs. End: file system couldn't be recovered, but we were able to move all the data somewhere else, wipe and rebuild the file system. |
Batch nodes | Batch jobs | Oct 5, 2015 11am | Oct 5, 2015 1pm |
Move to RHEL7 | Interactive nodes will also be moved to rhel7. Users logged into rhel5 interactive nodes will be able to continue their session, but batch submission will fail once batch nodes are converted to rhel7. | |||||
HPSS | Restoring files from tape | Sep 22, 2015 | Sep 24, 2015 | Sep 24, 2015 12pm | Upgrade of HPSS to version 7.4 | HPSS will become read-only on Sep21th at 5pm. |
psnehprioq/psfehprioq | All nodes | April 1, 2015 9am | April 1, 2015 8pm | March 31, 2015 | Maintenance on nodes to allow addition of 640 cores to computing system | Ended early due to technical issues. |
psnehq/psfehq | All nodes | March 31, 2015 11am | April 1, 2015 8pm | March 31, 2015 | Maintenance on nodes to allow addition of 640 cores to computing system | Ended early due to technical issues. |
All NEH/FEH computing | All LCLS computing services | Dec 31, 00:00hrs | Dec 31, 18:00hrs |
Switching of generator power to building power at Building 950 where servers are housed. |
All NEH/FEH computing |
Nov. 7, 2014 ~12pm | Unknown | 7:30pm | Unscheduled power outage | Power has been restored. We are running file system checks for our NFS servers before we could bring everything else back up. We are hoping to get things back up by 4.30pm. | ||
LCLS Offline Filesystem | All LCLS Offline filesystems | Sep 3, 2014 ~7.30pm | N/A | Sep 4, 2014 | Network outage causing offline filesystems to be inaccessible. | We have identified the root caused and issue should be fixed now. |
LCLS Computing | All LCLS computing services | Friday Aug 8th, Midnight (00:00 hrs) | Monday Aug 18th, 2014 Noon | Aug 18th, 2014 12pm | To prepare for the power shut down on Aug 11. | Not all machines will be shut down at once as we start performing system updates. Please do not depend on any service pass midnight Aug 10th. |
psexport, psana104, psana105 | Export nodes, HPSS, scratch and calib backups | Thursday August 7th, Noon | Monday Aug 18th, 2014 COB |
These machines need to be moved to Bld 50. |
LCLS Online Computing | All Online Computing Nodes, DSS, FFB | Tuesday Aug 5th, 2014 | Wednesday Aug 20th, 2014 COB |
CXI DSS nodes and ffb nodes need to be recabled and reconfigured. The IB switch is moved to B50 as part of the offline storage move. |
LCLS Offline Filesystem | All LCLS Offline filesystems ana01, ana02, ana03. ana04. ana11, ana12, ana14 | Tuesday Aug 5th, 2014 | Monday Aug 18th, 2014 COB |
Offline Analysis Hardware will be moved from B950 and B999 to B50. | The equipment needs to be disconnected, and carefully moved before the power outage on Aug 11, and reconnected at B50. | |||||
LSF | LSF Job Submission and Management | Wednesday, Mar 19, 2014 | N/A | Wednesday, 6:45pm | Unplanned software outage | No job can be submitted and managed at this time. We were informed that software vendor has been contacted, and SLAC Computing Division is working on the issue. Post-mortem from SLAC comp-out: The problem resulted from a bug in one of the new LSF 9.1.2 daemons. IBM is researching a fix and in the meantime we are still running 9.1.2, but that one daemon has been reverted to 9.1.1. We believe that jobs submitted prior to the problem should have continued to run and should continue to be tracked by LSF. |
All computing services at LCLS (NEH, FEH, XRT, FEE, Undulator Hall) | All services at LCLS | Friday, Aug 9th, 2013 | Tuesday, Aug 13th, 2013 |
Planned power outage at LCLS buildings |
psana11*,psana12* | LSF Compute Nodes | Thursday, May 30th, 2013 16:30hrs | Friday, May 31st, 2013 12:00hrs |
Unplanned power outage at SLAC | Most of the nodes are brought up. A handful have memory related problem and have been disabled in LSF awaiting diagnostic. | |
psanafeh |
Thursday, May 30th, 2013 16:30hrs | Friday, May 31st, 2013 12:00hrs | Friday, May 31st, 2013 10:37am | Unplanned power outage at SLAC |
psexport01 |
Thursday, May 30th, 2013 16:30hrs | Friday May 31st, 2013 12:00hrs | Friday, May 31st, 2013 11.00am | Unplanned power outage at SLAC |
ana01, ana02 | /reg/d/ana01, /reg/d/ana02 filesystems | Thursday, May 30th, 2013 16:30hrs | Friday May 31st, 2013 14:00hrs | Friday, May 31st, 6pm. | Unplanned power outage at SLAC |
pssrv100 (psnfs) | NFS mountpoint for PCDS diskless nodes | Tuesday, Mar 26th, 2013 | Tuesday, Mar 26th, 2013 | Tuesday, Mar 26th, 2013 |
|
|
pssrv100 (psnfs) | NFS mountpoint for PCDS diskless nodes | Monday, Jan 7th, 2013 (1030 hrs) | Monday, Jan 7th, 2013 | Wednesday, Jan 9th, 2013 | RAID controller malfunctioned upon power restoral after planned power outage in B950 203A | pssrv101 (old data) was used to bring up the FEE nodes for part of the outage. pssrv100 was restored to operation after a new RAID controller was delivered and installed. |
ana01 | /reg/d/ana01 filesystem | Tuesday, Dec 18th 2012 | unknown | Partial (98%) restoral Monday Dec 24th (0800 hrs) | Controller failed causing corrupted parity data | Parity errors fixed and new controller installed. 2 OSTs (LUNs) needed fsck'ing. One took a few hours, the other took 10 days. |
psanaoss21* | /reg/d/ana12 filesystem | Monday, Oct 8th, 2012 (1700 hrs) | Monday, Oct 8th, 2012 (1900 hrs) | Monday, Oct 8th, 2012 (1900 hrs) | Hardware upgrades |
psanaoss2** | /reg/d/ana11 and /reg/d/ana12 filesystem | Thursday, Sep 27, 2012 (1700 hrs) | Friday, Sep 28, 2012 (0100 hrs) | Friday, Sep 28, 2012 (0400 hrs) | Hardware upgrades |
Sitewide outage. All Linux Servers at NEH, FEH, XRT, FEE. | All computing services at LCLS. | Wednesday August 15, 2012 | August 17, 2012 1:00 PM |
SLAC sitewide power outage on August 16. | Expect logging in to any machines to be unavailable between 8/15 and 8/17 even if some of the servers are powered up before the expected end time. They will be maintenance performed on various servers during these 2 days. | |||||
All machines in XPP hutch and control room will be inaccessible. | XPP | Monday April 9, 2012 11:15AM | Monday April 9, 2012 11:45AM | Monday April 9, 2012 11:30AM | Electrical Work at XPP Hutch | Completed |
ana01/ana02 file systems | Wed Mar 28th, 2012 9am | Wed Mar 28th, 2012 1pm | Wed Mar 28th, 2012 4pm | Upgrade to IB | Completed | |
psananeh | NEH storage and processing |
| Tue Dec 27, 2011 4pm | Mon Dec 26, 2011 1pm |
| Completed. Chilled water was restored on Friday. |
psana batch nodes | All Science data is currently unavailable. Psananeh psanafeh is up for Matlab use, but no access to data on Lustre file system. | Saturday Oct 1, 2011 6am |
Lustre file system remains down after the unplanned power outage on Saturday. | The system administrators are working to bring them back. |
All LCLS computing services | Monday |
Electrical work at NEH server room and FEH. | pslogin is up. NFS server, LDAP, DNS, pswww are up. | ||||
psana, | All Science data, All user home directories, all DAQ cache nodes. All online services. | Wed | Wed | Wed | Upgrade of Lustre hardware. |
psana | Science data access | Tue |
Tue | NEH power outage | B950 and several other buildings experienced short power glitch but the lustre file servers did not survive the interruption and is still being brought up. | ||
psana | Science data access | Thu | Thu |
Lustre failover testing. |
NEH online nodes | NEH DAQ, outside ssh access | Thu | Thu |
Server room upgrade, ana02 memory upgrade | Completed | |||||
psana | Science data access | Thu | Thu | Thu | Lustre maintenance | Completed. Upgraded memory on psanaoss101-104, and replaced 10Gbit cards with 1 port SMCs. 717W power supplies are in place on psanaoss103-104 now. |
psana | Science data access | Thu | Thu | Thu | Lustre maintenance | Completed |
All | All | Fri | Sun | Sun | NEH power outage | Completed |
psana | Science data access | Thu | Thu | Thu | Lustre maintenance | Completed |
psana | Science data access | Fri | Mon |
NEH cooling outage | Completed | ||
psana | Science data access | Thu | Thu |
Enabling HA for Lustre system | Completed | |||||
All | All | Sat | Sat | Mon | NEH power cut | Completed |
psana | Science data access | Thu | Thu |
Lustre testing | Completed | ||
All | All | Wed | Wed |
NEH power cut | This power cut was NOT planned | |||||
All | All | Sat | Sat | Mon | NEH power cut | Completed |