Page History

All times are PT. Red entries are active. Most PST (Pacific Standard Time). Most recent entry first.

Excerpt

Nodes	Services	Expected End Time	Reason	Impact and Comments
============= Ongoing ============
Restore from tape	filemanager ws	Feb 2024	After the switch to the S3DF ANAFS the restore service needs to be updated.	File restore and status for files in S3DF is not working

============ Upcoming ============

Nodes	Services	Start Time	Expected End Time	Actual End Time	Reason	Impact and Comments
============ Completed ============
S33DF data storage	sdfdata	Feb 9th, morning	unknown	Feb 10th, 15:00	The object store for the S3DF data storage (/sdf/data/lcls/ds) has connection problems and data on it can not be accessed and therefore accessing files will likely fail with "Remote I/O error" next update 3pm PST Feb. 10 – see https://s3df.slac.stanford.edu/	Please refrain from writing new data to /sdf/data/lcls/ as that might fill up the flash storage. The Object store itself is up and the data are fully protected the issue is that the fast flash layer can not interact with the object store. If the outage persists /sdf/data/lcls/ will be unmounted. The file systems are available now.
PCDSN Network	All Services	Nov 27th, noon	Dec 11th		Scheduled NEH power outage. 11/28-12/8	The PCDSN routers are located in NEH, thus, all PCDSN services will be impacted: Storage (NFS), Login (Bastion Hosts), NoMachine, Interactive, DRP, Build, and Diskless (IOC and DAQ). Service in 208 will go down only when power is down. Controls and DAQ servers in affected areas will be turned off starting at noon the day before the power outage. When power is restored, systems will be restored in this order central systems (208) to get UED operational (Dec 6, later afternoon) NEH: LFE/KFE, TMO, RIX, Laser Hall, XPP (Dec 6-7) FEH: XRT, XCS, MFX, CXI, MEC (Dec 8) EPICS will remain offline until the above services are restored. Recovery can take up to 8hrs after these services are restored. PLC systems configured for automatic recovery will resume as soon as power is restored. S3DF will not be impacted during this outage.
S3DF	Batch Computing and Interactive Nodes	Oct 16th, 8:00 PM	Oct 17th, 12:01 AM	Oct 17th, 11:00 PM	weka upgrades	Upgrading Weka to the latest release to fix ACLs issues. Although no impact is expected, this is being added for informational purposes.
Home directories on the psana/daq systems.	All, except S3DF.	Sep 7th, 2023, 12:30pm		Sep 7th, 2023 5pm	weka problems	IT writes: If you experience issues first, try to log-out and log back in again. If that doesn't work, then reboot the host
DRP		June 25th 8:00 PM	July 5th 2:00 PM	June 16th, 2023 7:00 PM	Finalize the storage client upgrade.	DRP {cmp,eb,mon} nodes in SRCF are not available
psana	ana lustre filesystems, data movers, interactive machines	July 6, 9:20am		July 6, 4:00pm	Unplanned power outage
ALL	ALL	June 25th at 8:00 PM	July 4th		SRCF-II Electrical System	NFS, DHCP, Home directories, Archiving, Logging, Grafana, Routing (inter-subnet traffic), EPICS + Gateways, DAQ + analysis systems, Virtual machines (eg. lfe-console), ssh, NTP PLCs, and networked controllers will continue to function in-so-far as no inter-subnet communication is required. PLC work and maintenance will require local connection. Intra-subnet traffic will still be possible. IOC hosts may remain online but performance will likely degrade until they crash and will not recover until the outage is over. Reboot of any system may fail due to lack of IP assignment (DHCP is down). Archiving and logging will be dark. This outage is required to integrate the new SRCF-II building with the existing data center.
FFB	FFB Storage	May 1st, 2023 6:00 AM	May 2nd, 2023 6:00 AM	May 5th ??	Preparation for S3DF functionality integration.	Upgrading system and storage version. LCLS2 DAQ unavailable, LCLS1 DAQ only able to record to event-nodes during outage. The FFB was upgrade but some clients still need some work which will be done by 4th July (SRCF-II outage).
pslogin, psana, NoMachine	Login	June 16, 2023, 2:00pm		June 16th, 2023 7:00 PM		Unplanned power outage in SRCF rack F1
Kafka	Kafka, elog	April 28th, 2023 14:00	17:00 same day	16:00	Switching data-management from PCDS Kafka to S3DF Kafka	The elog and some services might not be available for a short times. Let pcds-datamgt-l know if something is still amiss
ana02	ana02	April 4th, 2023			Decommissioning offline filesystem ana02	All data on ana02 were copied, cleaned or archived. ana02 is not accessible anymore.
All	All	March 21, 2023 3pm	Unknown	March 22, 2023 4pm	SLAC site-wide power outage
	ARP	Mar 9, 2023 9AM	Mar 12, 2023 9PM	Mar 9, 2023 9PM	S3DF support	We're adding S3DF support for the ARP and as a result will be making major changes to the Automatic Run Processing infrastructure.
psana	ANA03	Feb 23, 2023 11:30AM	Unknown	Feb 23, 2023 1:00PM	Unhappy OSTs.	From IT: There was an issue with the zpools in ANA03. After coordinating the restarts, all seem operational.
All	All	Jan 18, 2023 7AM	Unknown	Jan 18, 2023 4PM	Planned power outage in B950	pslogin and psexport nodes are reachable again as of ~4pm. (nodes located in SRCF, but not those in NEH)
ConfigDB	DAQ Configuration for LCLS2	Oct 18 3PM	Oct 18 5PM	Oct 18 4PM		Enhancements for restore configuration
psdb, psdm	eLog, data-mover	Sep 1st, 9am	Sep 2nd, 9am	Sep 1st, 4pm	Updating OS, Kafka, MongoDB	MongoDB, Kafka, pswww were updated. all services are up and running but some checks are still performed.
psossana0101	ANA01	Aug 29st, 17:10	Aug, 30, 10am		disk issues	Three OSTs of ana01 are not accessible and access to ana01 will hang
Many	NSF	July 24, 7:00		July 25, 14:00	Some weka servers had problems.	Weka NFS filesystem showing multiple problems on many nodes
ANA FS	Lustre	June 24th, 8:00	June 24, 8:00	June 24, 15:00	Relocate OSSs
ANA FS	Interactive, Batch, ANA FS, Export	May 25, 9:00AM	May 25, 10:00AM	May 25, 11:22AM	Upgrade firmware to fix bug.	Part of the core upgrade, but encountered issues during the upgrade.
Core Services Network	All	May 25, 9:00AM	May 25, 10:00AM	May 25, 10:12AM	Upgrade firmware to fix bug.	No server reboots will be required.
All Diskless Systems	All Diskless Systems	April 21, 12:10PM	TBD	April 21, 1:10PM	Diskless system was rebooted, did not come back as expected	Critical impact
WEKA Cluster	NFS	April 7th, 8:00 AM	April 7th, 9:00 AM	April 7th, 9:30 AM	Patch Software	Minimal impact, no reboot required. Expect latency during the process.
Network Reorganization #3	ANA FS, Export Services, Interactive Pool and Batch cluster.	TBD			Network Upgrade
psana and ffb batch nodes	ana15 filesystem	Feb. 15 2022, 06:30		Feb. 15 2022, 11:10	OSS reboot was necessary	stooped again at 9:40 oss and mds got rebooted
psana and ffb batch nodes	ana01/02/03/15/16 filesystems	Feb. 9 2022, 0:00		Feb 9, 2022, 16:20	Some of the oss or mds have problems with their IP addresses.	ioc-und-rec01 was misconfigured and created a conflicting IP address with ana02/03
MFX nodes	Network
psmetric04	grafana (psdm)	Jan 12, 10:40		Jan 12, 10:45	Upgrade 6.7.1 -> 8.3.3.	PSDH-161
Network Reorganization #2	Name Services, Login Services, NoMachine, Web Services	Jan 6, 2022 9:00 AM	12:00 PM	1:30PM	Network Upgrade
Network Reorganization #1	ALL	Jan 5, 2022 8:00 AM	12:00 PM		Establish LCLS Routers
Weka	ALL	Jan 5, 2022 9:00 AM	10:00 AM		New Features and bug
XCS Nodes	Network	Nov 16, 9.30am			Switch upgrade
psdb0x	eLog, file manager, all web services	Dec 8th, 9AM	Dec 8th, 12PM	Dec 8th, 10AM		Upgrading Mongo to 4.4.10.
drp-neh-xxx	NEH DRP				Upgrading to the latest Lustre version	No DRP in the NEH alcove
All NEH and FEH nodes	IOCs and DAQ nodes, alcove DRP, and control room workstations	Aug 19, 6:00AM	Sep 18, 5:00PM		Electrical work in experimental halls	Systems in the experimental halls will be on-line whenever power is available. Note: psana is not in the experimental halls and won't be affected.
psanaoss215	ana1, ana12	Dec 2			oss psanaoss215 of ana11 is down	decommissioned
psanaoss231	ana13	Aug 5th, 5am			oss problem after power outage	decomissioned
all nodes	nfs file-system	Aug 31, 10am		Aug 31, 20:00	Some weka filesystems are not available.	file systems (e/g/: /cds/sw, /cds/groupcfel/..) show I/O error. Sometimes it might work.
	xtc file restore	Nov 6th,	Nov 8th	Nov 7th	Data Restore from tape is not working	HPSS might have just been very slow mounting the tapes.
pshub01	JupyterHub	Oct 21, 9am	Oct 21, 9am	Oct 21, 9am	Upgrade hub version	upgrade hub from 1.4.2 -> 2.0.0., fix matplotlib/ipympl issue
psjhub01	JupyterHub	Sep 16, 9am	Sep 16, 9:15am	Sep 16, 9:05am	Upgrade jupyterhub code
questionnaire	LCLS and UED questionnaires	Sep 14, 6PM	Sep 14, 10PM		Upgrade the questionnaire backend to new database
psexport	All data mover services	Aug 16, 1:00PM	Aug 19, 3:00PM	Aug 19, 15:30		All services have been started except the NEH data mover.
Diskless nodes	IOC and DAQ nodes	Aug 16, 9:00AM	Aug 20, 1:00PM		Diskless server will be migrated to the weka cluster	Cancelled
CXI Nodes	Network	Aug 18, 2:00PM	Aug 23, 5:00PM	Aug 27, 5:00PM	Network upgrade	No control room workstations or IOCs working in CXI during this time.
MEC Nodes	Network	Aug 18, 2:00PM	Aug 23, 5:00PM	Aug 26, 2:00PM	Network upgrade	No control room workstations or IOCs working in MEC during this time.
psnfsluste	ARP, Summary results	Aug 15th, 5:00PM		Aug 25, 1am	Various file system upgrades	ana-filesystems have been mounted on the psdmNN nodes
psdbdev01	LCLS2 calibration service	Aug 26th, 5PM	Aug 26th, 9PM	Aug 26th, 11:30PM	Move the databases to psdb0x machines	May affect analysis for TMO/RIX experiments.
drp-srcf-xxx	Fast feedback	Aug 15th, 5:00PM	Aug 25th, 5:00PM	Aug 25, 11:30PM	Reorganizing DRP racks and connectivity, upgrading Weka FFB Cluster	No fast feedback analysis during this time.
psana, JupyterHub	Interactive pool, batch nodes	Aug 16, 9:00AM	Aug 19, 2:00PM		Upgrading file systems, batch and interactive nodes	Some data analysis capabilities. Working on getting the remaining interactive and batch nodes on-line.
pswww	Web Services	Aug 16, 6:00PM	Aug 17, 12:00PM	Aug 17, 4:15PM	Service failed after upgrade.	Permission issues from the Weka upgrade
psnx, pslogin, psdev,	All login services	Aug 16, 1:00PM	Aug 16, 5:00PM	Aug 16, 6:30PM	Upgrading host to latest packages and services	No ability to ssh into the system during this time.
psweka	NFS	Aug 16th, 9:00AM	Aug 16th, 1:00PM	Aug 16th, 5:00PM	Upgrading and re-configuring Weka Cluster	The users home won't be available during this time so nothing will work. This will require rebooting all LCLS servers to use the new NFS stack.
psanagpu, lustre	interactive, jupyter, anafs	Aug 5th, 5am		Aug 5th, 9:20am	power issue SRCF	Most system have been restored. Only ana13 needs some attention
Weka Cluster	NFS (home directories, central storage, all related systems)	5/22/2021	Unknown	5/24/2021 ~10pm	Under investigation	The LCLS IT team is working with the vendor to diagnose root cause and future actions. The software version was rolled back, and the system is performing more stably.
psdb0x, psdm0x, pswww0x	logbooks, movers, questionnaires, most of the data management infrastructure	May 18, 9:00AM	May 18, 12:00PM	May 18, 9:55AM	Upgrading all the NFS mounts to the new version of NFS.
psweka	All	April 28th, 8:00PM	April 28th, 9:00PM	April 28th, 9:45PM	Deploying NFS-Ganesha
psanaoss121	ana02	Apr 23, 18:15		Apr 26, 11:50	oss crashed
psweka	All	April 26th 8am	April 26th 12pm	April 26th, 11am	Weka upgrade	Nothing will work during this time
psdb0x, psdm0x	logbooks, movers, questionnaires, most of the data management infrastructure	Apr 21, 3:00PM	Apr 21, 4:00PM	Apr 21, 5:00 PM	Moving all machines to 10Gbps networks.	psdm02 had a bad NIC port, we had to reconfigure this to use the alternate.
psanaoss12n	ana02	Apr 15, 4:30pm	Apr 15, 7pm	Apr 15, 5:40pm	Replacing broken fans
psexport	globus, data transfers	March 24, 9:00am	March 24, 10:00am		reboot psexports to remove ana11/12
psanaoss121	ana02	Apr 6th, 17:00	Apr 7th, afternoon	Apr 7th, 11am	disk related hardware issues
HPSS	tape archive, restore	March 23, 6:00am	March 23, 6pm	March 23, 14:20	HPSS upgrade
drp-srcf	FFB for TMO+XPP/XCS	March 11th, 10am	March 11th, 7pm	March 12th, 1am	Disable SMT on DRP SRCF nodes
psdb(psdb4)	Questionnaire/File Restore/File Manager/LCLS 1 DAQ data mover	Mar 3 6:00PM	Mar 3 10:00PM		Moving databases to new cluster	The questionnaire, the file manager services and the LCLS1 DAQ will be unavailable as we migrate to newer machines.
All	ldap/dhcp/dns	Feb 24th at 10m	Feb 24th at 1pm	Feb 24th at 3pm	psrelay migration	Netconfig and reboots will fail during this time and name resolution lookups may be delayed for a few seconds.
psdb(psdb4)	Questionnaire/File Restore/File Manager/LCLS 1 DAQ data mover	Feb 11 9:00AM	Feb 12 2:30AM	Feb 12, 8:00PM	Moving databases to new cluster	The questionnaire, the file manager services and the LCLS1 DAQ will be unavailable as we migrate to newer machines. Reverted back to psdb4 owing to routing issues.
Instrument Network	all NEH/FEH computing	2/4/2021 10:30am		2/4/2021 11:20am	Power supply failure	Replaced. Redundant supply added.
pswww/pswebkdc	elog/file restore	Jan 20, 2021 ~4:00PM		Jan 21, 2021 ~1 am	Issues with VMWare hypervisor
psdm0x psdb0x	eLog/File restore etc	Dec 21	Dec 24	Dec 22	Upgrade of backends to new releases of software
ana03	ana03	Dec 17, 2:49AM		DEC 17, 10:15am	Lustre issue	psossana0303 was stuck and needed a reboot (hard reboot)
Ana file systems	ana13	Nov 18	Nov 20,	Nov23, 11am	one ost is not accessible for write	ost0xe can not be written to due to mds/ost issues. It has been set to ro but writes to ana13 are slowed down. Reboot required. Fixed by rebooting MDS, (OSSs were also rebooted)
pshub01	JupyterHub	Oct 29 1:00PM		Oct 29 5:44PM	Disk failure.	The is a disk failure in the filesystem that stores the JupyterHub sessions. We are trying to recover this and restart the node. We could not recover the sessions; please log out and restart your session.
<<All>>	Network File System (NFS)	10/19/20, 4pm	10/22/20, 9am	21 Oct 2020	Upgrade to new Solid State Drive-based, distributed clusters	Remaining /reg/[d,g,neh]/* NFS volumes (see Detailed list)
ANA file systems	ana02/03/11/13/15	10/13/20, 6.30am		10/15/20, 17:40	(Unscheduled/failure)	21:20 UED has been moved to weka-nfs: /cds/data/ued/ana, The ana-filesystems are accessible now. The FFB->anafs migration has started but it will take time to clear the backlog
<<All>>	CDS Router	10/5/20, 9am	10/5/20, 12pm	10/5/20, 1pm	Upgrade to new routers, 100Gb optics	During 4-day PAMM. Actual disruption should be order of minutes.
<<All>>	Network File System (NFS)	10/5/20, 12pm	10/8/20, 12pm	10/8/20, 12pm	Upgrade to new Solid State Drive-based, distributed clusters	4-day PAMM. Replace aging legacy systems. Service disruptions may endure for several days as hard-links in applications are identified and repaired in real-time. Only /reg/neh/opr/ and /reg/g/pcds/ mounts were migrated. Future outage(s) will address remaining volumes.
ANA	ana04	April 16th	May 22nd		ana04 is down	Hardware problem with one ana04 ost
ANA file system	ana14	Dec, 2019	may 27th		ana14 is down	Hardware problem with one ana14 oss
nfs server	home directories	Sep 29th		Sep 29th, 11:30	nfs server issues	All servers are back up. Most if not all issues have been fixed.
psnfs02	home directores on home5	Sep 3, 16:40		Sep3, 19:11	psnfs02 crashed
ANA FS; GPUs, Export, Batch nodes	ANA02, ANA03, ANA11,ANA15, psana, psanafarm, psexport	July 7th, 5:00PM	July 9th, 8:00AM	July 8th, 6:00PM	Circuit breaker replacement	Circuit breaker has been replaced and sensitivity decreased. In order to replace the unit, the entire distribution panel (UDB-C) must be powered off.
ANA	ana13	June 8th	June 10th	June10th, 10:45	psanaoss232 is down
Gateway, Build nodes, ML node	pscag1 - pscag4, psbuild-rhel5, psbuild-rhel6, psbuild-rhel7, psjerry	June 8th 7:30AM	June 8th 9:00AM	June 8th, 6:00PM	Power maintenance	Power maintenance completed. The gateway and build servers are now accessible.
VMware	pswww, pslogion, psdev	May, 18th, 08:00	May 18th	May 19th, 02:00		pslogin, psdev and psww are not accessible Access using the LCLS NX servers will continue to work
VMware	pswww, pslogion, psdev	May, 16th, 10:00	May 16th, 17:00	May 17th, 18:30	VM hypervisor upgrade	Access using the LCLS NX servers will continue to work
psnfs02	home directories	May 12th	May 13th	May 13th, 10:14	server is down	Some users home directories are not accessible
psexport	Globus, gridftp	May 11th	May 12	May 13th, 19:30	psexport01 is down	psexport works again, gridftp (globus) is available.
pslogin, pswww	login, web services	May 12th	May 12th,	May 12th, 11:20	virtual machine cluster	Most VM should be accessible now.

Comments

psexportpsexportMay 11thMay 12 psexport01 is down and needs on-site access.ANA ana04April 16thopen oss of ana04 is downana04 is not accessible and file opens will hangANA file systemana14Dec, 2019open ana14 servers are down Hardware problems with ana14's oss


pswww	Data Management Portal	Feb 29	Mar 3 5PM	Mar 5th	Migrating databases to LCLS2 infrastructure	We'll be migrating all the databases to LCLS2 data management systems. The current data management portal should be available as read only.
ANA	Analysis Infrastructure	4 December 19 4:00PM PDT	Friday, December 13th	superseded by other

outages

Troubleshooting short-circuit

While bringing up some of the file systems, we encountered electrical issues. The plan is to keep things stable until Monday to make sure the equipment which is currently powered on is working properly.

Status:

Up: psana, psanaq, psexport, ana02, ana03, ana04, ana12, ana15

Down until further notice: ana14

Update: 2020-02-25
ana11 is up and batch nodes psana12xx

ANA file systems

Analysis infrastructure

Nov 14th, 2019

Nov 27th

Dec 4th

Move from building 50 to building 54 (SRCF)

Update:

ANA14, ANA12, ANA04, and ANA02 up by Friday

ANA11 up by Monday

Completed tasks:

ana03, ana12, ana13, and ana15 are up

Most of the psana batch queue is now available.

The psana interactive pool can now be accessed, but several nodes are still down(e.g psanagpu115, psanagpu116, etc).

The psexport nodes are operational.

pslogin, kerberos

pslogin, passwordless access

November 8, 2019

11:45AM

TBD

November 11th

2 Hypervisors are unresponsive

ANA file systems

April 11, 12:30pm

Unknown

April 11,

7:30PM

SLAC wide power glitch

7:30PM: ANA14 is now online. All systems are up and functional.

6:00PM: All ANA lustre filesystems have been recovered except ANA14 which is suffering major hardware issues. We don't know yet how long it will take to recover it.

12:30PM: All systems are up by now except the ANA file systems because the Lustre MDS was damaged. We don't know yet how long it will take to recover it.

All

All nodes

January 4th

January 11th

January 25th (User Services)

Server room relocation

Systems that are up:

psnxserv
psana (interactive)
psexport
psdev
pslogin
pswww(elog)
pshub(JupyterHub)
psana(batch)

psana, psexport

Aug 13, 7pm

Aug 14

Configuration error

Wrong MTU setting in building 50 causes psana and psexport to not be able to mount NFS. Will fix this morning.

All

All nodes

July 26, 5:00PM

July 30, 5:00PM

August 2nd

Power Outage

Systems are now on-line fully functional.

We encountered several systems with bad hardware and corrupted files.

Unnecessary systems are now off-line.

psnfs03 and psnfs04

/reg/g and /reg/common

Apr 24, 6:30am

Apr 24, 9am

Apr 24, 8:30

firmware update


psnfsopr	Operators home	Apr 24, 6:30am	Apr 24, 9am	Apr 24, 7:30	Move and firmware update


All	Networking will be down, so all machines and services will be unavailable.	Jan 17, 2:45 PM	Jan 17, 5:00 PM	Jan 17, 4:45 PM	Central Router firmware upgrades and replace NFS SAS module.	The new SAS NFS module woked, but psnfs03 and pnfs04 took longer than expected to boot.
The pslogin, psdev and psnxserv nodes will be unavailable.	User home directories and some /reg/* NFS shares will be unavailable	Dec 26, 6:00 AM	Dec 26, 8:00 PM	Dec 26, 9:30pm	Firmware upgrades	Problems found with one of the NFS servers (psnfs03), may need to take another outage to fix
All ANA filesystems, interactive (GPU & Phi) nodes, psexport nodes and LSF batch nodes	Science data and associated servers	Dec 26, 6:00 AM	Dec 31, 5:00 PM	Dec 31, 8:00 AM	Electrical Work


HPSS Storage System will be unavailable	Data backup and recovery	Dec 25, 6:00 AM	Jan 8, 5:00 PM	Jan 1	Electrical Work


All ana filesystems, interactive nodes, psexport nodes and most batch nodes	Science data and associated servers	Aug 22, 1PM	Aug 23, 8:00 PM	Aug 23, 10:00PM	Cooling and Electrical Work	Outage recovery was late because we had several hardware problems which required intervention.
~~LCLS Computing~~	~~All LCLS computing services~~	~~July 25, 4:00 AM (PDT)~~	~~July 25, 6:00 PM (PDT)~~

	Electrical Work	Outage Canceled.
All ana filesystems, interactive nodes, psexport nodes and most batch nodes	Science data and associated servers	June 7, 5:30 AM (PDT)	June 7, 6:00 PM (PDT)	June 7, 6:53 PM (PDT)	Electrical Work	Outage recovery was an hour late because server room electrical work extended beyond anticipated outage time and a Lustre-system RAID card failed.
ana02	Science data	Wed, April 19, 11 AM	Wed, April 19, 4 PM	Wed, April 19, 4 PM	Update OS and Lustre version


psnxserv03	psnxopr

				No Machine upgrade	Please use psnxserv01 and 02 while 03 is upgraded
All	NFS	Fri Feb 24, 2017 9am	Fri Feb 24, 2017 9pm	Fri Feb 24, 2017 8:08pm	NFS upgrade	During this outage it won't be possible any users or operator home
ana12	Science data	Aug 16, 9am	Aug 16, 12pm	Aug 16, 12pm	Hardware failure	Access to six OSTs is very slow, presumably because of a failing RAID card. We will shutdown one of five ana12 OSSs to replace the card and reconfigure as needed. No data is expected to be lost, but some data will be unavailable during the outage.
ana04	Science data	May 5th, 10am	May 10th, 10am	May 30th	Hardware failure	One of the ana04 OSTs doesn't detect a drive. There is enough redundancy to rebuild the array, but it's extremely slow and sometime it hangs. The OST has been set read only. Moving the data to other OSTs. End: file system couldn't be recovered, but we were able to move all the data somewhere else, wipe and rebuild the file system.
Batch nodes	Batch jobs	Oct 5, 2015 11am	Oct 5, 2015 1pm

	Move to RHEL7	Interactive nodes will also be moved to rhel7. Users logged into rhel5 interactive nodes will be able to continue their session, but batch submission will fail once batch nodes are converted to rhel7.
HPSS	Restoring files from tape	Sep 22, 2015 7am	Sep 24, 2015 5pm	Sep 24, 2015 12pm	Upgrade of HPSS to version 7.4	HPSS will become read-only on Sep21th at 5pm.
psnehprioq/psfehprioq	All nodes	April 1, 2015 9am	April 1, 2015 8pm	March 31, 2015	Maintenance on nodes to allow addition of 640 cores to computing system	Ended early due to technical issues.
psnehq/psfehq	All nodes	March 31, 2015 11am	April 1, 2015 8pm	March 31, 2015	Maintenance on nodes to allow addition of 640 cores to computing system	Ended early due to technical issues.
All NEH/FEH computing	All LCLS computing services	Dec 31, 00:00hrs	Dec 31, 18:00hrs

Switching of generator power to building power at Building 950 where servers are housed.


All NEH/FEH computing

	Nov. 7, 2014 ~12pm	Unknown	7:30pm	Unscheduled power outage	Power has been restored. We are running file system checks for our NFS servers before we could bring everything else back up. We are hoping to get things back up by 4.30pm.
LCLS Offline Filesystem	All LCLS Offline filesystems	Sep 3, 2014 ~7.30pm	N/A	Sep 4, 2014 12:45pm	Network outage causing offline filesystems to be inaccessible.	We have identified the root caused and issue should be fixed now.
LCLS Computing	All LCLS computing services	Friday Aug 8th, Midnight (00:00 hrs)	Monday Aug 18th, 2014 Noon	Aug 18th, 2014 12pm	To prepare for the power shut down on Aug 11. To perform hardware and software maintenance before the power outage.	Not all machines will be shut down at once as we start performing system updates. Please do not depend on any service pass midnight Aug 10th.
psexport, psana104, psana105	Export nodes, HPSS, scratch and calib backups	Thursday August 7th, Noon	Monday Aug 18th, 2014 COB

These machines need to be moved to Bld 50.

LCLS Online Computing

All Online Computing Nodes, DSS, FFB

Tuesday Aug 5th, 2014
5pm

Wednesday Aug 20th, 2014

COB

CXI DSS nodes and ffb nodes need to be recabled and reconfigured. The IB switch is moved to B50 as part of the offline storage move.

LCLS Offline Filesystem

All LCLS Offline filesystems

ana01, ana02, ana03. ana04. ana11, ana12, ana14

Tuesday Aug 5th, 2014
5pm

Monday Aug 18th, 2014

COB

Offline Analysis Hardware will be moved from B950 and B999 to B50.

The equipment needs to be disconnected, and carefully moved before the power outage on Aug 11, and reconnected at B50.

LSF

LSF Job Submission and Management

Wednesday, Mar 19, 2014

N/A

Wednesday, 6:45pm

Unplanned software outage

No job can be submitted and managed at this time. We were informed that software vendor has been contacted, and SLAC Computing Division is working on the issue.

Post-mortem from SLAC comp-out: The problem resulted from a bug in one of the new LSF 9.1.2 daemons. IBM is researching a fix and in the meantime we are still running 9.1.2, but that one daemon has been reverted to 9.1.1. We believe that jobs submitted prior to the problem should have continued to run and should continue to be tracked by LSF.

All computing services at LCLS (NEH, FEH, XRT, FEE, Undulator Hall)

All services at LCLS

Friday, Aug 9th, 2013
13:00hrs

Tuesday, Aug 13th, 2013
13:00hrs

Planned power outage at LCLS buildings


psana11,psana12 psana13, psana14	LSF Compute Nodes	Thursday, May 30th, 2013 16:30hrs	Friday, May 31st, 2013 12:00hrs

	Unplanned power outage at SLAC	Most of the nodes are brought up. A handful have memory related problem and have been disabled in LSF awaiting diagnostic.
psanafeh

Thursday, May 30th, 2013 16:30hrs

Friday, May 31st, 2013 12:00hrs

Friday, May 31st, 2013 10:37am

Unplanned power outage at SLAC

psexport01

Thursday, May 30th, 2013 16:30hrs

Friday May 31st, 2013 12:00hrs

Friday, May 31st, 2013 11.00am

Unplanned power outage at SLAC


ana01, ana02	/reg/d/ana01, /reg/d/ana02 filesystems	Thursday, May 30th, 2013 16:30hrs	Friday May 31st, 2013 14:00hrs	Friday, May 31st, 6pm.	Unplanned power outage at SLAC


pssrv100 (psnfs)	NFS mountpoint for PCDS diskless nodes	Tuesday, Mar 26th, 2013 12:30pm	Tuesday, Mar 26th, 2013 5pm	Tuesday, Mar 26th, 2013 4:45pm
pssrv100 (psnfs)	NFS mountpoint for PCDS diskless nodes	Monday, Jan 7th, 2013 (1030 hrs)	Monday, Jan 7th, 2013 (1600 hrs)	Wednesday, Jan 9th, 2013 (1140 hrs)	RAID controller malfunctioned upon power restoral after planned power outage in B950 203A	pssrv101 (old data) was used to bring up the FEE nodes for part of the outage. pssrv100 was restored to operation after a new RAID controller was delivered and installed.
ana01	/reg/d/ana01 filesystem	Tuesday, Dec 18th 2012	unknown	Partial (98%) restoral Monday Dec 24th (0800 hrs)	Controller failed causing corrupted parity data	Parity errors fixed and new controller installed. 2 OSTs (LUNs) needed fsck'ing. One took a few hours, the other took 10 days.
psanaoss21*	/reg/d/ana12 filesystem	Monday, Oct 8th, 2012 (1700 hrs)	Monday, Oct 8th, 2012 (1900 hrs)	Monday, Oct 8th, 2012 (1900 hrs)	Hardware upgrades


psanaoss2**	/reg/d/ana11 and /reg/d/ana12 filesystem	Thursday, Sep 27, 2012 (1700 hrs)	Friday, Sep 28, 2012 (0100 hrs)	Friday, Sep 28, 2012 (0400 hrs)	Hardware upgrades


Sitewide outage. All Linux Servers at NEH, FEH, XRT, FEE.	All computing services at LCLS.	Wednesday August 15, 2012 1:15 PM (1300 hrs)	August 17, 2012 1:00 PM (Except psanafeh, ana11 and ana12 file systems which will be down till Aug 21, 2012.

	SLAC sitewide power outage on August 16. Electrical work at LCLS. Server maintenance.	Expect logging in to any machines to be unavailable between 8/15 and 8/17 even if some of the servers are powered up before the expected end time. They will be maintenance performed on various servers during these 2 days.
All machines in XPP hutch and control room will be inaccessible.	XPP	Monday April 9, 2012 11:15AM	Monday April 9, 2012 11:45AM	Monday April 9, 2012 11:30AM	Electrical Work at XPP Hutch	Completed


	ana01/ana02 file systems	Wed Mar 28th, 2012 9am	Wed Mar 28th, 2012 1pm	Wed Mar 28th, 2012 4pm	Upgrade to IB	Completed
psananeh lclsq ana01 ana02	NEH storage and processing	Wed Dec 21, 2011 6am	Tue Dec 27, 2011 4pm	Mon Dec 26, 2011 1pm	Chilled water outage	Completed. Chilled water was restored on Friday.
psana batch nodes	All Science data is currently unavailable. Psananeh psanafeh is up for Matlab use, but no access to data on Lustre file system.	Saturday Oct 1, 2011 6am

Lustre file system remains down after the unplanned power outage on Saturday.

The system administrators are working to bring them back.


	All LCLS computing services	Monday Nov 14, 2011 7am

		Electrical work at NEH server room and FEH.	pslogin is up. NFS server, LDAP, DNS, pswww are up. The daq nodes will not come up until after 4PM. Lustre will not come up until after about 4.30PM. Batch nodes (psana11* psana12) and psana01 will not be up until Lustre is up.
psana, NEH Online Nodes, psimport, psexport, pslogin, psdev. psanasrv100, psanasrv101, psanasrv102	All Science data, All user home directories, all DAQ cache nodes. All online services.	Wed Sep 28, 2011 10am	Wed Sep 28, 2011 6pm	Wed Sep 28, 2011 6pm	Upgrade of Lustre hardware. Installation of taylor on several offline systems. Update of kernel on Online nodes.


psana	Science data access	Tue Sep 20, 2011 11:15am

		Tue Sep 20, 2011 6:15pm	NEH power outage	B950 and several other buildings experienced short power glitch but the lustre file servers did not survive the interruption and is still being brought up.
psana	Science data access	Thu Jun 2, 2011 1pm	Thu Jun 2, 2011 5pm

Lustre failover testing.


NEH online nodes ana02 psexport, psimport	NEH DAQ, outside ssh access	Thu May 25, 2011 noon	Thu May 25, 2011 7pm

	Server room upgrade, ana02 memory upgrade	Completed
psana	Science data access	Thu May 12, 2011 1pm	Thu May 12, 2011 6pm	Thu May 12, 2011 6.30pm	Lustre maintenance	Completed. Upgraded memory on psanaoss101-104, and replaced 10Gbit cards with 1 port SMCs. 717W power supplies are in place on psanaoss103-104 now.
psana	Science data access	Thu May 5, 2011 1pm	Thu May 5, 2011 5pm	Thu May 5, 2011 5pm	Lustre maintenance	Completed
All	All	Fri Apr 29, 2011 6.30pm	Sun May 1, 2011 11pm	Sun May 1, 2011 9pm	NEH power outage	Completed
psana	Science data access	Thu Apr 28, 2011 2pm	Thu Apr 28, 2011 6pm	Thu Apr 28, 2011 3pm	Lustre maintenance pssrv100 NFS volume reconstruction.	Completed Lustre maintenance postponed. Raid reconstruction pssrv100 will take 2-3 days. The new volume size is not released by the controller, so we will have to perform the file system resize on another day.
psana	Science data access	Fri Apr 1, 2011 6pm	Mon Apr 4, 2011 10am

	NEH cooling outage	Completed
psana	Science data access	Thu Mar 31, 11am	Thu Mar 31, 5pm

	Enabling HA for Lustre system	Completed
All	All	Sat Mar 26, 2011 7am	Sat Mar 26, 2011 7pm	Mon Mar 28, 2011 1pm	NEH power cut	Completed
psana	Science data access	Thu Mar 24, 2011 11am	Thu Mar 24, 2011 5pm

	Lustre testing	Completed
All	All	Wed Mar 23, 2011 10am	Wed Mar 23, 2011 3pm

		NEH power cut	This power cut was NOT planned
All	All	Sat Mar 19, 2011 7am	Sat Mar 19, 2011 7pm	Mon Mar 21, 2011 10am	NEH power cut	Completed

Confluence and Jira now require federated login. Read more.

Space shortcuts

Child pages

Versions Compared

Old Version 204

New Version 493

Key