Purpose of this page
- we are trying to collect information about upcoming Fermi computing outages (disks, oracle, network) to improve planning
- when planning an outage, please send an email to datalist and write the description here (including requested duration and preferred timeframe)
- we will try to combine outages as much as possible, in order to maximize uptime for time-critical services (FASTCopy, pipeline, etc.)
- once the plan is finalized, don't forget to send a message to glast-outage and the collaboration (if applicable)
Upcoming outage requests
- Outage of mysql-node03 to move to HA rack.
Feb 03, 2014 - Oracle and OS patching (ghost vulnerability patches)
- Outage of FASTCopy starting at 9:00am, reboot of FASTCopy machines
- Oracle OS reboot and patching starting at 10:00am
- Reboot Fermi linux xrootd servers and fermilnx machines
Feb 11, 2014 - Oracle and OS patching; also retirement of various glastlnx machines
- 10am - duration is likely several hours
- This outage affects all NFS servers (wains), including user disk as well as xroot servers.
- Expect interruptions in all Fermi services as they are moved from old glastlnx -> new fermilnx machines
Dec 11, 2013 - Oracle server battery replacement
- 10am - glast-oracle03 to have battery replaced in storage array. Expected outage duration: 30m
Dec 4, 2013 - OS Patching and re-IP'ing
- 10am - all Fermi wain-class servers will be rebooted for OS patching.
- 10am - glast-oracle03/04 will be rebooted for OS patching.
- At the same time, 16 wains will have new IP addresses assigned in anticipation of retiring old network switches and reconfiguring the network in January 2014.
- Three wains will be physically relocated to consolidate rack space
Service (xrootd if not specified)
Oct 2, 2013 - ISOC logging gateways to be shut down
At present the gateway daemons that allow one to make entries in the ISOC event log, the one displayed by the LogWatcher web app, have been run on glastlnx06 and glastlnx11. These daemons were contacted by software outside of the usual ISOC distribution.
glastlnx06 and 11 will shortly be decommisioned. fermilnx01 and 02 are now running the gateway daemons and I've prepared versions of the logging software that use them:
- GPLtools - Version GPLtools-02-00-02 in /afs/slac/g/glast/ground/PipelineConfig/GPLtools. This provides a Python version of the logging software. The only changed file is python/PipelineNetloggerConfig.py.
- org-glast-isoc-common - version 1.3 in the Fermi Maven repository. This provides the Java and Jython versions of the logging software.
I intend to shut down the logging gateways on glastlnx06 and 11 on Wednesday, Oct 2.
Sep 9-10, 2013 - ISOC realtime displays
- There will be occasional interruptions in service as the ISOC realtime support daemons are moved from glastlnx06,11 to fermilnx01 and 02.
Aug 13, 2013 - Quarterly Oracle security patching
- 10:00-12:00 GLASTP (glast-oracle03, 04)
- 11:00-12:00 reboot of the Wains
May 2013 - Quarterly Oracle security patching
- Wed May 01 10:00-11:00: GLASTDEV, GLASTSTG (glast-oracle02)
- Wed May 08 10:00-14:00: GLASTP (glast-oracle03, 04)
- Wed May 08 11:00-14:00: reboot of the wains (xrootd & NFS servers) for OS patches
Feb 13, 2013 - Oracle password change
- There should be no actual interruption of service unless something goes wrong.
Jan-Feb 2013 - Quarterly Oracle security patching
- Thu Jan 31 10:00-11:00: GLASTDEV, GLASTSTG (glast-oracle02)
- Mon Feb 4 10:00-14:00: GLASTP (glast-oracle03, 04)
Nov 18, 2012 - Oracle firmware upgrade
- 10-11 AM: Oracle firmware upgrade on glast-oracle03
Oct 25, 2012 - NFS Server upgrade
- Duration: approx 2+ hours (to be coordinated with HA rack movements (see below) )
- NFS server change from old sulky machine(s) to wain031, affecting ISOC Ops: /u23, /u28, /u41, /u42. The selected method of making this cut-over will determine the nature and length of the outage. More details and discussion on the datalist email list. Refs:
- move glast-oracle03,04 and glast-win01,2 to the HA rack
- Oracle quarterly security patching
Sep 4, 2012
- 10-12:30 AM: Oracle patching.
- 10-10:30 AM: replacing a fan on sulky 33.
Aug 16, 2012 - site wide power outage.
- From John: everything except the servers on the generator will go down. Building 50 is supposed to be the first (or one of the first) buildings brought back up. Power goes off at 5:30 am 8/16. We could have power restrored by 6:30am. Bring up would begin after that, most services back in 2-4 hours. NOTE, however, we tentatively plan to start taking machines down at 17:30 the night before (Aug 15). So we are talking about a ~16 hour outage, if things go well.
- Update: some of the Power Distribution Units (PDUs) are old and need inspection. Apparently this needs to be done after the outage. Each inspection is estimated to take ~45 minutes, and Boris (et al.) propose inspecting four: PDU 40, 41, 42, 44. Each of these powers multiple servers (mostly file servers), and it turns out FGST has servers on all four. We've agreed to wait until the inspection is over before restarting the pipeline. A few extra hours don't make any difference for this.
- We have asked FOT and FSSC to buffer the regular FastCopy data deliveries to the ISOC starting at 4pm on Aug 15, to have time to clear our buffers. We will inform them when we are back online and ready to accept data.
August 13, 2012
- 09:30-11 AM: Pause again while glastlnx06 is switch back to its old power supply.
August 10, 2012
- 10:00-12:00 AM: Pause in production while glastlnx06 is moved to a new power supply while prep work for the outage of August 16 is done.
July 11, 2012
- 11:00am - 1:00pm: Replacing a bad fan on sulky34. Since that server holds the LAT raw data, FASTCopy ingestion will be stopped about an hour beforehand to let the pipeline drain.
Also, the remaining databases will be migrated off of glastlnx01/02 onto mysql-node01.
July 9, 2012
- 9:30 - 11:00am: An internal disk on glast-oracle03, the host of the GLASTP database, is in danger of failing. This outage is to allow for its replacement.
June 12 2012
- 10am - 11:30am: migrating calib* and mood* databases from glastlnx01/02 to mysql-node03
May 10 2012
- [10am-12:30pm] Oracle quarterly update. This will affect pipeline, data catalog, flight operations and any other databases on the main Fermi Oracle server.
- [10am-12:30pm] xroot server reboot for OS upgrade. This will affect all 36 of the wain (Solaris) xroot servers.
- [10am-12:30pm] Fermi USER DISK (wain006) reboot for OS upgrade.
- [9am-3pm] xroot file server move. This will affect only two xroot servers: wain070 and wain071.
- [9am-3pm] NFS file server move. This will affect the following servers which will be unplugged and physically moved to new rack space
in building 50: sulky33, sulky34, sulky35, sulky36