You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 19 Next »

Purpose of this page

  • we are trying to collect information about upcoming Fermi computing outages (disks, oracle, network) to improve planning
  • when planning an outage, please send an email to datalist and write the description here (including requested duration and preferred timeframe)
  • we will try to combine outages as much as possible, in order to maximize uptime for time-critical services (FASTCopy, pipeline, etc.)
  • once the plan is finalized, don't forget to send a message to glast-outage and the collaboration (if applicable)

Upcoming outage requests

Aug 21, 2012

  • 10-12:30 AM: Oracle patching.

Aug 16, 2012 - site wide power outage.

  • From John: everything except the servers on the generator will go down. Building 50 is supposed to be the first (or one of the first) buildings brought back up. Power goes off at 5:30 am 8/16. We could have power restrored by 6:30am. Bring up would begin after that, most services back in 2-4 hours. NOTE, however, we tentatively plan to start taking machines down at 17:30 the night before (Aug 15). So we are talking about a ~16 hour outage, if things go well.
  • Update: some of the Power Distribution Units (PDUs) are old and need inspection. Apparently this needs to be done after the outage. Each inspection is estimated to take ~45 minutes, and Boris (et al.) propose inspecting four: PDU 40, 41, 42, 44. Each of these powers multiple servers (mostly file servers), and it turns out FGST has servers on all four. We've agreed to wait until the inspection is over before restarting the pipeline. A few extra hours don't make any difference for this.
  • We have asked FOT and FSSC to buffer the regular FastCopy data deliveries to the ISOC starting at 4pm on Aug 15, to have time to clear our buffers. We will inform them when we are back online and ready to accept data.

August 13, 2012

  • 09:30-11 AM: Pause again while glastlnx06 is switch back to its old power supply.

August 10, 2012

  • 10:00-12:00 AM: Pause in production while glastlnx06 is moved to a new power supply while prep work for the outage of August 16 is done.

July 11, 2012

  • 11:00am - 1:00pm: Replacing a bad fan on sulky34. Since that server holds the LAT raw data, FASTCopy ingestion will be stopped about an hour beforehand to let the pipeline drain.
    Also, the remaining databases will be migrated off of glastlnx01/02 onto mysql-node01.

July 9, 2012

  • 9:30 - 11:00am: An internal disk on glast-oracle03, the host of the GLASTP database, is in danger of failing. This outage is to allow for its replacement.

June 12 2012

  • 10am - 11:30am: migrating calib* and mood* databases from glastlnx01/02 to mysql-node03

May 10 2012

  • [10am-12:30pm] Oracle quarterly update. This will affect pipeline, data catalog, flight operations and any other databases on the main Fermi Oracle server.

  • [10am-12:30pm] xroot server reboot for OS upgrade. This will affect all 36 of the wain (Solaris) xroot servers.

  • [10am-12:30pm] Fermi USER DISK (wain006) reboot for OS upgrade.

  • [9am-3pm] xroot file server move. This will affect only two xroot servers: wain070 and wain071.

  • [9am-3pm] NFS file server move. This will affect the following servers which will be unplugged and physically moved to new rack space
    in building 50: sulky33, sulky34, sulky35, sulky36

  • No labels