Blog from July, 2013

On July 19, 2013 at about 8:26 AM, Computing Division staff detected a loss of cooled water to Building 50.   Scientific Computing Services staff responded quickly by powering down about 700 batch servers at around 8:45 AM, as temperatures were rising in the machine room.  Services were restored by about 11:15 AM.   Shutting down the servers mitigated problems that might have developed with components inside the systems.

Scientific Computing Services upgraded the batch RTM (Real Time Monitoring) utility to the latest version.   The upgraded version of RTM will function with the current production version of LSF (Load Sharing Facility) and will also work when  the LSF software is upgraded to version 9.1.   RTM provides scientific computing customers with a visual representation of the state of the batch queues.

Scientific Computing Services staff completed the relocation of more than 25 machines to create more contiguous open rack space.   This involved coordination with Networking, Data Center Operations and with our scientific customers to move file, database, and infrastructure servers.   The result is more capacity in Building 50 for new systems that arrive before the new Stanford Research Computing Facility (SRCF) opens in January of 2014.

Scientific Computing Services has implemented a deployment pipeline for efficient hardware installations.   This includes the development of a standard for hardware acquisitions, IPMI for remote server management, and serial-over-lan  for console and logging.   SCS trained Computing Division technical coordinator staff to handle the initial server BIOS setup and console configuration, streamlining the installation process and speeding up the deployment of new systems for Scientific Computing.

Scientific Computing Services worked with the PPA directorate and with Facilities to negotiate a date when the "black box" batch systems will be decommissioned.   These machines were purchased in 2007 and are housed in specially-equipped shipping containers outside of Building 50.   The chiller to cool these systems failed after the May 30 power outage.   Shutting these systems down on July 15 will lower our overall power consumption and save the lab $13,600 in repair costs that would have been spent on an aging outdoor computing facility.