Blog from January, 2014

Scientific Computing Services system administrators discovered that recently acquired Dell R720 servers were experiencing delays of up to 7 minutes while booting RHEL6.  The R720 is the preferred hardware platform for several scientific computing services. SCS worked with Dell and Red Hat to resolve this server performance issue which was traced to the RAID controller card. Dell provided a firmware update for the controller which eliminated the delay.  This fix will enable servers to run with optimum efficiency.

Scientific Computing Services responded to an escalated request from MCC staff during the evening of  Friday, January 17.   Response time on a storage server had degraded due a failed disk drive.   SCS staff were able to address the immediate problem and work is continuing to alleviate some of the long-standing problems with arrays on this system.   Maintaining adequate response time on this server is critical to MCC's mission.

The Computing Division completed the relocation of 192 "hequ" batch systems from  Forsythe Hall on Stanford campus to Building 50 at SLAC.   With teamwork and planning that involved Teri Church, John Weisskopf, Ron Barrett, Yemi Adesanya and Renata Dart, the systems were moved and put into production ahead of schedule with minimal downtime for our scientific community.

Scientific Computing Services devoted significant effort to diagnosing and reporting a complex NFS data corruption bug affecting our science customers.  By working with those customers and developing a method to reproduce the problem, SCS was able to determine the bug existed only with newer versions of the Red Hat Linux kernel, thus enabling Red Hat engineers to submit a fix for the NFS kernel code.  The fix will eventually make its way into the mainstream kernel, benefiting SLAC and all Red Hat Linux customers.

Scientific Computing Services worked with our technical coordination team to restore services on more than 250 systems that were disrupted by the site-wide power "sag" on Monday, January 13.  The impact of the power fluctuation caused a range of problems for hosts, including shutdowns, reboots, loss of network connectivity and spurious messages on service processors which required power cycling to correct.  The outage affected services for the ATLAS, Fermi and KIPAC science communities.  SCS system administrators spent more than 25 hours restoring services for the lab.