Blog from February, 2014

Scientific Computing Services set up a GPFS-based NFS storage cluster for MCC and released it for production work on Tuesday 2/11/2014. This storage model can be administered and managed with greater reliability and efficiency and will serve as the first storage offering of its kind for SLAC.

Scientific Computing Services met with representatives from Stanford, UCSC, and YFS, our support provider for AFS.  AFS is a distributed file system used at many research and educational institutions. It provides a scalable, globally shared filesystem, and is a core underlying infrastructure for SLAC computing. The AFS presentation focused on current issues with AFS and features and enhancements that will be available in YFS later this year.   YFS looks promising as a replacement for AFS at SLAC and we will continue to evaluate this file system over the coming months.

Scientific Computing Services upgraded the network connections on 21 PPA virtual machines during a scheduled outage. The link aggregation feature increases total bandwidth and adds redundancy. The Fermi VM migration project is nearing completion and several old Fermi hosts were also shutdown for retirement. The power and rackspace will be reclaimed for future installations.

Scientific Computing Services completed a cycle of applying security patches for the central AFS database servers which handle access and control  of Unix home directories and group space for a number of scientific experiments.   The work was completed with minimal disruption to the community and ensures that our systems are in compliance with Cyber Safety policies at SLAC.

Scientific Computing Services worked with our AFS support vendor to implement recommendations based on the way AFS is being used at SLAC.  The AFS servers were reconfigured and tuned to improve performance for our scientific computing community.  

SCS bullet points for week ending 2014/02/07

 Scientific Computing Services worked with the Networking team and with Cisco to resolve a networking issue that restricted the performance of virtualized services.  The problem was identified as a link aggregation bug.   SCS was able to update the firmware, without requiring any outage, and to confirm that the update fixed the issue. The link aggregation feature increases bandwidth and provides network redundancy for KVM Hypervisor hosts. Current KVM configurations include a number of Fermi compute services and systems in the public interactive Linux login pools.

Scientific Computing Services completed expansion work on the SUNCAT Photon Science cluster.  The Dell 'blade' server architecture can provide up to 1024 cores in a single water-cooled rack.   A total of 656 Intel Sandy Bridge cores were added, increasing the compute capacity for SUNCAT researchers.

SCS bullet points for week ending 2014/01/31  

The Computing Division completed the relocation of 192 "hequ" batch systems from  Forsythe Hall on Stanford campus to Building 50 at SLAC.   With teamwork and planning that involved Teri Church, John Weisskopf, Ron Barrett, Yemi Adesanya and Renata Dart, the systems were moved and put into production ahead of schedule with minimal downtime for our scientific community.

Scientific Computing Services devoted significant effort to diagnosing and reporting a complex NFS data corruption bug affecting our science customers.  By working with those customers and developing a method to reproduce the problem, SCS was able to determine the bug existed only with newer versions of the Red Hat Linux kernel, thus enabling Red Hat engineers to submit a fix for the NFS kernel code.  The fix will eventually make its way into the mainstream kernel, benefiting SLAC and all Red Hat Linux customers.

Scientific Computing Services worked with our technical coordination team to restore services on more than 250 systems that were disrupted by the site-wide power "sag" on Monday, January 13.  The impact of the power fluctuation caused a range of problems for hosts, including shutdowns, reboots, loss of network connectivity and spurious messages on service processors which required power cycling to correct.  The outage affected services for the ATLAS, Fermi and KIPAC science communities.  SCS system administrators spent more than 25 hours restoring services for the lab.