Blog

Scientific Computing Services upgraded the batch environment software to the latest version, LSF 9.1.2.   This version supports Linux cgroups which will enable the binding of batch jobs to a requested amount of CPU and memory, providing a more predictable and reliable batch processing environment.  The batch configuration was also modified to reduce the outage time required for upgrades.  The impact of this change is that batch queues no longer need to be drained for days prior to an upgrade.

Scientific Computing Services completed the upgrade of the PPA Lustre file system to the latest 2.5.1 release within a 5-day outage window.   The upgrade involved  interim transfer of 100TB of user data to a temporary area,  relocation of hardware to the first floor server room to provide Infiniband access to the shared PPA "bullet" cluster, improved Ethernet bandwidth by switching to 10Gb network links, and a doubling of capacity from 170TB to 340TB.  Additional RAID controllers combined with networking upgrades have resulted in significant performance gains which have met or exceeded expectations.  Along with the recent expansion of the bullet cluster from 3000 to 4600 cores, this parallel file system  will enable more users to switch from using local scratch space to shared space in the Lustre storage environment. 

Scientific Computing Services completed the first phase of its High Performance Storage System (HPSS) software upgrade.  HPSS was upgraded from 7.3.3.p8 to 7.3.3.p9a, the DB2 database from 9.5.9 to 9.7.8, and RHEL from 5.8 to 5.9. These intermediate software versions provide important stepping stones for the testing and deployment of the production versions in a few weeks.  When finalized, this upgrade will deliver functional and stability enhancements for this critical HPSS service which houses several petabytes of SLAC scientific data.

Scientific Computing Services met with representatives from Dell to discuss upcoming enhancements to their disk storage line.  We will consider using the newly announced MD3460 disk array in place of the MD3260 in future storage purchases to provide higher performance for science experiments.

In response to reports generated by the SLAC Patch and Vulnerability Group (PVG), Scientific Computing Services addressed issues related to Linux and Solaris systems.   This included research and detailed reporting on recent security scan results, determining underlying software dependencies that were averting the installation of some patches, and developing a monitoring solution on the Red Hat Network Satellite Server to detect a database problem that was preventing some systems from installing updates.   This continued effort improves the security of the systems supporting the computing infrastructure and research mission at SLAC.

Scientific Computing Services completed the project to increase the PPA shared "bullet" cluster from 2864 to 4512 cores.  All hosts are connected to the SLAC network via 10Gb ethernet and there is a dedicated 40Gb Infiniband network for parallel applications.  This cluster is a major part of the public batch compute farm, available to all users with Unix accounts.   It is also the only parallel cluster available for general use at SLAC.

Scientific Computing Services set up a GPFS-based NFS storage cluster for MCC and released it for production work on Tuesday 2/11/2014. This storage model can be administered and managed with greater reliability and efficiency and will serve as the first storage offering of its kind for SLAC.

Scientific Computing Services met with representatives from Stanford, UCSC, and YFS, our support provider for AFS.  AFS is a distributed file system used at many research and educational institutions. It provides a scalable, globally shared filesystem, and is a core underlying infrastructure for SLAC computing. The AFS presentation focused on current issues with AFS and features and enhancements that will be available in YFS later this year.   YFS looks promising as a replacement for AFS at SLAC and we will continue to evaluate this file system over the coming months.

Scientific Computing Services upgraded the network connections on 21 PPA virtual machines during a scheduled outage. The link aggregation feature increases total bandwidth and adds redundancy. The Fermi VM migration project is nearing completion and several old Fermi hosts were also shutdown for retirement. The power and rackspace will be reclaimed for future installations.

Scientific Computing Services completed a cycle of applying security patches for the central AFS database servers which handle access and control  of Unix home directories and group space for a number of scientific experiments.   The work was completed with minimal disruption to the community and ensures that our systems are in compliance with Cyber Safety policies at SLAC.

Scientific Computing Services worked with our AFS support vendor to implement recommendations based on the way AFS is being used at SLAC.  The AFS servers were reconfigured and tuned to improve performance for our scientific computing community.  

SCS bullet points for week ending 2014/02/07

 Scientific Computing Services worked with the Networking team and with Cisco to resolve a networking issue that restricted the performance of virtualized services.  The problem was identified as a link aggregation bug.   SCS was able to update the firmware, without requiring any outage, and to confirm that the update fixed the issue. The link aggregation feature increases bandwidth and provides network redundancy for KVM Hypervisor hosts. Current KVM configurations include a number of Fermi compute services and systems in the public interactive Linux login pools.

Scientific Computing Services completed expansion work on the SUNCAT Photon Science cluster.  The Dell 'blade' server architecture can provide up to 1024 cores in a single water-cooled rack.   A total of 656 Intel Sandy Bridge cores were added, increasing the compute capacity for SUNCAT researchers.

SCS bullet points for week ending 2014/01/31  

The Computing Division completed the relocation of 192 "hequ" batch systems from  Forsythe Hall on Stanford campus to Building 50 at SLAC.   With teamwork and planning that involved Teri Church, John Weisskopf, Ron Barrett, Yemi Adesanya and Renata Dart, the systems were moved and put into production ahead of schedule with minimal downtime for our scientific community.

Scientific Computing Services devoted significant effort to diagnosing and reporting a complex NFS data corruption bug affecting our science customers.  By working with those customers and developing a method to reproduce the problem, SCS was able to determine the bug existed only with newer versions of the Red Hat Linux kernel, thus enabling Red Hat engineers to submit a fix for the NFS kernel code.  The fix will eventually make its way into the mainstream kernel, benefiting SLAC and all Red Hat Linux customers.

Scientific Computing Services worked with our technical coordination team to restore services on more than 250 systems that were disrupted by the site-wide power "sag" on Monday, January 13.  The impact of the power fluctuation caused a range of problems for hosts, including shutdowns, reboots, loss of network connectivity and spurious messages on service processors which required power cycling to correct.  The outage affected services for the ATLAS, Fermi and KIPAC science communities.  SCS system administrators spent more than 25 hours restoring services for the lab.

Scientific Computing Services system administrators discovered that recently acquired Dell R720 servers were experiencing delays of up to 7 minutes while booting RHEL6.  The R720 is the preferred hardware platform for several scientific computing services. SCS worked with Dell and Red Hat to resolve this server performance issue which was traced to the RAID controller card. Dell provided a firmware update for the controller which eliminated the delay.  This fix will enable servers to run with optimum efficiency.

Scientific Computing Services responded to an escalated request from MCC staff during the evening of  Friday, January 17.   Response time on a storage server had degraded due a failed disk drive.   SCS staff were able to address the immediate problem and work is continuing to alleviate some of the long-standing problems with arrays on this system.   Maintaining adequate response time on this server is critical to MCC's mission.

The Computing Division completed the relocation of 192 "hequ" batch systems from  Forsythe Hall on Stanford campus to Building 50 at SLAC.   With teamwork and planning that involved Teri Church, John Weisskopf, Ron Barrett, Yemi Adesanya and Renata Dart, the systems were moved and put into production ahead of schedule with minimal downtime for our scientific community.

Scientific Computing Services devoted significant effort to diagnosing and reporting a complex NFS data corruption bug affecting our science customers.  By working with those customers and developing a method to reproduce the problem, SCS was able to determine the bug existed only with newer versions of the Red Hat Linux kernel, thus enabling Red Hat engineers to submit a fix for the NFS kernel code.  The fix will eventually make its way into the mainstream kernel, benefiting SLAC and all Red Hat Linux customers.

Scientific Computing Services worked with our technical coordination team to restore services on more than 250 systems that were disrupted by the site-wide power "sag" on Monday, January 13.  The impact of the power fluctuation caused a range of problems for hosts, including shutdowns, reboots, loss of network connectivity and spurious messages on service processors which required power cycling to correct.  The outage affected services for the ATLAS, Fermi and KIPAC science communities.  SCS system administrators spent more than 25 hours restoring services for the lab.

Scientific Computing Services continues to make improvements to the batch service. The production LSF configuration now supports the NVIDIA GPUs in the SUNCAT Photon Science cluster. The job scheduler will now track GPU and CPU usage independently, increasing the overall utilization of the cluster.

Scientific Computing Services installed a recompiled OpenMPI package on the PPA "bullet" and "orange" clusters.   This upgrade provides tighter integration with the OpenMPI runtime environment and makes it easier to run parallel compute jobs with LSF.

Scientific Computing Services held its first Town Hall meeting on Unix Services on Thursday, December 12.   The meeting was attended by approximately 30 people from around the Lab.   The agenda included an overview of SCS organization, recent activities, plans for the next 1-3 months, and a lightning talk on "Serf - A lightweight ad-hoc event system for batch job coordination".   This event was held to promote communication, cooperation and collaboration among scientific computing professionals at SLAC.

Database server "glast-astro-db2" is now online

SCS bullet points for week ending 2013/12/13

Scientific Computing Services provisioned 175 Terabytes of additional NFS storage for the Dark Energy Survey group (DES).

Scientific Computing Services has upgraded from OpenAFS 1.6.1 to 1.6.5 on Red Hat Enterprise 6, which provides the SLAC community with the latest stable version of AFS that addresses some known bugs.

Scientific Computing Services is now running LSF version 9.1.1.1 in a test environment. This release provides new features which may benefit MPI users.

SCS bullet points for week ending 2013/12/06 

Scientific Computing Services manages several GPU compute clusters.  SCS recognizes that although many customers cannot purchase their own clusters outright, they still want to investigate porting applications to GPUs.   To build support for a shared GPU cluster, SCS worked with Deborah Bard, an LSST cosmologist, to organize a meeting to discuss GPU computing.  SCS staff and Deborah will write a proposal to encourage vendors to donate hardware for a shared GPU cluster at SLAC.

Scientific Computing Services provided added flexibility to the ERP disk backup storage system by using Red Hat Linux Volume Manager to reconfigure the storage partition.  This change will enable dynamic space resizing and deployment to satisfy future ERP requirements.

SCS bullet points for week ending 2013/11/22

In response to MPI problems on the bullet cluster, Scientific Computing Services staff completed work with customer groups and IBM/Platform to address priority and resource allocation issues.   Since the configuration was modified, 'bulletmpi' jobs typically account for more than 50% of the bullet cluster load and users have confirmed that their jobs now obtain the resources they request.   Long term goals for batch service may feature live job migration using VMs and Linux cgroups for restricting CPU and memory usage.   These changes are designed to provide optimal use of the bullet cluster for scientific computing customers at the Lab.

Scientific Computing Services worked with the Networking team to deploy link aggregation for the Unix infrastructure servers housed in the Building 50 High Availability (HA) rack.   The HA rack was installed in late 2012 to provide generator-backed power for critical infrastructure services.   To provide network redundancy, it was necessary to reconfigure servers to be able to use an alternate route in the event of a switch failure.   Over the course of many weeks, more than 70 Unix servers were reconfigured with minimal impact to the SLAC computing community.   The completion of this task provides an additional enhancement to computing service reliability for the Lab.