Blog from March, 2014

Scientific Computing Services upgraded the batch environment software to the latest version, LSF 9.1.2.   This version supports Linux cgroups which will enable the binding of batch jobs to a requested amount of CPU and memory, providing a more predictable and reliable batch processing environment.  The batch configuration was also modified to reduce the outage time required for upgrades.  The impact of this change is that batch queues no longer need to be drained for days prior to an upgrade.

Scientific Computing Services completed the upgrade of the PPA Lustre file system to the latest 2.5.1 release within a 5-day outage window.   The upgrade involved  interim transfer of 100TB of user data to a temporary area,  relocation of hardware to the first floor server room to provide Infiniband access to the shared PPA "bullet" cluster, improved Ethernet bandwidth by switching to 10Gb network links, and a doubling of capacity from 170TB to 340TB.  Additional RAID controllers combined with networking upgrades have resulted in significant performance gains which have met or exceeded expectations.  Along with the recent expansion of the bullet cluster from 3000 to 4600 cores, this parallel file system  will enable more users to switch from using local scratch space to shared space in the Lustre storage environment. 

Scientific Computing Services completed the first phase of its High Performance Storage System (HPSS) software upgrade.  HPSS was upgraded from 7.3.3.p8 to 7.3.3.p9a, the DB2 database from 9.5.9 to 9.7.8, and RHEL from 5.8 to 5.9. These intermediate software versions provide important stepping stones for the testing and deployment of the production versions in a few weeks.  When finalized, this upgrade will deliver functional and stability enhancements for this critical HPSS service which houses several petabytes of SLAC scientific data.

Scientific Computing Services met with representatives from Dell to discuss upcoming enhancements to their disk storage line.  We will consider using the newly announced MD3460 disk array in place of the MD3260 in future storage purchases to provide higher performance for science experiments.

In response to reports generated by the SLAC Patch and Vulnerability Group (PVG), Scientific Computing Services addressed issues related to Linux and Solaris systems.   This included research and detailed reporting on recent security scan results, determining underlying software dependencies that were averting the installation of some patches, and developing a monitoring solution on the Red Hat Network Satellite Server to detect a database problem that was preventing some systems from installing updates.   This continued effort improves the security of the systems supporting the computing infrastructure and research mission at SLAC.

Scientific Computing Services completed the project to increase the PPA shared "bullet" cluster from 2864 to 4512 cores.  All hosts are connected to the SLAC network via 10Gb ethernet and there is a dedicated 40Gb Infiniband network for parallel applications.  This cluster is a major part of the public batch compute farm, available to all users with Unix accounts.   It is also the only parallel cluster available for general use at SLAC.