Blog

Scientific Computing Services replaced older hardware and installed two new, more powerful systems for scheduling batch jobs. The new production and backup masters improve overall performance in the batch environment for scientific computing.

Scientific Computing Services deployed a newly available software feature in LSF 9.1, called cpu affinity, on the batch production master. SCS staff also determined the cause of a problem that was resulting in lengthy wait times for pending jobs.   The impact is increased responsiveness to queries and shortened dispatch times.

Scientific Computing Services improved the logging and monitoring of cyber security events on centrally managed systems.   This increases our security and our ability to meet DOE audit requirements.

Scientific Computing Services recently deployed a 150TB disk storage server for KIPAC.  This server will be providing Dark Sky Simulation data via the web.

Scientific Computing Services staff reconfigured the setup for running full incremental backups on larger file systems and for some Oracle RMAN backups.  This has improved the overall performance for TSM backups in the computing environment.

Scientific Computing Services staff is evaluating dispatch issues in the shared "bullet" cluster and is working to educate users to submit batch jobs with appropriate options.  SCS has documented a set of hints for batch job submission on the SCS web site and has disseminated them through a number of distribution lists.  Helping users understand how to submit jobs more efficiently will mitigate delays and long pending times.

Scientific Computing Services purchased two new machines to handle the growth of the LSF batch computing environment.   These systems are being testing using the new CPU affinity capability featured in LSF version 9.   When deployed in the production environment, this upgrade will provide faster and more efficient batch service for scientific computing.

Scientific Computing Systems supplied the chassis for a failing file system that was needed by BaBar. As a result of this intervention, the BaBar analyst was able to successfully finish running jobs to meet a critical deadline.  She was able to present the results at an international conference at the end of May and will be publishing a paper soon.

Scientific Computing Services staff rearranged tape backups on its existing backup server to better accommodate the heavy load from PPA and LCLS UNIX servers.  In addition, a second backup server was set up to split off some of the load.  These changes provide faster and more reliable backups for science data.

Scientific Computing Services updated OS and application software for PPA and Photon Science GPU computing. The KIPAC cluster has 3 NVIDIA "Tesla 2070" GPUs and the SIMES cluster has  24 NVIDIA "Tesla 2050" GPUs.  All systems were upgraded to RHEL 6.5 with the latest CUDA GPU driver and API.  These upgrades enable users to submit single-slot and parallel jobs to the clusters via the central batch system.

Scientific Computing Services extended the new LSF cgroups feature on RHEL 6 systems to the general queues, after running it without incident in the ATLAS queues.  Cgroups are designed to manage the resources in our batch cluster, preventing the inadvertent overuse of resources by some jobs and  improving the overall throughput of LSF.

As a result of continuous monitoring of services, Scientific Computing Services and Enterprise Applications staff noticed a dramatic increase in the number of external attempts to break into SLAC computers.   SCS staff made an immediate change to block these off-site ssh attempts on infrastructure servers, including web servers.   By responding quickly through the use of our centralized management software, SCS mitigated this Cyber Security threat for the Lab.

Scientific Computing Services staff investigated a number of unrelated problems for several different research groups, including FGST, SIMES, and LCLS, as well as in the shared "bullet" computing cluster.   During a very busy two-week period, SCS proactively upgraded all of the FGST "xrootd" servers to the latest Linux kernel and resolved problems in the other service areas, in an effort to provide sustained service for the science mission.

Scientific Computing Services and Operations staff replaced the chassis of a failing machine for FGST with SCS equipment.   The FGST system was offline for several days due to a series of hardware errors.   The improvised solution enabled FGST to access 134TB of data and to continue an important reprocessing effort for the experiment.

Scientific Computing Services and Data Center support staff worked with customer groups to relocate a number of servers in Building 50.  The systems were removed from the old row 40 on the 1st floor in preparation for the build out of two new rows to accommodate LCLS. This is critical to the expansion of LCLS.

Scientific Computing Services and Data Center support staff worked with customer groups to relocate a number of servers in Building 50.  The systems were removed from the old row 40 on the 1st floor in preparation for the build out of two new rows to accommodate LCLS. This is critical to the expansion of LCLS.

Scientific Computing Services and Oracle Corp. upgraded firmware on our large tape libraries to enable new monitoring software and allow the use of advanced T10000D tape drives that store 60% more data per tape than our current drives.   The ability to purchase T10000D tape drives will decrease tape storage costs to our science customers.

 

OCIO staff have been working with customers to migrate 142 servers from old Cisco 6500 switches, that have reached End-of-Life, to newer Top Of Rack Nexus switches.  This work, which is part of our lifecycle management process, ensures better networking reliability and redundancy and also improves the network performance for our research computing environment.

In response to reported connectivity problems between a Fermi file server and batch machines, Scientific Computing Services installed an upgraded operating system patch on the affected server and on a number of other hosts. This upgrade has improved connectivity for the Fermi production system.

Scientific Computing Services staff worked with our user community to discover and report a file corruption issue in our batch processing environment. The Red Hat vendor provided a solution which has now been deployed on approximately 1000 RHEL6 general batch farm systems, mitigating the problem.

Scientific Computing Services held a second Town Hall Meeting for Unix Users on Thursday April 10.  In addition to providing updates on our Unix, Storage and HPC services, we had discussions on planning computing outages and managing the batch environment.   We are polling the community to schedule two more Town Hall meetings this year.

Computing Division staff completed the migration of 55 servers from an old Cisco 6500 network switch to newer Cisco Nexus Top-of-Rack switches.   This project involved months of coordination and communication with teams in Networking, Operations, and Scientific Computing Services, as well as with our customers.  The result is the decommissioning of a switch that had reached end-of-service-life and a reduced risk of failure in our networking infrastructure.

Scientific Computing Services implemented Network File System (NFS) monitoring on each node of the batch cluster to more quickly identify specific hosts that are not processing jobs properly.  This enables SCS to shut down problem nodes promptly and prevent batch jobs from failing.

Scientific Computing Services improved the configuration of the SLAC hosts used for File Transfer Protocol (FTP) file distribution service.  Anonymous access was modified to protect information and authenticated access was discontinued.   Both changes improve the security of the FTP service at SLAC.