Blog from June, 2014

Scientific Computing Systems supplied the chassis for a failing file system that was needed by BaBar. As a result of this intervention, the BaBar analyst was able to successfully finish running jobs to meet a critical deadline.  She was able to present the results at an international conference at the end of May and will be publishing a paper soon.

Scientific Computing Services staff rearranged tape backups on its existing backup server to better accommodate the heavy load from PPA and LCLS UNIX servers.  In addition, a second backup server was set up to split off some of the load.  These changes provide faster and more reliable backups for science data.

Scientific Computing Services updated OS and application software for PPA and Photon Science GPU computing. The KIPAC cluster has 3 NVIDIA "Tesla 2070" GPUs and the SIMES cluster has  24 NVIDIA "Tesla 2050" GPUs.  All systems were upgraded to RHEL 6.5 with the latest CUDA GPU driver and API.  These upgrades enable users to submit single-slot and parallel jobs to the clusters via the central batch system.

Scientific Computing Services extended the new LSF cgroups feature on RHEL 6 systems to the general queues, after running it without incident in the ATLAS queues.  Cgroups are designed to manage the resources in our batch cluster, preventing the inadvertent overuse of resources by some jobs and  improving the overall throughput of LSF.

As a result of continuous monitoring of services, Scientific Computing Services and Enterprise Applications staff noticed a dramatic increase in the number of external attempts to break into SLAC computers.   SCS staff made an immediate change to block these off-site ssh attempts on infrastructure servers, including web servers.   By responding quickly through the use of our centralized management software, SCS mitigated this Cyber Security threat for the Lab.

Scientific Computing Services staff investigated a number of unrelated problems for several different research groups, including FGST, SIMES, and LCLS, as well as in the shared "bullet" computing cluster.   During a very busy two-week period, SCS proactively upgraded all of the FGST "xrootd" servers to the latest Linux kernel and resolved problems in the other service areas, in an effort to provide sustained service for the science mission.

Scientific Computing Services and Operations staff replaced the chassis of a failing machine for FGST with SCS equipment.   The FGST system was offline for several days due to a series of hardware errors.   The improvised solution enabled FGST to access 134TB of data and to continue an important reprocessing effort for the experiment.

Scientific Computing Services and Data Center support staff worked with customer groups to relocate a number of servers in Building 50.  The systems were removed from the old row 40 on the 1st floor in preparation for the build out of two new rows to accommodate LCLS. This is critical to the expansion of LCLS.