Blog from April, 2014

In response to reported connectivity problems between a Fermi file server and batch machines, Scientific Computing Services installed an upgraded operating system patch on the affected server and on a number of other hosts. This upgrade has improved connectivity for the Fermi production system.

Scientific Computing Services staff worked with our user community to discover and report a file corruption issue in our batch processing environment. The Red Hat vendor provided a solution which has now been deployed on approximately 1000 RHEL6 general batch farm systems, mitigating the problem.

Scientific Computing Services held a second Town Hall Meeting for Unix Users on Thursday April 10.  In addition to providing updates on our Unix, Storage and HPC services, we had discussions on planning computing outages and managing the batch environment.   We are polling the community to schedule two more Town Hall meetings this year.

Computing Division staff completed the migration of 55 servers from an old Cisco 6500 network switch to newer Cisco Nexus Top-of-Rack switches.   This project involved months of coordination and communication with teams in Networking, Operations, and Scientific Computing Services, as well as with our customers.  The result is the decommissioning of a switch that had reached end-of-service-life and a reduced risk of failure in our networking infrastructure.

Scientific Computing Services implemented Network File System (NFS) monitoring on each node of the batch cluster to more quickly identify specific hosts that are not processing jobs properly.  This enables SCS to shut down problem nodes promptly and prevent batch jobs from failing.

Scientific Computing Services improved the configuration of the SLAC hosts used for File Transfer Protocol (FTP) file distribution service.  Anonymous access was modified to protect information and authenticated access was discontinued.   Both changes improve the security of the FTP service at SLAC.