This is the home of the Scientific Computing Services Public (SCSPub) space. Read the Scientific Computing Services Mission Statement.
Metrics
Chart |
---|
...
| ||||||
|
...
Scientific Computing Services worked with Fermi, Atlas, and BaBar to reallocate 10,000 shares from each group and provide a total of 30,000 shares to the Theory group on a temporary basis. A special queue has been set up to provide the parameters that would enable Theory to use the shares in a more intense manner than the regular queues would allow. This will help the Theory group prepare for the Snowmass meeting at the end of the month.
Scientific Computing Services has been working with Networking to test link aggregation for critical servers. The testing is complete and SCS will begin rolling out this networking protocol to other crucial servers over the next few months. In the event that one of the networking connections fails, this strategy provides network redundancy and increases the availability of critical services for the lab.
Issues:
Scientific Computing Services is still dealing with fallout from the May 30 power outage. Approximately 20 batch machines are down due to hardware issues that developed as a result of the sudden power loss.
Scientific Computing Services continues to work on LCLS/PCDS storage problems following the May 30 power outage . A hardware RAID controller failed and may be responsible for corrupting one of the 1PB Lustre file systems. Repair work is underway and the file system is currently offline.
Current activities
...
Activity
...
start date
...
end date
...
status
...
link to further information
|
Chart | ||||
---|---|---|---|---|
| ||||
|
...
IPv6
...
2013-03-06
...
2013-09-30
...
in process
...
|
...
|
...
|
...
|
Chart | ||||
---|---|---|---|---|
| ||||
|
...
LSF upgrade
...
2013-03-01
...
...
|
...
|
...
|
...
|
More detailed metrics and monitoring can be found at these links Nagios Ganglia PlatformRTM SCS Infrastructure Dashboards
Computing Division Status and Calendar https://www-
...
internal.slac.stanford.edu/
...
Cyber Safety planning and reviews
...
2013-04-22
...
2013-08-31
...
in process
...
https://slacspace.slac.stanford.edu/Operations/SCCS/Ops/Shared%20Documents/Forms/AllItems.aspx
...
Cyber Safety sudo_all
...
2013-03-15
...
2013-06-07
...
sudo all completed
Workgroup and user sudo
75% complete.
...
Proposal for Tracking sudo all privs
Status of conversion to new Sudo Process
...
PCDS - task list
...
2013-05-16
...
...
in process
...
PCDS Task List (original request)
https://www-rt.slac.stanford.edu/rt3/Ticket/Display.html?id=455960 (tracking ticket)
comp-out/
Recent and planned changes comp-change archives (login required)
News and Announcements
Systems and Services
System/Service or Experiment | Service Owner | Service Manager/Implementer | Contact Name | Contact Email | Additional Information |
---|---|---|---|---|---|
AD | OCIO/SciComp staff | ||||
ATLAS | OCIO/SciComp staff | Wei Yang | |||
BaBar | BaBar Collaboration | OCIO/SciComp staff | Tina Cartaro | ||
EXO | EXO Experiment | OCIO/SciComp staff | Tony Johnson | Tony Waite (apw@slac.stanford.edu) if the issue is related to EXO data taking/data acquisition Carter Hall (crhall@umd.edu) for offline software issues Max Turri (turri@slac.stanford.edu) for web server issues Ken Fouts (kfouts@slac.stanford.edu) for EXO operations | |
CDMS | OCIO/SciComp staff | Anders Borgland | |||
DES | OCIO/SciComp staff | Stuart Marshall | |||
Fermi/GLAST | Fermi | OCIO/SciComp staff | Richard Dubois | Tom Glanzman fermi-request-outage This list is intended for those with an interest in planning the timing and other details of *future* outages. fermi-outage This list is intended for those with an interest in discussing the details/fallout for an impending, ongoing, or recently completed outage. | |
KIPAC | OCIO/SciComp staff | Stuart Marshall | |||
LCLS | OCIO/SciComp staff | Mike Dunne | Amedeo Perazzo, Randy Radmer | ||
LSST | OCIO/SciComp staff | Richard Dubois | |||
MCC | OCIO/SciComp staff | Jingchen Zhou | Ken Brobeck | ||
PPA | OCIO/SciComp staff | David MacFarlane | |||
PULSE | OCIO/SciComp staff | Todd Martinez | |||
SIMES | OCIO/SciComp staff | Tom Devereaux | Brian Moritz | ||
SSRL | OCIO/SciComp staff | Hector Prado? | Hector Prado | ||
SUNCAT | OCIO/SciComp staff | Jens Norskov | Johannes Voss | ||
Theory | |||||
Beam Physics | |||||
Klystron | |||||
LCD | |||||
SPEAR | |||||
DNS for core infrastructure | CIO | OCIO/SciComp staff | Yee Ting Li | net-admin@slac.stanford.edu | |
LDAP for Unix | CIO | OCIO/SciComp staff | Andrew May | unix-admin@slac.stanford.edu | |
Nagios | CIO | OCIO/SciComp staff | Andrew May | unix-admin@slac.stanford.edu | |
Ganglia | CIO | OCIO/SciComp staff | Andrew May | unix-admin@slac.stanford.edu | |
Authentication for Unix accounts | CIO | OCIO/SciComp staff | Andrew May | unix-admin@slac.stanford.edu | |
Webauth | CIO | OCIO/SciComp staff | Andrew May | unix-admin@slac.stanford.edu | |
AFS | CIO | OCIO/SciComp staff | Renata Dart | unix-admin@slac.stanford.edu | Andrew May, Lance Nakata |
LSF batch system | CIO | OCIO/SciComp staff | Yemi Adesanya | unix-admin@slac.stanford.edu | Renata Dart |
Tape Library | CIO | OCIO/SciComp staff | Andrew May | unix-admin@slac.stanford.edu | Lance Nakata |
Red Hat Satellite Server | CIO | OCIO/SciComp staff | Karl Amrhein | unix-admin@slac.stanford.edu |
Current activities
Activity | Purpose | start date | end date | status | link to further information |
---|---|---|---|---|---|
New hardware planning and acquisitions | customer requirements | ongoing |
|
| |
RHEL 5 | phaseout | 2014-08-01 |
| ||
GPFS | shared, tiered storage | 2013-10-01 | open | SCS test environment; MCC, ACD, Fermi, KIPAC in production |
...
LCLS Unix account password process
...
2013-07 (proposed start date)
...
...
waiting resources
...
LCLS Unix account password process
https://www-rt.slac.stanford.edu/rt3/Ticket/Display.html?id=447547
...
New hardware planning and acquisitions
...
ongoing
...
Accomplishments
...
Clusters and High Performance Computing
...
2013/05/10: The new PPA bullet cluster (~2900 cores) is now available to all SLAC Unix users via the production batch system. This introduced the capability of selecting a newer release of the RedHat operating system. Scientific Computing Services worked with key customer groups including Fermi, KIPAC and EXO in order to minimize disruption to their production environments and ensure the cluster will support parallel and single-core jobs.
2013/05/03: Scientific Computing Services implemented Nagios monitoring and alerting for AFS quotas. This is currently being used by Fermi and MCC, but is available for any interested group. This service enables customers to be informed about their AFS quota before it reaches its maximum limit, thereby providing time to take corrective action and minimize the potential for quota-related computing problems.
2013/05/03: Scientific Computing Services is working with IBM to give a presentation on the new features of LSF 9.1 to scientific computing customers. Along with the presentation, SCS staff will provide an overview of our use of MPI applications in our cluster environment. This interaction will improve understanding between IBM and SLAC regarding the use of LSF and clarify features that would be valuable in this software product.
2013/04/05: Scientific Computing Services has completed the initial tuning of the PPA cluster hardware for parallel computation. Test runs included 256-core and 1024-core jobs using OpenMPI on the 40Gb/sec Infiniband network. All 2900 compute cores will be made available to the general queues in addition to a high priority MPI queue. This tuning has improved the overall performance of the cluster for scientific computing and research.
...
Storage
...
2013/06/14: The Scientific Computing Services storage team contacted NERSC and Vanderbilt University to gather information about their General Parallel File System (GPFS) deployments. This allows us to learn from their experiences as we look at beginning our own deployment for SLAC scientific customers.
2013/06/07: Following an unexpected power outage on Thursday, May 30th, Scientific Computing Services restored services within 4 hours of the return of power and chilled water to Building 50. SCS also responded to the failure of a controller in the PCDS/LCLS Lustre storage system, returning it to service by Friday evening. The restoration of services enabled the Scientific Computing community to continue with their experiments and programs.
2013/05/24: LCLS users reported that they were unable to access various files stored on a 1PB Lustre filesystem. Scientific Computing Services diagnosed the problem and ran utilities to repair file system inconsistencies, restoring the access to user's files.
2013/03/29: Scientific Computing Services has added 200 tapes to our tape libraries, providing more than a PetaByte of tape storage for our LCLS customers.
...
Cyber Safety (critical to keep all Lab computing services in operation)
...
2013/06/14: The Scientific Computing Services storage team contacted NERSC and Vanderbilt University to gather information about their General Parallel File System (GPFS) deployments. This allows us to learn from their experiences as we look at beginning our own deployment for SLAC scientific customers.
2013/06/07: Scientific Computing Services responded to requests from the visiting KPMG team related to Unix accounts, elevated privileges, security, system management, logging, monitoring and the process for handling changes. This provided the IG Audit review team with information and substantiation of how SLAC handles the centrally-managed systems and services.
2013/05/31: Scientific Computing Services revised documentation for UNIX tape backup policies, procedures and scheduling in response to a request for information for the IG Audit. This supports the Cyber Safety program at SLAC.
2013/05/24: Scientific Computing Services applied a mitigation for a very serious security vulnerability which affected 1,042 managed Red Hat Enterprise Linux 6 hosts. SCS applied this mitigation using central configuration management within hours of learning of the vulnerability, thereby preventing a published exploit, which was actively compromising systems on the internet, from affecting the SLAC network and impacting scientific computing resources.
2013/05/17: Scientific Computing Services provided responses, documentation and artifacts for the IG audit questions regarding Unix infrastructure for Configuration Management, Identity and Access Management, and Remote Access Management. This is in support of our Cyber Safety program at SLAC and prepares us for the audit that will occur at the end of May.
2013/04/26: Scientific Computing Services held a joint meeting with the Cyber Safety team to summarize our framework for Unix system management, continuous monitoring and reporting. This review provided fundamental information that the Cyber Safety team can use in preparation for the upcoming IG Audit scheduled for the end of May.
2013/04/12: Scientific Computing Services developed the automated tools for reviewing accounts with elevated privileges. A process was established for handling this review at regular intervals. In response to a DOE finding, 170 tickets were created to review and approve privileged accounts. This supports the Cyber Safety program at SLAC and meets the DOE deadline for this security requirement.
...
Infrastructure Services
...
2013/06/07: Scientific Computing Services worked with Datacenter Technical Coordinators to modernize the server management infrastructure in Building 50. New server installations no longer require obsolete serial communications hardware. This will reduce cost overheads and shorten the amount of time required for initial system setup and deployment.
2013/04/19: Scientific Computing Services responded quickly to the April 9 power fluctuation and temporary chilled water loss that impacted services for research computing. In addition, SCS revised the documentation and processes surrounding emergency response to such an event. This enhances our ability to provide continuity of services for the lab.
2013/03/29: Scientific Computing Services has 3560 machines under configuration management, an increase of 3.2% over the previous month. This increase is primarily in batch systems which provide additional support for scientific computing at the lab.
SCSPub space will be used as a repository of information from Scientific Computing Services which can be shared with others at SLAC. This information will include notes, agendas, working papers and proposals, etc, which the group wants to share with others at SLAC. In addition, for certain types of documents needing version or check-out/check-in control there is a document library on the SCS SharePoint site.