Steps for New User
- request unix-admin to give them login access to the suncat machines
add them to the LSF queue permissions list with ypgroup adduser -group suncat-norm -user <username> [-user ...]
- if appropriate, add them to majordomo list with approve <password> subscribe suncat-list <emailaddr>
- if appropriate, add them to .mailrc suncatcomp list
Serial Port Linux Console Access
If you have a kerberos ticket, you can connect to the serial port
console from any machine (eg, your desktop, iris, noric, etc.).
to connect to serial console:
$ /usr/local/bin/console suncat0006
or use this command if your machine has a private /usr/local :
$ /afs/slac/local/bin/console suncat0006
You will get this message from the console software:
[suncat0006: Attached readwrite on conserv1]
Press enter once. You should see:
----------------------------------------------------- Red Hat Enterprise Linux Client release 5.5 (Tikanga) Kernel 2.6.18-194.11.1.el5 on an x86_64 suncat0006 login: -----------------------------------------------------
That is the linux login prompt on the serial console.
All commands you give to the console software start with
the escape sequence:
Ctrl-e c (Ctrl-e, followed by a 'c')
plus one character (sometimes 2 characters).
You can press ctrl-e c ? to see the possible commands:
----------------------------------------------------- Escape seq (currently ^Ec) + char . disconnect a attach readwrite b display heartbeats of consoles on this server c change to new console d down (close tty) a console e change escape sequence f force attach readwrite g get location infomation h print this message i display info about all consoles on this server k set idle timeouts for this session l1 send break (halt host!) m execute a macro (? for list) o (re)open the tty file p Display and select the baud rate qy shutdown the server r replay the last 20 lines s attach readonly u show status of all consoles on this server v show server version info w show all users connected to this server x examine -- show detailed console info <cr> ignore/abort command ? print this message -----------------------------------------------------
To disconnect, use
Ctrl-e c .
hpiLO Command Line Interface
NOTE: this is for the suncatlsX/suncatfsX nodes ONLY. The suncat0XXX have a BMC/IPMI interface instead (see below)
Use Esc-( to get to the hpiLO prompt.
Useful commands (I think you need to "cd system1" before these will work):
power on power off power reset exit
Sometimes the console baud rates are messed up. They can be changed with ^e-c-p. Choose 56000 or 9600 typically.
BMC/IPMI Command Line Interface
NOTE: this is for the suncat0XXX nodes ONLY. The suncatlsX/suncatfsX have an hpiLO interface instead (see above)
You can also connect to the HP Baseboard Management Controller (ie, IPMI)
on the serial port by pressing
Esc-( [that is, escape followed by shift-9)
Then press enter once. You will get this "Login: " prompt with a capital L.
This the BMC login prompt:
-----------------------------------------------------
Command Line Interface
Copyright 2004-2008 ServerEngines Corporation
All rights reserved.
Login:
-----------------------------------------------------
That is the login to the command-line-interface to the IPMI management controller.
Once you log in, you get this prompt:
-----------------------------------------------------
CLP Session Initiated
/./->
-----------------------------------------------------
To see valid commands, type 'show'.
There are two entry points: system1 is the Linux OS. map1 is the Management Service Processor.
You can do things like a hard power reset, query error logs, temperature states, etc.
ESC R ESC r ESC R resets the system ESC ^ powers on the system last time I tried to reset, but failed, instead had to: cd system1/console1 (probably could have done just "system1" here) reset
It is important to return to the linux OS login prompt when you are done
using the Baseboard Management Controller. You exit out of here and switch
the serial port back to Linux by typing 'exit', followed by Esc-Q
(escape, followed by Shift-q) Then press enter once.
----------------------------------------------------- /./-> exit Command Line Interface (CLI) Copyright 2004-2008 ServerEngines Corporation All rights reserved. Login: -----------------------------------------------------
After typing: Esc-Q, followed by 'enter', you will see a login prompt with
a lowercase 'login: '. This is the linux login prompt. It will time out
after a minute or so, then it will give you the full linux login prompt:
----------------------------------------------------- login: Login timed out Red Hat Enterprise Linux Client release 5.5 (Tikanga) Kernel 2.6.18-194.11.1.el5 on an x86_64 suncat0006 login: -----------------------------------------------------
ipmitool
The command
sudo ipmitool sel elist
shows errors.
Linux kernel SysRq facility
When you are connected to the linux serial port console (and NOT the BMC) you
can send the kernel SysRq commands. See this page if you are unfamiliar with
the linux kernel SysRq facility:
http://en.wikipedia.org/wiki/Magic_SysRq_key
You can send the kernel sysrq commands by first sending a break signal on
the serial port console, followed by a single character which represents a
command, eg, hard reset, show memory, etc.
You can send a break followed by a space to see the possible commands.
You send a break signal using the serial console software with
Ctrl-e c l 1
(that is, ctrl-e, followed by a c, then l (lower case 'L'), then a 1 (number 1).
The console software "help" display says that 'l1' will halt the host.
That was only true for older versions of Solaris. It will not halt
a linux host, or modern Solaris systems (as we have then configured).
The console software will let you know when it has sent a break signal on the line
with this output:
The SysRq help looks like this:
SysRq : HELP : loglevel0-8 reBoot Crashdump tErm Full kIll thaw-filesystems(J) saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount shoWcpus
The Capital letters show the character that you use for each action. For example,
to show current memory statistics, you use 'M'.
System Monitoring/History: Ganglia and Nagios
For CPU usage monitoring:
To look for memory/CPU problems:
http://nagios.slac.stanford.edu/nagios/cgi-bin/status.cgi?hostgroup=SUNCAT%20cluster&style=detail
There is also some history in /scswork/ranger
How bsub Command Functions
bsub is a script written by Neal Adams which calls the "real" executable:
/afs/slac/package/lsf/curr/bin/bsubx
Depending on the "-a" option (for suncat this is typically "openmpi") bsubx calls an "esub" script (in the lfs "etc" directory). This in turn points to another wrapper script in the lsf "bin" directory. For openmpi the script is "openmpirun_wrapper". This last one is the one that executes the mpirun command. mpirun uses "lsgrun" on the master node to direct the "res" daemons on the slave nodes to start executables.
Batch Commands
lsload -R suncat (show CPU loading of all suncat machines) lshosts -R suncat (show list of suncat machines and associated info) bhosts -w suncatfarm (show status of hosts, from a batch perspective) bacct -u all -b -q "suncat-xlong suncat-long suncat-medium suncat-short" -C "2010/9/25," > bacct.out &
Access to BIOS
For the farm nodes, at the command line type
sudo ipmitool chassis bootdev bios
This should cause the console interface to show the BIOS screen after the next reboot. Subsequent reboots do not do this.
For the login nodes, the best I have been able to do is hit ESC-9. This gives the "rbsu" CLI to view/control BIOS settings. We have only been able to get the VT100-graphics version of the BIOS with a crash cart.
Running Disk Tests
Quick stats:
/usr/sbin/smartctl -a /dev/sda
The longer testing mode can be done while jobs are running (non-destructive)
Running Memory Tests
from ole: run memtest86+ (from memtest.org)
Power Controller History
ppic -d -v
need to have ipmi daemon running to talk to "carbondale" chip
(if not, ppic will try to start ipmi itself, but doesn't do it right)