You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 42 Next »

Steps for New User

  • request unix-admin to give them login access to the suncat machines
  • add them to the LSF queue permissions list with ypgroup adduser -group suncat-norm -user <username> [-user ...]

  • if appropriate, add them to majordomo list with approve <password> subscribe suncat-list <emailaddr>
  • if appropriate, add them to .mailrc suncatcomp list

Serial Port Linux Console Access

If you have a kerberos ticket, you can connect to the serial port
console from any machine (eg, your desktop, iris, noric, etc.).

to connect to serial console:

$ /usr/local/bin/console suncat0006

or use this command if your machine has a private /usr/local :

$ /afs/slac/local/bin/console suncat0006

You will get this message from the console software:

[suncat0006: Attached readwrite on conserv1]

Press enter once. You should see:

-----------------------------------------------------
Red Hat Enterprise Linux Client release 5.5 (Tikanga)
Kernel 2.6.18-194.11.1.el5 on an x86_64

suncat0006 login:
-----------------------------------------------------

That is the linux login prompt on the serial console.

All commands you give to the console software start with
the escape sequence:

Ctrl-e c (Ctrl-e, followed by a 'c')

plus one character (sometimes 2 characters).
You can press ctrl-e c ? to see the possible commands:

-----------------------------------------------------
Escape seq (currently ^Ec) + char
.    disconnect
a    attach readwrite
b    display heartbeats of consoles on this server
c    change to new console
d    down (close tty) a console
e    change escape sequence
f    force attach readwrite
g    get location infomation
h    print this message
i    display info about all consoles on this server
k    set idle timeouts for this session
l1   send break (halt host!)
m    execute a macro (? for list)
o    (re)open the tty file
p    Display and select the baud rate
qy   shutdown the server
r    replay the last 20 lines
s    attach readonly
u    show status of all consoles on this server
v    show server version info
w    show all users connected to this server
x    examine -- show detailed console info
<cr> ignore/abort command
?    print this message
-----------------------------------------------------

To disconnect, use

Ctrl-e c .

hpiLO Command Line Interface

NOTE: this is for the suncatlsX/suncatfsX nodes ONLY. The suncat0XXX have a BMC/IPMI interface instead (see below)

Use Esc-( to get to the hpiLO prompt.

Useful commands (I think you need to "cd system1" before these will work):

power on
power off
power reset
exit

Sometimes the console baud rates are messed up. They can be changed with ^e-c-p. Choose 56000 or 9600 typically.

BMC/IPMI Command Line Interface

NOTE: this is for the suncat0XXX nodes ONLY. The suncatlsX/suncatfsX have an hpiLO interface instead (see above)

You can also connect to the HP Baseboard Management Controller (ie, IPMI)
on the serial port by pressing

Esc-( [that is, escape followed by shift-9)

Then press enter once. You will get this "Login: " prompt with a capital L.
This the BMC login prompt:

-----------------------------------------------------
Command Line Interface
Copyright 2004-2008 ServerEngines Corporation
All rights reserved.

Login:
-----------------------------------------------------

That is the login to the command-line-interface to the IPMI management controller.

Once you log in, you get this prompt:

-----------------------------------------------------
CLP Session Initiated
/./->
-----------------------------------------------------

To see valid commands, type 'show'.

There are two entry points: system1 is the Linux OS. map1 is the Management Service Processor.

You can do things like a hard power reset, query error logs, temperature states, etc.


cd system1
stop/start/reset system1 (controls power)

It is important to return to the linux OS login prompt when you are done
using the Baseboard Management Controller. You exit out of here and switch
the serial port back to Linux by typing 'exit', followed by Esc-Q
(escape, followed by Shift-q) Then press enter once.

-----------------------------------------------------
/./-> exit

Command Line Interface (CLI)
Copyright 2004-2008 ServerEngines Corporation
All rights reserved.

Login:
-----------------------------------------------------

After typing: Esc-Q, followed by 'enter', you will see a login prompt with
a lowercase 'login: '. This is the linux login prompt. It will time out
after a minute or so, then it will give you the full linux login prompt:

-----------------------------------------------------
login: Login timed out
Red Hat Enterprise Linux Client release 5.5 (Tikanga)
Kernel 2.6.18-194.11.1.el5 on an x86_64

suncat0006 login:
-----------------------------------------------------

ipmitool

The command

sudo ipmitool sel elist

shows errors.

Linux kernel SysRq facility

When you are connected to the linux serial port console (and NOT the BMC) you
can send the kernel SysRq commands. See this page if you are unfamiliar with
the linux kernel SysRq facility:

http://en.wikipedia.org/wiki/Magic_SysRq_key

You can send the kernel sysrq commands by first sending a break signal on
the serial port console, followed by a single character which represents a
command, eg, hard reset, show memory, etc.
You can send a break followed by a space to see the possible commands.

You send a break signal using the serial console software with
Ctrl-e c l 1

(that is, ctrl-e, followed by a c, then l (lower case 'L'), then a 1 (number 1).

The console software "help" display says that 'l1' will halt the host.
That was only true for older versions of Solaris. It will not halt
a linux host, or modern Solaris systems (as we have then configured).

The console software will let you know when it has sent a break signal on the line
with this output:

halt - sent

The SysRq help looks like this:

SysRq : HELP : loglevel0-8 reBoot Crashdump tErm Full kIll thaw-filesystems(J) saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount shoWcpus

The Capital letters show the character that you use for each action. For example,
to show current memory statistics, you use 'M'.

System Monitoring/History: Ganglia and Nagios

For CPU usage monitoring:

http://ganglia02.slac.stanford.edu:8080/ganglia/batch/?m=load_one&r=week&s=descending&c=suncat&h=&sh=1&hc=4&z=small

To look for memory/CPU problems:

http://nagios.slac.stanford.edu/nagios/cgi-bin/status.cgi?hostgroup=SUNCAT%20cluster&style=detail

There is also some history in /scswork/ranger

How bsub Command Functions

bsub is a script written by Neal Adams which calls the "real" executable:

/afs/slac/package/lsf/curr/bin/bsubx

Depending on the "-a" option (for suncat this is typically "openmpi") bsubx calls an "esub" script (in the lfs "etc" directory). This in turn points to another wrapper script in the lsf "bin" directory. For openmpi the script is "openmpirun_wrapper". This last one is the one that executes the mpirun command. mpirun uses "lsgrun" on the master node to direct the "res" daemons on the slave nodes to start executables.

Batch Commands

lsload -R suncat (show CPU loading of all suncat machines)
lshosts -R suncat (show list of suncat machines and associated info)
bhosts -w suncatfarm (show status of hosts, from a batch perspective)
bacct -u all -b -q "suncat-xlong suncat-long suncat-medium suncat-short" -C "2010/9/25," > bacct.out &

SL2x170z BIOS Settings

  Main
    -> Boot Settings Configuration
       -> Splash Screen               - change to 'DISABLED'
       -> Restore After AC Power Loss - change to 'OFF'
  Advanced
    -> IPMI Configuration
       -> Serial Port Configuration
          -> Serial Port Switching    - change to 'ENABLED'
    -> Remote Access Configuration
       -> BIOS Serial Console         - change to 'ENABLED'
       -> Serial Port Mode            - change to '57600'
  Boot
    -> Boot Device Priority           - move 'Network' to the TOP
  Advanced
    ->PowerConfiguration
       ->PowerEfficiencyMode = Performance

Access to BIOS

For the farm nodes, at the command line type

sudo ipmitool chassis bootdev bios

This should cause the console interface to show the BIOS screen after the next reboot. Subsequent reboots do not do this.

For the login nodes, the best I have been able to do is hit ESC-9. This gives the "rbsu" CLI to view/control BIOS settings. We have only been able to get the VT100-graphics version of the BIOS with a crash cart.

If Motherboard Changed

Update mac address in:

/etc/sysconfig/network-scripts/ifcfg-eth1

Running Disk Tests

Quick stats:

/usr/sbin/smartctl -a /dev/sda

The longer testing mode can be done while jobs are running (non-destructive)

Running Memory Tests

from ole: run memtest86+ (from memtest.org)

Power Controller History

need to copy from /nfs/slac/g/suncatfs/sw/package/PPIC_LinuxV1/ppic to local directory like /tmp (can't execute nfs exe's with sudo)

ppic -d -v
need to have ipmi daemon running to talk to "carbondale" chip
(if not, ppic will try to start ipmi itself, but doesn't do it right)

Checking RAID Controller

install rpm for hpacucli
/nfs/slac/g/suncatfs/sw/package/smartshow

options
Usage: ./smartshow with flags one of -c -d -e -1 -a -l -D

-c shows controller status
-d show disk
-1 show disk compact (1 line per)
-e show enclosure
-l logical drives

location, status, size, model number, firmware rev

1I:1:1: OK: 500 GB: HP MM0500FAMYT : HPD3
1I:1:2: OK: 500 GB: HP MM0500FAMYT : HPD3
2E:1:1: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:2: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:3: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:4: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:5: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:6: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:7: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:8: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:9: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:10: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:11: OK: 600 GB: HP EF0600FARNA : HPD2
2E:1:12: OK: 600 GB: HP EF0600FARNA : HPD2

How Karl Monitors the NFS RAID Status

There is a cronjob on suncatfs1 which checks the status of the raid cards
every two hours. The current status is updated in this file:

$ cat /afs/slac/g/scs/systems/system.info/suncatfs1/cciss_vol_status
/dev/cciss/c0d0: (Smart Array P410i) RAID 1 Volume 0 status: OK.
/dev/cciss/c2d0: (Smart Array P812) RAID 6 Volume 0 status: OK.   At least one spare drive designated.  At least one spare drive remains available.
/dev/cciss/c2d0: (Smart Array P812) Enclosure D2600 SAS AJ940A (S/N: CN894700M7      ) on Bus 2, Physical Port 2E status: OK.

If there are any changes to the status (if the output is different at all), you and I will get an email.

The internal OS disks are the RAID1. The NFS disks are the RAID 6.

Restoring Files from Backup

(this is information from Lance Nakata)

Sorry for the delay. Here are a few examples using dsmc, the TSM
CLI interface. Note that the keyboard emulation is quite bad.
The default is overwrite mode; hitting the Insert key on your
keyboard will toggle insert mode. I often must use ctrl-h
instead of backspace to delete characters. The arrow keys might
be the easiest way to move around, but beware of the overwrite
mode.

up arrow = display (and optionally run) previous command(s)
dn arrow = display (and optionally run) next command(s)
ctrl-l = move back a word

sudo dsmc incremental (start an incremental backup of /u1)
sudo dsmc (start dsmc in interactive mode. root access allows
you to restore anyone's files.)

Once you're in interactive mode (as root on suncatfs1):

tsm> query filespace (list all file space names)
tsm> query backup /u1/ (list files/dirs in /u1) (note trailing slash)
tsm> query backup /u1/cpo  (list /u1/cpo directory)
tsm> query backup /u1/cpo/* (list files/dirs in /u1/cpo)
tsm> query backup /u1/cpo/ -subdir=yes (recursively list files/dirs in /u1/cpo)
tsm> query backup /u1/cpo/ -subdir=yes -inactive (recursively list files/dirs in /u1/cpo, including inactive copies)

It's better to narrow your search to a particular subdirectory,
then recover that subdirectory or files within it. You can
also narrow the timeframe using the -pitdate and -pittime
point-in-time options. It is recommended you create a
destination directory in advance and restore the files there
rather than writing back to the original location (unless that
original location is empty).

The next command is one line. It says to do a restore of a
backup, preserve the complete file path, recurse through the
source directory, use a point-in-time date of 03/04/2011 and a
time of midnight, restore source is /u1/cpo/test/, restore
destination is /u1/cpo/test.restore/. NOTE: both subdirs have a
trailing slash.

tsm> restore -preservepath=complete -subdir=yes -pitdate=03/04/2011 -pittime=00:00 /u1/cpo/test/ /u1/cpo/test.restore/

If you want to pick which files to restore from the list above,
add the -pick option to the command line.

The previous night's backup log is in /var/adm/tsm/dsmsched.log.
It's a circular log set to max at 200MB in size. It will wrap
around once it hits that limit, which is set in the
/var/adm/tsm/dsm.sys file.

Full dsmc documentation is at
/afs/slac.stanford.edu/package/tsm/doc/tsm550_clientunix.pdf

There are many more options listed there. Practicing a few
restores is probably the best thing to do. Just make sure to
restore to a temp directory somewhere.

How To Setup Root Access

This is set up via /root/.k5login, which uses your Kerberos credentials.

$ less /usr/kerberos/man/man5/.k5login.5.gz

  • No labels