Debugging Tips

troubleshooting:
- check overquota
- regenerate ssh keys using script in error message
- look for multiple servers (can cause sql lock issues)

From Xiang on installing his own user jupyter kernel with psana:

Thanks Chris, you pointed me to the right direction. I used the option2 posted here https://stackoverflow.com/questions/58068818/how-to-use-jupyter-notebooks-in-a-conda-environment, which is similar to your suggestion. This part was easily done.

More time was spent on getting psana to work in my environment. It needs proper environment variables set for the kernel. In the end it worked out, now it's great that I can work in this new kernel just as in LCLS-II py3, but with the flexibility to install user packages. 

**********************************************************************

06/11/19

clemens works with both murali/wilko on this.

- clemens recommends "classic notebook" for now, but soon "jupyter lab"
  (see that "jupyterlab" doesn't work on apr 28, 2020: see error about
  (roughly) "session name already exists"
- advantage of jupyter-lab: 
  o file browser
  o terminal
  o extensions
- would like to add another drop-down to select to run on a gpu node
  (not useful for gpu since can't select psanagpu node)
- jupyterhub/notebook/lab are all in one separate conda installation:
  /reg/g/psdm/sw/jupyterhub/miniconda3/

to manage the env:

source /reg/g/psdm/sw/jupyterhub/miniconda3/etc/profile.d/conda.sh
conda activate jhub

**********************************************************************

need to update by hand:

/reg/g/psdm/sw/conda/jhub_config/prod-rhel7/kernels/ana-current

some jupyter stuff is here:  /reg/g/psdm/sw/jupyterhub/

https://psjh.slac.stanford.edu:8000

global log files (logfiles) are on pshub01 /var/log/messages

e.g. this line shows the spawning of a new server:
Jun 21 17:15:54 pshub01 jupyterhub: [I 2022-06-21 17:15:54.541 JupyterHub sshsp\
awner:84] hostname: psanagpu110 port: 49178 pid: 18534


a json file has to set the environment when switching between conda
environments, e.g.

/reg/neh/home/weninc/.local/share/jupyter/kernels/python2/kernel.json:

(ana-1.0.8) psanaphi101:python2$ more kernel.json 
{
 "display_name": "Conda test kernel sfs", 
 "language": "python", 
 "argv": [
  "/reg/g/psdm/sw/conda/inst/miniconda2-prod-rhel7/envs/ana-1.0.8/bin/python", 
  "-m", 
  "ipykernel", 
  "-f", 
  "{connection_file}"
 ],
 "env": {"SIT_DATA": "/reg/g/psdm/sw/conda/inst/miniconda2-prod-rhel7/envs/ana-1
.0.6/data:/reg/g/psdm/data",
         "SIT_ROOT": "/reg/g/psdm", "LD_LIBRARY_PATH": ""}
}
(ana-1.0.8) psanaphi101:python2$ pwd

Have to unset LD_LIBRARY_PATH otherwise "import _psana" picks up old stuff.

**********************************************************************

clemens knowledge transfer aug 5, 2019

server doesn't execute code
code gets executed in kernel, separate from server
kernels can have different py versions, and be different than server

jupyter starts one server per user

https://jupyterhub.readthedocs.io/en/stable/

spawners it the most important part: we wrote our own
they had:
- docker image
- local spawner (our requirement: we wanted the servers to get started on
  different nodes)
- batch queue

our spawner is called "ssh spawner".  it does "ssh psana" and launches
a server.

hubs->controlpanel tab in the jupyter notebook  shows all available notebook

clemens has a script "jhub-stats.py" that shows which node runs which server

github repo: slac-lcls/jhub2 git@github.com:slac-lcls/jhub2.git
config script:
- jhub2/jupyterhub_config.py  specifies authenticator and spawner
- has a timeout to shutdown idle servers for 2 days (sometimes takes
  longer than set value)  "cull_idle_server.py" is the script, copied
  from jupyterhub repo: "jupyterhub/examples/cull-idle/"

jupyterhub runs on server pshub01.  runs on the /u1 disk to avoid nfs.  dbase
is on /u1.  but env is on nfs:
/reg/g/psdm/sw/jupyterhub/miniconda3/envs/jhub/bin/jupyterhub

security: generate-keys.sh: jupyterhub has to run as root, but has to
become the user.  generate-keys.sh.  user needs working passwordless
ssh keys.  this script generates them. The root jupyterhub process
runs as root and generates the "ssh" command as the appropriate user.
http error 511: suggests that the user has unusable ssh keys as
described above. users can mess this up by having wrong permissions on
the .ssh keys.  try ssh'ing as the user to see if it works.  have them
run the generate-keys.sh script from https://github.com/slac-lcls/jhub2
this page has useful information:

https://pswww.slac.stanford.edu/errors/JupyterHubCustomErrorPage.html

history of user problems can be found in ~/.jhub.log

ssh spawner uses asyncio in the execute function.  forces user to use
bash and not execute the .bashrc.  sshspawner puts a pid/hostname in
the sql table (used by jhub_stats.py).  JUPYTER_PATH is where all the
kernels are.  "start()" method starts the notebook.  every user has a
~/.jhub.log for debugging. "poll()" is called by the hub to check to
see if the user's server is alive (once every 60sec).  if node with
user-server is dead (can't ssh), the hub doesn't learn about it because it
can't call poll(): when node comes back it realizes the user-server
is dead.

kernels (e.g. ana-1.4.2): /reg/g/psdm/sw/conda/jhub_config/prod-rhel7.
need to restart user jupyter-notebook server for this to take effect.
each kernel has a json file which sets up the env.  an
ipykernel-install (?)  command generates the json.  ana-current is
edited by hand.  user can install a kernel on their own with "python2
-m ipykernel install --user".  generates the .json file in a local
dir.  then it should show up in the user's list of possible kernels.
maybe could slowly remove the old ones:  moving more to just using
the ana-current kernel.

two styles:  notebook and jupyterlab.  lab is more modern and has more
features:  terminals, ssh, etc.  notebook will get phased out.  recommend

extensions are tricky (e.g. interactive matplotlib, ipywidgets for
e.g. sliders is most useful, bokeh).  extensions have server and
kernel parts.  server is javascript (need nodejs): have to install
"twice", jupyterlab and the notebook are in the same conda env with
different extensions. extensions have to be installed in the notebook
server in our conda env.  clemens thinks extensions have to be
installed in the central env, wilko thinks it can be installed
locally.

problem areas:
- ssh keys
- extensions
- if node with 

if we update jupyterlab/jupyterhub with conda, but database formats
can change

debugging:

- on pshub01 can see user session being created with:
  sudo journalctl -u jupyterhub -n 20 -f
- look in user's .jhub.log file in their home dir

**********************************************************************

Sept. 6, 2020

Error 'NoneType' object has no attribute

Traced back to an authentication failure because a node that was closed off to users was
being selected with "ssh psana"

This happens on this line of code:

https://github.com/slac-lcls/jhub2/blob/3494cc530a27dcc8d21a042081cf2a19cbc7d9eb/sshspawner.py#L69

If you look in the “execute” function (line 15) it does effectively “ssh psana”.   So I tried this as user jeph on pshub01:

bash-4.2$ hostname
pshub01
bash-4.2$ whoami
jeph
bash-4.2$ ssh psana
Authentication failed.

**********************************************************************

from matt henderson:

In the server environment (not in kernel env) have to install ipywidgets
first so other packages (e.g. ipympl) can register.

check extensions ok with: "jupyter labextension list"
should see jupyterlab-manager, jupyter-matplotlib (plotly also has one)
on kernel side the order doesn't matter, but need the packages as well
- ipywidgets/ipympl

installation process runs "jupyterlab build" (or "jupyter labbuild"?)
as last step automaticallyif it doesn't happen (e.g. if out of memory)
run by hand

**********************************************************************

matt needs:

conda install redis redis-py aioredis

(currently in ps-3.0.15)

**********************************************************************

database lock issue (from wilko):

see 3 jupyter servers on psanagpu110:

psanagpu110:~$ ps -ef | grep labhub | grep anatut
anatut   1813    1 0 18:10 ?       00:00:01 /reg/g/psdm/sw/jupyterhub/miniconda3/envs/jhub/bin/python /reg/g/psdm/sw/jupyterhub/miniconda3/envs/jhub/bin/jupyter-labhub --ip=0.0.0.0 --port=42066 
anatut  18534    1 3 17:15 ?       00:05:23 /cds/sw/ds/dm/conda/envs/jhub-2.0.0/bin/python3.9 /cds/sw/ds/dm/conda/envs/jhub-2.0.0/bin/jupyter-labhub --ip=0.0.0.0 --port=49178 
anatut  19718    1 0 15:46 ?       00:01:07 /cds/sw/ds/dm/conda/envs/jhub-2.0.0/bin/python3.9 /cds/sw/ds/dm/conda/envs/jhub-2.0.0/bin/jupyter-labhub --ip=0.0.0.0 --port=36939

(can also look with "lslocks", see below)

On psanagpu110 and as anatut the following fails:
%/cds/home/w/wilko/psdm/dev/envs/tst39/bin/dm lock --lockf nbsignatures.db
    Failed to obtain shared lock nbsignatures.db
lsof shows that process 18534 and 19718 have the nbsignatures.db open.

anatut@psanagpu110% lslocks | grep nbsig 
jupyter-labhub 19718 POSIX   60K READ 0 1073741826 1073742335 /cds/home/a/anatut/.local/share/jupyter/nbsignatures.db 
jupyter-labhub 19718 POSIX   60K WRITE 0 1073741825 1073741825 /cds/home/a/anatut/.local/share/jupyter/nbsignatures.db 
jupyter-labhub 18534 POSIX   60K READ 0 1073741826 1073742335 /cds/home/a/anatut/.local/share/jupyter/nbsignatures.db


To find the active jupyter server process for a user one has to check the jupyterhub database on pshub01. I ran:
% ssh pshub01  /u1/jupyterhub/prod/jhub2/user_status.sh

(ps-4.5.15) psbuild-rhel7-01:~$ ssh pshub01  /u1/jupyterhub/prod/jhub2/user_status.sh | grep anatut
cpo@pshub01's password: 
anatut|psanagpu107|{"pid": 19197, "hostname": "psanagpu107"}|2022-06-22 03:32:56.997353|2022-06-22 03:38:09.536620
(ps-4.5.15) psbuild-rhel7-01:~$ 

Problems Seen During UXSS 2022

Recommendation: if we use jupyter next time, have students get their own accounts (like an experiment).  With anatut there are too many "shared resources" that can bring down the system for everyone:

  • disk space
  • memory (everyone ends up on one server)
  • "stop my server"
  • opening "my server" brings up someone else's notebook
  • (minor) default notebook name "Untitled1.ipynb" can create name conflict

many people using "anatut" account simultaneously (others using their own SLAC accounts)

had to continually watch:

  • memory usage on anatut server node (where all processes run)
  • anatut disk usage
  • watch for server moving to another node (killed by either user clicking "stop my server" or OOM killer?) and duplicate servers
  • kept a "canary" anatut notebook running all the time

~stilton2 had weka shome issue on gpu107, which prevented him from starting/stopping notebooks
database locked (two issues): this error traced down to having to jupyterhub servers running on the same node for user "anatut"

    sqlite3.OperationalError: database is locked

multiple server problem:

anatut   1813    1 0 18:10 ?       00:00:01 /reg/g/psdm/sw/jupyterhub/miniconda3
/envs/jhub/bin/python /reg/g/psdm/sw/jupyterhub/miniconda3/envs/jhub/bin/jupyter
-labhub --ip=0.0.0.0 --port=42066 
anatut  18534    1 3 17:15 ?       00:05:23 /cds/sw/ds/dm/conda/envs/jhub-2.0.0/
bin/python3.9 /cds/sw/ds/dm/conda/envs/jhub-2.0.0/bin/jupyter-labhub --ip=0.0.0.
0 --port=49178 
anatut  19718    1 0 15:46 ?       00:01:07 /cds/sw/ds/dm/conda/envs/jhub-2.0.0/
bin/python3.9 /cds/sw/ds/dm/conda/envs/jhub-2.0.0/bin/jupyter-labhub --ip=0.0.0.
0 --port=36939

other problems we encountered:
memory problems
quota problems

anatut server moved to a different node (gpu102) from gpu111, but
gpu111 server process did not die then gpu102 ran out of memory
because of a user script and node crashed.  server moved back to
gpu111 where we now had two servers which could be the explanation for
the database lock problem we saw previously

  • No labels