Content

Installation

Installation

on pslogin
ana-1.3.37
scs
cd ...
virtualenv venv-pymongo
source venv-pymongo/bin/activate

???
# python -m pip install pymongo 


Alternative installation:
-------------------------
# https://docs.mongodb.com/manual/tutorial/install-mongodb-on-linux/
cd lib

curl -O https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-3.6.2.tgz

tar -zxvf mongodb-linux-x86_64-3.6.2.tgz

mkdir -p mongodb
cp -R -n mongodb-linux-x86_64-3.6.2/ mongodb

export PATH=/reg/neh/home/dubrovin/LCLS/venv-pymongo/lib/mongodb/mongodb-linux-x86_64-3.6.2/bin/:$PATH
echo $PATH
The same in 
source set_path_to_mongodb

1. Create the data directory
mkdir -p ./data/db

2. Set r/w permissions for the data directory
chmod 775 data
chmod 775 data/db

Run server

Run server

pslogin
ssh psanaphi105
cd LCLS/venv-pymongo/
source bin/activate
source set_path_to_mongodb
assumes that ./data/db is already created 
mongod --dbpath ./data/db --bind_ip_all &

!!! DO NOT CLOSE WINDOW,

Shell

Shell is a manual command line interface.

Shell

mongo --host psanaphi105 --port 27017

To exit the shell, type quit() or use the <Ctrl-C> shortcut.

> db
test

> show dbs
admin           0.000GB
calib-cxif5315  0.006GB
config          0.000GB
local           0.000GB

> use calib-cxif5315
switched to db calib-cxif5315

> show collections
cspad-0-cxids1-0
cspad-1

> db["cspad-0-cxids1-0"].find()
> db["cspad-0-cxids1-0"].find().pretty()
> help

Connection to DB in python

Connection to the data base

from pymongo import MongoClient
#client = MongoClient('localhost', 27017)
client = MongoClient('psanaphi105', 27017) #, username=uname, password=pwd)
db = client['calib-cxi12345']
col = db['camera-0-cxids1-0']

Connection time is 50-150ms depending on host and time.

Tentative model of the calibration store

Experiment-centric calibration data base

Tentative schema of the experiment-centric db

# Database for experiment
dbexp = client["calib-cxif5315"]

# Collections:
col1 = dbexp["cspad-0-cxids2-0"]
col2 = dbexp["cspad2x2-0-cxids2-0"]
col3 = dbexp["andor-0-cxids2-0"]

# Document content for dbexp
doc = {
   "_id":ObjectId("53402597d852426020000002"),
   "experiment": "cxif5315"
   "run": 123
   "detector": "cspad-0-cxids2-0"
   "ctype": "pedestals"
   "time_sec": 1516321053
   "time_nsec": 123456789
   "time_stamp": "2018-01-18T16:17:33.123456789-0800"
   "version": "v00-11-22"
   "facility": "LCLS2"
   "uid": "login-name"
   "host": "psanaphi102"
   "comments": ["very good constants", "throw them in trash immediately!"]
   "data_size": 32*185*388
   "data_shape": (32,185,388)
   "data_type": "int16"
   "data": np.array(...)
}

All meta-data information is accessible through a single-level document.

Detector-centric calibration data base

Tentative schema for detector-centric db

# References or DBRefs for detectors

dbdet = client['calib-cspad'] 
col1 = dbdet['cspad-0-cxids1-0']
col2 = dbdet['cspad-0-cxids2-0']
col3 = dbdet['cspad-0-cxidsd-0']
col4 = dbdet['cspad-0-xcsendstation-0']
col5 = dbdet['cspad-0-xppgon-0']
col6 = dbdet['cspad-0-sxrbeamline-1']
col7 = dbdet['cspad-0-mectargetchamber-0']

# Document content for dbdet
doc = {
   "_id":ObjectId("..."),
   "ref_id": ObjectId("534009e4d852427820000002"),
   etc...
}

Essentially document in the detector collection has a reference to the data in the experiment collections.

Data flow for documents less than 16 MB

Preparation of data

Conversion of numpy array to unicode

nda = gu.random_standard(shape=(32,185,388), mu=20, sigma=5, dtype=gu.np.float)

import pickle
from bson.binary import Binary

t0_sec = time()

arr = nda.flatten()
arr = ' '.join(['%.2f' % v for v in arr])
sarr = Binary(pickle.dumps(arr, protocol=2), subtype=128)

doc = {
   "experiment": "cxi12345",
   "run": 124,
   ...
   "data": sarr,
}

dt_sec = time() - t0_sec

Preparation of cspad data in text/unicode format for inserting takes ~1sec.
Only limited precision data can be saved due to limit on document size 16MB.

Inserting data

Insert document in collection

doc_id = col.insert_one(doc).inserted_id

Insertion time is 110-180ms.

Find data

Find data

t0_sec = time()
docs = col.find({"run": 125})
dt_sec = time() - t0_sec

Finding data time is 50-60us

Unpack data

Unpack data from unicode to numpy array

doc = docs[0]
xcarr = pickle.loads(doc["data"]) # 30-40ms
arr = gu.np.fromstring(xcarr, dtype=float, count=-1, sep=' ') # 300ms

Time to unpack is 350ms.

Data flow for large documents

Timing test is done for mongod running on psanaphi105 and scripts on psanagpu106.

Initialization

Initialization MongoDB and GridFS

import gridfs
from pymongo import MongoClient
#client = MongoClient('localhost')
client = MongoClient('psanaphi105', 27017)
db = client['calib-cxi12345']
fs = gridfs.GridFS(db)
col = db['camera-0-cxids1-0']

Time to connect 116-150ms.

Put

Save numpy array in db

ida = fs.put(nda.tobytes())

Time to save data 330-420ms.

Preparation of document with metadata and insert

doc = {
   "experiment": "cxi12345",
   "run": 126,
   "detector": col.name,
   "ctype": "pedestals",
   "data_size":  nda.size,
   "data_shape": nda.shape,
   "data_type":  str(nda.dtype),
   "data_id":    ida,
	...
}
doc_id = col.insert_one(doc).inserted_id

Document meta-data with reference to data preparation time is 43-53us.

Insert metadata time 0.5-0.6ms.

Get

Find and get document

docs = col.find({"time_stamp" : "2018-01-25T09:33:10PST"})
doc = docs[0]

Metadata find and get time: 0.7ms

Get data from fs

s = fs.get(doc['data_id']).read()
nda = gu.np.fromstring(s)

Data extraction time: 96ms. Thus returned array is "flattend" and needs to be shaped.

Summary

MongoDB structure has limitations in number of levels and document size.
- server may have many DBs
- DB is a container for collections
- collection is a group of documents
- document is a JSON/BSON object of key:value pairs (dictionary). Each value may be dictionary itself etc, but further structure levels are not supported by DB structure.
  - document size has hardwired limit 16MB (in 2010 increased from 4 to 16MB and devs do not want to change it). CSPAD 2Mpix*8byte(double) = 16MB, but we may expect larger detectors like Jungfrau, Epix, Andor, etc.
  - Larger data size is suggested to save using GridFS; split data for chanks and save chunks in the same DB in different collections.
  - JSON (text) object in MongoDB is presented in unicode...(UTF-8). Data should be converted to unicode force and back in saving retrieving.
schema-less DB looks interesting to certain extents, but in order to find something in DB there should be a schema...
GridFS works fine with document size>16GB.

Page tree

MongoDB evaluation for calibration store