Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
titleShell
collapsetrue
mongo --host psanaphi105 --port 27017

To exit the shell, type quit() or use the <Ctrl-C> shortcut.

> db
test

> show dbs
admin           0.000GB
calib-cxif5315  0.006GB
config          0.000GB
local           0.000GB

> use calib-cxif5315
switched to db calib-cxif5315

> show collections
cspad-0-cxids1-0
cspad-1

> db["cspad-0-cxids1-0"].find()
> db["cspad-0-cxids1-0"].find().pretty()

# Delete databale:
use calib-cxif5315 
db.dropDatabase()

# Delete collection
db.collection.drop()
# OR:
db["cspad-0-cxids1-0"].drop()

> help

# Export/backup database in file
> help mongodump -d <dbname> --archive <filename> --out /path/to/backup/dir

# Import database from file
mongorestore -d <dbname> --archive <filename>

Connection to DB in python

...

Code Block
titleTentative schema of the experiment-centric db
collapsetrue
# Database for experiment
dbexp = client[("calibdb-cxif5315"]

# Collections:
col1)
fs = dbexp[big_data_file_system(dbexp)
 
# Collections:
"cspad-0-cxids2-0"]
col2 = dbexp["cspad2x2-0-cxids2-0"]
col3 = dbexp["andor-0-cxids2-0"]
...
# Auto-generated collections
'fs.files'
'fs.chunks'

# Document content for dbexp
doc = {
   "_id":ObjectId("53402597d852426020000002"), # auto-generated
   "experiment": "cxif5315",
   "run": "123",
   "detector": "cspad-0-cxids2-0",
   "ctype": "pedestals",
   "time_sec": "1516321053
   "time_nsec": 123456789",
   "time_stamp": "2018-01-18T16:17:33.123456789-0800",
   "version": "v00-11-22",
   "facilityuid": "LCLS2login-name",
   "uidhost": "login-namepsanaphi102",
   "hostport": "psanaphi10212345",
   "commentscomment": ["very good constants", "throw them in trash immediately!"]
   "id_data_size": 32*185*388
   "data_shape": (32,185,388)
   "data_type": "int16"
   "data": np.array(...)
"5a98464a5777035bba3a4f41"  # added as a reference to big data
 }

All meta-data information is accessible through a single-level document.

...

Code Block
titleTentative schema for detector-centric db
collapsetrue
# Model References#1: orDB DBRefsper for detectors

dbdet = client['calib-cspad'] 
col1 = dbdet['cspad-0-cxids1-0']
col2 = dbdet['cspad-0-cxids2-0']
col3 = dbdet['cspad-0-cxidsd-0']
col4 = dbdet['cspad-0-xcsendstation-0']
col5 = dbdet['cspad-0-xppgon-0']
col6 = dbdet[detector type, collection per detector:
-----------------------------------------------------------
dbdet = client('db-cspad')

# Collections:
'cspad-0-cxids1-0'
'cspad-0-cxids2-0'
'cspad-0-cxidsd-0'
'cspad-0-xcsendstation-0'
'cspad-0-xppgon-0'
'cspad-0-sxrbeamline-1']
col7 = dbdet['cspad-0-mectargetchamber-0']

# Document content for dbdet
doc the =same {
as dbexp plus "id_id":ObjectId("data"
doc = {..."),
   "refid_iddata": ObjectId("534009e4d852427820000002"),
   etc...
}

...



# Model #2: DB per detector, one collection per detector:
---------------------------------------------------------
dbdet = client('db-cspad-0-cxids1-0')
col = dbdet['cspad-0-cxids1-0']

# Add collections in case of DB copy
'fs.files'
'fs.chunks'


Essentially document in the detector collection has a reference to the data in the experiment collections.

 

Data flow for documents less than 16 MB

Preparation of data

Code Block
titleConversion of numpy array to unicode
collapsetrue
nda = gu.random_standard(shape=(32,185,388), mu=20, sigma=5, dtype=gu.np.float)

import pickle
from bson.binary import Binary

t0_sec = time()

arr = nda.flatten()
arr = ' '.join(['%.2f' % v for v in arr])
sarr = Binary(pickle.dumps(arr, protocol=2), subtype=128)

doc = {
   "experiment": "cxi12345",
   "run": 124,
   ...
   "data": sarr,
}

dt_sec = time() 

 

Data flow for documents less than 16 MB

Preparation of data

Code Block
titleConversion of numpy array to unicode
collapsetrue
nda = gu.random_standard(shape=(32,185,388), mu=20, sigma=5, dtype=gu.np.float)

import pickle
from bson.binary import Binary

t0_sec = time()

arr = nda.flatten()
arr = ' '.join(['%.2f' % v for v in arr])
sarr = Binary(pickle.dumps(arr, protocol=2), subtype=128)

doc = {
   "experiment": "cxi12345",
   "run": 124,
   ...
   "data": sarr,
}

dt_sec = time() - t0_sec
  • Preparation of cspad data in text/unicode format for inserting takes ~1sec.
  • Only limited precision data can be saved due to limit on document size 16MB.

...

Code Block
titleFind data
collapsetrue
t0_sec = time()
docs = col.find({"run": 125})
dt_sec = time() - t0_sec

Finding data time is 50-60us

Unpack data

Code Block
titleUnpack data from unicode to numpy array
collapsetrue
doc = docs[0]
xcarr = pickle.loads(doc["data"]) # 30-40ms
arr = gu.np.fromstring(xcarr, dtype=float, count=-1, sep=' ') # 300ms

Time to unpack is 350ms.

Data flow for large documents

Timing test is done for mongod running on psanaphi105 and scripts on psanagpu106.

Initialization

Code Block
titleInitialization MongoDB and GridFS
collapsetrue
import gridfs
from pymongo import MongoClient
#client = MongoClient('localhost')
client = MongoClient('psanaphi105', 27017)
db = client['calib-cxi12345']
fs = gridfs.GridFS(db)
col = db['camera-0-cxids1-0']

Time to connect 116-150ms.

Put

Code Block
titleSave numpy array in db
collapsetrue
ida = fs.put(nda.tobytes())

Time to save data 330-420ms.

Code Block
titlePreparation of document with metadata and insert
collapsetrue
doc = {
   "experiment": "cxi12345",
   "run": 126,
   "detector": col.name,
   "ctype": "pedestals",
   "data_size":  nda.size,
   "data_shape": nda.shape,
   "data_type":  str(nda.dtype),
   "data_id":    ida,
	...
}
doc_id = col.insert_one(doc).inserted_id

Document meta-data with reference to data preparation time is 43-53us.

Insert metadata time 0.5-0.6ms.

Get

Code Block
titleFind and get document
collapsetrue
docs = col.find({"time_stamp" : "2018-01-25T09:33:10PST"})
doc = docs[0]

Metadata find and get time: 0.7ms

Code Block
titleGet data from fs
collapsetrue
s = fs.get(doc['data_id']).read()
nda = gu.np.fromstring(s)
.find({"run": 125})
dt_sec = time() - t0_sec

Finding data time is 50-60us

Unpack data

Code Block
titleUnpack data from unicode to numpy array
collapsetrue
doc = docs[0]
xcarr = pickle.loads(doc["data"]) # 30-40ms
arr = gu.np.fromstring(xcarr, dtype=float, count=-1, sep=' ') # 300ms

Time to unpack is 350ms.

Data flow for large documents

Timing test is done for mongod running on psanaphi105 and scripts on psanagpu106.

Initialization

Code Block
titleInitialization MongoDB and GridFS
collapsetrue
import gridfs
from pymongo import MongoClient
#client = MongoClient('localhost')
client = MongoClient('psanaphi105', 27017)
db = client['calib-cxi12345']
fs = gridfs.GridFS(db)
col = db['camera-0-cxids1-0']

Time to connect 116-150ms.

Put

Code Block
titleSave numpy array in db
collapsetrue
ida = fs.put(nda.tobytes())

Time to save data 330-420ms.

Code Block
titlePreparation of document with metadata and insert
collapsetrue
doc = {
   "experiment": "cxi12345",
   "run": 126,
   "detector": col.name,
   "ctype": "pedestals",
   "data_size":  nda.size,
   "data_shape": nda.shape,
   "data_type":  str(nda.dtype),
   "data_id":    ida,
	...
}
doc_id = col.insert_one(doc).inserted_id

Document meta-data with reference to data preparation time is 43-53us.

Insert metadata time 0.5-0.6ms.

Get

Code Block
titleFind and get document
collapsetrue
docs = col.find({"time_stamp" : "2018-01-25T09:33:10PST"})
doc = docs[0]

Metadata find and get time: 0.7ms

Code Block
titleGet data from fs
collapsetrue
s = fs.get(doc['data_id']).read()
nda = gu.np.fromstring(s)

Data extraction time: 96ms. Thus returned array is "flattend" and needs to be shaped.

Interface from Murali

2018-08-03 e-mail from Murali:
I have installed Mongo 4.0 on psdb-dev. I was hoping to use their REST service but this seems to have been deprecated and eliminated since 3.6. 
So, I knocked a quick web service and have proxied it from pswww. This web service (https://github.com/slaclab/psdm_mongo_ws) is a suggestion only; please let me know if you need something different. 
These are examples of getting data over HTTPS from a batch node from within cori; needless to say, the URL prefix is https://pswww.slac.stanford.edu/calib_ws

Two users:

  • mongo --host=psdb-dev --port 9306 -u "dubrovin" -p "...." --authenticationDatabase "admin"
  • mongo --host=psdb-dev --port 9306 -u "calibuser" -p "...." --authenticationDatabase "admin"

Test commands:

  • curl -s "https://pswww.slac.stanford.edu/calib_ws/test_db/test_coll/5b649a9df59ae00bda110168"
  • curl -s "https://pswww.slac.stanford.edu/calib_ws/test_db/test_coll"
  • curl -s "https://pswww.slac.stanford.edu/calib_ws/test_db/test_coll?item=planner&size.uom=cm"
  • curl -s "https://pswww.slac.stanford.edu/calib_ws/test_db/test_coll?query_string=%7B%20%22item%22%3A%20%22planner%22%2C%20%22qty%22%3A%2075%20%7D%0A"
  • curl -s "https://pswww.slac.stanford.edu/calib_ws/" - get string of databases
  • curl -s "https://pswww.slac.stanford.edu/calib_ws/test_db" - get list of collections in database
  • curl -s "https://pswww.slac.stanford.edu/calib_ws/cdb_cxic0415/cspad_detnum1234?ctype=pedestals&data_size=2296960&run=74" - find and return document for query
  • curl -s "https://pswww.slac.stanford.edu/calib_ws/cdb_cxic0415/cspad_detnum1234/gridfs/5b6893e81ead141643fe4344" - get document with constants from GridFS using document id
  • curl -s "https://pswww.slac.stanford.edu/calib_ws/cdb_cxic0415/cspad_detnum1234/gridfs/5b6893e81ead141643fe4344" - DEPRICATED - access to GridFS raw data through doc _id
  • curl -s "https://pswww.slac.stanford.edu/calib_ws/cdb_cxic0415/gridfs/5b6893d91ead141643fe3f6a"  - access to GridFS raw data through data _id

Implementation

Write web access

Code Block
title2019-07-27 web service to write in DB
collapsetrue
 

2019-07-27
Here's version 1; any feedback is appreciated.
Regards,
Murali

#!/usr/bin/env python

"""
Sample for posting to the calibration service using a web service and kerberos authentication.
Make sure we have a kerberos ticket.
"""

import requests
import json
from krtc import KerberosTicket
from urllib.parse import urlparse

ws_url = "https://pswww.slac.stanford.edu/ws-kerb/calib_ws/"
krbheaders = KerberosTicket("HTTP@" + urlparse(ws_url).hostname).getAuthHeaders()

# Create a new document in the collection test_coll in the database test_db.
resp = requests.post(ws_url+"test_db/test_coll/", headers=krbheaders, json={"calib_count": 1})
print(resp.text)
new_id = resp.json()["_id"]

# Update an existing document
resp = requests.put(ws_url+"test_db/test_coll/"+new_id, headers=krbheaders, json={"calib_count": 2})
print(resp.text)

# Delete an existing document
resp = requests.delete(ws_url+"test_db/test_coll/"+new_id, headers=krbheaders)
print(resp.text)

# Create a new GridFS document, we upload an image called small_img.png
files = [("files",  ('small_img.png', open('small_img.png', 'rb'), 'image/png'))]
resp = requests.post(ws_url+"test_db/gridfs/", headers=krbheaders, files=files)
print(resp.text)
new_id = resp.json()["_id"]

# Delete the GridFS document
resp = requests.delete(ws_url+"test_db/gridfs/"+new_id, headers=krbheaders)
print(resp.text)

 Data extraction time: 96ms. Thus returned array is "flattend" and needs to be shaped.

Summary

  • MongoDB structure has limitations in number of levels and document size.
    • server may have many DBs
    • DB is a container for collections
    • collection is a group of documents
    • document is a JSON/BSON object of key:value pairs (dictionary). Each value may be dictionary itself etc, but further structure levels are not supported by DB structure.
      • document size has hardwired limit 16MB (in 2010 increased from 4 to 16MB and devs do not want to change it). CSPAD 2Mpix*8byte(double) = 16MB, but we may expect larger detectors like Jungfrau, Epix, Andor, etc.
      • Larger data size is suggested to save using GridFS; split data for chanks and save chunks in the same DB in different collections.
      • JSON (text) object in MongoDB is presented in unicode...(UTF-8).  Data should be converted to unicode force and back in saving retrieving.
  • schema-less DB looks interesting to certain extents, but in order to find something in DB there should be a schema...
  • GridFS works fine with document size>16GB.

References