Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page describes the processes and scripts that provide the transfer of image and meta data from the TEM's.

Everything is documented within an apache airflow instance. This DAG program provides both task management and monitoring.

The airflow instance currently runs on cryoem-airflow.slac.stanford.edu. Specifically, it runs as a docker compose stack (mainly because i couldn't get the docker swarn airflow web server working. Local modifications to the airflow Dockerfile enables file permissions from the TEMs to be maintained (although it's hacky due to the security issues with docker)

EPU and SerialEM are installed and running on the TEM machines. Each TEM has the following machines

  • temN-control: actual server connected to the TEMs
  • temN-k2: server connected to the K2 cameras where data will reside
  • temN-support: machine used for remote access from FEI.

Internally to each tem, there is a private network where all machines are interconnected. Across these servers, an 'X:' drive is mounted where data collection is stored.

Operators are expected to utilise the elogbook to control the backend data management systems. This primarily functions as a means to 

  • Copy and remove (old) data from the local tem servers
  • Organise the data onto the large disk subsystems in the data center
  • Begin pre-processing pipelines to align and ctf the images

In the data center, an apache airflow instance runs to manage the workflows required to support data management. It currently runs as a docker swarn instance on cryoem-daq[01-5] nodes.

On these cryoem-daq nodes, the K2 server's disk is mounted via CIFS. the cryoem-daq nodes also mount the large GPFS filesystem where the data ultimately resides and users can access. As the docker swarm instances are containers, these mountpoints are 'bind' mounted into the airflow instancesIn particular, the CIFs file share from the TEM cameras are mounted on cryoem-airflow.slac.stanford.edu and then volume bound onto the airflow workers. The host also mounts a nfs share for the long term storage (GPFS).

The airflow stack is kept under revision control under github.

 

airflow Airflow define workflows in DAGs. these These are coded in python and provide dependency graphs between tasks. The following table describes the function of each DAG.

DAGPurpose   
*_filetemN-dropdaq.pyReads in a {{tem?-experiment.yaml}} file to determine where to copy files to (NFS) from where (the CIFS share). it will then (currently) rsync the files and the finally deleted all mrc files older than 9 hours and greater than 100MBLiterally monitors the elogbook for the current experiment and sets up the storage and preprocessing pipelines in preparation. Also copies the data from the tem servers to the GPFS file system - and is it goes so, triggers the appropriate new pre-processing task to kick off.   
<experiment name>_<sample id>.pythese DAGs get generated with every new experiment and contain the actual preprocessing pipeline to align, ctf and particle pick each and every image that is triggered from the temN-daq DAG.      
pipeline_single-particle_pre-processing.pydefault template DAG for single particle pre-processing. This file is copied to <experiment name>_<sample id>.py when a new experiment starts