Dragonfly Software


Introduction

 

The Dragonfly software package implements the EMC single-particle reconstruction algorithm via the MPI and OpenMP frameworks. This package also includes a data stream simulator, that generates noisy single-particle diffraction patterns from a PDB file as well as an experimental pattern-classification GUI to separate single particle diffraction patterns in experimental data.

This document focuses on the experimental case.

 

Data Preparation Workflow

 

The first few steps are needed to take experimental run data and convert into a form that Dragonfly can use.

 

1. Run the script ‘1_reprocess_.py’.  Its purpose it to take the single-hit images from different experimental runs and save them into NumPy arrays (containing ADUs and photon counts). It first saves the images in HDF5 format. The second step is to set the detector and mask parameters. The image stack is converted into new a photon (and ADU) stack to save as 2 separate NumPy arrays.

 

2. Next the NumPy arrays for each run are saved to HDF5 along with the masks. This is done by running the script ‘2_save_to_h5.py’.

 

3. The file ‘3_dragonfly_prepare_new.py’. It takes the HDF5 file and downsamples them by a factor of four. The new set of photon patterns are put in a stack.

 

4. The mask file is a binary file (.byt) that needs to be converted into a .dat file. This may not be necessary if a mask data file already exists.

 

5. Use the script ‘convert2emc.sh’ to take the h5 file and convert into an EMC file for use in Dragonfly. This script essentially runs ‘h5toemc_spi2.py’. You need an extra configuration file  ‘config.ini’ to create the executable ‘etc’ to run in the script ‘run_emc.py

 

This will be necessary to use Dragonfly and use the executable that contains the photon counts in an image stack. For experimental data, this is the completion of the data to run reconstructions.  There is also a workflow for the user to simulate and produce data, but this option is not covered in this document.

 

 

Setting the Parameters

 

Parameters in the configuration file pertain to the experimental and simulation values.

 

- Detector Distance from Object: (detd) 586 (nm)

- X-ray wavelength: (lambda) 7.75 (A)

- Detector Size: (detsize) 257 260

- Stop radius: (stopradius) 10 (pixels)

- Pixel Size: (pix size)

- Ewald Radius: (ewald radius) 650

- Beta

- Beta Schedule

 

 

Output Metrics

 

1. Mean RMS—root mean squared change per voxel in 3D intensity model from successive iterations. It is an indicator of convergence if the change in this value is small.

2. Mutual Information—sharpness of probability distributions over orientations; value expected as model reconstruction converges.

3. Average log-likelihood of patterns given the current 3D intensity model—it quantifies how a reconstruction approaches a global likely maximum.

 

Dragonfly Best Practices

 

The best mode to run Dragonfly is to use both MPI and OpenMP. One should have only a single MPI process per physical node and within the node, shared-memory parallelism using OpenMP is most efficient. By default, the program tries to run with as many threads as it can on each node. You can control how many threads each process uses by either setting the environment variable OMP_NUM_THREADS or by passing the command line parameters "-t <num_threads>". You will have a severe hit to performance if you have multiple MPI processes on the same node, each trying to use all the threads.

 

The best route is to use a batch job through the bsub command to submit to a queue on a cluster such as psana. Here is a good example of a Dragonfly batch run:

 

bsub -q psfehq -n 4 -R "span[ptile=1]" -o .%J.out mpirun emc <num_iterations>

 

The parameter value “span[ptile=1]” indicates the number of processors on each host that should be allocated to the job, in this case a single processor. The “-R” flag allows one to start the simulation from the previous iteration.

Another possible cause of the slowdown, which is specific to the psana cluster at LCLS, is that sometimes you have users directly ssh-ing into nodes and running things. So, some of the nodes are partially busy.

 

 

Post-Processing and Visualization

 

1. autoplay.py: visualizes the diagnostics in real-time.

 

2. frameviewer.py: allows one to examine individual image frames.

 

 

References

 

Ayyer, K., Lan, T. Y., Elser, V., & Loh, N. D. (2016). Dragonfly: an implementation of the expand–maximize–compress algorithm for single-particle imaging. Journal of applied crystallography, 49(4), 1320-1335.

  • No labels