Cluster Execution#

This tutorial demonstrates how to scale template matching to large datasets. When processing dozens or hundreds of tomograms, manually creating and submitting individual jobs becomes impractical. The pytme_runner automates this workflow by discovering datasets, generating cluster scripts, and managing job submission.

From version 0.3.1 onwards, the runner has two modes

  • pytme_runner matching: Run template matching

  • pytme_runner analysis: Analyze template matching results

These can be run independently or as part of a complete pipeline.

Dataset Organization#

For this tutorial, we extend the ribosome picking example to a larger dataset. Your project directory will typically look like this

project_directory/
├── tomograms/                      # Tomograms
│   ├── TS_037_10.00Apx.rec
│   ├── TS_041_10.00Apx.rec
│   └── TS_045_10.00Apx.rec
├── metadata/                       # Metadata files
│   ├── TS_037.mdoc                 # Can also be Warp/M XMLs
│   ├── TS_041.mdoc                 # or tomostar STAR files
│   └── TS_045.mdoc
├── masks/                          # Optional tomogram masks
│   ├── TS_037_mask.mrc
│   ├── TS_041_mask.mrc
│   └── TS_045_mask.mrc
└── templates/
    ├── emd_3228_resampled.mrc      # 80S ribosome template
    └── emd_3228_resampled_mask.mrc # 80S ribosome mask

The batch runner automatically extracts tomogram identifiers by removing technical suffixes like pixel size information (_10.00Apx) and matches files across directories.

Basic Batch Processing#

The following outlines how to perform a basic template matching and analysis run.

Matching#

The template matching workflow identifies all tomograms and metadata files using glob patterns

pytme_runner matching \
    --tomograms "project_directory/tomograms/*.rec" \
    --metadata "project_directory/metadata/*.mdoc" \
    --template templates/emd_3228_resampled.mrc \
    --template-mask templates/emd_3228_resampled_mask.mrc \
    --particle-diameter 300 \
    --output-dir ribosome_batch_001/results \
    --script-dir ribosome_batch_001/scripts \
    --dry-run

Note

The quotation marks are required for parsing of glob patterns. If your tomogram names end with .mrc, you would adapt the glob pattern to "project_directory/tomograms/*.mrc".

This command will

  1. Discover all .rec files in the tomograms directory

  2. Match each tomogram with its corresponding .mdoc metadata file

  3. Generate individual SLURM scripts for each valid pair

The generated scripts can be submitted manually, or automatically through the runner by omitting the --dry-run flag.

The scripts generated by the runner will generally follow this pattern

#!/bin/bash

# SLURM directives
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=256G
#SBATCH --time=08:00:00
#SBATCH --partition=gpu-el8
#SBATCH --gres=gpu:1
#SBATCH --job-name=pytme_TS_037

# Environment setup
module load pyTME

# Run template matching
match_template \
    --target project_directory/tomograms/TS_037_10.00Apx.rec \
    --template templates/emd_3228_resampled.mrc \
    --template-mask templates/emd_3228_resampled_mask.mrc \
    --output results/TS_037/TS_037_10.00Apx.pickle \
    --particle-diameter 300 \
    --tilt-angles project_directory/metadata/TS_037.mdoc \
    --amplitude-contrast 0.08 \
    --spherical-aberration 27000000.0 \
    --acceleration-voltage 300

Note

Currently only SLURM scripts are supported. Feel free to open an issue if you require a different architecture, or create one yourself by inheriting from the ExecutionBackend defined in pytme_runner.py.

Analysis#

After template matching completes, use the analysis workflow to identify peaks and generate particle coordinates. The analysis workflow is CPU-only and much faster than template matching.

pytme_runner analysis \
    --input-files "ribosome_batch_001/results/*.pickle" \
    --num-peaks 1000 \
    --angles-clockwise \
    --output-format relion4 \
    --output-dir ribosome_batch_001/picks \
    --script-dir ribosome_batch_001/picks_scripts \
    --dry-run

Output#

Results are organized in the following manner

ribosome_batch_001/
├── results/
│   ├── TS_037_10.00Apx.pickle     # Template matching results
│   ├── TS_037_12345.out           # SLURM logs
│   ├── TS_041_10.00Apx.pickle
│   ├── TS_041_12346.out
│   ├── TS_045_10.00Apx.pickle
│   └── TS_045_12347.out
└── picks/
    ├── TS_037_10.00Apx.star       # Peak coordinates
    ├── TS_037_12345.out           # SLURM logs
    ├── TS_041_10.00Apx.star
    ├── TS_041_12346.out
    ├── TS_045_10.00Apx.star
    └── TS_045_12347.out

Processing Subsets#

To process only specific tomograms, create a list file

# Create tomogram selection
echo "TS_037" > selected_tomos.txt
echo "TS_041" >> selected_tomos.txt

# Process only selected tomograms
pytme_runner matching \
    --tomograms "project_directory/tomograms/*.rec" \
    --metadata "project_directory/metadata/*" \
    --template templates/emd_3228_resampled.mrc \
    --tomo-list selected_tomos.txt \
    --particle-diameter 300 \
    --dry-run

Advanced Options#

The following outlines advanced features for production workflows, including filtering, background correction, and multi-entity analysis.

Filtering#

For production runs, you may want to include additional filters similar to those described in the ribosome picking tutorial

pytme_runner matching \
    --tomograms "project_directory/tomograms/*.rec" \
    --metadata "project_directory/metadata/*.mdoc" \
    --masks "project_directory/masks/*mask.mrc" \
    --template templates/emd_3228_resampled.mrc \
    --template-mask templates/emd_3228_resampled_mask.mrc \
    --particle-diameter 300 \
    --lowpass 40 \
    --tilt-weighting relion \
    --whiten-spectrum \
    --amplitude-contrast 0.08 \
    --spherical-aberration 2.7 \
    --voltage 300 \
    --cpus 8 \
    --memory 256 \
    --gpu-count 1 \
    --time-limit "08:00:00" \
    --output-dir results/ribosome_batch_001 \
    --dry-run

Compared to the basic run above, this now includes

  • Tomogram masks to exclude problematic regions

  • Lowpass filtering to 40 Ångstrom

  • Missing wedge correction with RELION-style tilt weighting

  • Spectral whitening to enhance weak signals

  • Resource specifications appropriate for your cluster

Tip

You can switch between compute backends via --backend. By default, the runner will use cupy.

Mixed Formats#

You can mix formats by adapting the glob patterns. For instance for metadata

pytme_runner matching \
    --tomograms "project_directory/tomograms/*.rec" \
    --metadata "project_directory/metadata/*" \
    --template templates/emd_3228_resampled.mrc \
    --particle-diameter 300 \
    --dry-run

The metadata/* pattern will match .mdoc, .xml, .star, and other supported formats, automatically pairing each tomogram with its corresponding metadata file. However, note that when multiple metadata files exist for a given tomogram, the runner will default to the first one it encountered.

Background Correction#

In some cases its helpful to compute template matching scores for an uninformative template to suppress background peaks and improve detection. The background score distributionu can be generated by adding --scramble-phases

pytme_runner matching \
    --tomograms "project_directory/tomograms/*.rec" \
    --metadata "project_directory/metadata/*.mdoc" \
    --template templates/emd_3228_resampled.mrc \
    --template-mask templates/emd_3228_resampled_mask.mrc \
    --particle-diameter 300 \
    --output-dir ribosome_batch_001/results_noise \
    --script-dir ribosome_batch_001/scripts_noise \
    --scramble-phases \
    --dry-run

Tip

You can also use any other template as background, e.g., a membrane-only class.

Then run analysis with background correction

pytme_runner analysis \
    --input-files "ribosome_batch_001/results/*.pickle" \
    --background-files "ribosome_batch_001/results_noise/*.pickle" \
    --num-peaks 1000 \
    --angles-clockwise \
    --output-format relion4 \
    --output-dir ribosome_batch_001/picks_norm \
    --script-dir ribosome_batch_001/picks_scripts \
    --dry-run

Multiple Entities and Backgrounds#

The analysis workflow supports combining results from multiple template matching runs. This is useful when distinguishing between different templates

pytme_runner analysis \
    --input-files "ribosome_batch_001/results/*.pickle" "ribosome_batch_001/results_rnap/*.pickle" \
    --background-files "ribosome_batch_001/results_noise/*.pickle" \
    --num-peaks 1000 \
    --angles-clockwise \
    --output-format relion4 \
    --output-dir ribosome_batch_001/picks \
    --script-dir ribosome_batch_001/picks_scripts \
    --dry-run

When multiple input patterns are provided, the analysis workflow will

  • Aggregate correlation scores from all matching runs for each tomogram

  • Take the maximum score at each position across all inputs

  • Apply background correction using all provided background datasets

  • Generate a single coordinate file per tomogram with peaks and class labels corresponding to the order of input files.

You can also include multiple background datasets for more custom normalization

pytme_runner analysis \
    --input-files "ribosome_batch_001/results/*.pickle" "ribosome_batch_001/results_rnap/*.pickle" \
    --background-files "ribosome_batch_001/results_noise/*.pickle" "ribosome_batch_001/results_membrane/*.pickle" \
    --num-peaks 1000 \
    --angles-clockwise \
    --output-format relion4 \
    --output-dir ribosome_batch_001/picks \
    --script-dir ribosome_batch_001/picks_scripts \
    --dry-run

Monitoring Progress#

Use standard SLURM commands to monitor your batch jobs:

# Check all your jobs
squeue --me

# Count running/pending jobs
squeue --me | grep pytme | wc -l

# Check specific job details
scontrol show job 12345

# Monitor resource usage
sacct -j 12345 --format=JobID,JobName,MaxRSS,Elapsed

Environment Configuration#

Different clusters require different environment setups. Configure this for your specific cluster

# Using environment modules (default)
pytme_runner matching --environment-setup "module load pyTME; export \$PYTHONPATH" ...

# Using conda environments
pytme_runner matching --environment-setup "source ~/.bashrc; conda activate pytme_env" ...

# Complex setup with GPU modules
pytme_runner matching --environment-setup "module load CUDA/11.7; conda activate pytme" ...