Cluster Execution#
This tutorial demonstrates how to scale template matching to large datasets. When processing dozens or hundreds of tomograms, manually creating and submitting individual jobs becomes impractical. The pytme_runner
automates this workflow by discovering datasets, generating cluster scripts, and managing job submission.
Dataset Organization#
For this tutorial, we extend the ribosome picking example to a larger dataset. Your project directory will typically look like this
project_directory/
├── tomograms/ # Tomograms
│ ├── TS_037_10.00Apx.rec
│ ├── TS_041_10.00Apx.rec
│ └── TS_045_10.00Apx.rec
├── metadata/ # Metadata files
│ ├── TS_037.mdoc # Can also be Warp/M XMLs
│ ├── TS_041.mdoc # or tomostar STAR files
│ └── TS_045.mdoc
├── masks/ # Optional tomogram masks
│ ├── TS_037_mask.mrc
│ ├── TS_041_mask.mrc
│ └── TS_045_mask.mrc
└── templates/
├── emd_3228_resampled.mrc # 80S ribosome template
└── emd_3228_resampled_mask.mrc # 80S ribosome mask
The batch runner automatically extracts tomogram identifiers by removing technical suffixes like pixel size information (_10.00Apx
) and matches files across directories.
Basic Batch Processing#
The batch processing command identifies all tomograms and metadata files using glob patterns
pytme_runner \
--tomograms "project_directory/tomograms/*.rec" \
--metadata "project_directory/metadata/*.mdoc" \
--template templates/emd_3228_resampled.mrc \
--template-mask templates/emd_3228_resampled_mask.mrc \
--particle-diameter 300 \
--output-dir ribosome_batch_001/results \
--script-dir ribosome_batch_001/scripts \
--dry-run
Note
The quotation marks are required for parsing of glob patterns. If your tomogram names end with .mrc
, you would adapt the glob pattern to "project_directory/tomograms/*.mrc"
.
This command will
Discover all
.rec
files in the tomograms directoryMatch each tomogram with its corresponding
.mdoc
metadata fileGenerate individual SLURM scripts for each valid pair
The generated scripts can be submitted manually, or automatically through the runner by omitting the --dry-run
flag.
The scripts generated by the runner will generally follow this pattern
#!/bin/bash
# SLURM directives
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=256G
#SBATCH --time=08:00:00
#SBATCH --partition=gpu-el8
#SBATCH --gres=gpu:1
#SBATCH --job-name=pytme_TS_037
# Environment setup
module load pyTME
# Run template matching
match_template \
--target project_directory/tomograms/TS_037_10.00Apx.rec \
--template templates/emd_3228_resampled.mrc \
--template-mask templates/emd_3228_resampled_mask.mrc \
--output results/TS_037/TS_037_10.00Apx.pickle \
--particle-diameter 300 \
--tilt-angles project_directory/metadata/TS_037.mdoc \
--amplitude-contrast 0.08 \
--spherical-aberration 27000000.0 \
--acceleration-voltage 300
Note
Currently only SLURM scripts are supported. Feel free to open an issue if you require a different architecture, or create one yourself by inheriting from the ExecutionBackend defined in pytme_runner
.
Advanced Processing Options#
For production runs, you may want to include additional filters similar to those described in the ribosome picking tutorial
pytme_runner \
--tomograms "project_directory/tomograms/*.rec" \
--metadata "project_directory/metadata/*.mdoc" \
--masks "project_directory/masks/*mask.mrc" \
--template templates/emd_3228_resampled.mrc \
--template-mask templates/emd_3228_resampled_mask.mrc \
--particle-diameter 300 \
--lowpass 40 \
--tilt-weighting relion \
--whiten-spectrum \
--amplitude-contrast 0.08 \
--spherical-aberration 2.7 \
--voltage 300 \
--cpus 8 \
--memory 256 \
--gpu-count 1 \
--time-limit "08:00:00" \
--output-dir results/ribosome_batch_001
Compared to the basic run above, this now includes
Tomogram masks to exclude problematic regions
Lowpass filtering to 40 Ångstrom
Missing wedge correction with RELION-style tilt weighting
Spectral whitening to enhance weak signals
Resource specifications appropriate for your cluster
Tip
You can also switch between compute backends via --backend
. By default, the runner will use cupy.
Processing Subsets#
To process only specific tomograms, create a list file
# Create tomogram selection
echo "TS_037" > selected_tomos.txt
echo "TS_041" >> selected_tomos.txt
# Process only selected tomograms
pytme_runner \
--tomograms "project_directory/tomograms/*.rec" \
--metadata "project_directory/metadata/*" \
--template templates/emd_3228_resampled.mrc \
--tomo-list selected_tomos.txt \
--particle-diameter 300
Output Structure#
Results are organized in the following manner
ribosome_batch_001/results
├── TS_037_10.00Apx.pickle # Template matching results
├── TS_037_12345.out # SLURM logs
├── TS_041_10.00Apx.pickle
├── TS_041_12346.out
├── TS_045_10.00Apx.pickle
└── TS_045_12347.out
Mixed Formats#
You can mix formats by adapting the glob patterns. For instance for metadata
pytme_runner \
--tomograms "project_directory/tomograms/*.rec" \
--metadata "project_directory/metadata/*" \
--template templates/emd_3228_resampled.mrc \
--particle-diameter 300
The metadata/*
pattern will match .mdoc
, .xml
, .star
, and other supported formats, automatically pairing each tomogram with its corresponding metadata file. However, note that when multiple metadata files exist for a given tomogram, the runner will default to the first one it encountered.
Monitoring Progress#
Use standard SLURM commands to monitor your batch jobs:
# Check all your jobs
squeue --me
# Count running/pending jobs
squeue --me | grep pytme | wc -l
# Check specific job details
scontrol show job 12345
# Monitor resource usage
sacct -j 12345 --format=JobID,JobName,MaxRSS,Elapsed
Environment Configuration#
Different clusters require different environment setups. Configure this for your specific cluster
# Using environment modules (default)
pytme_runner --environment-setup "module load pyTME; export \$PYTHONPATH" ...
# Using conda environments
pytme_runner --environment-setup "source ~/.bashrc; conda activate pytme_env" ...
# Complex setup with GPU modules
pytme_runner --environment-setup "module load CUDA/11.7; conda activate pytme" ...