REcovery of eukaryotic genomes using contrastive learning. A specialized metagenomic binning tool designed for recovering high-quality eukaryotic genomes from mixed prokaryotic-eukaryotic samples.
# Create environment and install REMAG with its external dependency
conda create -n remag -c bioconda -c conda-forge remag
conda activate remag
# Run REMAG
remag contigs.fasta -c alignments.bamdocker run --rm -v $(pwd):/data danielzmbp/remag:latest \
/data/contigs.fasta -c /data/alignments.bam -o /data/output# Create environment first
conda create -n remag python=3.9
conda activate remag
# Install the external dependency, then REMAG
conda install -c bioconda miniprot
pip install remag
remag contigs.fasta -c alignments.bamThis is the easiest installation path because the conda package pulls in miniprot automatically.
conda create -n remag -c bioconda -c conda-forge remag
conda activate remag
remag --helpIf you install from PyPI, install miniprot separately first:
conda create -n remag python=3.9
conda activate remag
conda install -c bioconda miniprot
pip install remagconda install -c conda-forge matplotlib umap-learnREMAG uses PyTorch and will use GPU acceleration automatically when a supported backend is available. No extra REMAG flag is required.
If you want a CUDA-enabled PyTorch build, install REMAG first and then replace the CPU PyTorch package with the CUDA-enabled one that matches your system:
conda create -n remag -c bioconda -c conda-forge remag
conda activate remag
conda install -c pytorch -c nvidia pytorch pytorch-cuda=12.1Adjust the CUDA version to match your driver and platform.
On Apple Silicon, PyTorch can use Metal (mps) automatically when available. In most cases no extra REMAG-specific setup is needed beyond installing a current PyTorch build.
If you install REMAG with pip, install the PyTorch build you want first, then install REMAG:
conda create -n remag python=3.9
conda activate remag
conda install -c bioconda miniprot
# Install the desired PyTorch build first
pip install torch
# Then install REMAG
pip install remagFor NVIDIA systems, use the PyTorch install command from the official PyTorch selector so the wheel matches your CUDA runtime.
# Pull and run the latest version (output directory defaults to remag_output)
docker run --rm -v $(pwd):/data danielzmbp/remag:latest \
/data/contigs.fasta -c /data/alignments.bam
# Or specify output directory
docker run --rm -v $(pwd):/data danielzmbp/remag:latest \
/data/contigs.fasta -c /data/alignments.bam -o /data/output
# For interactive use
docker run -it --rm -v $(pwd):/data danielzmbp/remag:latest /bin/bash# Pull and run the latest version directly
singularity run docker://danielzmbp/remag:latest \
contigs.fasta -c alignments.bam
# Build Singularity image from Docker Hub
singularity build remag_v0.3.4.sif docker://danielzmbp/remag:v0.3.4
# Or build latest version
singularity build remag_latest.sif docker://danielzmbp/remag:latest
# Run with Singularity
singularity run --bind $(pwd):/data remag_v0.3.4.sif \
/data/contigs.fasta -c /data/alignments.bam
# Or use exec for direct command execution
singularity exec --bind $(pwd):/data remag_v0.3.4.sif \
remag /data/contigs.fasta -c /data/alignments.bam -o /data/output
# For interactive shell
singularity shell --bind $(pwd):/data remag_v0.3.4.sif
# Build a local Singularity image file (optional)
singularity build remag.sif docker://danielzmbp/remag:latest
singularity run remag.sif contigs.fasta -c alignments.bamconda create -n remag python=3.9
conda activate remag
git clone https://github.com/danielzmbp/remag.git
cd remag
conda install -c bioconda miniprot
pip install .pip install -e ".[dev]"After installation, you can use REMAG via the command line:
# Basic usage (output defaults to remag_output in FASTA directory)
remag contigs.fasta -c alignments.bam
# With explicit output directory
remag contigs.fasta -c alignments.bam -o output_directory
# Multiple samples using repeated flags
remag contigs.fasta -c sample1.bam -c sample2.bam
# Multiple samples using shell-expanded globs
remag contigs.fasta -c samples/*.bam
# Using explicit -f flag (both styles work)
remag -f contigs.fasta -c alignments.bam
# Keep intermediate files with -k shorthand
remag contigs.fasta -c alignments.bam -k
# Only run eukaryotic filtering (skip binning)
remag contigs.fasta -c alignments.bam --filter-only
# Use single-cell mode (adjusts k-NN and clustering defaults)
remag contigs.fasta -c alignments.bam -m single-cellpython -m remag contigs.fasta -c alignments.bam# Quick reference (basic options)
remag -h
# Full documentation (all advanced options)
remag --helpREMAG uses a sophisticated multi-stage pipeline specifically designed for eukaryotic genome recovery:
- Eukaryotic Filtering: By default, REMAG automatically filters for eukaryotic contigs using the integrated HyenaDNA LLM-based classifier (can be disabled with
--skip-bacterial-filter) - Feature Extraction: Combines k-mer composition (4-mers) with coverage profiles across multiple samples. Large contigs are split into overlapping fragments for augmentation during training
- Contrastive Learning: Trains a Siamese neural network using the Barlow Twins self-supervised loss function. This creates embeddings where fragments from the same contig are close together
- Eukaryotic Gene Marker Annotation: Uses miniprot to annotate contigs with eukaryotic single-copy core genes, providing the quality metrics needed for clustering decisions
- Greedy Clustering: Iteratively extracts bins using a greedy Leiden approach -- at each step, tests multiple Leiden resolutions on the remaining contigs, selects the single best-quality cluster (by F1 score of completeness vs. contamination), removes it from the graph, and repeats
- Bin Rescue: Merges fragmented bins into larger bins based on embedding similarity and single-copy gene safety, and rescues unbinned contigs into matching bins
- Automatic Eukaryotic Filtering: The HyenaDNA classifier uses a pre-trained genomic foundation model to identify and retain eukaryotic sequences
- Multi-Sample Support: Can process coverage information from multiple samples (BAM/CRAM files) simultaneously
- Greedy Multi-Resolution Clustering: Iteratively extracts bins by testing multiple Leiden resolutions at each step, allowing different bins to use different resolutions for optimal quality
- Barlow Twins Loss: Uses a self-supervised contrastive learning approach that doesn't require negative pairs
- Fragment Augmentation: Large contigs are split into multiple overlapping fragments during training to improve representation learning
- Bin Rescue: Merges fragmented bins and rescues unbinned contigs into existing bins based on embedding similarity and single-copy gene safety
Use remag -h for a quick reference or remag --help for the full CLI documentation.
Commonly used options:
-c, --coverage: one or more BAM, CRAM, or TSV coverage inputs-o, --output: output directory; defaults toremag_outputnext to the input FASTA-k, --keep-intermediate: retain embeddings, features, model weights, and other intermediate files--filter-only: stop after eukaryotic filtering and write filtered FASTA output-m, --mode: select presets such asmetagenomicsorsingle-cell--save-filtered-contigs: also write the contigs removed by the eukaryotic filter
For the complete list of neural-network, clustering, filtering, and rescue options, run:
remag --helpREMAG produces several output files:
bins/: Directory containing FASTA files for each binbins.csv: Final contig-to-bin assignmentsembeddings.csv: Contig embeddings from the neural networkremag.log: Detailed log file*_eukaryotic_filtered.fasta: Filtered FASTA file with only eukaryotic contigs retained when eukaryotic filtering is enabled
siamese_model.pt: Trained Siamese neural network modelkmer_embeddings.csv: K-mer encoder embeddings (before fusion)coverage_embeddings.csv: Coverage encoder embeddings (before fusion)params.json: Complete run parameters for reproducibilityfeatures.csv: Extracted k-mer and coverage featuresfragments.pkl: Fragment information used during training*_hyenadna_classification.tsv: HyenaDNA eukaryotic classification results (tab-separated)gene_contig_mappings.json: Cached gene-to-contig mappings for faster processingcore_gene_duplication_results.json: Core gene duplication analysisknn_graph_edges.csv: k-NN graph edge list used for Leiden clusteringknn_graph_stats.json: k-NN graph construction statisticstemp_miniprot/: Temporary directory for miniprot alignments (removed unless --keep-intermediate)
*_non_eukaryotic.fasta: Contigs removed by the HyenaDNA filter when--save-filtered-contigsis used
To generate UMAP visualization plots:
# Install plotting dependencies if not already installed
pip install "remag[plotting]"
# Generate UMAP visualization from embeddings
python scripts/plot_features.py --features output_directory/embeddings.csv --clusters output_directory/bins.csv --output output_directoryThis creates:
umap_coordinates.csv: UMAP projections for visualizationumap_plot.pdf: UMAP visualization plot with cluster assignments
- Python 3.9+
- PyTorch (≥1.11.0)
- einops (≥0.6.0) - for HyenaDNA model operations
- scikit-learn (≥1.0.0)
- leidenalg (≥0.9.0) - for graph-based clustering
- igraph (≥0.10.0) - for graph construction in Leiden clustering
- pandas (≥1.3.0)
- numpy (≥1.21.0)
- scipy (≥1.6.0)
- pysam (≥0.18.0)
- loguru (≥0.6.0)
- tqdm (≥4.62.0)
- rich-click (≥1.5.0)
- miniprot - Required for core gene analysis and quality assessment
- Install with:
conda install -c bioconda miniprot
- Install with:
- For visualization: matplotlib (≥3.5.0), umap-learn (≥0.5.0)
- Install with:
pip install remag[plotting]
- Install with:
The package includes a pre-trained HyenaDNA classifier model for eukaryotic contig filtering. The HyenaDNA model is a genomic foundation model based on the Hyena operator architecture.
The integrated HyenaDNA classifier uses a pre-trained genomic foundation model:
- Repository: HazyResearch/hyena-dna
- Paper: Nguyen E, Poli M, Faizi M, et al. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. NeurIPS 2023.
MIT License - see LICENSE file for details.
If you use REMAG in your research, please cite:
@article {G{\'o}mez-P{\'e}rez2026.03.05.709928,
author = {G{\'o}mez-P{\'e}rez, Daniel and Raguideau, S{\'e}bastien and Warring, Sally and James, Robert and Hildebrand, Falk and Quince, Christopher},
title = {REMAG: recovery of eukaryotic genomes from metagenomic data using contrastive learning},
elocation-id = {2026.03.05.709928},
year = {2026},
doi = {10.64898/2026.03.05.709928},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2026/03/08/2026.03.05.709928},
eprint = {https://www.biorxiv.org/content/early/2026/03/08/2026.03.05.709928.full.pdf},
journal = {bioRxiv}
}