This repository serves as the official resource collection for the survey paper "Forging Spatial Intelligence: A Survey on Multi-Modal Pre-Training for Autonomous Systems".
In this work, we establish a systematic taxonomy for the field, unifying terminology, scope, and evaluation benchmarks. We organize existing methodologies into three complementary paradigms based on information flow and abstraction level:
- π· Single-Modality Pre-Training The Bedrock of Perception. Focuses on extracting foundational features from individual sensor streams (Camera or LiDAR) via self-supervised learning techniques, such as Contrastive Learning, Masked Modeling, and Forecasting. This paradigm establishes the fundamental representations for sensor-specific tasks.
- π Multi-Modality Pre-Training Bridging the Semantic-Geometric Gap. Leverages cross-modal synergy to fuse heterogeneous sensor data. This category includes LiDAR-Centric (distilling visual semantics into geometry), Camera-Centric (injecting geometric priors into vision), and Unified frameworks that jointly learn modality-agnostic representations.
- π Open-World Perception and Planning The Frontier of Embodied Autonomy. Represents the evolution from passive perception to active decision-making. This paradigm encompasses Generative World Models (e.g., video/occupancy generation), Embodied Vision-Language-Action (VLA) models, and systems capable of Open-World reasoning.
π Paper Link
If you find this work helpful for your research, please kindly consider citing our paper:
@article{wang2025forging,
title={Forging Spatial Intelligence: A Survey on Multi-Modal Pre-Training for Autonomous Systems},
author={Song Wang and Lingdong Kong and Xiaolu Liu and Hao Shi and Wentong Li and Jianke Zhu and Steven C. H. Hoi},
journal={arXiv preprint arXiv:25xx.xxxxx},
year={2025}
}- 1. Benchmarks & Datasets
- 2. Single-Modality Pre-Training
- 3. Multi-Modality Pre-Training
- 4. Open-World Perception and Planning
- 5. Acknowledgements
| Dataset | Venue | Sensor | Task | Download |
|---|---|---|---|---|
KITTI |
CVPR'12 | 2 Cam(RGB), 2 Cam(Gray), 1 LiDAR(64) | 3D Det, Stereo, Optical Flow, SLAM | |
Argoverse |
CVPR'19 | 7 Cam(RGB), 2 LiDAR(32) | 3D Tracking, Forecasting, Map | |
nuScenes |
CVPR'20 | 6 Cam(RGB), 1 LiDAR(32), 5 Radar | 3D Det, Seg, Occ, Map | |
Waymo |
CVPR'20 | 5 Cam(RGB), 5 LiDAR | Perception (Det, Seg, Track), Motion | |
Lyft L5 |
CoRL'20 | 7 Cam(RGB), 3 LiDAR, 5 Radar | 3D Det, Motion Forecasting/Planning | |
ONCE |
NeurIPS'21 | 7 Cam(RGB), 1 LiDAR(40) | 3D Det (Self-supervised/Semi-supervised) | |
PandaSet |
ITSC'21 | 6 Cam(RGB), 2 LiDAR | 3D Det, LiDAR Seg |
| Dataset | Venue | Sensor | Task | Download |
|---|---|---|---|---|
M3ED |
CVPRW'23 | Cam (RGB/Gray), LiDAR, Event | 2D/3D Seg, Depth, Optical Flow | |
CDrone |
GCPR'24 | Camera (Carla) | Monocular 3D Det | |
VisDrone |
ICCVW'19 | Aerial Camera | Detection, Tracking | |
UAVid |
ISPRS JPRS'20 | Slanted Camera | Semantic Segmentation | |
BioDrone |
IJCV'24 | Bionic Camera | Tracking |
| Dataset | Venue | Platform | Sensors | Website |
|---|---|---|---|---|
RailSem19 |
CVPRW'19 | Railway | Camera | |
WaterScenes |
TITS'24 | USV (Water) | Camera, Radar | |
Han et al. |
NMI'24 | Legged Robot | Depth Camera |
Methods utilizing Point Cloud Contrastive Learning, Masked Autoencoders (MAE), or Forecasting.
Self-supervised learning from image sequences for driving/robotics.
| Model | Paper | Venue | GitHub |
|---|---|---|---|
INoD |
Injected Noise Discriminator for Self-Supervised Representation | RA-L 2023 | |
TempO |
Self-Supervised Representation Learning From Temporal Ordering | RA-L 2024 | |
LetsMap |
Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping | ECCV 2024 | |
NeRF-MAE |
Masked AutoEncoders for Self-Supervised 3D Representation Learning | ECCV 2024 | |
VisionPAD |
A Vision-Centric Pre-training Paradigm for Autonomous Driving | arXiv 2024 |
Enhancing LiDAR representations using Vision foundation models (Knowledge Distillation).
Learning 3D Geometry from Camera inputs using LiDAR supervision.
Joint optimization of multi-modal encoders for unified representations.
| Model | Paper | Venue | GitHub |
|---|---|---|---|
PonderV2 |
Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm | arXiv 2023 | |
UniPAD |
A Universal Pre-training Paradigm for Autonomous Driving | CVPR 2024 | |
UniM2AE |
Multi-Modal Masked Autoencoders with Unified 3D Representation | ECCV 2024 | |
ConDense |
Consistent 2D/3D Pre-training for Dense and Sparse Features | ECCV 2024 | |
Unified Pretrain |
Learning Shared RGB-D Fields: Unified Self-supervised Pre-training | arXiv 2024 | |
BEVWorld |
A Multimodal World Simulator for Autonomous Driving via Unified BEV Latent Space | arXiv 2024 |
| Model | Paper | Venue | GitHub |
|---|---|---|---|
CLIP2Scene |
Towards Label-efficient 3D Scene Understanding by CLIP | CVPR 2023 | |
OpenScene |
3D Scene Understanding with Open Vocabularies | CVPR 2023 | |
CLIP-FO3D |
Learning Free Open-world 3D Scene Representations from 2D Dense CLIP | ICCVW 2023 | |
POP-3D |
Open-Vocabulary 3D Occupancy Prediction from Images | NeurIPS 2023 | |
VLM2Scene |
Self-Supervised Image-Text-LiDAR Learning with Foundation Models | AAAI 2024 | |
SAL |
Better Call SAL: Towards Learning to Segment Anything in Lidar | ECCV 2024 | |
Affinity3D |
Propagating Instance-Level Semantic Affinity for Zero-Shot Semantic Seg | ACM MM 2024 | |
UOV |
3D Unsupervised Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving | arXiv 2024 |
We thank the authors of the referenced papers for their open-source contributions.
