Forging Spatial Intelligence

A Survey on Multi-Modal Pre-Training for Autonomous Systems


Figure 1: Taxonomy of Multi-Modal Representation Learning for Spatial Intelligence.

This repository serves as the official resource collection for the survey paper "Forging Spatial Intelligence: A Survey on Multi-Modal Pre-Training for Autonomous Systems".

In this work, we establish a systematic taxonomy for the field, unifying terminology, scope, and evaluation benchmarks. We organize existing methodologies into three complementary paradigms based on information flow and abstraction level:

📷 Single-Modality Pre-Training The Bedrock of Perception. Focuses on extracting foundational features from individual sensor streams (Camera or LiDAR) via self-supervised learning techniques, such as Contrastive Learning, Masked Modeling, and Forecasting. This paradigm establishes the fundamental representations for sensor-specific tasks.
🔄 Multi-Modality Pre-Training Bridging the Semantic-Geometric Gap. Leverages cross-modal synergy to fuse heterogeneous sensor data. This category includes LiDAR-Centric (distilling visual semantics into geometry), Camera-Centric (injecting geometric priors into vision), and Unified frameworks that jointly learn modality-agnostic representations.
🌍 Open-World Perception and Planning The Frontier of Embodied Autonomy. Represents the evolution from passive perception to active decision-making. This paradigm encompasses Generative World Models (e.g., video/occupancy generation), Embodied Vision-Language-Action (VLA) models, and systems capable of Open-World reasoning.

📄 Paper Link

Citation

If you find this work helpful for your research, please kindly consider citing our paper:

@article{wang2025forging,
    title={Forging Spatial Intelligence: A Survey on Multi-Modal Pre-Training for Autonomous Systems},
    author={Song Wang and Lingdong Kong and Xiaolu Liu and Hao Shi and Wentong Li and Jianke Zhu and Steven C. H. Hoi},
    journal={arXiv preprint arXiv:25xx.xxxxx},
    year={2025}
}

1. Benchmarks & Datasets

Vehicle-Based Datasets

Dataset	Venue	Sensor	Task
`KITTI`	CVPR'12	2 Cam(RGB), 2 Cam(Gray), 1 LiDAR(64)	3D Det, Stereo, Optical Flow, SLAM
`Argoverse`	CVPR'19	7 Cam(RGB), 2 LiDAR(32)	3D Tracking, Forecasting, Map
`nuScenes`	CVPR'20	6 Cam(RGB), 1 LiDAR(32), 5 Radar	3D Det, Seg, Occ, Map
`Waymo`	CVPR'20	5 Cam(RGB), 5 LiDAR	Perception (Det, Seg, Track), Motion
`Lyft L5`	CoRL'20	7 Cam(RGB), 3 LiDAR, 5 Radar	3D Det, Motion Forecasting/Planning
`ONCE`	NeurIPS'21	7 Cam(RGB), 1 LiDAR(40)	3D Det (Self-supervised/Semi-supervised)
`PandaSet`	ITSC'21	6 Cam(RGB), 2 LiDAR	3D Det, LiDAR Seg

Drone-Based Datasets

Dataset	Venue	Sensor	Task
`M3ED`	CVPRW'23	Cam (RGB/Gray), LiDAR, Event	2D/3D Seg, Depth, Optical Flow
`CDrone`	GCPR'24	Camera (Carla)	Monocular 3D Det
`VisDrone`	ICCVW'19	Aerial Camera	Detection, Tracking
`UAVid`	ISPRS JPRS'20	Slanted Camera	Semantic Segmentation
`BioDrone`	IJCV'24	Bionic Camera	Tracking

Other Robotic Platforms

Dataset	Venue	Platform	Sensors
`RailSem19`	CVPRW'19	Railway	Camera
`WaterScenes`	TITS'24	USV (Water)	Camera, Radar
`Han et al.`	NMI'24	Legged Robot	Depth Camera

2. Single-Modality Pre-Training

LiDAR-Only

Methods utilizing Point Cloud Contrastive Learning, Masked Autoencoders (MAE), or Forecasting.

Model	Paper	Venue
`PointContrast`	Unsupervised Pre-training for 3D Point Cloud Understanding	ECCV 2020
`DepthContrast`	Self-supervised Pretraining of 3D Features on any Point-Cloud	ICCV 2021
`SegContrast`	3D Point Cloud Feature Representation Learning through Self-supervised Segment Discrimination	RA-L 2021
`ProposalContrast`	Unsupervised Pre-training for LiDAR-Based 3D Object Detection	ECCV 2022
`Occupancy-MAE`	Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders	T-IV 2023
`ALSO`	Automotive LiDAR Self-supervision by Occupancy Estimation	CVPR 2023
`GD-MAE`	Generative Decoder for MAE Pre-training on LiDAR Point Clouds	CVPR 2023
`AD-PT`	Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset	NeurIPS 2023
`PatchContrast`	Self-Supervised Pre-training for 3D Object Detection	arXiv 2023
`MAELi`	Masked Autoencoder for Large-Scale LiDAR Point Clouds	WACV 2024
`BEV-MAE`	Bird's Eye View Masked Autoencoders for Point Cloud Pre-training	AAAI 2024
`UnO`	Unsupervised Occupancy Fields for Perception and Forecasting	CVPR 2024
`BEVContrast`	Self-Supervision in BEV Space for Automotive Lidar Point Clouds	3DV 2024
`Copilot4D`	Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion	ICLR 2024
`T-MAE`	Temporal Masked Autoencoders for Point Cloud Representation Learning	ECCV 2024
`PICTURE`	Point Cloud Reconstruction Is Insufficient to Learn 3D Representations	ACM MM 2024
`LSV-MAE`	Rethinking Masked-Autoencoder-Based 3D Point Cloud Pretraining	IV 2024
`UNIT`	Unsupervised Online Instance Segmentation through Time	arXiv 2024
`R-MAE`	Sense Less, Generate More: Pre-training LiDAR Perception with Masked Autoencoders	arXiv 2024

Camera-Only

Self-supervised learning from image sequences for driving/robotics.

Model	Paper	Venue
`INoD`	Injected Noise Discriminator for Self-Supervised Representation	RA-L 2023
`TempO`	Self-Supervised Representation Learning From Temporal Ordering	RA-L 2024
`LetsMap`	Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping	ECCV 2024
`NeRF-MAE`	Masked AutoEncoders for Self-Supervised 3D Representation Learning	ECCV 2024
`VisionPAD`	A Vision-Centric Pre-training Paradigm for Autonomous Driving	arXiv 2024

3. Multi-Modality Pre-Training

LiDAR-Centric Pre-Training

Enhancing LiDAR representations using Vision foundation models (Knowledge Distillation).

Model	Paper	Venue
`SLidR`	Image-to-Lidar Self-Supervised Distillation	CVPR 2022
`ST-SLidR`	Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss	CVPR 2023
`I2P-MAE`	Learning 3D Representations from 2D Pre-trained Models via Image-to-Point MAE	CVPR 2023
`TriCC`	Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast	CVPR 2023
`Seal`	Segment Any Point Cloud Sequences by Distilling Vision FMs	NeurIPS 23
`PRED`	Pre-training via Semantic Rendering on LiDAR Point Clouds	NeurIPS 23
`ImageTo360`	360° from a Single Camera: A Few-Shot Approach for LiDAR Segmentation	ICCVW 2023
`ScaLR`	Three Pillars improving Vision Foundation Model Distillation for Lidar	CVPR 2024
`CSC`	Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception	CVPR 2024
`GPC`	Pre-Training LiDAR-Based 3D Object Detectors Through Colorization	ICLR 2024
`Cross-Modal SSL`	Cross-Modal Self-Supervised Learning with Effective Contrastive Units	IROS 2024
`SuperFlow`	4D Contrastive Superflows are Dense 3D Representation Learners	ECCV 2024
`Rel`	Image-to-Lidar Relational Distillation for Autonomous Driving Data	ECCV 2024
`HVDistill`	Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation	IJCV 2024
`RadarContrast`	Self-Supervised Contrastive Learning for Camera-to-Radar Knowledge Distillation	DCOSS-IoT 2024
`CM3D`	Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection	CoRL 2024
`OLIVINE`	Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models	NeurIPS 2024

Camera-Centric Pre-Training

Learning 3D Geometry from Camera inputs using LiDAR supervision.

Model	Paper	Venue
`DD3D`	Is Pseudo-Lidar needed for Monocular 3D Object detection?	ICCV 2021
`DEPT`	Delving into the Pre-training Paradigm of Monocular 3D Object Detection	arXiv 2022
`OccNet`	Scene as Occupancy	ICCV 2023
`GeoMIM`	Towards Better 3D Knowledge Transfer via Masked Image Modeling	ICCV 2023
`GAPretrain`	Geometric-aware Pretraining for Vision-centric 3D Object Detection	arXiv 2023
`UniScene`	Multi-Camera Unified Pre-training via 3D Scene Reconstruction	RA-L 2024
`SelfOcc`	Self-Supervised Vision-Based 3D Occupancy Prediction	CVPR 2024
`ViDAR`	Visual Point Cloud Forecasting enables Scalable Autonomous Driving	CVPR 2024
`DriveWorld`	4D Pre-trained Scene Understanding via World Models	CVPR 2024
`OccFeat`	Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation	CVPRW 2024
`OccWorld`	Learning a 3D Occupancy World Model for Autonomous Driving	ECCV 2024
`MVS3D`	Exploiting the Potential of Multi-Frame Stereo Depth Estimation Pre-training	IJCNN 2024
`OccSora`	4D Occupancy Generation Models as World Simulators	arXiv 2024
`MIM4D`	Masked Modeling with Multi-View Video for Autonomous Driving	arXiv 2024
`GaussianPretrain`	A Simple Unified 3D Gaussian Representation for Visual Pre-training	arXiv 2024

Unified Pre-Training

Joint optimization of multi-modal encoders for unified representations.

Model	Paper	Venue
`PonderV2`	Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm	arXiv 2023
`UniPAD`	A Universal Pre-training Paradigm for Autonomous Driving	CVPR 2024
`UniM2AE`	Multi-Modal Masked Autoencoders with Unified 3D Representation	ECCV 2024
`ConDense`	Consistent 2D/3D Pre-training for Dense and Sparse Features	ECCV 2024
`Unified Pretrain`	Learning Shared RGB-D Fields: Unified Self-supervised Pre-training	arXiv 2024
`BEVWorld`	A Multimodal World Simulator for Autonomous Driving via Unified BEV Latent Space	arXiv 2024

4. Open-World Perception and Planning

Text-Grounded Understanding

Model	Paper	Venue
`CLIP2Scene`	Towards Label-efficient 3D Scene Understanding by CLIP	CVPR 2023
`OpenScene`	3D Scene Understanding with Open Vocabularies	CVPR 2023
`CLIP-FO3D`	Learning Free Open-world 3D Scene Representations from 2D Dense CLIP	ICCVW 2023
`POP-3D`	Open-Vocabulary 3D Occupancy Prediction from Images	NeurIPS 2023
`VLM2Scene`	Self-Supervised Image-Text-LiDAR Learning with Foundation Models	AAAI 2024
`SAL`	Better Call SAL: Towards Learning to Segment Anything in Lidar	ECCV 2024
`Affinity3D`	Propagating Instance-Level Semantic Affinity for Zero-Shot Semantic Seg	ACM MM 2024
`UOV`	3D Unsupervised Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving	arXiv 2024

Unified World Representation for Action

5. Acknowledgements

We thank the authors of the referenced papers for their open-source contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs/figures		docs/figures
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forging Spatial Intelligence