Skip to content

64victim/awesome-spatial-intelligence

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Forging Spatial Intelligence

A Survey on Multi-Modal Pre-Training for Autonomous Systems

Awesome Visitors PR's Welcome

Taxonomy of Spatial Intelligence
Figure 1: Taxonomy of Multi-Modal Representation Learning for Spatial Intelligence.

This repository serves as the official resource collection for the survey paper "Forging Spatial Intelligence: A Survey on Multi-Modal Pre-Training for Autonomous Systems".

In this work, we establish a systematic taxonomy for the field, unifying terminology, scope, and evaluation benchmarks. We organize existing methodologies into three complementary paradigms based on information flow and abstraction level:

  • πŸ“· Single-Modality Pre-Training The Bedrock of Perception. Focuses on extracting foundational features from individual sensor streams (Camera or LiDAR) via self-supervised learning techniques, such as Contrastive Learning, Masked Modeling, and Forecasting. This paradigm establishes the fundamental representations for sensor-specific tasks.
  • πŸ”„ Multi-Modality Pre-Training Bridging the Semantic-Geometric Gap. Leverages cross-modal synergy to fuse heterogeneous sensor data. This category includes LiDAR-Centric (distilling visual semantics into geometry), Camera-Centric (injecting geometric priors into vision), and Unified frameworks that jointly learn modality-agnostic representations.
  • 🌍 Open-World Perception and Planning The Frontier of Embodied Autonomy. Represents the evolution from passive perception to active decision-making. This paradigm encompasses Generative World Models (e.g., video/occupancy generation), Embodied Vision-Language-Action (VLA) models, and systems capable of Open-World reasoning.

πŸ“„ Paper Link


Citation

If you find this work helpful for your research, please kindly consider citing our paper:

@article{wang2025forging,
    title={Forging Spatial Intelligence: A Survey on Multi-Modal Pre-Training for Autonomous Systems},
    author={Song Wang and Lingdong Kong and Xiaolu Liu and Hao Shi and Wentong Li and Jianke Zhu and Steven C. H. Hoi},
    journal={arXiv preprint arXiv:25xx.xxxxx},
    year={2025}
}

Table of Contents


1. Benchmarks & Datasets

Vehicle-Based Datasets

Dataset Venue Sensor Task Download
KITTI CVPR'12 2 Cam(RGB), 2 Cam(Gray), 1 LiDAR(64) 3D Det, Stereo, Optical Flow, SLAM Website
Argoverse CVPR'19 7 Cam(RGB), 2 LiDAR(32) 3D Tracking, Forecasting, Map Website
nuScenes CVPR'20 6 Cam(RGB), 1 LiDAR(32), 5 Radar 3D Det, Seg, Occ, Map Website
Waymo CVPR'20 5 Cam(RGB), 5 LiDAR Perception (Det, Seg, Track), Motion Website
Lyft L5 CoRL'20 7 Cam(RGB), 3 LiDAR, 5 Radar 3D Det, Motion Forecasting/Planning Website
ONCE NeurIPS'21 7 Cam(RGB), 1 LiDAR(40) 3D Det (Self-supervised/Semi-supervised) Website
PandaSet ITSC'21 6 Cam(RGB), 2 LiDAR 3D Det, LiDAR Seg Website

Drone-Based Datasets

Dataset Venue Sensor Task Download
M3ED CVPRW'23 Cam (RGB/Gray), LiDAR, Event 2D/3D Seg, Depth, Optical Flow Website
CDrone GCPR'24 Camera (Carla) Monocular 3D Det Website
VisDrone ICCVW'19 Aerial Camera Detection, Tracking Website
UAVid ISPRS JPRS'20 Slanted Camera Semantic Segmentation Website
BioDrone IJCV'24 Bionic Camera Tracking Website

Other Robotic Platforms

Dataset Venue Platform Sensors Website
RailSem19 CVPRW'19 Railway Camera Website
WaterScenes TITS'24 USV (Water) Camera, Radar Website
Han et al. NMI'24 Legged Robot Depth Camera Website

2. Single-Modality Pre-Training

LiDAR-Only

Methods utilizing Point Cloud Contrastive Learning, Masked Autoencoders (MAE), or Forecasting.

Model Paper Venue GitHub
PointContrast Unsupervised Pre-training for 3D Point Cloud Understanding ECCV 2020 GitHub
DepthContrast Self-supervised Pretraining of 3D Features on any Point-Cloud ICCV 2021 GitHub
SegContrast 3D Point Cloud Feature Representation Learning through Self-supervised Segment Discrimination RA-L 2021 GitHub
ProposalContrast Unsupervised Pre-training for LiDAR-Based 3D Object Detection ECCV 2022 GitHub
Occupancy-MAE Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders T-IV 2023 GitHub
ALSO Automotive LiDAR Self-supervision by Occupancy Estimation CVPR 2023 GitHub
GD-MAE Generative Decoder for MAE Pre-training on LiDAR Point Clouds CVPR 2023 GitHub
AD-PT Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset NeurIPS 2023 GitHub
PatchContrast Self-Supervised Pre-training for 3D Object Detection arXiv 2023
MAELi Masked Autoencoder for Large-Scale LiDAR Point Clouds WACV 2024
BEV-MAE Bird's Eye View Masked Autoencoders for Point Cloud Pre-training AAAI 2024 GitHub
UnO Unsupervised Occupancy Fields for Perception and Forecasting CVPR 2024
BEVContrast Self-Supervision in BEV Space for Automotive Lidar Point Clouds 3DV 2024 GitHub
Copilot4D Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion ICLR 2024
T-MAE Temporal Masked Autoencoders for Point Cloud Representation Learning ECCV 2024 GitHub
PICTURE Point Cloud Reconstruction Is Insufficient to Learn 3D Representations ACM MM 2024
LSV-MAE Rethinking Masked-Autoencoder-Based 3D Point Cloud Pretraining IV 2024
UNIT Unsupervised Online Instance Segmentation through Time arXiv 2024 GitHub
R-MAE Sense Less, Generate More: Pre-training LiDAR Perception with Masked Autoencoders arXiv 2024 GitHub

Camera-Only

Self-supervised learning from image sequences for driving/robotics.

Model Paper Venue GitHub
INoD Injected Noise Discriminator for Self-Supervised Representation RA-L 2023 GitHub
TempO Self-Supervised Representation Learning From Temporal Ordering RA-L 2024
LetsMap Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping ECCV 2024
NeRF-MAE Masked AutoEncoders for Self-Supervised 3D Representation Learning ECCV 2024 GitHub
VisionPAD A Vision-Centric Pre-training Paradigm for Autonomous Driving arXiv 2024

3. Multi-Modality Pre-Training

LiDAR-Centric Pre-Training

Enhancing LiDAR representations using Vision foundation models (Knowledge Distillation).

Model Paper Venue GitHub
SLidR Image-to-Lidar Self-Supervised Distillation CVPR 2022 GitHub
ST-SLidR Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss CVPR 2023
I2P-MAE Learning 3D Representations from 2D Pre-trained Models via Image-to-Point MAE CVPR 2023 GitHub
TriCC Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast CVPR 2023
Seal Segment Any Point Cloud Sequences by Distilling Vision FMs NeurIPS 23 GitHub
PRED Pre-training via Semantic Rendering on LiDAR Point Clouds NeurIPS 23
ImageTo360 360Β° from a Single Camera: A Few-Shot Approach for LiDAR Segmentation ICCVW 2023
ScaLR Three Pillars improving Vision Foundation Model Distillation for Lidar CVPR 2024
CSC Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception CVPR 2024 GitHub
GPC Pre-Training LiDAR-Based 3D Object Detectors Through Colorization ICLR 2024 GitHub
Cross-Modal SSL Cross-Modal Self-Supervised Learning with Effective Contrastive Units IROS 2024 GitHub
SuperFlow 4D Contrastive Superflows are Dense 3D Representation Learners ECCV 2024 GitHub
Rel Image-to-Lidar Relational Distillation for Autonomous Driving Data ECCV 2024
HVDistill Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation IJCV 2024 GitHub
RadarContrast Self-Supervised Contrastive Learning for Camera-to-Radar Knowledge Distillation DCOSS-IoT 2024
CM3D Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection CoRL 2024 GitHub
OLIVINE Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models NeurIPS 2024 GitHub

Camera-Centric Pre-Training

Learning 3D Geometry from Camera inputs using LiDAR supervision.

Model Paper Venue GitHub
DD3D Is Pseudo-Lidar needed for Monocular 3D Object detection? ICCV 2021 GitHub
DEPT Delving into the Pre-training Paradigm of Monocular 3D Object Detection arXiv 2022
OccNet Scene as Occupancy ICCV 2023 GitHub
GeoMIM Towards Better 3D Knowledge Transfer via Masked Image Modeling ICCV 2023 GitHub
GAPretrain Geometric-aware Pretraining for Vision-centric 3D Object Detection arXiv 2023 GitHub
UniScene Multi-Camera Unified Pre-training via 3D Scene Reconstruction RA-L 2024 GitHub
SelfOcc Self-Supervised Vision-Based 3D Occupancy Prediction CVPR 2024 GitHub
ViDAR Visual Point Cloud Forecasting enables Scalable Autonomous Driving CVPR 2024 GitHub
DriveWorld 4D Pre-trained Scene Understanding via World Models CVPR 2024
OccFeat Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation CVPRW 2024
OccWorld Learning a 3D Occupancy World Model for Autonomous Driving ECCV 2024 GitHub
MVS3D Exploiting the Potential of Multi-Frame Stereo Depth Estimation Pre-training IJCNN 2024
OccSora 4D Occupancy Generation Models as World Simulators arXiv 2024 GitHub
MIM4D Masked Modeling with Multi-View Video for Autonomous Driving arXiv 2024 GitHub
GaussianPretrain A Simple Unified 3D Gaussian Representation for Visual Pre-training arXiv 2024 GitHub

Unified Pre-Training

Joint optimization of multi-modal encoders for unified representations.

Model Paper Venue GitHub
PonderV2 Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm arXiv 2023 GitHub
UniPAD A Universal Pre-training Paradigm for Autonomous Driving CVPR 2024 GitHub
UniM2AE Multi-Modal Masked Autoencoders with Unified 3D Representation ECCV 2024 GitHub
ConDense Consistent 2D/3D Pre-training for Dense and Sparse Features ECCV 2024
Unified Pretrain Learning Shared RGB-D Fields: Unified Self-supervised Pre-training arXiv 2024 GitHub
BEVWorld A Multimodal World Simulator for Autonomous Driving via Unified BEV Latent Space arXiv 2024 GitHub

4. Open-World Perception and Planning

Text-Grounded Understanding

Model Paper Venue GitHub
CLIP2Scene Towards Label-efficient 3D Scene Understanding by CLIP CVPR 2023 GitHub
OpenScene 3D Scene Understanding with Open Vocabularies CVPR 2023 GitHub
CLIP-FO3D Learning Free Open-world 3D Scene Representations from 2D Dense CLIP ICCVW 2023
POP-3D Open-Vocabulary 3D Occupancy Prediction from Images NeurIPS 2023 GitHub
VLM2Scene Self-Supervised Image-Text-LiDAR Learning with Foundation Models AAAI 2024 GitHub
SAL Better Call SAL: Towards Learning to Segment Anything in Lidar ECCV 2024 GitHub
Affinity3D Propagating Instance-Level Semantic Affinity for Zero-Shot Semantic Seg ACM MM 2024
UOV 3D Unsupervised Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving arXiv 2024 GitHub

Unified World Representation for Action


5. Acknowledgements

We thank the authors of the referenced papers for their open-source contributions.

About

🌐 Forging Spatial Intelligence: A Survey on Multi-Modal Pre-Training for Autonomous Systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors