Skip to content

baopj/Vid-Morp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Paper Link: https://arxiv.org/pdf/2412.00811

In this paper, we propose a new dataset and algorithm for video moment retrieval, which effectively relieves the high cost of human annotations. Our experiments highlight that:

  • Compared to the fully supervised approach SimBase, Our ReCorrect model achieves 81.3% and 86.7% of its performance in zero-shot and unsupervised settings.
  • This narrow performance gap underscores the potential of our Vid-Morp dataset to address the critical challenge of VMR's heavy reliance on manual annotations.

Quick Start

To run the code, use the following command, which integrates the evaluation process for 1) zero-shot, 2) unsupervised, and 3) fully-supervised setting.

python main.py --cfg ./experiment/charades/recorrect_eval_configs_on_ZeroShot+Unsup+Full.json --eval

You do not need any extra downloading to run the code, as the repository is self-contained with necessary features and checkpoints.

  1. CLIP features are available in the data/charades/feat directory.
  2. Pre-trained checkpoints are located in ckpt/charades
    • zero_shot.ckpt: zero-shot model.
    • unsup.ckpt: unsupervised model.
    • full_sup.ckpt: fully supervised model.

Fully Supervised Setting

Method R@0.1 R@0.2 R@0.3 mIoU
SimBase 77.77 66.48 44.01 56.15
ReCorrect (Ours) 78.55 68.39 45.78 57.42

 

Zero-Shot Setting

Method R@0.1 R@0.2 R@0.3 mIoU
ReCorrect 66.54 51.15 28.54 45.63
% of SimBase 85.6% 76.9% 64.8% 81.3%

 

Unsupervised Setting

Method R@0.1 R@0.2 R@0.3 mIoU
ReCorrect 70.96 54.42 31.10 48.66
% of SimBase 91.2% 81.9% 70.7% 86.7%

 

Motivation

Motivation

A crucial challenge in video moment retrieval is its heavy reliance on extensive manual annoations for training. To overcome this, we introduce a large scale dataset for Video Moment Retrieval Pretraining (Vid-Morp), collected with minimal human involvement. Vid-Morp comprises over 50K in-the-wild videos and 200K pseudo training samples. Models pretrained on Vid-Morp significantly relieve the annotation costs and demonstrate strong generalizability across diverse downstream settings.

Dataset

Dataset Download

To access the dataset download link, please send an email to peijun001@e.ntu.edu.sg. Note the dataset is only for academic usage.

Comparison to Existing Dataset

Dataset Comparision

Citation

If you use our code or dataset in your research, please cite with:

@article{bao2024vid,
  title={Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild},
  author={Bao, Peijun and Kong, Chenqi and Shao, Zihao and Ng, Boon Poh and Er, Meng Hwa and Kot, Alex C},
  journal={arXiv preprint arXiv:2412.00811},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages