Minghan LI1,2,*, Shuai LI1,2,*, Xindong ZHANG2 and Lei ZHANG1,2,$\dagger$
1Hong Kong Polytechnic University, 2OPPO Research Institute
[ arXiv paper] [ Video demo in project page]
We propose a novel unified VS architecture, namely UniVS, by using prompts as queries. For each target of interest, UniVS averages the prompt features stored in the memory pool as its initial query, which is fed to a target-wise prompt cross-attention (ProCA) layer to integrate comprehensive prompt features. On the other hand, by taking the predicted masks of entities as their visual prompts, UniVS can convert different VS tasks into the task of prompt-guided target segmentation, eliminating the heuristic inter-frame matching. More video demo on our project page: https://sites.google.com/view/unified-video-seg-univs
More video demo in our project page
| Task | VIS | VSS | VPS |
|---|---|---|---|
| Output |
| Task | VOS | RefVOS | PVOS |
|---|---|---|---|
| Prompt | "A frog is holded by a person in his hand and place near the another frog" | ||
| Output |
Updates
-
[Hightlights]: To facilitate the evaluation of video segmentation tasks under the Detectron2 framework, we wrote the evaluation metrics of the six existing video segmentation tasks into the Detectron2 Evaluators, including VIS, VSS, VPS, VOS, PVOS, and RefVOS tasks. Now, you can evaluate VS tasks directly in our code just like COCO, and no longer need to manually adapt any evaluation indicators by yourself. Please refer tounivs/inferenceandunivs/evaluationfor specific codes. If you encounter any issues when using our code, please push them to the GitHub issues. We will reply to you as soon as possible. -
[May-30-2024]: Support to test custom images for category-gudied Image Seg. tasks. Enjoy! :) -
[April-14-2024]: Support to test custom videos for text prompt-gudied VS tasks. Enjoy! :) -
[April-10-2024]: Support to test custom videos for category-gudied VS tasks. Enjoy! :) -
[April-8-2024]: Support to extract semantic feature map and object tokens for custom videos. It can be used to train segmentation-guided text-to-video generation. -
[Mar-20-2024]: Trained models with EMA on stage 3 have been released now! You can download it from Model Zoo. -
[Feb-29-2024]: Trained models on stage 2 have been released now! Try to use it for your video data! -
[Feb-28-2024]: Our paper has been accepted by CVPR2024!!. We released the paper in ArXiv.
Table of Contents
-
Installation
-
Datasets
-
Unified Training and Inference
- Unified Training for Images and Videos
- Unified Inference for Videos
- Detailed Steps for Inference
- Performance on 10 Benchmarks
-
Visualization Demo
- Visualization Code
- Visualization Demo for Custom Videos
- Category-guided VS Tasks (VIS, VSS, VPS)
- Text Prompt-guided VS Task (RefVOS)
-
Semantic Extraction for Custom Videos
Installation
See installation instructions.
Datasets
See Datasets preparation.
Unified Training and Inference
Unified Training for Images and Videos
We provide a script train_net.py, that is made to train all the configs provided in UniVS.
Download pretrained weights of Mask2Former and save them into the path pretrained/m2f_panseg/, then run the following three stages one by one:
sh tools/run/univs_r50_stage1.sh
sh tools/run/univs_r50_stage2.sh
sh tools/run/univs_r50_stage3.sh
Unified Inference for Videos
Download trained weights from Model Zoo, and save it into the path output/stage{1,2,3}/. We support multiple ways to evaluate UniVS on VIS, VSS, VPS, VOS, PVOS and RefVOS tasks:
# test all six tasks using ResNet50 backbone (one-model)
$ sh tools/test/test_r50.sh
# test pvos only using ResNet50, swin-T/B/L backbones
$ sh tools/test/individual_task/test_pvos.sh
Detailed Steps for Inference
Step 1: You need to download the needed datasets from their original website. Please refer to dataset preparation for more guidance.
Step 2: Built in datasets as detectron2 format in here. The datasets involved in our paper has been built, so this step can be omitted. If it is a newly added dataset, it needs to be built by yourself.
Step 3: Modify the dataset name as your needed datasets in inference .sh commond. Taking the OVIS dataset of VIS task as an example, you just need to add the commond DATASETS.TEST '("ovis_val", )' \ in the file ./tools/test/individual_task/test_vis.sh. Then, run the commond sh tools/test/individual_task/test_vis.sh.
Step 4: For YouTube-VIS, OVIS, YouTube-VOS, Ref-YouTube-VOS datasets, you need to submit the predicted results (results.json in the output dir) to the codelab for performance evaluation. The official codelab websits are provided below for your convenience: YouTube-VIS 2021, OVIS, YouTube-VOS, Ref-YouTube-VOS. For other datasets, the ground-truth annotations in valid set are released, you can get the performance directly after Step 3.
Performance on 10 Benchmarks
UniVS shows a commendable balance between perfor0mance and universality on 10 challenging VS benchmarks, covering video instance, semantic, panoptic, object, and referring segmentation tasks.
Visualization Demo
Visualization Code
Visualization is avaliable during inference, but you need to turn it on manually.
a) For category-guided VS tasks, you can visualize results via enabling self.visualize_results_enable = True form here. The visualization code for VIS/VSS/VPS lies in here.
b) For prompt-guided VS tasks, you need to enable self.visualize_results_only_enable = True here. The visualization code for VOS/PVOS/RefVOS here
Visualization Demo for Custom Videos
Please follow the steps to run UniVS on custom videos. Until now, it only support category-guided VS tasks (VIS, VSS, VPS) and language-guided VS tasks. We will add visual prompt-guided VS tasks later.
Category-guided VS Tasks (VIS, VSS, VPS)
# Step 1: move your custom data into `./datasets/custom_videos/raw/`. Support two ways to test custom videos:
# a. any video formats with 'mp4', 'avi', 'mov', 'mkv'
# b. put all video frames in a subdir in the path `./datasets/custom_videos/raw/`
# For your convenience, we give two examples in this dir, you can directly run the below code
# Step 2:: run it
$ sh tools/test_custom_videos/test_custom_videos.sh
# Step 3: check the predicted results in the below path
$ cd datasets/custom_videos/inference