🦖 OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Highlight

OV-DINO is a novel unified open vocabulary detection approach that offers superior performance and effectiveness for practical real-world application.
OV-DINO entails a Unified Data Integration pipeline that integrates diverse data sources for end-to-end pre-training, and a Language-Aware Selective Fusion module to improve the vision-language understanding of the model.
OV-DINO shows significant performance improvement on COCO and LVIS benchmarks compared to previous methods, achieving relative improvements of +2.5% AP on COCO and +12.7% AP on LVIS compared to G-DINO in zero-shot evaluation.

Abstract

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone.

Overview

Figure 1. Overall Framework of OV-DINO. The pre-training of OV-DINO comprises three primary data sources (Detection, Grounding,Image-Text). OV-DINO has three main components: a text encoder, an image encoder, and a language-aware detection decoder. First, we process the text inputs with Unified Data Integration pipeline to ensure embedding representation consistency across these data sources. Then, the unified prompted text inputs go through a Text Encoder to extract the text embedding, and the original image inputs undergo an Image Encoder and some Encoder Layers to output the multi-scale refined image embedding. Subsequently, we employ the Language-Aware Query Selection to select the most relevant image embedding with the text embedding as the object embedding. The selected object embedding and the learnable content queries go through the Language-Aware Decoder to fuse the content queries dynamically. Finally, OV-DINO outputs the classification scores by calculating the similarity of the projected query embedding with the text embedding through region-text alignment, and the regressed bounding boxes via an MLP layer.

Results

Results on LVIS

Table 1. Zero-shot Domain Transfer Evaluation on LVIS MiniVal and Val Datasets(%). APr, APc, and APf indicate the AP of rare, common and frequent categories, respectively. Gray numbers denote that the model is trained on the LVIS dataset using either supervised or few-shot settings. CC3M^† denotes the pseudo-labeled CC3M in YOLO-World. CC1M^‡ denotes a filtered subset from the CC3M dataset in our setting.

Results on COCO

Table 2. Zero-shot Domain Transfer and Fine-tuning Evaluation on COCO(%). OV-DINO achieves superior performance than prior methods in zero-shot evaluation. Further fully fine-tuned on COCO, OV-DNIO surpasses the previous State-of-the-Art (SoTA) performance under the same setting. Gray numbers denote the method is trained on the COCO dataset under the settings of supervised or few-shot.

Demo

We provide the online demo, click and enjoy !!! OV-DINO detects anything based on your provided classes. OV-SAM marries OV-DINO with SAM2, enabling detecting then segmenting anything based on your provided classes.

Acknowledgement

This project has referenced some excellent open-sourced repos Detectron2, detrex , GLIP , G-DINO , YOLO-World . Thanks for their wonderful works and contributions to the community.

BibTeX


@article{wang2024ovdino,
    title={OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion}, 
    author={Hao Wang and Pengzhen Ren and Zequn Jie and Xiao Dong and Chengjian Feng and Yinlong Qian
            and Lin Ma and Dongmei Jiang and Yaowei Wang and Xiangyuan Lan and Xiaodan Liang},
    journal={arXiv preprint arXiv:2407.07844},
    year={2024}
}