X-SAM

From Segment Anything to Any Segmentation

1 Sun Yat-sen University, 2 Pengcheng Lab, 3 Meituan Inc
📧 Corresponding author.

🚀 Highlight

  • X-SAM is a novel unified segmentation MLLMs, which offers superior performance on all image segmentation benchmarks.
  • X-SAM integrates the SAM into MLLMs via a unified formulation adapted to all image segmentation, extending the SAM's capability from segment anything to any segmentation.
  • X-SAM co-trains on multi data sources via an effective multi-stage training strategy, achieving robust performance across all tasks.

📄 Abstract

The Segment Anything Model (SAM) has emerged as a pivotal advancement in computer vision, particularly within the context of visual-prompt-driven segmentation. However, SAM is constrained by intrinsic limitations in multi-mask prediction and category-specific image segmentation tasks. Concurrently, Large Language Models (LLMs) have exhibited remarkable proficiency in comprehensive knowledge representation across a wide range of domains, yet they inherently lack the capacity for pixel-level perceptual understanding. To bridge these complementary gaps, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that seamlessly integrates SAM with LLMs, thereby augmenting SAM's functionality from segment anything to any segmentation. Specifically, we introduce a novel approach for integrating SAM with MLLMs, which facilitates more advanced dense, pixel-level perceptual comprehension within MLLMs. Furthermore, we propose a new segmentation paradigm, termed Visual GrounDed (VGD) segmentation, which empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training of MLLMs on diverse data sources, we devise a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficacy for multimodal pixel-level visual understanding.

🔍 Overview

X-SAM Framework

Fig. 1. Overview of X-SAM. X-SAM consists of dual encoders, dual projectors, a language model, a segmentation connector, and a segmentation decoder. First, the dual encoders encode the image simultaneously, then project them into the same dimension as text embeddings and feed them to the language model along with tokenized text embedding for instruction-guided image understanding. We bridge the SAM encoded feature with a connector to the segmentation decoder. In addition, the <SEG> token output by the LLM is decoded by the segmentation decoder into segmentation masks.

📊 Benchmark Results

Table 1. Comprehensive Performance Comparison. We compare X-SAM with other methods, including segmentation-specific models (Gray) and MLLMs. A "-" indicates that the method does not support this task, while a "?" indicates that the method does not report results for this dataset. X-SAM achieves state-of-the-art performance across all image segmentation tasks with one model. The best performance is highlighted in bold, and the second-best performance is highlighted with underline.

Benchmark Results
📊 More Results

Table 2. Comparison of Referring Segmentation. We compare different methods on referring segmentation benchmarks, regarding their LLM types or MLLM types.

Benchmark Results

Table 3. Comparison of Generic Segmentation. We compare different methods on the generic segmentation benchmarks.

Generic Segmentation Results

Table 4. Comparison of OV Segmentation. We compare different methods on the OV segmentation benchmarks.

OV Segmentation Results

Table 5. Comparison of GCG Segmentation. We compare different methods on the GCG segmentation benchmark. † indicates that the method used the GranD dataset for pretraining.

Benchmark Results

Table 6. Comparison of Reasoning Segmentation. We compare X-SAM with other methods on the reasoning segmentation benchmark.

Benchmark Results

Table 7. Comparison of Interactive Segmentation. We compare X-SAM with other methods on the interactive segmentation benchmark.

Interactive Segmentation Results

Table 8. Comparison of VGD Segmentation. We compare different methods on the VGD segmentation benchmark. † indicates our evaluation results following X-SAM setting.

Benchmark Results

Table 9. Comparison of Image-level Benchmarks. We compare X-SAM with other methods on the image-level benchmarks, including MME, MMBench, SEED-Bench, POPE, and AI2D.

Benchmark Results

🚀 Interactive Demo

Experience X-SAM in action! Try our interactive demo to see how X-SAM performs advanced segmentation tasks.

Launch Demo
X-SAM Demo Screenshot

😊 Acknowledgement

This project has referenced some excellent open-sourced repositories: xtuner, VLMEvalKit, Sa2VA. Thanks for their wonderful works and contributions to the community.

📝 Citation

@article{wang2024xsam,
  title={X-SAM: From Segment Anything to Any Segmentation},
  author={Wang, Hao and Qiao, Limeng and Jie, Zequn and Huang, Zhijian and Feng, Chengjian and Zheng, Qingfang and Ma, Lin and Lan, Xiangyuan and Liang, Xiaodan},
  journal={arXiv preprint arXiv:2024.xxxxx},
  year={2024}
}