The Segment Anything Model (SAM) has emerged as a pivotal advancement in computer vision, particularly within the context of visual-prompt-driven segmentation. However, SAM is constrained by intrinsic limitations in multi-mask prediction and category-specific image segmentation tasks. Concurrently, Large Language Models (LLMs) have exhibited remarkable proficiency in comprehensive knowledge representation across a wide range of domains, yet they inherently lack the capacity for pixel-level perceptual understanding. To bridge these complementary gaps, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that seamlessly integrates SAM with LLMs, thereby augmenting SAM's functionality from segment anything to any segmentation. Specifically, we introduce a novel approach for integrating SAM with MLLMs, which facilitates more advanced dense, pixel-level perceptual comprehension within MLLMs. Furthermore, we propose a new segmentation paradigm, termed Visual GrounDed (VGD) segmentation, which empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training of MLLMs on diverse data sources, we devise a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficacy for multimodal pixel-level visual understanding.
Fig. 1. Overview of X-SAM. X-SAM consists of dual encoders, dual projectors, a language model, a segmentation connector, and a segmentation decoder. First, the dual encoders encode the image simultaneously, then project them into the same dimension as text embeddings and feed them to the language model along with tokenized text embedding for instruction-guided image understanding. We bridge the SAM encoded feature with a connector to the segmentation decoder. In addition, the <SEG> token output by the LLM is decoded by the segmentation decoder into segmentation masks.
Table 1. Comprehensive Performance Comparison. We compare X-SAM with other methods, including segmentation-specific models (Gray) and MLLMs. A "-" indicates that the method does not support this task, while a "?" indicates that the method does not report results for this dataset. X-SAM achieves state-of-the-art performance across all image segmentation tasks with one model. The best performance is highlighted in bold, and the second-best performance is highlighted with underline.
Table 2. Comparison of Referring Segmentation. We compare different methods on referring segmentation benchmarks, regarding their LLM types or MLLM types.
Table 3. Comparison of Generic Segmentation. We compare different methods on the generic segmentation benchmarks.
Table 4. Comparison of OV Segmentation. We compare different methods on the OV segmentation benchmarks.
Table 5. Comparison of GCG Segmentation. We compare different methods on the GCG segmentation benchmark. † indicates that the method used the GranD dataset for pretraining.
Table 6. Comparison of Reasoning Segmentation. We compare X-SAM with other methods on the reasoning segmentation benchmark.
Table 7. Comparison of Interactive Segmentation. We compare X-SAM with other methods on the interactive segmentation benchmark.
Table 8. Comparison of VGD Segmentation. We compare different methods on the VGD segmentation benchmark. † indicates our evaluation results following X-SAM setting.
Table 9. Comparison of Image-level Benchmarks. We compare X-SAM with other methods on the image-level benchmarks, including MME, MMBench, SEED-Bench, POPE, and AI2D.
Experience X-SAM in action! Try our interactive demo to see how X-SAM performs advanced segmentation tasks.
Launch DemoThis project has referenced some excellent open-sourced repositories: xtuner, VLMEvalKit, Sa2VA. Thanks for their wonderful works and contributions to the community.
@article{wang2024xsam,
title={X-SAM: From Segment Anything to Any Segmentation},
author={Wang, Hao and Qiao, Limeng and Jie, Zequn and Huang, Zhijian and Feng, Chengjian and Zheng, Qingfang and Ma, Lin and Lan, Xiangyuan and Liang, Xiaodan},
journal={arXiv preprint arXiv:2024.xxxxx},
year={2024}
}