✨X-SAM

From Segment Anything to Any Segmentation

1 Sun Yat-sen University, 2 Peng Cheng Laboratory, 3 Meituan Inc.
📧 Corresponding author.

🚀 Highlight

  • X-SAM introduces a unified multimodal large language model (MLLM) framework, extending the segmentation paradigm from segment anything to any segmentation, thereby enhancing pixel-level perceptual understanding.
  • X-SAM proposes a novel Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, empowering the model with visually grounded, pixel-wise interpretative capabilities.
  • X-SAM presents a unified training strategy that enables co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on various image segmentation benchmarks, highlighting its efficiency in multimodal, pixel-level visual understanding.

🔖 Abstract

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from segment anything to any segmentation. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding.

🔍 Overview

X-SAM Framework

Fig. 1. The Overview of X-SAM. X-SAM comprises dual encoders, dual projectors, a language model, a segmentation connector, and a segmentation decoder. The dual encoders process the image and project features to match text embedding dimensions, which are then input to the language model with tokenized text for instruction-guided understanding. The SAM features are connected to the segmentation decoder, which uses the LLM’s <SEG> token to generate segmentation masks.

📊 Benchmarks

Tab. 1. Comprehensive Performance Comparison. We compare X-SAM to segmentation-specific models (Gray) and MLLMs. denotes unsupported tasks. “-” indicates unreported results. X-SAM achieves state-of-the-art performance across all segmentation tasks with a single model.Best results are in bold, second-best are underlined.

Benchmark Results
📊 More

Tab. 2. Comparison of Referring Segmentation. We evaluate methods on referring segmentation benchmarks by (M)LLMs.

Benchmark Results

Tab. 3. Comparison of GCG Segmentation. † indicates pretraining with the GranD dataset.

Benchmark Results

Tab. 4. Comparison of VGD Segmentation. † indicates evaluation results following X-SAM setting.

Benchmark Results

Tab. 5. Comparison of Generic Segmentation. We compare different methods on the generic segmentation benchmarks.

Generic Segmentation Results

Tab. 6. Comparison of OV Segmentation. We compare different methods on the A150-OV segmentation benchmarks.

OV Segmentation Results

Tab. 7. Comparison of Reasoning Segmentation. We compare X-SAM with other methods on the reasoning segmentation benchmark.

Benchmark Results

Tab. 8. Comparison of Interactive Segmentation. We compare X-SAM with other methods on the interactive segmentation benchmark.

Interactive Segmentation Results

Tab. 9. Comparison of Image-level Benchmarks. We compare X-SAM with other methods on the image-level benchmarks, including MME, MMBench, SEED-Bench, POPE, and AI2D.

Benchmark Results

💻 Demo

Launch
0:00 0:00

📌 Citation

@article{wang2025xsam,
  title={X-SAM: From Segment Anything to Any Segmentation},
  author={Wang, Hao and Qiao, Limeng and Jie, Zequn and Huang, Zhijian and Feng, Chengjian and Zheng, Qingfang and Ma, Lin and Lan, Xiangyuan and Liang, Xiaodan},
  journal={arXiv preprint arXiv:2508.04655},
  year={2025}
}