✨ X2SAM ✨

Any Segmentation in Images and Videos

1 Sun Yat-sen University 2 Peng Cheng Laboratory 3 Meituan Inc.
📧 Corresponding author

Highlights

X2SAM introduces a unified segmentation MLLM framework that extends any-segmentation capabilities from images to videos, supporting generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation in one model.
X2SAM supports both conversational instructions and visual prompts through a unified interface, and couples an LLM with a Mask Memory module to store guided vision features for temporally consistent video mask generation.
X2SAM proposes the Video Visual Grounded (V-VGD) segmentation benchmark and adopts a unified joint training strategy over heterogeneous image and video datasets, achieving strong video segmentation performance while remaining competitive on image segmentation and preserving general image/video chat ability.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface.

We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.

Overview

X2SAM Framework Architecture
Figure 1. Overview of X2SAM. The Vision Encoder extracts global visual representations, while the Mask Encoder captures fine-grained visual features. The Large Language Model generates the language response and produces the latent condition embedding, which guides the Mask Decoder in generating the segmentation mask. The Mask Memory module stores guided vision features for each video frame, and the Region Sampler extracts region-of-interest embeddings from both images and videos.

Benchmarks

Table 1. Comparison of state-of-the-art segmentation methods across image and video segmentation benchmarks, ranging from non-MLLM-based to MLLM-based, and from specialists to generalists. "x" denotes unsupported. "--" indicates unreported. Best results are in bold, second-best are underlined.
Benchmark Results Overview
More Benchmark Results
Table 2. Comparison across image and video reasoning segmentation benchmarks.
Reasoning Segmentation
Table 3. Comparison on out-of-domain tasks, including image generalized referring segmentation, image and video open-vocabulary segmentation benchmarks.
Out-of-Domain Segmentation
Table 4. Comparison across image and video visual grounded segmentation benchmarks.
VGD Segmentation
Table 5. Comparison across image and video generic segmentation benchmarks.
Generic Segmentation
Table 6. Comparison across image and video referring segmentation benchmarks.
Referring Segmentation
Table 7. Comparison across image and video grounded conversation generation segmentation benchmarks. Grayed values means the method is reported in the original paper, * means the method is re-evaluated in this work.
GCG Segmentation
Table 8. Comparison on object-centric segmentation tasks, including image interactive segmentation (I-Int.) and video object segmentation (V-Obj.) benchmarks.
Object-Centric Segmentation
Table 9. Comparison across image chat benchmarks.
Image Chat
Table 10. Comparison across video chat benchmarks.
Video Chat

Live Demo

Launch Interactive Demo
0:00 0:00

Citation

@article{wang2026x2sam,
  title={X2SAM: Any Segmentation in Images and Videos},
  author={Wang, Hao and Qiao, Limeng and Zhang, Chi and Wan, Guanglu and Ma, Lin and Lan, Xiangyuan and Liang, Xiaodan},
  journal={arXiv preprint arXiv:2605.00891},
  year={2026}
}

@inproceedings{wang2026xsam,
  title={X-SAM: From segment anything to any segmentation},
  author={Wang, Hao and Qiao, Limeng and Jie, Zequn and Huang, Zhijian and Feng, Chengjian and Zheng, Qingfang and Ma, Lin and Lan, Xiangyuan and Liang, Xiaodan},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={40},
  number={31},
  pages={26187--26196},
  year={2026}
}