2026

MTRefSeg: An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Bingyu Li1,3,+ · Da Zhang2,3,+ · Tao Huo2,3 · Zhiyuan Zhao3 · Junyu Gao2,3 · Xuelong Li3,*
1University of Science and Technology of China, Hefei, China
2School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China
3Institute of Artificial Intelligence (TeleAI), China Telecom, China
+Equal contribution    *Corresponding author: Xuelong Li
This work was done during Bingyu Li's internship at TeleAI.
MTRefSeg motivation and overview

Abstract

Large Vision--Language Models have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. We introduce Multi-temporal Referring Segmentation (MTRS), a task that segments language-described temporal changes from multi-temporal images. MTRS jointly requires temporal correspondence reasoning, language grounding, and pixel-level mask prediction. To support this task, we propose CRAFT-Agent, an automated data construction pipeline with human auditing, and build MTRefSeg-21K, the first MTRS benchmark with 21K high-quality bi-image--text--mask triplets across diverse scenes, viewpoints, and domains. Benchmarking 15 VLMs reveals that direct inference performs poorly, while task-specific fine-tuning remains limited, indicating that single-temporal vision-language pretraining is insufficient for MTRS. We further propose MTRefSeg-R1, a change-aware LVLM framework trained with a two-stage strategy: it first learns general temporal-change perception from approximately 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization.

New Task

MTRS unifies temporal change understanding, natural-language grounding, and pixel-level segmentation.

New Benchmark

MTRefSeg-21K contains fine-grained bi-image--text--mask triplets across normal-scene and remote-sensing domains.

Data Engine

CRAFT-Agent constructs annotations through grid-aware perception, mask refinement, expression beautification, and human auditing.

Strong Baseline

MTRefSeg-R1 learns change-sensitive visual representations from approximately 20K vision-only bi-temporal samples and then performs fine-grained language-conditioned localization.

Task Introduction: From Static Language Segmentation to Temporal Referring Segmentation

Our motivation starts from three representative language-guided segmentation paradigms: open-vocabulary segmentation, referring segmentation, and reasoning segmentation. These tasks have pushed segmentation from closed-set category prediction toward flexible text-conditioned perception, but they are still mainly formulated on single-time images. In real-world dynamic scenes, however, users often care about what has changed, where the change happens, and which changed region is specified by the instruction. This gap motivates our proposed Multi-temporal Referring Segmentation (MTRS) task.

Task comparison with open-vocabulary, referring, and reasoning segmentation
Relationship to existing language-guided segmentation tasks. Open-vocabulary segmentation emphasizes category-list recognition, referring segmentation emphasizes explicit instance localization, and reasoning segmentation emphasizes implicit instruction reasoning. MTRS further introduces cross-temporal comparison: the target is not only defined by language, but also by its temporal change between images.

Open-Vocabulary Segmentation

Focuses on aligning pixels with category names or vocabulary lists. It is strong for recognizing unseen categories, but the target is usually category-level and does not require distinguishing a specific changed object across time.

Referring Segmentation

Localizes an explicitly described object or region in a single image, such as an object with certain attributes or spatial relations. It emphasizes precise grounding, but the evidence mainly comes from one visual observation.

Reasoning Segmentation

Segments a target implied by a high-level or implicit instruction. It requires semantic reasoning, but still usually reasons over a static scene rather than comparing earlier and later observations.

Our task, MTRS: given temporally related images and a natural-language expression, the model must compare the images, identify the language-relevant temporal variation, and output the pixel-level mask of the referred changed region. Therefore, MTRS differs from the three tasks above by jointly requiring temporal correspondence, change understanding, language grounding, and segmentation accuracy.

Dataset: MTRefSeg-21K

MTRefSeg-21K provides a unified benchmark for evaluating language-guided temporal change understanding across remote sensing, aerial-view, and normal-scene imagery, with overall, NS-domain, and RS-domain evaluation settings.

Note: Although we have made our best efforts to clean MTRefSeg-21K, a small number of noisy annotations may still exist due to the inherent difficulty of constructing large-scale multi-temporal image--text--mask data. Nevertheless, models trained on our dataset can effectively tolerate and ignore these noisy ground-truth annotations. We kindly ask users to be aware of this issue.

Example of noisy ground-truth annotation in MTRefSeg-21K
Example of noisy annotation. Although MTRefSeg-21K has been carefully cleaned, a few noisy ground-truth masks may still remain. As shown in this case, the model prediction focuses on the language-relevant changed region and can effectively ignore the noisy ground-truth annotation.
9,521
Bi-temporal image pairs
20,924
Referring expressions
21K
Image-text-mask triplets
20K
Stage-1 vision samples
RS+NS
Multi-domain evaluation
MTRefSeg-21K dataset examples
MTRefSeg-21K examples. The benchmark covers multiple views and diverse scenes, including remote-sensing/aerial views and normal scenes with appearing, disappearing, and state-change regions.
Dataset statistics and multi-domain comparison
Dataset statistics and comparison. MTRefSeg-21K contains 9,521 bi-temporal image pairs and 20,924 referring expressions. Compared with prior general-domain and remote-sensing referring segmentation datasets, it explicitly supports bi-temporal inputs, provides richer language descriptions, and covers both general and remote-sensing domains with a broad resolution range.

CRAFT-Agent Data Construction

CRAFT-Agent automatically produces high-quality bi-image--text--mask triplets through a three-stage workflow: grid-aware change perception, iterative mask correction, and natural referring-expression refinement.

CRAFT-Agent pipeline
CRAFT-Agent pipeline. The agent identifies temporal changes, generates spatially grounded expressions, refines masks, rewrites mechanical descriptions into natural language, and removes uncorrected samples through human auditing.

Method: MTRefSeg-R1

MTRefSeg-R1 is a change-aware LVLM for MTRS. It first learns general multi-temporal change perception from approximately 20K vision-only bi-temporal samples with binary change masks, then performs referring multi-temporal finetuning to segment only the language-specified changed region.

MTRefSeg-R1 framework
MTRefSeg-R1 framework. The model uses a shared multi-scale vision encoder, change-aware multi-scale fusion, an LVLM reasoning branch, and a mask decoder guided by the generated [SEG] token.
Adapting VLM and LVLM segmentation models to MTRS
Adapting existing models to MTRS. For VLM-based segmentation models, single-image inputs are extended to paired temporal images with temporal feature interaction. For segmentation-oriented LVLMs, temporally ordered image tokens and a change-aware prompt are used so that the generated [SEG] token conditions mask prediction on the described temporal change.
Core idea: instead of detecting all changed pixels, MTRefSeg-R1 aligns a natural-language instruction with temporal variations and outputs the mask of the referred changed region.

Quantitative Benchmark Results

We benchmark specialist referring segmentation models and LVLM-based segmentation models under the Train→Val, NS→NS, and RS→RS settings. The results show that direct use of single-temporal LVLMs is insufficient for MTRS, while MTRefSeg-R1 provides a strong change-aware baseline after visual change pretraining and multimodal fine-tuning.

Train to validation quantitative comparison
Train→Val evaluation. MTRefSeg-R1 achieves the best overall performance among compared LVLM baselines, reaching 68.24 mIoU and 73.90 Pr@50, showing strong language-guided temporal localization ability in the standard validation setting.
NS to NS quantitative comparison
NS-domain evaluation. Under the normal-scene setting, MTRefSeg-R1 remains competitive and improves high-threshold localization quality, indicating that the model can handle fine-grained object-level temporal changes in everyday scenes.
RS to RS quantitative comparison
RS-domain evaluation. In remote-sensing imagery, MTRefSeg-R1 obtains clear gains over existing LVLM baselines, especially for dense building, road, and agricultural-structure changes that require robust spatial reasoning.

Qualitative Results

MTRefSeg evaluates both normal-scene and remote-sensing cases. Compared with existing LVLM baselines, MTRefSeg-R1 better localizes fine-grained language-described temporal changes.

Qualitative comparison on normal scenes
Normal-scene results. MTRefSeg-R1 produces more accurate masks for disappeared, newly visible, and state-changed objects under language guidance.
Qualitative comparison on remote sensing scenes
Remote-sensing results. MTRefSeg-R1 better handles dense buildings, agricultural structures, roads, and large-scale remote-sensing changes.

Language-Guided Temporal Attention

To better understand MTRefSeg-R1, we visualize decoder [SEG] attention maps and intermediate query masks. The attention progressively concentrates on the region described by the referring expression, and the query masks refine the target changed object before producing the final prediction.

Visualization of language-guided temporal attention and mask decoding
Attention and mask decoding visualization. The decoder attention highlights the language-relevant temporal region, while intermediate masks gradually localize the final changed object. This supports the core design of aligning temporal visual evidence with natural-language change descriptions.

Acknowledgements

We sincerely thank all collaborators and friends who provided valuable support, feedback, and assistance during the construction of MTRefSeg-21K and the development of MTRefSeg-R1.

Data cleaning and annotation refinement. The authors would like to thank Chenggang Rong from Northwestern Polytechnical University, Du Wu from Northwestern Polytechnical University, and Feiyu Wang from Fudan University for their valuable assistance in data cleaning and annotation refinement.
Research discussions and support. The authors also thank Haocheng Dong from the University of Science and Technology of China for valuable discussions and support throughout this work.
Multi-temporal vision-language perspective. The authors are grateful to Liang Yao from Hohai University for constructive feedback, which inspired us to reconsider the progression of vision-language segmentation through the lens of multi-temporal visual understanding.

Cite This Work

@article{li2026mtrefseg,
  title   = {MTRefSeg: An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation},
  author  = {Li, Bingyu and Zhang, Da and Huo, Tao and Zhao, Zhiyuan and Gao, Junyu and Li, Xuelong},
  journal = {arXiv preprint arXiv:2606.00987},
  year    = {2026}
}