Academic Homepage

Researching Multimodal Intelligence for Open-World Visual Understanding

Welcome to my academic homepage. My work focuses on multimodal large language models, vision-language learning, and open-vocabulary visual understanding across diverse real-world scenarios such as remote sensing and underwater environments.

Multimodal LLMs Vision-Language Models Open-Vocabulary Segmentation Remote Sensing Vision Underwater Vision

About Me

I am a Ph.D. student at the University of Science and Technology of China (USTC), supervised by Prof. Xuelong Li.

My research focuses on applying multimodal large language models and vision-language models to visual tasks across diverse scenes. I am particularly interested in open-vocabulary segmentation, multimodal reasoning, and domain-oriented visual intelligence.

Research Topics

I work at the intersection of multimodal learning, visual understanding, and domain-specific intelligence.

Multimodal LLMs

Vision-language models, multimodal reasoning, and foundation models for general-purpose visual intelligence.

Computer Vision

Open-vocabulary segmentation, semantic understanding, instance segmentation, and video understanding.

Domain Applications

Remote sensing vision, underwater vision, and robust multimodal perception in challenging environments.

News

  • 2026.03
    Four papers are accepted by CVPR 2026 (2 first-author, 1 second-author, and 1 fourth-author paper)! 🎉
  • 2025.11
    One paper is accepted by AAAI 2026 (Oral)! 🎉
  • 2025.10
    Awarded the National Scholarship for Graduate Students (研究生国家奖学金). 🎖️
  • 2025.04
    StitchFusion is accepted by ACM MM 2025 (Oral).
  • 2024.09
    Started my Ph.D. journey at USTC.

Research Highlights

Full publication list is available on Google Scholar.

Multi-Visual Modality
ACM MM 2025 StitchFusion
StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation
Bingyu Li, D Zhang, Z Zhao, J Gao, X Li

We propose a novel framework that seamlessly integrates arbitrary visual modalities to improve multimodal semantic segmentation.

Pattern Recognition 2025 U3M
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation
Bingyu Li, D Zhang, Z Zhao, J Gao, X Li

We develop an unbiased multiscale modal fusion framework for multimodal semantic segmentation.

Vision-Language and Multimodal Large Language Models
AAAI 2026 RSKT-Seg
Exploring Efficient Open-Vocabulary Segmentation in Remote Sensing
Bingyu Li, H Dong, D Zhang, Z Zhao, J Gao, X Li

We investigate efficient open-vocabulary segmentation approaches tailored to remote sensing imagery.

CVPR 2026 MARIS
MARIS: Marine Open-Vocabulary Instance Segmentation
Bingyu Li, F Wang, D Zhang, Z Zhao, J Gao, X Li

This work introduces MARIS, a benchmark and method for open-vocabulary instance segmentation in marine environments.

CVPR 2026 Earth2Ocean
Exploring the Underwater World Segmentation without Extra Training
Bingyu Li, T Huo, D Zhang, Z Zhao, J Gao, X Li

We explore training-free segmentation methods for underwater scenes, enabling effective transfer without additional supervision.

CVPR 2026 QICA
Boosting Quantitative and Spatial Awareness for Zero-Shot Object Counting
D Zhang, Bingyu Li, F Wang, Z Zhao, J Gao

We enhance zero-shot object counting by improving both quantitative reasoning and spatial awareness.

arXiv 2025 FGAseg
FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation
Bingyu Li, D Zhang, Z Zhao, J Gao, X Li

We propose a fine-grained pixel-text alignment framework for open-vocabulary semantic segmentation.

Honors and Awards

  • 2025, National Scholarship for Graduate Students | 研究生国家奖学金

Academic Service

Reviewer — Journals

  • TGRS
  • Pattern Recognition (PR)
  • More journals in related areas

Reviewer — Conferences

  • CVPR
  • NeurIPS
  • ICLR
  • Other major conferences