It improves Qwen2-VL by employing window attention, native dynamic resolution, absolute time encoding, and more high-quality data.
Visual encoder. The used ViT is trained from scratch. It employs self-attention + window attention to improve efficiency. It employs MRoPE as position embedding. Images and videos are sampled at native resolutions and dynamic frame rates.
Vision-Language Merger. Group adjacent four visual patches, concat them along feature dimensions, and project them using a two-layer MLP.
Language model. Qwen2.5 LLM.
Pre-training stages. (1) Stage 1: ViT is trained to learn visual knowledge; (2) Stage 2: all model parameters are optimized to learn diverse knowledge and tasks; (3) Stage 3: all model parameters are optimized to learn long sequences by incorporating video and agent-based data.
Post-training stages. SFT and DPO are employed to optimize the language model.
Sparkling capabilities. Omni-document parsing, precise object grounding (based on real resolution), ultra-long video understanding and grounding, and enhanced agent functionality.
It improves Qwen-VL by using a Naive Dynamic Resolution mechanism with multimodal RoPE, and a unified image-video processing paradigm.
Visual encoder (675M). Use self-developed ViT. Employ Naive Dynamic Resolution with 2D-RoPE to provide a variable number of visual tokens for images or videos with different resolution and frame number. Compress visual tokens by 2x2 using MLP.
Language model (1.5B, 7.6B, 72B). Qwen series.
Unified image and video processing: (1) Sample each video at two frames per second; (2) Compress video inputs by 4x using 3D convs; (3) Each image is treated as two identical frames for consistency. (4) The limit of tokens per video is set to 16,384 by adjusting the resolution.
Three-stage training (same as Qwen-VL). (1) Pre-training on 600B tokens by optimizing ViT; (2) Multi-task pre-training on 600B + 800B tokens by optimizing all model parameters; (3) Instruction tuning on instruction-following data (ChatML format) by optimizing LLM.
Three model sizes: Qwen2-VL-2B (on-device), Qwen2-VL-7B (performance-optimized), Qwen2-VL-72B (most capable).
Capabilities: general chat, multilingual image text understanding, formula recognition, function calling, UI interaction, long document understanding, code/math reasoning, video understanding, grounding, live chat, and agent potential.
Figure 1.Qwen2-VL Structure. It discards the multimodal connector module used in Qwen-VL.Figure 2.Unified Multimodal Rotary Position Embedding (M-RoPE) for text, images, and videos.Figure 3.Dataset format example.
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Shaoyen Tseng, Gustavo A Lujan-Moreno, Matthew L Olson, Musashi Hinck, David Cobbley, Vasudev Lal, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
Salesforce AI Research, Intel Labs, University of Washington
It improves BLIP-2 by introducing interleaved multimodal data, unified training objective, and visual resampler.
Training. (1) Stage 1: base resolution pre-training on 100B tokens with 384x384 visual resolution; (2) Stage 2: high resolution pre-training on high-quality data; (3) Stage 3: SFT on single-image instruction-following data; (4) Stage 4: SFT on multi-image interleaved data.
Figure 1. BLIP-3 improves BLIP-2 by introducing interleaved data, using unified training objective, and fine-grained training stages.Figure 2.Structure. It replaces Q-Former in BLIP-2 by a sampler (inspired by Flamingo). Only the sampler and the LLM (Phi-3) are trained.
MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-Task Learning
It makes the model learn to tackle 6 tasks with different task identifiers through three-stage training (maybe inspired by Qwen-VL).
Visual structure. Use ViT-G/14 from EVA-CLIP with a Q-Former (same as MiniGPT-4). Image resolution is increased from 224x224 to 448x448, and every four neighboring visual tokens are concatenated into a single token to save compute by reducing tokens.
Language structure. Language model is upgraded from Vicuna to LLaMA2-chat (7B).
Task identifiers are used by the model to identify tasks. VQA: [vqa]; captioning: [caption]; grounded captioning: [grounding]; referring expression comprehension: [refer]; referring expression generation: [identify]; object parsing and grounding: [detection].
The grounding task is introduced to improve MiniGPT (maybe inspired by Qwen-VL).
Figure 1.Training data.
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
It aligns a frozen visual encoder with a frozen LLM (Vicuna) using one projection layer.
Structure. The same pretrained vision module as BLIP-2: ViT-G/14 from EVA-CLIP with a Q-Former. Language model: Vicuna. Connector: a single projection layer.
Training. Pre-training + instruction-tuning. It only fine-tunes the projection layer.
Training data. Pre-training: LAION, Conceptual Captions, SBU. Instruction-tuning: 3500 images from Conceptual Caption with captions generated by the pre-trained model (cleaned by ChatGPT).
Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
University of Wisconsin-Madison, Microsoft Research, Columbia University
Advances in Neural Information Processing Systems (NeurIPS), 2023
Vision encoder. ViT-L/14 from CLIP or ViT-G/14 from EVA-CLIP.
Language model. OPT model or FlanT5.
Querying Transformer (Q-Former). An image transformer & a text transformer, they are initialized from BERT and share the self-attention layer.
Training. (1) Stage 1 (250K steps): learn vision-langauge representations from a frozen image encoder by optimizing the three losses used in BLIP; (2) Stage 2 (80K steps): learn vision-to-language generation from a frozen LLM.
Data. Basically same as BLIP. Only the Q-Former is trained.
Figure 1.Overall structure. Visual query embeddings are projected and prepended to the input text embeddings as Q-Former output & LLM input.Figure 2.Q-Former (left) with 32 query tokens, and self-attention masking strategy (right) for different training tasks.
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
DeepMind
Advances in Neural Information Processing Systems (NeurIPS), 2022
It achieves few-shot in-context learning ability by bridging vision and language models and training on interleaved visual and textual data.
Visual encoder. Use pre-trained and frozen Normalizer-Free ResNet, and pre-train it using contrastive loss. Images and videos (sample_fps=1) are compressed to spatio-temporal grid of features.
Perceiver resampler (Q-Former). It processes a variable number of image or video tokens and produces a fixed number of visual tokens (64).
Gated xattn-dense layers. They are inserted to the pre-trained, frozen language model (Chinchilla) and are trained from scratch.
Model size. Flamingo-3B, Flamingo-9B, and Flamingo-80B.
Figure 1.Overall structure (top) and gated xattn-dense layers (bottom).
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi
Salesforce Research
International Conference on Machine Learning (ICML), 2022
It enables both vision-language understanding & generation by multi-task learning with a unified framework, as well as a data bootstrapping strategy.
Figure 1.Structure. (1) Unimodal encoder is trained with an image-text contrastive (ITC) loss; (2) Image-grounded text encoder uses cross-attention layers, trained with an image-text matching (ITM) loss; (3) Image-grounded text decoder is trained with a language modeling (LM) loss.
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
OpenAI
International Conference on Machine Learning (ICML), 2021
CLIP shifts computer vision research from high-quality, crowd-labeled data with pre-defined labels, e.g., ImageNet, to web-scale data with natural language supervision. CLIP generalizes well on visual benchmarks, and spurs research on multimodal foundation models. It has over 30,000 citations (as of Jul 2025).
By training on 400M internet text-image pairs through contrastive learning, it shows great generalization on visual benchmarks.
Figure 1.Training and inference pipelines.Figure 2.Pseudocode for training CLIP.Figure 3.Zero-shot CLIP outperforms few-shot probes of SoTA visual models.Figure 4.Linear probe CLIP outperforms SoTA visual models.Figure 5. CLIP is much more robust to distribution shift.
It introduces unified decoder (14B with 7B activated) with Mixture-of-Transformer-Experts (MoT) trained on trillion-scale interleaved multimodal data, achieving sota open-source performance in generation & understanding while exhibiting emergent reasoning for long-context visual tasks.
Figure 1.Main structure: one understanding expert & one generation expert, they share the attention. Initialized by Qwen2.5 LLM. Encoder: one visual generation encoder (FLUX VAE, fixed) & one visual understanding encoder (ViT initialized by SigLIP2-so400m/14) & one text tokenizer. Loss: next-token prediction loss on text tokens, and flow matching loss on image tokens. Attention: apply causal attention to text tokens, and bidirectional attention to image tokens. Infra: use PyTorch FlexAttention to achieve 2x speed-up; use KV-cache. Classifier-free guidance: randomly drop text, ViT, and clean VAE tokens by 0.1, 0.5, and 0.1.Figure 2.Data.Figure 3.Training recipe.Alignment: Align SigLIP2 ViT encoder to Qwen2.5 LLM by optimizing MLP connector through image captioning. Pre-training: training all parameters on 2.5T tokens. Continued training: increase visual resolution and sampling of interleaved data, trained on 2.6T tokens. Supervised fine-tuning: trained on 72.7B tokens from a subset of LLaVA-OV and Mammoth-VL.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu
Salesforce Research, University of Maryland, Virginia Tech, New York University, University of Washington, UC Davis
It finds it is beneficial to generate CLIP features by employing flow matching loss, and use sequential training of understanding and generation.
Structure. Use Qwen 2.5-VL-7B-Instruct and freeze it, and train a 1.4B diffusion transformer (Lumina-Next) on it.
Data. Pre-training data: 25M open-source data and 30M proprietary data, with captions generated by Qwen 2.5-VL. Instruction tuning data: 60K.
Figure 1.Structure. It unifies the visual understanding and generation by using CLIP encoder.Figure 2.Design choices on image generation in unified multimodal model.Figure 3. Performance on different design choices. CLIP + Flow Matching is a better choice.Figure 4.Joint training vs. sequential training.
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, Michael Rubinstein, Michalis Raptis, Deqing Sun, Radu Soricut
It achieves visual generation and understanding by applying diffusion loss on continuous visual tokens and cross-entropy loss on discrete text tokens.
Figure 1.Framework: joint training of visual generation and understanding tasks through next-token prediction. Tokenizer: use VAE to provide tokens for visual generation, use SigLIP to provide tokens for visual understanding, use SentencePiece to provide text tokens. Prediction head: use modality-specific prediction heads to calculate losses and sampling for each modality. Loss: image understanding loss on text answer + image generation loss on image tokens. Training details: batchsize=2048, optimizer=AdamW, lr=1e-4, steps=1M, init_ckpt=Gemma-2.Figure 2. There is trade-off between generation & understanding.Figure 3.Unified training improves generation.Figure 4.Better pre-trained LLM backbone leads to better visual generation and understanding performance.
It proposes a unified autoregressive model trained end-to-end across text, vision, and audio.
Figure 1.Visual generation capability of GPT-4o evaluated by this paper.Text rendering: spelling, alignment, formatting in document. Compositional generation and prompt following: assemble complex scene elements, styles, attributes. Geometric consistency and viewpoint realism: 3D view synthesis, camera control, depth-conditioned rendering. Comprehensive image transformation: from low-level to high-level tasks.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy
Meta, Waymo, University of Southern California
International Conference on Learning Representations (ICLR), 2025
It trains a unified model (7B) on 2T multi-modal tokens by predicting discrete text tokens and diffusing continuous image tokens.
Data. Use total 2T tokens from: (1) Llama 2 tokenizer and corpus (2T tokens), (2) 380M Shutterstock images and captions (resized to 256x256).
Structure. It applies next-token prediction on discrete text tokens and diffusion loss on continuous image tokens: L=L_LM+lambda*L_diffusion. It uses modality-specific components with unshared parameters: embedding layer for text, and VAE (U-Net or linear structure, 8x8-8c) with linear or up/down blocks for images. It applies causal mask on text tokens and bidirectional mask on image tokens.
Training details. Optimizer=AdamW, lr=3e-4, 250K steps, lambda=5, train_timesteps=1000, infer_timesteps=250, cfg=3.
Performance. In text-to-image generation task, Transfusion exceeds Chameleon at less than a third of the compute. In image-to-text generation task, Transfusion exceeds Chameleon at 21.8% of the FLOPs. In text-to-text generation task, Transfusion exceeds Chameleon at 50% of FLOPs.
Figure 1.Transfusion structure.Figure 2. Transfusion outperforms Chameleon while scaling.Figure 3. Transfusion outperforms Chameleon by using few FLOPs, both are 7B.Figure 4. Transfusion achieves competitive results compared with Llama2.Figure 5.Encoder: U-Net is better than linear (maybe due to it brings more inductive bias). Attention: bidirectional is better than causal.Figure 6.Small patch size leads to better performance by providing more visual tokens.Figure 7.Overall performance.