Visual Generative Models

The reading papers curated by Junkun Yuan (yuanjk0921@outlook.com)

Contents:
[back to main contents of reading papers]

Last updated on March 30, 2025.

Summaries

Surveys & Insights

Surveys and interesting insights on visual generative models.
Date & Model Paper & Publication & Project Summary
2025-03-10
Inference Beat Pretraining
Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms (arXiv 2025) Analyze pre-training algorithm design from a inference-first perspective, and scaling inference from a unified perspective of scaling sequence length & refinement steps.
2025-02-18
noise-unconditional model
Is Noise Conditioning Necessary for Denoising Generative Models? (arXiv 2025) Theoretical and empirical analysis on noise-unconditional denoising diffusion models without a timestep input for image generation.
2024-12-09
Flow Matching Guide and Code
Flow Matching Guide and Code (arXiv 2024) Star Comprehensive and self-contained review of the flow matching algorithm, covering its mathmatical foundations, design choices, extensions, and code implementations.
2022-08-25
Unified Perspective
Understanding Diffusion Models: A Unified Perspective (arXiv 2022) Introduction to VAE, DDPM, score-based generative model, guidance from a unified generative perspective.
[back to top]


Foundation Models & Algorithms

Foundation models and algorithms on visual generative models.
Date & Model Paper & Publication & Project Summary
2025-03-26
Wan
Wan: Open and Advanced Large-Scale Video Generative Models (arXiv 2025) Star Alibaba Tongyi's open-sourced model (14B) for text-to-video & image-to-video generation, using 8x8x4 VAE, DiT structure, etc.
2025-03-14
Step-Video-TI2V
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model (arXiv 2025) Star StepFun's open-sourced model (30B) for image-to-video generation, trained upon Step-Video-T2V, by using channel concat of image condition and timestep-combined motion condition.
2025-03-10
Seedream2.0
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model (arXiv 2025) ByteDance (Seed Vision Team)'s foundation model for image genertion with native Chinese-English bilingual capability, where some techniques such as scaled RoPE, SFT, RLHF are employed.
2025-02-14
Step-Video-T2V
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model (arXiv 2025) Star StepFun's open-sourced model (30B) for text-to-video generation, using DiT structure & RoPE-3D & QK-Norm & 16x16x8 VAE & two bilingual text encoders & DPO.
2024-12-03
HunyuanVideo
HunyuanVideo: A Systematic Framework For Large Video Generative Models (arXiv 2024) Star Tencent (Hunyuan Team)'s open-sourced video generation model (13B) using diffusion transformer and conducting fine-grained data curation, captioning, and training scaling.
2024-10-17
MovieGen
Movie Gen: A Cast of Media Foundation Models (arXiv 2024) A diffusion transformer-based model (30B) for 16s / 1080p / 16 fps video and synchronized audio generation.
2024-07-16
DiT-MoE
Scaling Diffusion Transformers to 16 Billion Parameters (arXiv 2024) Star A diffusion transformer (16B) with MoE that inserts experts into DiT blocks for image generation.
2022-12-19
DiT
(notes in jupyter)
Scalable Diffusion Models with Transformers (ICCV 2023) Star Replace U-Net by transformer for scalable image generation, the timestep and prompt are injected by adaLN-Zero structure.
2022-10-06
Flow Matching
Flow Matching for Generative Modeling (ICLR 2023) Star A type of generative models built on continuous normalizing flows by learning a time-dependent vector field that transports data from the source distribution to the target distribution.
2022-05-29
CogVideo
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers (ICLR 2023) Star An open-sourced transformer-based video generation model (9B) that auto-regressively generates frame sequences and then performs auto-regressive frame interpolatation.
2021-12-20
LDM
High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022) Star Efficient high-quality image generation by applying diffusion and denoising processes in the VAE latent space.
2021-12-08
CFG
(notes in jupyter)
Classifier-Free Diffusion Guidance (NeurIPS workshop 2021) Image generation with classifier-free condition guidance by jointly training a conditional model and an unconditional model.
2020-10-06
DDIM
(notes in jupyter)
Denoising Diffusion Implicit Models (ICLR 2021) Star Accelerate sampling of diffusion models by introducing a non-Markovian, deterministic process that achieves high-quality results with fewer steps while preserving training consistency.
2020-06-19
DDPM
(notes in jupyter)
Denoising Diffusion Probabilistic Models (NeurIPS 2020) Star Denoising diffusion probabilistic models that iteratively denoises data from random noise for image generation.
[back to top]


Fine-Tuning

Post-training by fine-tuning models or some modules like LoRA or ControlNet.
Date & Model Paper & Publication & Project Summary
2024-11-27
Reliable Seed
Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds (ICLR 2025) The noises initialized by reliable seeds result in accurate image generation such as numeracy and position, and use these generated data for fine-tuning further improves performance.
2024-03-08
ELLA
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment (arXiv 2024) Star ELLA: Replace CLIP with LLM to understand dense prompts; DPG-Bench: evaluate image generation on dense prompts.
[back to top]


Reinforcement Learning

Post-training by employing reinforcement learning.
Date & Model Paper & Publication & Project Summary
2025-01-23
PARM
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step (arXiv 2025) Star Apply Chain-of-Thought into image generation and combine it with reinforcement learning to further improve performance.
2025-01-23
Flow-RWR
Flow-DPO
Improving Video Generation with Human Feedback (arXiv 2025) Star A human preference video dataset; Adapt diffusion-based reinforcement learning to flow-based video generation models.
2023-12-19
InstructVideo
InstructVideo: Instructing Video Diffusion Models with Human Feedback (CVPR 2024) Star Use HPS v2 to provide reward and train video generation models in an editing manner.
2023-11-21
Diffusion-DPO
(notes in jupyter)
Diffusion Model Alignment Using Direct Preference Optimization (CVPR 2024) Star Adapt Direct Preference Optimization (DPO) from large language models to diffusion models for image generation.
2022-12-19
promptist
Optimizing Prompts for Text-to-Image Generation (NeurIPS 2023) Star Use LLM to refine prompts for preference-aligned image generation by taking relevance and aesthetics as reward.
[back to top]


Inference-Time Improvement

Improve inference-time visual generation performance, inspired by progress in large language models like OpenAI o1 and DeepSeek-R1.
Date & Model Paper & Publication & Project Summary
2025-02-24
IGTR
Autoregressive Image Generation Guided by Chains of Thought (arXiv 2025) Insert reasoning prompts to improve auto-regressive image generation performance by Chain-of-Thought.
2025-01-23
PARM
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step (arXiv 2025) Star Apply Chain-of-Thought into image generation and combine it with reinforcement learning to further improve performance.
2025-01-16
Scaling Analysis
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps (CVPR 2025) Analysis on inference-time scaling of diffusion models for image generation from the axes of verifiers and algorithms.
2024-12-14
Z-Sampling
Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection (ICLR 2025) Star Use guidance gap between denosing and inversion and iteratively perform them to improve image generation quality.
[back to top]


Evaluation

Metrics and benchmarks for evaluating visual generation performance.
Date & Model Paper & Publication & Project Summary
2024-07-19
T2V-CompBench
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation (arXiv 2024) Star Use 1400 prompts to evaluate video generation on compositional generation, including consistent attribute binding, dynamic attribute binding, sptial relationships, motion binding, action binding, object interations, generative numeracy.
2024-04-01
VQAScore
Evaluating Text-to-Visual Generation with Image-to-Text Generation (ECCV 2024) Star VQAScore: alignment probability of "yes" answer from a VQA model with CLIP-FlanT5 structure; GenAI-Bench: evaluation benchmark with 1600 prompts for image generation.
2024-03-08
DPG-Bench
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment (arXiv 2024) Star ELLA: Replace CLIP with LLM to understand dense prompts; DPG-Bench: evaluate image generation on dense prompts.
2023-11-29
VBench
VBench: Comprehensive Benchmark Suite for Video Generative Models (CVPR 2024) Star Evaluate video generation from 16 dimensions within the perspectives of video quality and video-prompt consistency.
2023-10-17
GenEval
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment (NeurIPS 2023) Star An object-focused framework for image generation evaluation by providing scores of single object, two objects, counting, colors, position, attribute binding, and overall.
2023-07-12
T2I-CompBench
(notes in jupyter)
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation (NeurIPS 2023) Star Use 6000 prompts to train and evaluate image generation on compositional generation, including attribute binding, object relationship, and complex compositions.
2023-06-15
HPS v2
(notes in jupyter)
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis (arXiv 2023) Star HPD v2: 798K binary human preference choices on 433K pairs of generated images; HPS v2: use HPD v2 to fine-tune CLIP for image generation evaluation.
2023-05-02
PickScore
(notes in jupyter)
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation (NeurIPS 2023) Star Pick-a-Pic: use a web app to collect user preferences; PickScore: train a CLIP-based model for image generation evaluation.
2023-04-12
ImageReward
(notes in jupyter)
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation (NeurIPS 2023) Star Train BLIP on 137K human preference image pairs for image generation and use it to tune diffusion models by Reward Feedback Learning (ReFL).
2023-03-25
HPS
Human Preference Score: Better Aligning Text-to-Image Models with Human Preference (ICCV 2023) Star Fine-tune CLIP using annotated 98K SD generated images from 25K prompts for image generation evaluation.
2021-04-18
CLIP Score
(notes in jupyter)
CLIPScore: A Reference-free Evaluation Metric for Image Captioning (EMNLP 2021) Star A reference-free metric mainly focusing on semantic alignment for image generation evaluation.
2019-05-04
FVD
FVD: A new Metric for Video Generation (ICLR workshop 2019) Extend FID for video generation evaluation by replacing 2D InceptionNet with pre-trained Inflated 3D convnet.
2017-06-26
FID
(notes in jupyter)
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017) Calculate Fréchet distance between Gaussian distributions of InceptionNet feature maps of real-world data and synthetic data for image generation evaluation.
2016-06-10
Inception Score
(notes in jupyter)
Improved Techniques for Training GANs (NeurIPS 2016) Star Calculate KL divergence between p(y|x) and p(y) that aims to minimize the entropy across samples and maximize the entropy across classes for image generation evaluation.
[back to top]


Papers & Reading Notes

[2025-03-26] Wan: Open and Advanced Large-Scale Video Generative Models (arXiv 2025)

Authors: Wan

Organizations: Alibaba Group

Summary: Alibaba Tongyi's open-sourced model (14B) for text-to-video & image-to-video generation, using 8x8x4 VAE, DiT structure, etc.

Data procssing pipeline. Fundenmental dimensions: text detection, aesthetic evaluation, NSFW score, watermark & logo detection, black border detection, overexposure detection, synthetic image detection, blur detection, duration & resolution. Visual quality: clustering, scoring. Motion quality: optimal motion, medium-quality motion, static videos, camera-driven motion, low-quality motion, shaky camera footage. Visual text data: hundreds of millions of text-containing images by rendering Chinese characters on a pure white background and large amounts from real-world data. Captions: celebrities, landmarks, movie characters, object counting, OCR, camera angle and motion, fine-grained categories, relational understanding, re-caption, editing instruction caption, group image description, human-annotated image and video captions.
VAE compresses data by 8x8x4. It is trained using L1 reconstruction loss, KL loss, LPIPS perceptual loss.
Architecture. Cross-attention for text prompts; MLP for time embeddings; Text encoder is umT5; training objective is flow-matching. Image pre-training -> image-video joint training.
Transformer blocks.
I2V framework. Image condition is injected by channel-concat and CLIP image encoder.

[back to top]

[2025-03-14] Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model (arXiv 2025)

Authors: Step-Video Team

Organizations: StepFun

Summary: StepFun's open-sourced model (30B) for image-to-video generation, trained upon Step-Video-T2V, by using channel concat of image condition and timestep-combined motion condition.

Image conditioning: channel-wise concatenation of noise-augmented and zero-padded image condition and input noise. Motion conditioning: motion is extracted using optical flow and is combined with timestep.

[back to top]

[2025-03-10] Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms (arXiv 2025)

Authors: Jiaming Song, Linqi Zhou

Organizations: Luma AI

Summary: Analyze pre-training algorithm design from a inference-first perspective, and scaling inference from a unified perspective of scaling sequence length & refinement steps.

[back to top]

[2025-03-10] Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model (arXiv 2025)

Authors: ByteDance's Seed Vision Team

Organizations: ByteDance

Summary: ByteDance (Seed Vision Team)'s foundation model for image genertion with native Chinese-English bilingual capability, where some techniques such as scaled RoPE, SFT, RLHF are employed.

Performance with English and Chinese prompts.
Pre-training data system.
Model structure is designed following SD3.
Data cleaning process.
Generic captions: long & short; specialied captions: style, color, composition, light, textual; surreal captions: surreal & fantastical aspects.
Overview of training and inference stages.

[back to top]

[2025-02-24] Autoregressive Image Generation Guided by Chains of Thought (arXiv 2025)

Authors: Miaomiao Cai, Guanjie Wang, Wei Li, Zhijun Tu, Hanting Chen, Shaohui Lin, Jie Hu

Organizations: University of Science and Technology of China, Huawei Noah's Ark Lab, East China Normal University

Summary: Insert reasoning prompts to improve auto-regressive image generation performance by Chain-of-Thought.

Take an image in the same class as a Class Reasoning Prompt or indices from codebook of image tokenizer as a Universal Reasoning Prompt.

[back to top]

[2025-02-18] Is Noise Conditioning Necessary for Denoising Generative Models? (arXiv 2025)

Authors: Qiao Sun, Zhicheng Jiang, Hanhong Zhao, Kaiming He

Organizations: MiT

Summary: Theoretical and empirical analysis on noise-unconditional denoising diffusion models without a timestep input for image generation.

[back to top]

[2025-02-14] Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model (arXiv 2025)

Authors: Step-Video Team

Organizations: StepFun

Summary: StepFun's open-sourced model (30B) for text-to-video generation, using DiT structure & RoPE-3D & QK-Norm & 16x16x8 VAE & two bilingual text encoders & DPO.

Main structure: a VAE, bilingual text encoders, DiT, and DPO.
Video-VAE compresses videos by 8x16x16 & 16-channel features.
DPO framework: use data & synthesis prompts to generate winning & losing samples from different seeds, followed by human annotation. Adapt Diffusion-DPO here and reduce beta & increase learning rate to achieve faster convergence.
Use DiT with 3D full-attention, cross-attention for prompts, RoPE-3D, and QK-Norm. HunyuanCLIP and Step-LLM are used as the bilingual text encoders.
2B video-text pairs, 3.8B image-text pairs. Video segmentation: use AdaptiveDetector in PySceneDetect to identify scene changes and use FFmpeg to split them. Video quality assessment: sample eight frames and evaluate. Aesthetic score: LAION CLIP-based aesthetic predictor. NSFW score: LAION CLIP-based NSFW detector. Watermark detection: EfficientNet-based model. Subtitle detection: use PaddleOCR to recognize and localize tests. Saturation score: transform from RGB to HSV and extract saturation channel. Blur score: Laplacian variance. Black border detection: FFmpeg. Video motion assessment: mean optical flow.Video captioning: short caption, dense caption, original title. Video concept balancing: use K-means to group into 120K clusters. Video-text alignment: CLIP Score.
Pre-training details.
Data filtering details.

[back to top]

[2025-01-23] Improving Video Generation with Human Feedback (arXiv 2025)

Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang

Organizations: CUHK, Tsinghua University, Kuaishou Technology, Shanghai Jiao Tong University, Shanghai AI Lab

Summary: A human preference video dataset; Adapt diffusion-based reinforcement learning to flow-based video generation models.

[back to top]

[2025-01-23] Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step (arXiv 2025)

Authors: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng

Organizations: CUHK, Peking University, Shanghai AI Lab

Summary: Apply Chain-of-Thought into image generation and combine it with reinforcement learning to further improve performance.

ORM is coarse, PRM does not know when to make decision, PARM combines them.
Self-correction makes use of bad cases.

[back to top]

[2025-01-16] Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps (CVPR 2025)

Authors: Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, Saining Xie

Organizations: NYU, MIT, Google

Summary: Analysis on inference-time scaling of diffusion models for image generation from the axes of verifiers and algorithms.

Scale with search is more effective than with denoising steps.
Random Search performs the best because it converges fastest.

[back to top]

[2024-12-14] Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection (ICLR 2025)

Authors: Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, Zeke Xie

Organizations: The Hong Kong University of Science and Technology (Guangzhou), Mohamed bin Zayed University of Artificial Intelligence, Baidu Inc

Summary: Use guidance gap between denosing and inversion and iteratively perform them to improve image generation quality.

Iteratively denoising & inversion.
Capture more semantics by denoising more times.
More efficient & effective than standard sampling.

[back to top]

[2024-12-09] Flow Matching Guide and Code (arXiv 2024)

Authors: Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, Itai Gat

Organizations: FAIR at Meta, MIT CSAIL, Weizmann Institute of Science

Summary: Comprehensive and self-contained review of the flow matching algorithm, covering its mathmatical foundations, design choices, extensions, and code implementations.

[back to top]

[2024-12-03] HunyuanVideo: A Systematic Framework For Large Video Generative Models (arXiv 2024)

Authors: Hunyuan Foundation Model Team

Organizations: Tencent

Summary: Tencent (Hunyuan Team)'s open-sourced video generation model (13B) using diffusion transformer and conducting fine-grained data curation, captioning, and training scaling.

[back to top]

[2024-11-27] Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds (ICLR 2025)

Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann

Organizations: EPFL, Stony Brook University

Summary: The noises initialized by reliable seeds result in accurate image generation such as numeracy and position, and use these generated data for fine-tuning further improves performance.

[back to top]

[2024-10-17] Movie Gen: A Cast of Media Foundation Models (arXiv 2024)

Authors: Adam Polyak et al.

Organizations: Meta

Summary: A diffusion transformer-based model (30B) for 16s / 1080p / 16 fps video and synchronized audio generation.

[back to top]

[2024-07-19] T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation (arXiv 2024)

Authors: Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, Xihui Liu

Organizations: The University of Hong Kong, The Chinese University of Hong Kon, Huawei Noah's Ark Lab

Summary: Use 1400 prompts to evaluate video generation on compositional generation, including consistent attribute binding, dynamic attribute binding, sptial relationships, motion binding, action binding, object interations, generative numeracy.

T2V-CompBench: categories (left), evaluation methods (middle), and benchmarking model performance (right).

[back to top]

[2024-07-16] Scaling Diffusion Transformers to 16 Billion Parameters (arXiv 2024)

Authors: Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang

Organizations: Kunlun Inc.

Summary: A diffusion transformer (16B) with MoE that inserts experts into DiT blocks for image generation.

DiT-MoE is built upon DiT and replaces MLP of Transformer block by sparsely activatd mixture of MLPs as experts.

[back to top]

[2024-04-01] Evaluating Text-to-Visual Generation with Image-to-Text Generation (ECCV 2024)

Authors: Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

Organizations: Crnegie Mellon University, Meta

Summary: VQAScore: alignment probability of "yes" answer from a VQA model with CLIP-FlanT5 structure; GenAI-Bench: evaluation benchmark with 1600 prompts for image generation.

Evaluate by computing probability of answering "yes" from a trained VQA model by feeding the prompt and the generated image.

[back to top]

[2024-03-08] ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment (arXiv 2024)

Authors: Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, Gang Yu

Organizations: Tencent

Summary: ELLA: Replace CLIP with LLM to understand dense prompts; DPG-Bench: evaluate image generation on dense prompts.

[back to top]

[2023-12-19] InstructVideo: Instructing Video Diffusion Models with Human Feedback (CVPR 2024)

Authors: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni

Organizations: Zhejiang University, Alibaba Group, Tsinghua University, Singapore University of Technology and Design, Nanyang Technological University, University of Cambridge

Summary: Use HPS v2 to provide reward and train video generation models in an editing manner.

It takes reward fine-tuning as an editing task to accelerate training. The reward model is HPS v2. TAR assigns larger weight to central frames.

[back to top]

[2023-11-21] Diffusion Model Alignment Using Direct Preference Optimization (CVPR 2024)

Authors: Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik

Organizations: Nikhil Naik, Salesforce AI, Stanford University

Summary: Adapt Direct Preference Optimization (DPO) from large language models to diffusion models for image generation.

[back to top]

[2023-11-29] VBench: Comprehensive Benchmark Suite for Video Generative Models (CVPR 2024)

Authors: Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

Organizations: Nanyang Technological University, Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, Nanjing University

Summary: Evaluate video generation from 16 dimensions within the perspectives of video quality and video-prompt consistency.

caption1.

[back to top]

[2023-10-17] GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment (NeurIPS 2023)

Authors: Dhruba Ghosh, Hanna Hajishirzi, Ludwig Schmidt

Organizations: University of Washington, Allen Institute for AI, LAION

Summary: An object-focused framework for image generation evaluation by providing scores of single object, two objects, counting, colors, position, attribute binding, and overall.

GenEval detects objects using Mask2Former detector and evaluates attributes of them.
Specific evaluation perspectives of GenEval.

[back to top]

[2023-07-12] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation (NeurIPS 2023)

(notes in jupyter)

Authors: Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, Xihui Liu

Organizations: The University of Hong Kong, Huawei Noah's Ark Lab

Summary: Use 6000 prompts to train and evaluate image generation on compositional generation, including attribute binding, object relationship, and complex compositions.

  • Use disentangled BLIP-VQA to evaluate attribute binding, UniDet-based mstric to evaluate spatial relationship, CLIPScore to evaluate non-spatial relationship, 3-in-1 metric (average score of the three metrics) to evaluate complex compositions.
  • [back to top]

    [2023-06-15] Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis (arXiv 2023)

    (notes in jupyter)

    Authors: Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, Hongsheng Li

    Organizations: CUHK, SenseTime Research, Shanghai Jiao Tong University, Centre for Perceptual and Interactive Intelligence

    Summary: HPD v2: 798K binary human preference choices on 433K pairs of generated images; HPS v2: use HPD v2 to fine-tune CLIP for image generation evaluation.

    Step 1: Clean prompts from COCO captions and DiffusionDB by ChatGPT; Step 2: Generate images using 9 models; Step 3: Rand and annotate each pair of images; Step 4: Train CLIP and obtain preference model for providing HPS v2 score.

    [back to top]

    [2023-05-02] Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation (NeurIPS 2023)

    (notes in jupyter)

    Authors: Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, Omer Levy

    Organizations: Tel Aviv University, Stability AI

    Summary: Pick-a-Pic: use a web app to collect user preferences; PickScore: train a CLIP-based model for image generation evaluation.

    [back to top]

    [2023-04-12] ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation (NeurIPS 2023)

    (notes in jupyter)

    Authors: Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, Yuxiao Dong

    Organizations: Tsinghua University, Zhipu AI, Beijing University of Posts and Telecommunications

    Summary: Train BLIP on 137K human preference image pairs for image generation and use it to tune diffusion models by Reward Feedback Learning (ReFL).

    Step 1: use DiffusionDB prompts to generate images; Step 2: Rate and rank images; Step 3: Train ImageReward using ranking data; Step 4: tune diffusion models using ImageReward; Step 5: provide scores to the generated images.

    [back to top]

    [2023-03-25-hps] Human Preference Score: Better Aligning Text-to-Image Models with Human Preference (ICCV 2023)

    Authors: Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, Hongsheng Li

    Organizations: CUHK, SenseTime Research, Shanghai Jiao Tong University, Centre for Perceptual and Interactive Intelligence, Shanghai AI Lab

    Summary: Fine-tune CLIP using annotated 98K SD generated images from 25K prompts for image generation evaluation.

    Train human preference classifier: similar to CLIP except for the sample with the highest preference is taken as the positive; append a special token to the prompts of worse images and train SD LoRA; remove the special token during inference.

    [back to top]

    [2022-12-19] Scalable Diffusion Models with Transformers (ICCV 2023)

    (notes in jupyter)

    Authors: William Peebles, Saining Xie

    Organizations: UC Berkeley, New York University

    Summary: Replace U-Net by transformer for scalable image generation, the timestep and prompt are injected by adaLN-Zero structure.

    Use adaLN-Zero structure to inject timestep and prompt is better than cross-attention or in-context.

    [back to top]

    [2022-12-19] Optimizing Prompts for Text-to-Image Generation (NeurIPS 2023)

    Authors: Yaru Hao, Zewen Chi, Li Dong, Furu Wei

    Organizations: Microsoft Research

    Summary: Use LLM to refine prompts for preference-aligned image generation by taking relevance and aesthetics as reward.

    Step 1: fine-tune a language model (LM) to optimize prompts; Step 2: further fine-tune LM with PPO, where aesthetic and relevance are the reward.

    [back to top]

    [2022-10-06] Flow Matching for Generative Modeling

    Authors: Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, Matt Le

    Organizations: Meta AI (FAIR), Weizmann Institute of Science

    Summary: A type of generative models built on continuous normalizing flows by learning a time-dependent vector field that transports data from the source distribution to the target distribution.

    [back to top]

    [2022-08-25] Understanding Diffusion Models: A Unified Perspective (arXiv 2022)

    Authors: Calvin Luo

    Organizations: Google Brain

    Summary: Introduction to VAE, DDPM, score-based generative model, guidance from a unified generative perspective.

    [back to top]

    [2022-05-29] CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers (ICLR 2023)

    Authors: Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang

    Organizations: Tsinghua University, BAAI

    Summary: An open-sourced transformer-based video generation model (9B) that auto-regressively generates frame sequences and then performs auto-regressive frame interpolatation.

    It is trained on CogView2. In stage 1, it generates frames sequentially. In stage 2, it recursively interpolates frames.

    [back to top]

    [2021-12-20] High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022)

    Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer

    Organizations: Heidelberg University, Runway ML

    Summary: Efficient high-quality image generation by applying diffusion and denoising processes in the VAE latent space.

    LDM performs the diffusion process in the latent space of VAE.

    [back to top]

    [2021-12-08] Classifier-Free Diffusion Guidance (NeurIPS workshop 2021)

    (notes in jupyter)

    Authors: Jonathan Ho, Tim Salimans

    Organizations: Google Research, Brain team

    Summary: Image generation with classifier-free condition guidance by jointly training a conditional model and an unconditional model.

    [back to top]

    [2021-04-18] CLIPScore: A Reference-free Evaluation Metric for Image Captioning (EMNLP 2021)

    (notes in jupyter)

    Authors: Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi

    Organizations: Allen Institute for AI, University of Washington

    Summary: A reference-free metric mainly focusing on semantic alignment for image generation evaluation.

    CLIPScore frees from the shortcomings of n-gram matching that disfavors good captions with new words and favors captions with familiar words.

    [back to top]

    [2020-10-06] Denoising Diffusion Implicit Models (ICLR 2021)

    (notes in jupyter)

    Authors: Jiaming Song, Chenlin Meng, Stefano Ermon

    Organizations: Stanford University

    Summary: Accelerate sampling of diffusion models by introducing a non-Markovian, deterministic process that achieves high-quality results with fewer steps while preserving training consistency.

    Comparisons between Markovian DDPM (left) and non-Markovian DDIM (right).
    Accelerate sampling by skipping time steps.

    [back to top]

    [2020-06-19] Denoising Diffusion Probabilistic Models (NeurIPS 2020)

    (notes in jupyter)

    Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel

    Organizations: UC Berkeley

    Summary: Denoising diffusion probabilistic models that iteratively denoises data from random noise for image generation.

    Forward and reverse processes of DDPM.
    Training and sampling algorithms of DDPM.

    [back to top]

    [2019-05-04] FVD: A new Metric for Video Generation (ICLR workshop 2019)

    Authors: Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, Sylvain Gelly

    Organizations: Johannes Kepler University, IDSIA, Google Brain

    Summary: Extend FID for video generation evaluation by replacing 2D InceptionNet with pre-trained Inflated 3D convnet.

    [back to top]

    [2017-06-26] GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017)

    (notes in jupyter)

    Authors: Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp Hochreiter

    Organizations: Johannes Kepler University Linz

    Summary: Calculate Fréchet distance between Gaussian distributions of InceptionNet feature maps of real-world data and synthetic data for image generation evaluation.

    [back to top]

    [2016-06-10] Improved Techniques for Training GANs (NeurIPS 2016)

    (notes in jupyter)

    Authors: Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen

    Organizations: OpenAI

    Summary: Calculate KL divergence between p(y|x) and p(y) that aims to minimize the entropy across samples and maximize the entropy across classes for image generation evaluation.

    [back to top]