Visual Generative Models

The reading papers curated by Junkun Yuan (yuanjk0921@outlook.com)

Contents:
[back to main contents of reading papers]

Last updated on April 28, 2025.

Summaries

Surveys & Insights

Surveys and interesting insights on visual generative models.
Date & Model Paper & Publication & Project Summary
2025-03-10
Inference Beat Pretraining
Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms (arXiv 2025) Analyze pre-training algorithm design from a inference-first perspective, and scaling inference from a unified perspective of scaling sequence length & refinement steps.
2025-02-18
noise-unconditional model
Is Noise Conditioning Necessary for Denoising Generative Models? (arXiv 2025) Theoretical and empirical analysis on noise-unconditional denoising diffusion models without a timestep input for image generation.
2024-12-09
Flow Matching Guide and Code
Flow Matching Guide and Code (arXiv 2024) Star Comprehensive and self-contained review of the flow matching algorithm, covering its mathmatical foundations, design choices, extensions, and code implementations.
2022-08-25
Unified Perspective
Understanding Diffusion Models: A Unified Perspective (arXiv 2022) Introduction to VAE, DDPM, score-based generative model, guidance from a unified generative perspective.
[back to top]


Foundation Models & Algorithms

Foundation models and algorithms on visual generative models.
Date & Model Paper & Publication & Project Summary
2025-04-15
SimpleAR
SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL (arXiv 2025) Star A vanilla, open-sourced AR model (0.5B) for 1K text-to-image generation, trained by pre-training, SFT, RL (GRPO), acceleration (KV cache, vLLM serving, speculative jacobi decoding).
2025-04-15
Seedream 3.0
Seedream 3.0 Technical Report (arXiv 2025) ByteDance Sead Vision Team's text-to-image generation model (MMDiT structure), improve Seedream 2.0 by defect-aware training, representation alignment, larger reward model, etc.
2025-04-11
Seaweed-7B
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model (arXiv 2025) ByteDance Seaweed Team's text-to-video & image-to-video generation model (7B) with DiT structure, trained (multi-task learning) on O(100M) video clips using 665K H100 GPU hours.
2025-03-26
Wan
Wan: Open and Advanced Large-Scale Video Generative Models (arXiv 2025) Star Alibaba Tongyi Wanxiang's open-sourced model (14B) for text-to-video & image-to-video generation, using 8x8x4 VAE, DiT structure, etc.
2025-03-14
Step-Video-TI2V
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model (arXiv 2025) Star StepFun's open-sourced model (30B) for image-to-video generation, trained upon Step-Video-T2V, by using channel concat of image condition and timestep-combined motion condition.
2025-03-10
Seedream2.0
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model (arXiv 2025) ByteDance Sead Vision Team 's foundation model for image genertion with native Chinese-English bilingual capability, where some techniques such as scaled RoPE, SFT, RLHF are employed.
2025-02-14
Step-Video-T2V
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model (arXiv 2025) Star StepFun's open-sourced model (30B) for text-to-video generation, using DiT structure & RoPE-3D & QK-Norm & 16x16x8 VAE & two bilingual text encoders & DPO.
2024-12-05
Infinity
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis (arXiv 2024) Star It improves VAR by applying bitwise modeling that makes vocabulary "infinity" to open up new posibilities of discrete text-to-image generation with next-scale prediction paradigm.
2024-12-03
HunyuanVideo
HunyuanVideo: A Systematic Framework For Large Video Generative Models (arXiv 2024) Star Tencent (Hunyuan Team)'s open-sourced video generation model (13B) using diffusion transformer and conducting fine-grained data curation, captioning, and training scaling.
2024-10-17
MovieGen
Movie Gen: A Cast of Media Foundation Models (arXiv 2024) A diffusion transformer-based model (30B) for 16s / 1080p / 16 fps video and synchronized audio generation.
2024-10-17
Fluid
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (ICLR 2025) It shows auto-regressive models with continuous tokens beat discrete tokens counterpart, and some interesting empirical observations during scaling progress.
2024-07-16
DiT-MoE
Scaling Diffusion Transformers to 16 Billion Parameters (arXiv 2024) Star A diffusion transformer (16B) with MoE that inserts experts into DiT blocks for image generation.
2024-06-10
LlamaGen
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (arXiv 2024) Star Show that applying "next-token prediction" to vanilla autoregressive models achieves good class-conditional & text-conditional image generation performance.
2024-04-03
VAR
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (NeurIPS 2024 Best Paper) Star Employ next-scale prediction to make auto-regressive models surpass diffusion transformers for image generation on image quality, inference speed, data efficiency, scalability.
2023-07-04
SDXL
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (ICLR 2024) Star Employ three times larger UNet backbone of SD, use resolution & crop coordinates & bucketing information as resolution condition for training, employ a refinement model.
2022-12-19
DiT
(notes in jupyter)
Scalable Diffusion Models with Transformers (ICCV 2023) Star Replace U-Net by transformer for scalable image generation, the timestep and prompt are injected by adaLN-Zero structure.
2022-10-06
Flow Matching
Flow Matching for Generative Modeling (ICLR 2023) Star A type of generative models built on continuous normalizing flows by learning a time-dependent vector field that transports data from the source distribution to the target distribution.
2022-05-29
CogVideo
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers (ICLR 2023) Star An open-sourced transformer-based video generation model (9B) that auto-regressively generates frame sequences and then performs auto-regressive frame interpolatation.
2021-12-20
LDM
High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022) Star Efficient high-quality image generation by applying diffusion and denoising processes in the VAE latent space.
2021-12-08
CFG
(notes in jupyter)
Classifier-Free Diffusion Guidance (NeurIPS workshop 2021) Image generation with classifier-free condition guidance by jointly training a conditional model and an unconditional model.
2020-10-06
DDIM
(notes in jupyter)
Denoising Diffusion Implicit Models (ICLR 2021) Star Accelerate sampling of diffusion models by introducing a non-Markovian, deterministic process that achieves high-quality results with fewer steps while preserving training consistency.
2020-06-19
DDPM
(notes in jupyter)
Denoising Diffusion Probabilistic Models (NeurIPS 2020) Star Denoising diffusion probabilistic models that iteratively denoises data from random noise for image generation.
2019-06-02
VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2 (NeurIPS 2019) Introduce hierarchical VQ-VAE and make sampling in the compressed latent space (which is faster than sampling in pixel space), inspired by the idea of lossy compression.
2017-11-02
VQ-VAE
Neural Discrete Representation Learning (NeurIPS 2017) Propose vector quantised variational autoencoder to generate discrete codes while the prior is also learned.
[back to top]


Fine-Tuning

Post-training by fine-tuning models or some modules like LoRA or ControlNet.
Date & Model Paper & Publication & Project Summary
2024-11-27
Reliable Seed
Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds (ICLR 2025) The noises initialized by reliable seeds result in accurate image generation such as numeracy and position, and use these generated data for fine-tuning further improves performance.
2024-03-08
ELLA
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment (arXiv 2024) Star ELLA: Replace CLIP with LLM to understand dense prompts; DPG-Bench: evaluate image generation on dense prompts.
[back to top]


Reinforcement Learning

Post-training by employing reinforcement learning.
Date & Model Paper & Publication & Project Summary
2025-01-23
PARM
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step (arXiv 2025) Star Apply Chain-of-Thought into image generation and combine it with reinforcement learning to further improve performance.
2025-01-23
Flow-RWR
Flow-DPO
Improving Video Generation with Human Feedback (arXiv 2025) Star A human preference video dataset; Adapt diffusion-based reinforcement learning to flow-based video generation models.
2023-12-19
InstructVideo
InstructVideo: Instructing Video Diffusion Models with Human Feedback (CVPR 2024) Star Use HPS v2 to provide reward and train video generation models in an editing manner.
2023-11-21
Diffusion-DPO
(notes in jupyter)
Diffusion Model Alignment Using Direct Preference Optimization (CVPR 2024) Star Adapt Direct Preference Optimization (DPO) from large language models to diffusion models for image generation.
2022-12-19
promptist
Optimizing Prompts for Text-to-Image Generation (NeurIPS 2023) Star Use LLM to refine prompts for preference-aligned image generation by taking relevance and aesthetics as reward.
[back to top]


Acceleration

Accelerate inference of diffusion models by employing distillation, quantization, and pruning algorithms.
Date & Model Paper & Publication & Project Summary
2024-03-18
SD3-Turbo
LADD
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (SIGGRAPH Asia 2024) Perform distillation of diffusion models in latent space using teacher-synthetic data and optimizing only an adversarial loss with the teacher as the discriminator.
[back to top]


Inference-Time Improvement

Improve inference-time visual generation performance, inspired by progress in large language models like OpenAI o1 and DeepSeek-R1.
Date & Model Paper & Publication & Project Summary
2025-02-24
IGTR
Autoregressive Image Generation Guided by Chains of Thought (arXiv 2025) Insert reasoning prompts to improve auto-regressive image generation performance by Chain-of-Thought.
2025-01-23
PARM
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step (arXiv 2025) Star Apply Chain-of-Thought into image generation and combine it with reinforcement learning to further improve performance.
2025-01-16
Scaling Analysis
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps (CVPR 2025) Analysis on inference-time scaling of diffusion models for image generation from the axes of verifiers and algorithms.
2024-12-14
Z-Sampling
Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection (ICLR 2025) Star Use guidance gap between denosing and inversion and iteratively perform them to improve image generation quality.
[back to top]


Evaluation

Metrics and benchmarks for evaluating visual generation performance.
Date & Model Paper & Publication & Project Summary
2025-03-07
UnifiedReward
Unified Reward Model for Multimodal Understanding and Generation(arXiv 2025) Star A unified reward model for both multimodal understanding & generation evaluation by pairwise ranking & pointwise scoring.
2024-12-24
FGA-BLIP2
EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation (arXiv 2024) Star Use 40K image-text pairs with fine-grained human annotations for image generation evaluation.
2024-07-19
T2V-CompBench
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation (arXiv 2024) Star Use 1400 prompts to evaluate video generation on compositional generation, including consistent attribute binding, dynamic attribute binding, sptial relationships, motion binding, action binding, object interations, generative numeracy.
2024-05-23
MPS
Learning Multi-dimensional Human Preference for Text-to-Image Generation (CVPR 2024) Star Apply condition upon CLIP to learn multi-dimensional preference score: aesthetics & alignment & detail & overall, trained on a dataset with 92K preference choices of 61K images.
2024-04-01
VQAScore
Evaluating Text-to-Visual Generation with Image-to-Text Generation (ECCV 2024) Star VQAScore: alignment probability of "yes" answer from a VQA model with CLIP-FlanT5 structure; GenAI-Bench: evaluation benchmark with 1600 prompts for image generation.
2024-03-08
DPG-Bench
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment (arXiv 2024) Star ELLA: Replace CLIP with LLM to understand dense prompts; DPG-Bench: evaluate image generation on dense prompts.
2023-11-29
VBench
VBench: Comprehensive Benchmark Suite for Video Generative Models (CVPR 2024) Star Evaluate video generation from 16 dimensions within the perspectives of video quality and video-prompt consistency.
2023-10-17
GenEval
GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment (NeurIPS 2023) Star An object-focused framework for image generation evaluation by providing scores of single object, two objects, counting, colors, position, attribute binding, and overall.
2023-07-12
T2I-CompBench
(notes in jupyter)
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation (NeurIPS 2023) Star Use 6000 prompts to train and evaluate image generation on compositional generation, including attribute binding, object relationship, and complex compositions.
2023-06-15
HPS v2
(notes in jupyter)
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis (arXiv 2023) Star HPD v2: 798K binary human preference choices on 433K pairs of generated images; HPS v2: use HPD v2 to fine-tune CLIP for image generation evaluation.
2023-05-02
PickScore
(notes in jupyter)
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation (NeurIPS 2023) Star Pick-a-Pic: use a web app to collect user preferences; PickScore: train a CLIP-based model for image generation evaluation.
2023-04-12
ImageReward
(notes in jupyter)
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation (NeurIPS 2023) Star Train BLIP on 137K human preference image pairs for image generation and use it to tune diffusion models by Reward Feedback Learning (ReFL).
2023-03-25
HPS
Human Preference Score: Better Aligning Text-to-Image Models with Human Preference (ICCV 2023) Star Fine-tune CLIP using annotated 98K SD generated images from 25K prompts for image generation evaluation.
2021-04-18
CLIPScore
(notes in jupyter)
CLIPScore: A Reference-free Evaluation Metric for Image Captioning (EMNLP 2021) Star A reference-free metric mainly focusing on semantic alignment for image generation evaluation.
2019-05-04
FVD
FVD: A new Metric for Video Generation (ICLR workshop 2019) Extend FID for video generation evaluation by replacing 2D InceptionNet with pre-trained Inflated 3D convnet.
2017-06-26
FID
(notes in jupyter)
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017) Calculate Fréchet distance between Gaussian distributions of InceptionNet feature maps of real-world data and synthetic data for image generation evaluation.
2016-06-10
Inception Score
(notes in jupyter)
Improved Techniques for Training GANs (NeurIPS 2016) Star Calculate KL divergence between p(y|x) and p(y) that aims to minimize the entropy across samples and maximize the entropy across classes for image generation evaluation.
[back to top]


Downstream Tasks

Models & techniques on downstream tasks, e.g., editing, inpainting, outpainting, etc.
Date & Model Paper & Publication & Project Summary
2025-04-24
Step1X-Edit
Step1X-Edit: A Practical Framework for General Image Editing (arXiv 2025) Star Use a MLLM to generate condition embedding of the reference image and user's editing instruction, trained on 20M instruction-image data for image generation editing.
[back to top]


Papers & Reading Notes

[2025-04-24] Step1X-Edit: A Practical Framework for General Image Editing (arXiv 2025)

Authors: Step1X-Image Team

Organizations: StepFun

Summary: Use a MLLM to generate condition embedding of the reference image and user's editing instruction, trained on 20M instruction-image data for image generation editing.

Multimodal large language model: generate embeddings of instruction and reference images; use Qwen-VL. Connector: substitute text embedding; use token refiner. Diffusion Transformer.

[back to top]

[2025-04-15] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL (arXiv 2025)

Project: SimpleAR

Authors: Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang

Organizations: Fudan University, ByteDance Seed

Summary: A vanilla, open-sourced AR model (0.5B) for 1K text-to-image generation, trained by pre-training, SFT, RL (GRPO), acceleration (KV cache, vLLM serving, speculative jacobi decoding).

[back to top]

[2025-04-15] Seedream 3.0 Technical Report (arXiv 2025)

Project: Seedream 3.0

Authors: ByteDance Seed Vision Team

Organizations: ByteDance

Summary: ByteDance Sead Vision Team's text-to-image generation model (MMDiT structure), improve Seedream 2.0 by defect-aware training, representation alignment, larger reward model, etc.

Seedream3.0 achieves the best ELO performance.

[back to top]

[2025-04-11] Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model (arXiv 2025)

Project: Seaweed-7B

Authors: ByteDance Seaweed Team

Organizations: ByteDance

Summary: ByteDance Seaweed Team's text-to-video & image-to-video generation model (7B) with DiT structure, trained (multi-task learning) on O(100M) video clips using 665K H100 GPU hours.

Video data processing.
Temporal splitting: internal splitter.
Spatial cropping: use FFmpeg to remove black border, and develop models for text, logo, watermark, and special effects detection;
Quality filtering: >256 pixels, within [1/3, 3/1] aspect ratio, elimitenate static clips and undeseirable movements, camera shake & playback speed detection, safety, artifact detection.
Multi-aspect data balancing & video deduplication: clustering.
Simulation data: use graphics engines to synthesize videos.
Video captioning: a short & a long caption.
System prompt: video types, camera position, camera angles, camera movement, visual styles.
VAE. Use L1 + LPIPS + adversarial loss. Use both an image discriminator and a video discriminator performs better than using either one. Downsample by 16x16x4-48c or 8x8x4-16c. Smaller downsample ratios lead to faster convergence. Compress information using VAE outperforms patchification in DiT and faster.
VAE training stages for images and videos.
Use mixed resolution & durations & frame rate VAE training converges slower but performs better than training on low resolution.
Full attention enjoys training scalability.
Hybrid-stream is better than dual-stream.
Four-stage training procedure.
Use multi-task training: text-to-video, image-to-video, video-to-video. Input features and conditioning features are channel-concatenated, with a binary mask indicating the condition. Ratio of image-to-video is 20% during pre-training, and increasing to 50%-75% detached for fine-tuning.
SFT: use 700K good videos and 50K top videos, the following ability drops a little.
RLHF: lr=1e-7, beta=100, select from 4 samples from different seeds.
Distillation: trajectory segmented consistency distillation + CFG distillation + adversarial training, distill to 8 steps.
Image-to-video ELO performance.

[back to top]

[2025-03-26] Wan: Open and Advanced Large-Scale Video Generative Models (arXiv 2025)

Project: Wan

Authors: Wan

Organizations: Alibaba Group

Summary: Alibaba Tongyi Wanxiang's open-sourced model (14B) for text-to-video & image-to-video generation, using 8x8x4 VAE, DiT structure, etc.

Data procssing pipeline.
Fundamental dimensions: text detection, aesthetic evaluation, NSFW score, watermark & logo detection, black border detection, overexposure detection, synthetic image detection, blur detection, duration & resolution.
Visual quality: clustering, scoring.
Motion quality: optimal motion, medium-quality motion, static videos, camera-driven motion, low-quality motion, shaky camera footage.
Visual text data: hundreds of millions of text-containing images by rendering Chinese characters on a pure white background and large amounts from real-world data.
Captions: celebrities, landmarks, movie characters, object counting, OCR, camera angle and motion, fine-grained categories, relational understanding, re-caption, editing instruction caption, group image description, human-annotated image and video captions.
VAE compresses data by 8x8x4. It is trained using L1 reconstruction loss, KL loss, LPIPS perceptual loss.
Architecture. Cross-attention for text prompts; MLP for time embeddings; Text encoder is umT5; training objective is flow-matching. Image pre-training -> image-video joint training.
Transformer blocks.
I2V framework. Image condition is injected by channel-concat and CLIP image encoder.

[back to top]

[2025-03-14] Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model (arXiv 2025)

Project: Step-Video-TI2V

Authors: Step-Video Team

Organizations: StepFun

Summary: StepFun's open-sourced model (30B) for image-to-video generation, trained upon Step-Video-T2V, by using channel concat of image condition and timestep-combined motion condition.

Image conditioning: channel-wise concatenation of noise-augmented and zero-padded image condition and input noise.
Motion conditioning: motion is extracted using optical flow and is combined with timestep.

[back to top]

[2025-03-10] Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms (arXiv 2025)

Project: Inference Beat Pretraining

Authors: Jiaming Song, Linqi Zhou

Organizations: Luma AI

Summary: Analyze pre-training algorithm design from a inference-first perspective, and scaling inference from a unified perspective of scaling sequence length & refinement steps.

[back to top]

[2025-03-10] Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model (arXiv 2025)

Project: Seedream2.0

Authors: ByteDance's Seed Vision Team

Organizations: ByteDance

Summary: ByteDance Sead Vision Team 's foundation model for image genertion with native Chinese-English bilingual capability, where some techniques such as scaled RoPE, SFT, RLHF are employed.

Performance with English and Chinese prompts.
Pre-training data system.
Model structure is designed following SD3.
Data cleaning process.
Generic captions: long & short; specialied captions: style, color, composition, light, textual; surreal captions: surreal & fantastical aspects.
Overview of training and inference stages.

[back to top]

[2025-03-07] Unified Reward Model for Multimodal Understanding and Generation(arXiv 2025)

Project: UnifiedReward

Authors: Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, Jiaqi Wang

Organizations: Fudan University, Shanghai Innovation Institute, Shanghai AI Lab, Shanghai Academy of Artificial Intelligence for Science

Summary: A unified reward model for both multimodal understanding & generation evaluation by pairwise ranking & pointwise scoring.

Pipeline: (1) Unified Reward Model Training: train a unified reward model for generation & understanding tasks using pointwise scoring & pairwise ranking; (2) Preference Data Construction: data generation, pairwise ranking, pointwise filtering; (3) Generation/Understanding Model Alignment: DPO. The trained model is initialized from LLaVA-OneVision 7B (OV-7B).

[back to top]

[2025-02-24] Autoregressive Image Generation Guided by Chains of Thought (arXiv 2025)

Project: IGTR

Authors: Miaomiao Cai, Guanjie Wang, Wei Li, Zhijun Tu, Hanting Chen, Shaohui Lin, Jie Hu

Organizations: University of Science and Technology of China, Huawei Noah's Ark Lab, East China Normal University

Summary: Insert reasoning prompts to improve auto-regressive image generation performance by Chain-of-Thought.

Take an image in the same class as a Class Reasoning Prompt or indices from codebook of image tokenizer as a Universal Reasoning Prompt.

[back to top]

[2025-02-18] Is Noise Conditioning Necessary for Denoising Generative Models? (arXiv 2025)

Project: noise-unconditional-model

Authors: Qiao Sun, Zhicheng Jiang, Hanhong Zhao, Kaiming He

Organizations: MiT

Summary: Theoretical and empirical analysis on noise-unconditional denoising diffusion models without a timestep input for image generation.

[back to top]

[2025-02-14] Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model (arXiv 2025)

Project: Step-Video-T2V

Authors: Step-Video Team

Organizations: StepFun

Summary: StepFun's open-sourced model (30B) for text-to-video generation, using DiT structure & RoPE-3D & QK-Norm & 16x16x8 VAE & two bilingual text encoders & DPO.

Main structure: a VAE, bilingual text encoders, DiT, and DPO.
Video-VAE compresses videos by 8x16x16 & 16-channel features.
DPO framework: use data & synthesis prompts to generate winning & losing samples from different seeds, followed by human annotation. Adapt Diffusion-DPO here and reduce beta & increase learning rate to achieve faster convergence.
Use DiT with 3D full-attention, cross-attention for prompts, RoPE-3D, and QK-Norm. HunyuanCLIP and Step-LLM are used as the bilingual text encoders.
2B video-text pairs, 3.8B image-text pairs. Video segmentation: use AdaptiveDetector in PySceneDetect to identify scene changes and use FFmpeg to split them. Video quality assessment: sample eight frames and evaluate. Aesthetic score: LAION CLIP-based aesthetic predictor. NSFW score: LAION CLIP-based NSFW detector. Watermark detection: EfficientNet-based model. Subtitle detection: use PaddleOCR to recognize and localize tests. Saturation score: transform from RGB to HSV and extract saturation channel. Blur score: Laplacian variance. Black border detection: FFmpeg. Video motion assessment: mean optical flow.Video captioning: short caption, dense caption, original title. Video concept balancing: use K-means to group into 120K clusters. Video-text alignment: CLIP Score.
Pre-training details.
Data filtering details.

[back to top]

[2025-01-23] Improving Video Generation with Human Feedback (arXiv 2025)

Project: Flow-RAR, Flow-DPO

Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang

Organizations: CUHK, Tsinghua University, Kuaishou Technology, Shanghai Jiao Tong University, Shanghai AI Lab

Summary: A human preference video dataset; Adapt diffusion-based reinforcement learning to flow-based video generation models.

[back to top]

[2025-01-23] Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step (arXiv 2025)

Project: PARM

Authors: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng

Organizations: CUHK, Peking University, Shanghai AI Lab

Summary: Apply Chain-of-Thought into image generation and combine it with reinforcement learning to further improve performance.

ORM is coarse, PRM does not know when to make decision, PARM combines them.
Self-correction makes use of bad cases.

[back to top]

[2025-01-16] Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps (CVPR 2025)

Project: Scaling Analysis

Authors: Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, Saining Xie

Organizations: NYU, MIT, Google

Summary: Analysis on inference-time scaling of diffusion models for image generation from the axes of verifiers and algorithms.

Scale with search is more effective than denoising steps.
Random Search performs the best because it converges fastest.

[back to top]

[2024-12-24] EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation (arXiv 2024)

Project: FGA-BLIP2

Authors: Shuhao Han, Haotian Fan, Jiachen Fu, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Chunle Guo, Chongyi Li

Organizations: Nankai University, ByteDance Inc, NKIARI

Summary: Use 40K image-text pairs with fine-grained human annotations for image generation evaluation.

[back to top]

[2024-12-14] Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection (ICLR 2025)

Project: Sampling

Authors: Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, Zeke Xie

Organizations: The Hong Kong University of Science and Technology (Guangzhou), Mohamed bin Zayed University of Artificial Intelligence, Baidu Inc

Summary: Use guidance gap between denosing and inversion and iteratively perform them to improve image generation quality.

Iterative denoising & inversion.
Capture more semantics by denoising more times.
More efficient & effective than standard.

[back to top]

[2024-12-09] Flow Matching Guide and Code (arXiv 2024)

Project: Flow Matchiing Guide and Code

Authors: Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, Itai Gat

Organizations: FAIR at Meta, MIT CSAIL, Weizmann Institute of Science

Summary: Comprehensive and self-contained review of the flow matching algorithm, covering its mathmatical foundations, design choices, extensions, and code implementations.

[back to top]

[2024-12-05] Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis (arXiv 2024)

Project: Infinity

Authors: Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu

Organizations: ByteDance

Summary: It improves VAR by applying bitwise modeling that makes vocabulary "infinity" to open up new posibilities of discrete text-to-image generation with next-scale prediction paradigm.

Viusal tokenization and quantization: instead of predicting 2**d indices, infinite-vocabulary classifier predicts d bits instead.
Text prompt: T5 text embedding is projected into a token as the input of the first scale.
Performance: Infinity is fast & better.
Tokenizer performance: discrete outperforms continuous.
Inifinite-Vocabulary Classifier: low memory & better.
Self-correction mitigates the train-test discrepancy.
Scaling up vocabulary: larger vocabulary & larger model size performs the best.
Scaling model size: strong correlation between validation loss and evaluation metrics.
2D RoPE outperforms APE.

[back to top]

[2024-12-03] HunyuanVideo: A Systematic Framework For Large Video Generative Models (arXiv 2024)

Project: HunyuanVideo

Authors: Hunyuan Foundation Model Team

Organizations: Tencent

Summary: Tencent (Hunyuan Team)'s open-sourced video generation model (13B) using diffusion transformer and conducting fine-grained data curation, captioning, and training scaling.

[back to top]

[2024-11-27] Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds (ICLR 2025)

Project: Reliable-Seed

Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann

Organizations: EPFL, Stony Brook University

Summary: The noises initialized by reliable seeds result in accurate image generation such as numeracy and position, and use these generated data for fine-tuning further improves performance.

[back to top]

[2024-10-17] Movie Gen: A Cast of Media Foundation Models (arXiv 2024)

Project: MovieGen

Authors: Adam Polyak et al.

Organizations: Meta

Summary: A diffusion transformer-based model (30B) for 16s / 1080p / 16 fps video and synchronized audio generation.

[back to top]

[2024-10-17] Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (ICLR 2025)

Project: Fluid

Authors: Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian

Organizations: Google DeepMind, MIT

Summary: It shows auto-regressive models with continuous tokens beat discrete tokens counterpart, and some interesting empirical observations during scaling progress.

Image tokenizer: images are converted to either discrete (VQGAN, 16x16 tokens with 8192 vocabulary) or continuous tokens (VAE, 32x32 with 4 channels, patchisize to 256 tokens with 16 channels).
Text tokenizer: text (max length 128) is converted to discrete tokens using T5-XXL.
Model structure: transformer with cross-attention attending to text embedding.
Decoder: text: softmax with cross-entropy loss; image: MLP with diffusion loss.
Validation losses consistently scale with model size.
Random-order models & continuous tokens perform the best in evaluation scores.
Random-order models & continuous tokens scale with training computes.
Strong correlation between validation loss & evaluation metrics.

[back to top]

[2024-07-19] T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation (arXiv 2024)

Project: T2V-CompBench

Authors: Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, Xihui Liu

Organizations: The University of Hong Kong, The Chinese University of Hong Kon, Huawei Noah's Ark Lab

Summary: Use 1400 prompts to evaluate video generation on compositional generation, including consistent attribute binding, dynamic attribute binding, sptial relationships, motion binding, action binding, object interations, generative numeracy.

T2V-CompBench: categories (left), evaluation methods (middle), and benchmarking model performance (right).

[back to top]

[2024-07-16] Scaling Diffusion Transformers to 16 Billion Parameters (arXiv 2024)

Project: DiT-MoE

Authors: Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang

Organizations: Kunlun Inc.

Summary: A diffusion transformer (16B) with MoE that inserts experts into DiT blocks for image generation.

DiT-MoE is built upon DiT and replaces MLP of Transformer block by sparsely activatd mixture of MLPs as experts.

[back to top]

[2024-06-10] Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (arXiv 2024)

Project: LlamaGen

Authors: Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan

Organizations: The University of Hong Kong, ByteDance

Summary: Show that applying "next-token prediction" to vanilla autoregressive models achieves good class-conditional & text-conditional image generation performance.

[back to top]

[2024-05-23] Learning Multi-dimensional Human Preference for Text-to-Image Generation (CVPR 2024)

Project: MPS

Authors: Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang

Organizations: Kuaishou Technology

Summary: Apply condition upon CLIP to learn multi-dimensional preference score: aesthetics & alignment & detail & overall, trained on a dataset with 92K preference choices of 61K images.

[back to top]

[2024-04-03] Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (NeurIPS 2024 Best Paper)

Project: VAR

Authors: Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang

Organizations: Peking University, Bytedance Inc

Summary: Employ next-scale prediction to make auto-regressive models surpass diffusion transformers for image generation on image quality, inference speed, data efficiency, scalability.

Next-scale prediction: it starts from the 1x1 token map, and progressively expands in resolution; at each step, the transformer predicts the next higher-resolution token map conditioned on all previous ones.
VAR shows good scaling behavior, and significantly outperforms DiT.
Training pipeline of tokenzier and VAR. Tokenzier: the same architecture, training data (OpemImages) as VQVAE, using codebook of 4096 and spatial downsample ratio of 16x. VAR: the standard transformers architecture with AdaLN; do not use RoPE, SwiGLU MLP, RMS Norm.
Encoding & reconstruction process of tokenizer.

[back to top]

[2024-04-01] Evaluating Text-to-Visual Generation with Image-to-Text Generation (ECCV 2024)

Project: VQAScore

Authors: Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

Organizations: Crnegie Mellon University, Meta

Summary: VQAScore: alignment probability of "yes" answer from a VQA model with CLIP-FlanT5 structure; GenAI-Bench: evaluation benchmark with 1600 prompts for image generation.

Evaluate by computing probability of answering "yes" from a trained VQA model by feeding the prompt and the generated image.

[back to top]

[2024-03-18] Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation (SIGGRAPH Asia 2024)

Project: SD3-Turbo

Authors: Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, Robin Rombach

Organizations: Stability AI

Summary: Perform distillation of diffusion models in latent space using teacher-synthetic data and optimizing only an adversarial loss with the teacher as the discriminator.

ADD: (1) An adversarial loss for deceiving a discriminator (DINO v2); (2) A distillation loss for matching denoised output to that of a teacher.
LADD: (1) Use teacher generated images instead of real images for distillation; (2) Use teacher as the discrinimator.
Advantages: (1) Save memory and improve efficiency; (2) Diffusion model teacher as the discriminator provides noise-level feedback; (3) Diffusion model as the discriminator handles multi-aspect ratio; (4) Diffusion model as the discriminator aligns with human perception on shape.
(1) Training with synthetic data works better than real data. (2) Training on synthetic data only needs the adversarial loss.
(1) Training using LADD performs better than LCM. (2) LCM may need larger models like SDXL-LCM, instand of SD1.5-LCM.
Student model size significant impacts performance, while the benefits of teacher models and data quality plateau.
Use LoRA for DPO-traning, and apply DPO-LoRA after LADD training.

[back to top]

[2024-03-08] ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment (arXiv 2024)

Project: DPG-Bench

Authors: Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, Gang Yu

Organizations: Tencent

Summary: ELLA: Replace CLIP with LLM to understand dense prompts; DPG-Bench: evaluate image generation on dense prompts.

[back to top]

[2023-12-19] InstructVideo: Instructing Video Diffusion Models with Human Feedback (CVPR 2024)

Project: InstructVideo

Authors: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni

Organizations: Zhejiang University, Alibaba Group, Tsinghua University, Singapore University of Technology and Design, Nanyang Technological University, University of Cambridge

Summary: Use HPS v2 to provide reward and train video generation models in an editing manner.

It takes reward fine-tuning as an editing task to accelerate training. The reward model is HPS v2. TAR assigns larger weight to central frames.

[back to top]

[2023-11-21] Diffusion Model Alignment Using Direct Preference Optimization (CVPR 2024)

(notes in jupyter)

Project: Diffusion-DPO

Authors: Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik

Organizations: Nikhil Naik, Salesforce AI, Stanford University

Summary: Adapt Direct Preference Optimization (DPO) from large language models to diffusion models for image generation.

[back to top]

[2023-11-29] VBench: Comprehensive Benchmark Suite for Video Generative Models (CVPR 2024)

Project: VBench

Authors: Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

Organizations: Nanyang Technological University, Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, Nanjing University

Summary: Evaluate video generation from 16 dimensions within the perspectives of video quality and video-prompt consistency.

caption1.

[back to top]

[2023-10-17] GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment (NeurIPS 2023)

Project: GenEval

Authors: Dhruba Ghosh, Hanna Hajishirzi, Ludwig Schmidt

Organizations: University of Washington, Allen Institute for AI, LAION

Summary: An object-focused framework for image generation evaluation by providing scores of single object, two objects, counting, colors, position, attribute binding, and overall.

GenEval detects objects using Mask2Former detector and evaluates attributes of them.
Specific evaluation perspectives of GenEval.

[back to top]

[2023-07-12] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation (NeurIPS 2023)

(notes in jupyter)

Project: T2I-CompBench

Authors: Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, Xihui Liu

Organizations: The University of Hong Kong, Huawei Noah's Ark Lab

Summary: Use 6000 prompts to train and evaluate image generation on compositional generation, including attribute binding, object relationship, and complex compositions.

  • Use disentangled BLIP-VQA to evaluate attribute binding, UniDet-based metric to evaluate spatial relationship, CLIPScore to evaluate non-spatial relationship, 3-in-1 metric (average score of the three metrics) to evaluate complex compositions.
  • [back to top]

    [2023-07-04] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (ICLR 2024)

    Project: SDXL

    Authors: Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach

    Organizations: Stability AI

    Summary: Employ three times larger UNet backbone of SD, use resolution & crop coordinates & bucketing information as resolution condition for training, employ a refinement model.

    Architecture:.
    (1) SDXL uses different transformer blocks compared to SD.
    (2) Use two text encoders OpenCLIP VIT-bigG & CLIP ViT-L.
    (3) The final model has 2.6B parameters with 817M parameters for text encoders.
    (4) Embeddings of height & width and cropping top & left and bucketing heigh & width are added to timestep embeddings.
    (5) Improve auto-encoder by employing larger batchsize of 256 and EMA.
    (6) Use a refiner of SDEdit for refining details.
    (7) Training: Stage 1: 256 x 256, 600K steps, batchsize 2048. Stage 2: 512 x 512, 200K steps. Stage 3: multi-aspect training.

    [back to top]

    [2023-06-15] Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis (arXiv 2023)

    (notes in jupyter)

    Project: HPS v2

    Authors: Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, Hongsheng Li

    Organizations: CUHK, SenseTime Research, Shanghai Jiao Tong University, Centre for Perceptual and Interactive Intelligence

    Summary: HPD v2: 798K binary human preference choices on 433K pairs of generated images; HPS v2: use HPD v2 to fine-tune CLIP for image generation evaluation.

    Step 1: Clean prompts from COCO captions and DiffusionDB by ChatGPT; Step 2: Generate images using 9 models; Step 3: Rand and annotate each pair of images; Step 4: Train CLIP and obtain preference model for providing HPS v2 score.

    [back to top]

    [2023-05-02] Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation (NeurIPS 2023)

    (notes in jupyter)

    Project: PickScore

    Authors: Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, Omer Levy

    Organizations: Tel Aviv University, Stability AI

    Summary: Pick-a-Pic: use a web app to collect user preferences; PickScore: train a CLIP-based model for image generation evaluation.

    [back to top]

    [2023-04-12] ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation (NeurIPS 2023)

    (notes in jupyter)

    Project: ImageRewward

    Authors: Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, Yuxiao Dong

    Organizations: Tsinghua University, Zhipu AI, Beijing University of Posts and Telecommunications

    Summary: Train BLIP on 137K human preference image pairs for image generation and use it to tune diffusion models by Reward Feedback Learning (ReFL).

    Step 1: use DiffusionDB prompts to generate images; Step 2: Rate and rank images; Step 3: Train ImageReward using ranking data; Step 4: tune diffusion models using ImageReward; Step 5: provide scores to the generated images.

    [back to top]

    [2023-03-25-hps] Human Preference Score: Better Aligning Text-to-Image Models with Human Preference (ICCV 2023)

    Project: HPS

    Authors: Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, Hongsheng Li

    Organizations: CUHK, SenseTime Research, Shanghai Jiao Tong University, Centre for Perceptual and Interactive Intelligence, Shanghai AI Lab

    Summary: Fine-tune CLIP using annotated 98K SD generated images from 25K prompts for image generation evaluation.

    Train human preference classifier: similar to CLIP except for the sample with the highest preference is taken as the positive; append a special token to the prompts of worse images and train SD LoRA; remove the special token during inference.

    [back to top]

    [2022-12-19] Scalable Diffusion Models with Transformers (ICCV 2023)

    (notes in jupyter)

    Project: DiT

    Authors: William Peebles, Saining Xie

    Organizations: UC Berkeley, New York University

    Summary: Replace U-Net by transformer for scalable image generation, the timestep and prompt are injected by adaLN-Zero structure.

    Use adaLN-Zero structure to inject timestep and prompt is better than cross-attention or in-context.

    [back to top]

    [2022-12-19] Optimizing Prompts for Text-to-Image Generation (NeurIPS 2023)

    Project: promptist

    Authors: Yaru Hao, Zewen Chi, Li Dong, Furu Wei

    Organizations: Microsoft Research

    Summary: Use LLM to refine prompts for preference-aligned image generation by taking relevance and aesthetics as reward.

    Step 1: fine-tune a language model (LM) to optimize prompts; Step 2: further fine-tune LM with PPO, where aesthetic and relevance are the reward.

    [back to top]

    [2022-10-06] Flow Matching for Generative Modeling (ICLR 2023)

    Project: Flow Matching

    Authors: Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, Matt Le

    Organizations: Meta AI (FAIR), Weizmann Institute of Science

    Summary: A type of generative models built on continuous normalizing flows by learning a time-dependent vector field that transports data from the source distribution to the target distribution.

    [back to top]

    [2022-08-25] Understanding Diffusion Models: A Unified Perspective (arXiv 2022)

    Project: Unified Perspective

    Authors: Calvin Luo

    Organizations: Google Brain

    Summary: Introduction to VAE, DDPM, score-based generative model, guidance from a unified generative perspective.

    [back to top]

    [2022-05-29] CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers (ICLR 2023)

    Project: CogVideo

    Authors: Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang

    Organizations: Tsinghua University, BAAI

    Summary: An open-sourced transformer-based video generation model (9B) that auto-regressively generates frame sequences and then performs auto-regressive frame interpolatation.

    It is trained on CogView2. In stage 1, it generates frames sequentially. In stage 2, it recursively interpolates frames.

    [back to top]

    [2021-12-20] High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022)

    Project: LDM

    Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer

    Organizations: Heidelberg University, Runway ML

    Summary: Efficient high-quality image generation by applying diffusion and denoising processes in the VAE latent space.

    LDM performs the diffusion process in the latent space of VAE.

    [back to top]

    [2021-12-08] Classifier-Free Diffusion Guidance (NeurIPS workshop 2021)

    (notes in jupyter)

    Project: CFG

    Authors: Jonathan Ho, Tim Salimans

    Organizations: Google Research, Brain team

    Summary: Image generation with classifier-free condition guidance by jointly training a conditional model and an unconditional model.

    [back to top]

    [2021-04-18] CLIPScore: A Reference-free Evaluation Metric for Image Captioning (EMNLP 2021)

    (notes in jupyter)

    Project: CLIPScore

    Authors: Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi

    Organizations: Allen Institute for AI, University of Washington

    Summary: A reference-free metric mainly focusing on semantic alignment for image generation evaluation.

    CLIPScore frees from the shortcomings of n-gram matching that disfavors good captions with new words and favors captions with familiar words.

    [back to top]

    [2020-10-06] Denoising Diffusion Implicit Models (ICLR 2021)

    (notes in jupyter)

    Project: DDIM

    Authors: Jiaming Song, Chenlin Meng, Stefano Ermon

    Organizations: Stanford University

    Summary: Accelerate sampling of diffusion models by introducing a non-Markovian, deterministic process that achieves high-quality results with fewer steps while preserving training consistency.

    Comparisons between Markovian DDPM (left) and non-Markovian DDIM (right).
    Accelerate sampling by skipping time steps.

    [back to top]

    [2020-06-19] Denoising Diffusion Probabilistic Models (NeurIPS 2020)

    (notes in jupyter)

    Project: DDPM

    Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel

    Organizations: UC Berkeley

    Summary: Denoising diffusion probabilistic models that iteratively denoises data from random noise for image generation.

    Forward and reverse processes of DDPM.
    Training and sampling algorithms of DDPM.

    [back to top]

    [2019-06-02] Generating Diverse High-Fidelity Images with VQ-VAE-2 (NeurIPS 2019)

    Project: VQ-VAE-2

    Authors: Ali Razavi, Aaron van den Oord, Oriol Vinyals

    Organizations: DeepMind

    Summary: Introduce hierarchical VQ-VAE and make sampling in the compressed latent space (which is faster than sampling in pixel space), inspired by the idea of lossy compression.

    [back to top]

    [2019-05-04] FVD: A new Metric for Video Generation (ICLR workshop 2019)

    Project: FVD

    Authors: Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, Sylvain Gelly

    Organizations: Johannes Kepler University, IDSIA, Google Brain

    Summary: Extend FID for video generation evaluation by replacing 2D InceptionNet with pre-trained Inflated 3D convnet.

    [back to top]

    [2017-11-02] Neural Discrete Representation Learning (NeurIPS 2017)

    Project: VQ-VAE

    Authors: Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu

    Organizations: DeepMind

    Summary: Propose vector quantised variational autoencoder to generate discrete codes while the prior is also learned.

    [back to top]

    [2017-06-26] GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (NeurIPS 2017)

    (notes in jupyter)

    Project: FID

    Authors: Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp Hochreiter

    Organizations: Johannes Kepler University Linz

    Summary: Calculate Fréchet distance between Gaussian distributions of InceptionNet feature maps of real-world data and synthetic data for image generation evaluation.

    [back to top]

    [2016-06-10] Improved Techniques for Training GANs (NeurIPS 2016)

    (notes in jupyter)

    Project: Inception Score

    Authors: Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen

    Organizations: OpenAI

    Summary: Calculate KL divergence between p(y|x) and p(y) that aims to minimize the entropy across samples and maximize the entropy across classes for image generation evaluation.

    [back to top]