Junkun Yuan  

Researcher

Hunyuan Foundation Model Team   at  Tencent

live and work in Shenzhen🚀, China

yuanjk0921@outlook.com


🤗 Try our video generation foundation model at HunyuanVideo

🤗 Try our image generation foundation model at HunyuanDiT

Biography

I am a researcher of Hunyuan Foundation Model Team, Tencent since 2024.07, working on visual generative foundation models.

My research interests include visual & multimodal foundation models and their various downstream applications.

During 2023.09 - 2024.06, I was an intern at Hunyuan Foundation Model Team, Tencent, working with Wei Liu.

During 2022.07 - 2023.08, I was an intern at Baidu Computer Vision Group, working with Xinyu Zhang and Jingdong Wang.

I got Ph.D degree from Zhejiang University in 2024.06, supervised by Prof. Kun Kuang, Prof. Lanfen Lin, and Prof. Fei Wu.

Projects []

[2024.12, HunyuanVideo]   [paper] [product] [project] GitHub stars
HunyuanVideo is the first open-sourced large video generation model (with 13B parameters). It is pre-trained on hundreds of millions of hierarchical data with structured captions, outperforming Runway Gen-3, Luma 1.6, and 3 top Chinese video generative models.

[2024.05, Hunyuan-DiT]   [paper] [product] [project] GitHub stars
Hunyuan-DiT is an open-sourced large image generation model with the structure of diffusion transformer. It is pre-trained on billions of high-quality image data with refined captions of English and Chinese, enabling its understanding abillity of English and Chinese prompts.

[2022.02, Awesome-Domain-Generalization]   [project] GitHub stars
I read and organized the relevant researches and resources on visual domain generalization up to 2023.01 (which may be outdated😅). You may refer to this repo if you are interested in this topic. And also welcome to contribute to this repo by updating the latest papers.

[2024.12, Follow-Your-Emoji (SIGGRAPH-Asia 2024)]   [paper] [project] GitHub stars
Follow-Your-Emoji is a diffusion-based portrait animation method. It employs expression-aware landmarks and a facial fine-grained loss to animate a reference portrait by target landmark sequences while preserving portrait identity and temporal consistency and fidelity.

[2025.02, Follow-Your-Canvas (AAAI 2025)]   [paper] [project] GitHub stars
Follow-Your-Canvas explores higher-resolution video outpainting with extensive content generation. It achieves it by employing spatial window strategy with positional relation learning. It excels in large-scale video outpainting, e.g., from 512 x 512 to 1152 x 2048 (9x).

[2023.12, HAP (NeurIPS 2023)]   [paper] [project] GitHub stars
HAP is the first to introduce masked image modeling as a pre-training framework for human-centric perception. It incorporates human structure priors into pre-training by proposing a pose-guided masking sampling strategy and a human structure-invarint alignment loss.

Publications [] [] []


(✳ Equal Contribution; ✉ Corresponding Author.)

2025

Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, Wei Liu

[2025.02, AAAI]   AAAI Conference on Artificial Intelligence

(conditional video generation) It explores higher-resolution video outpainting with extensive content generation. It achieves it by employing spatial window strategy with positional relation learning. It excels in large-scale video outpainting, e.g., from 512 x 512 to 1152 x 2048 (9x).

[arXiv] [Project] [Code]

2024

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Hunyuan Foundation Model Team

[2024.12, Technical Report]   

(video generation foundation model) It presents HunyuanVideo as a foundation model, which outperforms Runway Gen-3, Luma 1.6, and 3 top performing Chinese video generative models.

[arXiv] [Try it] [Code]

Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen

[2024.12, SIGGRAPH-Asia]   Computer Graphics and Interactive Techniques-Asia

(conditional video generation) An interesting diffusion-based framework using expression-aware landmarks and a facial fine-grained loss for portrait animation, which animates a reference portrait by target landmark sequences while preserving portrait identity and temporal consistency and fidelity.

[arXiv] [PDF] [Project] [Code]

Mutual Prompt Leaning for Vision Language Models

Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu, Jingyuan Feng, Shengsheng Wang, Jingdong Wang

[2024.09, IJCV]   International Journal of Computer Vision

(prompt learning) It presents fine-grained text prompt to decompose image features into finer-grained semantics, and text-reorganized visual prompt to attend to class-related representations.

[PDF]

Neural Collapse Anchored Prompt Tuning for Generalizable Vision-Language Models

Didi Zhu, Zexi Li, Min Zhang, Junkun Yuan, Jiashuo Liu, Kun Kuang, Chao Wu

[2024.08, KDD]   ACM SIGKDD Conference on Knowledge Discovery and Data Mining

(prompt learning) It makes the first attempt to borrow the idea of neural collapes for improving the representations of multimodal models by means of prompt learning. It lets multimodal representations exhibit a equiangular tight frame structure to improve the generalization performance of the model.

[arXiv]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Hunyuan Foundation Model Team

[2024.05, Technical Report]   

(image generation foundation model) It presents Hunyuan-DiT as a foundation model, which is a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese.

[arXiv] [Try it] [Code]

Domaindiff: Boost out-of-Distribution Generalization with Synthetic Data

Qiaowei Miao, Junkun Yuan, Shengyu Zhang, Fei Wu, Kun Kuang

[2024.04, ICASSP]   International Conference on Acoustics, Speech, and Signal Processing

It employs diffusion models to synthesize data for improving OOD generalization performance.

[PDF]

2023

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding, Lanfen Lin, Fei Wu, Jingdong Wang

[2023.12, NeurIPS]    Advances in Neural Information Processing Systems

(visual self-supervised learning) It first introduces masked image modeling as a pre-training framework for human-centric perception. It incorporates human structure priors into pre-training by proposing a pose-guided masking sampling strategy and a human structure-invarint alignment loss.

[arXiv] [PDF] [Poster] [Project] [Code]

Collaborative Semantic Aggregation and Calibration for Federated Domain Generalization

Junkun Yuan, Xu Ma, Defang Chen, Fei Wu, Lanfen Lin, Kun Kuang

[2023.12, TKDE]    IEEE Transactions on Knowledge and Data Engineering

(out-of-distribution generalization) To protect data privacy, it solves the federated domain generalization task by proposing a collaborative semantic aggregation and calibration method with local semantic acquisition, data-free semantic aggregation, and cross-layer semantic calibration.

[arXiv] [PDF] [Code]

MAP: Towards Balanced Generalization of IID and OOD through Model-Agnostic Adapters

Min Zhang, Junkun Yuan, Yue He, Wenbin Li, Zhengyu Chen, Kun Kuang

[2023.10, ICCV]   International Conference on Computer Vision

(out-of-distribution generalization) It presents empirical evidence that existing OOD methods exhibit subpar performance when faced with minor distribution shifts, and proposes a bilevel optimization strategy, leveraging adapters to strike a balance between IID and OOD generalization.

[PDF] [Code]

Universal Domain Adaptation via Compressive Attention Matching

Didi Zhu, Yinchuan Li, Junkun Yuan, Zexi Li, Kun Kuang, Chao Wu

[2023.10, ICCV]   International Conference on Computer Vision

(domain adaptation) To better align common features and separate target classes for universal domain adaptation, it makes attempts of leveraging attention maps in ViTs to perform prediction.

[arXiv] [PDF]

Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models

Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu, Luping Zhou, Shengsheng Wang, Jingdong Wang

[2023.10, ICCV]   International Conference on Computer Vision

(prompt learning) It presents a class-aware text prompt to enrich generated prompts with label-related image information, and makes the image branch attend to class-related representations. The text and image branches mutually promote to enhance the adaptation of vision-language models.

[arXiv] [PDF]

CAE v2: Context Autoencoder with CLIP Latent Alignment

Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

[2023.09, TMLR]    Transactions on Machine Learning Research

(visual self-supervised learning) It proposes CAE v2, a new CLIP-guided masked image modeling method, which (i) uses both masked and visible tokens as the distillation target, and (ii) employs a masking ratio proportional to the model size, achieving superior performance than existing SOTAs.

[arXiv] [PDF] [Code]

Quantitatively Measuring and Contrastively Exploring Heterogeneity for Domain Generalization

Yunze Tong, Junkun Yuan, Min Zhang, Didi Zhu, Keli Zhang, Fei Wu, Kun Kuang

[2023.08, KDD]   ACM SIGKDD Conference on Knowledge Discovery and Data Mining

(out-of-distribution generalization) It tackles the label heterogeneity problem, i.e., the ground-truth labels could be sub-optimal, and proposes a method to quantitatively measure and address it.

[arXiv] [PDF] [Code]

Instrumental Variable-Driven Domain Generalization with Unobserved Confounders

Junkun Yuan, Xu Ma, Ruoxuan Xiong, Mingming Gong, Xiangyu Liu, Fei Wu, Lanfen Lin, Kun Kuang

[2023.06, TKDD]    ACM Transactions on Knowledge Discovery from Data

(out-of-distribution generalization, causal learning) It provides a causal view on domain generalization that separates domain-invariant and domain-specific parts of data. It then proposes to learn the domain-invariant relationship between the input features and the labels through instrumental variable method. It allows one to remove unobserved confounders and obtain a generalizable model.

[arXiv] [Slides]

Knowledge Distillation-based Domain-invariant Representation Learning for Domain Generalization

Ziwei Niu, Junkun Yuan, Xu Ma, Yingying Xu, Jing Liu, Yen-Wei Chen, Ruofeng Tong, Lanfen Lin

[2023.04, TMM]   IEEE Transactions on Multimedia

(out-of-distribution generalization) Inspired by knowledge distillation, it performs two-stage distillation with (i) multiple students: each is trained on one dataset, and (ii) one leader student.

[PDF]

2022

Domain-Specific Bias Filtering for Single Labeled Domain Generalization

Junkun Yuan, Xu Ma, Defang Chen, Kun Kuang, Fei Wu, Lanfen Lin

[2022.11, IJCV]   International Journal of Computer Vision

(out-of-distribution generalization) It proposes a task called single labeled domain generalization where only one source dataset is labeled, as well as a method named domain-specific bias filtering.

[arXiv] [PDF] [Code]

Label-Efficient Domain Generalization via Collaborative Exploration and Generalization

Junkun Yuan, Xu Ma, Defang Chen, Kun Kuang, Fei Wu, Lanfen Lin

[2022.10, MM]   International Conference on Multimedia

(out-of-distribution generalization) To enable generalization learning with limited annotation, it proposes a framework called collaborative exploration and generalization, which jointly optimizes active exploration and semi-supervised generalization for the label-efficient domain generalization task.

[arXiv] [PDF] [Code]

Domain Generalization via Contrastive Causal Learning

Qiaowei Miao, Junkun Yuan, Kun Kuang

[2022.10, Technical Report]   

(out-of-distribution generalization, causal learning) It proposes a generalizable causal model by controlling the unstable domain factor and quantifying the causal effects with front-door criterion.

[arXiv] [Code]

Attention-based Cross-Layer Domain Alignment for Unsupervised Domain Adaptation

Xu Ma, Junkun Yuan, Yen-wei Chen, Ruofeng Tong, Lanfen Lin

[2022.08, Neurocomputing]   

(domain adaptation) It proposes attention-based cross-layer domain alignment, which captures the semantic relationship between the source domain and the target domain across model layers and calibrates each level of semantic information automatically through a dynamic attention mechanism.

[arXiv] [PDF] [Code]

Learning Decomposed Representations for Treatment Effect Estimation

Anpeng Wu, Junkun Yuan, Kun Kuang, Bo Li, Pan Zhou, Jianrong Tao, Qiang Zhu, Yueting Zhuang, Fei Wu

[2022.02, TKDE]   IEEE Transactions on Knowledge and Data Engineering

(causal reasoning) It separates confounders by learning representations of confounders and non-confounders, balancing confounders with sample re-weighting, and estimating treatment effect.

[arXiv] [PDF] [Code]

Auto IV: Counterfactual Prediction via Automatic Instrumental Variable Decomposition

Junkun Yuan, Anpeng Wu, Kun Kuang, Bo Li, Runze Wu, Fei Wu, Lanfen Lin

[2022.01, TKDD]   ACM Transactions on Knowledge Discovery from Data

(causal reasoning) Instrumental Variable (IV) is a powerful tool for causal inference, but it is hard to pre-define valid IVs. It proposes an automatic instrumental variable decomposition algorithm to generate effective IV representations from observed variables for IV-based counterfactual prediction.

[arXiv] [PDF] [Code]

2021 and before

Subgraph Networks with Application to Structural Feature Space Expansion

Qi Xuan, Jinhuan Wang, Minghao Zhao, Junkun Yuan, Chenbo Fu, Zhongyuan Ruan, Guanrong Chen

[2021.12, TKDE]   IEEE Transactions on Knowledge and Data Engineering

(network embedding) It introduces a new concept of subgraph network and constructs 1st/2nd-order SGNs. SGNs provide additional structural features to the features that extracted by advanced network-embedding methods, largely improving model performance on the network classification task.

[arXiv] [PDF] [Code]

Black-box Adversarial Attacks Against Deep Learning Based Malware Binaries Detection with GAN

Junkun Yuan, Shaofang Zhou, Lanfen Lin, Feng Wang, Jia Cui

[2020.08, ECAI]   European Conference on Artificial Intelligence

(adversarial attack) To attack malware detector, it proposes a black-box attack framework called GAPGAN to generate adversarial payloads via GAN. It generates malicious payloads and appends them to the malware to craft "benign binaries" while preserving its original attacking functionality.

[PDF] [Slides]

CNN-based DGA Detection with High Coverage

Shaofang Zhou, Lanfen Lin, Junkun Yuan, Feng Wang, Zhaoting Ling, Jia Cui

[2019.07, ISI]   International Conference on Intelligence and Security Informatics

(adversarial attack) Attackers use algorithms to create domains for attacks. It presents a temporal convolutional network-based generated domain detection method with high accuracy and coverage.

[PDF]


Last updated on Dec. 31, 2024.