Nanyang Technological University1 ByteDance2 National University of Singapore3
*Part of the work is done during an internship at ByteDance.
Despite recent progress, video generative models still struggle to generate delicate human actions (e.g., gymnastics), particularly when they are required to start from a user-provided reference image. In this paper, we explore the task of learning to animate images into videos that portray delicate human actions using a small number of videos --- 16 or fewer --- which reduces the need for extensive data collection and enhances practicality for real-world applications. Learning generalizable motion patterns that smoothly transition from user-provided reference images in such a few-shot setting is highly challenging. We propose FLASH (Few-shot Learning to Animate and Steer Humans), which enhances generalization of motion by training the model to reconstruct a video using the motion features and cross-frame correspondences extracted from another video with the same motion but different appearance. This encourages the learning of transferable motion and mitigates overfitting to the appearance in limited training data. Additionally, FLASH extends the decoder with additional layers to propagate details from the reference image to generated frames, improving transition smoothness. Human judges significantly favor FLASH, with 65.78\% of 488 responses prefer FLASH over baselines.
| Variant | Motion Alignment Module | Detail Enhancement Decoder | ||
|---|---|---|---|---|
| Strong Augment | Motion Align | Correspondence Align | ||
| #1 | × | × | × | × |
| #2 | ✓ | × | × | × |
| #3 | ✓ | ✓ | × | × |
| #4 | ✓ | × | ✓ | × |
| #5 | ✓ | ✓ | ✓ | × |
| #6 | ✓ | ✓ | ✓ | ✓ |
@inproceedings{FLASH-2025,
author = {Haoxin Li and Yingchen Yu and Qilong Wu and Hanwang Zhang and Song Bai and Boyang Li},
title = {Learning to Animate Images from A Few Videos to Portray Delicate Human Actions.},
journal = {Winter Conference on Applications of Computer Vision (WACV)},
year = {2026}
}