InterHandGen

🤝 InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion

KAIST¹, Codec Avatars Lab, Meta², Imperial College London³
CVPR 2024

Abstract

We present 🤝InterHandGen, a novel framework that learns the generative prior of two-hand interaction. Sampling from our model yields plausible and diverse two-hand shapes in close interaction with or without an object. Our prior can be incorporated into any optimization or learning methods to reduce ambiguity in an ill-posed setup. Our key observation is that directly modeling the joint distribution of multiple instances imposes high learning complexity due to its combinatorial nature. Thus, we propose to decompose the modeling of joint distribution into the modeling of factored unconditional and conditional single instance distribution. In particular, we introduce a diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout. For sampling, we combine anti-penetration and classifier-free guidance to enable plausible generation. Furthermore, we establish the rigorous evaluation protocol of two-hand synthesis, where our method significantly outperforms baseline generative models in terms of plausibility and diversity. We also demonstrate that our diffusion prior can boost the performance of two-hand reconstruction from monocular in-the-wild images, achieving new state-of-the-art accuracy.

🤲 Two-Hand Interaction Generation

🤝InterHandGen can generate plausible and diverse two-hand interactions. Please also refer to the paper for quantitative comparisons with baselines, where we establish the rigorous evaluation protocol of two-hand synthesis.

⚾️ Two-Hand-Object Interaction Generation

The leraning formulation of 🤝InterHandGen can be easily extended to object-conditioned two-hand interaction generation. It is shown to model plausible and diverse bimanual hand-object interactions.

📸 In-the-Wild Two-Hand Reconstruction from RGB

🤝 InterHandGen can be incorporated as a prior into any optimization or learning methods via Score Distillation Sampling-like loss. It achieves new state-of-the-art reconstruction accuracy on two-hand reconstruction from in-the-wild images.

🛠 Learning Formula and Network Architecture

Our key observation is that directly modeling the joint distribution of multiple instances imposes high learning complexity due to its combinatorial nature. Thus, we reformulate the two-hand distribution modeling into the modeling of factored single-hand model distribution unconditional and conditional to the other hand, such that \(p_{\phi}(\mathbf{x}_{l}, \mathbf{x}_{r}) = p_{\phi}(\mathbf{x}_{l})\, p_{\phi}(\mathbf{x}_{r} | \mathbf{x}_{l}) \), to reduce the dimensionality of each generation target. After normalizing hand side, we jointly learn the resulting unconditional and conditional distributions via conditioning dropout using a single network.

@inproceedings{lee2024interhandgen, title={InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion}, author={Lee, Jihyun and Saito, Shunsuke and Nam, Giljoo and Sung, Minhyuk and Kim, Tae-Kyun}, booktitle={CVPR}, year={2024} }