Temporal-Modal Replay for Continual
Audio-Conditioned 3D Motion Generation

Temporal-Modal Replay (TMR) advances continual learning for audio-conditioned 3D motion generation through: (a) a unified framework covering three fundamentally different generator families, including flow matching, autoregressive VQ-VAE, and diffusion, across two qualitatively distinct audio conditioning types, including speech and spatial binaural audio; (b) the first continual learning benchmark for audio-conditioned 3D motion generation, evaluated against twelve baselines spanning all four CL paradigms, including memory-based, regularization-based, optimization-based, and architecture-based methods; (c) a theoretically grounded temporal feature operator whose principal eigenspace uniquely minimizes replay-induced forgetting among all equal-rank projections; (d) up to 75% reduction in catastrophic forgetting, lossless retention with 25% fewer replay samples, and 59% improvement in feature-space coverage over competing replay methods; (e) near-flat long-horizon retention across nine sequential sessions where vanilla fine-tuning catastrophically diverges, using only a 15% replay buffer; and (f) zero additional trainable parameters, zero inference overhead, and negligible replay construction cost, working across generator architectures with no auxiliary model or architectural modification.

BEAT2 (Speech conditioned full body 3D gestures). The goal is to generate natural, coherent, and synchronized motion of the lips, facial expressions, hands, legs, and full body, which TMR is able to retain even after 9 long sessions while maintaining low FGD.

Reference (Session 0, before any continual learning): Speaker 2 Scott — FGD: 0.412 (↓)

Vanilla - Session 4; Speaker 2 Scott

FGD: 2.129 (↓)

CSReL - Session 4; Speaker 2 Scott

FGD: 0.743 (↓)

GCR - Session 4; Speaker 2 Scott

FGD: 0.732 (↓)

TMR - Session 4; Speaker 2 Scott

FGD: 0.613 (↓)

Vanilla - Session 9; Speaker 2 Scott

FGD: 1.900 (↓)

TMR - Session 9; Speaker 2 Scott

FGD: 0.601 (↓)

MOSPA (moving towards sound source). The goal is to generate coherent 3D motion toward the sound source, which is clearly captured by TMR after the second session for the Sensitive genre.

Reference (Session 1, when Sensitive genre was first introduced): Vanilla — FID: 19.457 (↓)

Vanilla - Session 2; Sensitive Genre

FID: 62.617 (↓)

CSReL - Session 2; Sensitive Genre

FID: 67.617 (↓)

GCR - Session 2; Sensitive Genre

FID: 40.144 (↓)

TMR - Session 2; Sensitive Genre

FID: 37.802 (↓)