Temporal-Modal Replay (TMR) advances continual learning for audio-conditioned 3D motion generation through: (a) a unified framework covering three fundamentally different generator families, including flow matching, autoregressive VQ-VAE, and diffusion, across two qualitatively distinct audio conditioning types, including speech and spatial binaural audio; (b) the first continual learning benchmark for audio-conditioned 3D motion generation, evaluated against twelve baselines spanning all four CL paradigms, including memory-based, regularization-based, optimization-based, and architecture-based methods; (c) a theoretically grounded temporal feature operator whose principal eigenspace uniquely minimizes replay-induced forgetting among all equal-rank projections; (d) up to 75% reduction in catastrophic forgetting, lossless retention with 25% fewer replay samples, and 59% improvement in feature-space coverage over competing replay methods; (e) near-flat long-horizon retention across nine sequential sessions where vanilla fine-tuning catastrophically diverges, using only a 15% replay buffer; and (f) zero additional trainable parameters, zero inference overhead, and negligible replay construction cost, working across generator architectures with no auxiliary model or architectural modification.
FGD: 2.129 (↓)
FGD: 0.743 (↓)
FGD: 0.732 (↓)
FGD: 0.613 (↓)
FGD: 1.900 (↓)
FGD: 0.601 (↓)
FID: 62.617 (↓)
FID: 67.617 (↓)
FID: 40.144 (↓)
FID: 37.802 (↓)