Temporal-Modal Replay for Continual
Audio-Conditioned 3D Motion Generation

Temporal-Modal Replay (TMR) advances continual learning for audio-conditioned 3D motion generation through: (a) a unified framework covering three fundamentally different generator families, including flow matching, autoregressive VQ-VAE, and diffusion, across two qualitatively distinct audio conditioning types, including speech and spatial binaural audio; (b) the first continual learning benchmark for audio-conditioned 3D motion generation, evaluated against twelve baselines spanning all four CL paradigms, including memory-based, regularization-based, optimization-based, and architecture-based methods; (c) a theoretically grounded temporal feature operator whose principal eigenspace uniquely minimizes replay-induced forgetting among all equal-rank projections; (d) up to 75% reduction in catastrophic forgetting, lossless retention with 25% fewer replay samples, and 59% improvement in feature-space coverage over competing replay methods; (e) near-flat long-horizon retention across nine sequential sessions where vanilla fine-tuning catastrophically diverges, using only a 15% replay buffer; and (f) zero additional trainable parameters, zero inference overhead, and negligible replay construction cost, working across generator architectures with no auxiliary model or architectural modification.