Temporal-Modal Replay for Continual
Audio-Conditioned 3D Motion Generation

BEAT2 (Speech conditioned full body 3D gestures). The goal is to generate natural, coherent, and synchronized motion of the lips, facial expressions, hands, legs, and full body, which TMR is able to retain even after 9 long sessions while maintaining low FGD.

Vanilla - Session 4; Speaker 2 Scott

FGD: 2.129 (↓)

CSReL - Session 4; Speaker 2 Scott

FGD: 0.743 (↓)

GCR - Session 4; Speaker 2 Scott

FGD: 0.732 (↓)

TMR - Session 4; Speaker 2 Scott

FGD: 0.613 (↓)

Vanilla - Session 9; Speaker 2 Scott

FGD: 1.900 (↓)

TMR - Session 9; Speaker 2 Scott

FGD: 0.601 (↓)

MOSPA (moving towards sound source). The goal is to generate coherent 3D motion toward the sound source, which is clearly captured by TMR after the second session for the Sensitive genre.

Vanilla - Session 2; Sensitive Genre

FID: 62.617 (↓)

CSReL - Session 2; Sensitive Genre

FID: 67.617 (↓)

GCR - Session 2; Sensitive Genre

FID: 40.144 (↓)

TMR - Session 2; Sensitive Genre

FID: 37.802 (↓)