Voice2Face: Audio-driven Facial and Tongue Rig Animations with cVAEs

dc.contributor.authorVillanueva Aylagas, Monicaen_US
dc.contributor.authorAnadon Leon, Hectoren_US
dc.contributor.authorTeye, Mattiasen_US
dc.contributor.authorTollmar, Konraden_US
dc.contributor.editorDominik L. Michelsen_US
dc.contributor.editorSoeren Pirken_US
dc.date.accessioned2022-08-10T15:19:54Z
dc.date.available2022-08-10T15:19:54Z
dc.date.issued2022
dc.description.abstractWe present Voice2Face: a Deep Learning model that generates face and tongue animations directly from recorded speech. Our approach consists of two steps: a conditional Variational Autoencoder generates mesh animations from speech, while a separate module maps the animations to rig controller space. Our contributions include an automated method for speech style control, a method to train a model with data from multiple quality levels, and a method for animating the tongue. Unlike previous works, our model generates animations without speaker-dependent characteristics while allowing speech style control. We demonstrate through a user study that Voice2Face significantly outperforms a comparative state-of-the-art model in terms of perceived animation quality, and our quantitative evaluation suggests that Voice2Face yields more accurate lip closure in speech with bilabials through our speech style optimization. Both evaluations also show that our data quality conditioning scheme outperforms both an unconditioned model and a model trained with a smaller high-quality dataset. Finally, the user study shows a preference for animations including tongue. Results from our model can be seen at https://go.ea.com/voice2face.en_US
dc.description.number8
dc.description.sectionheadersCapture, Tracking, and Facial Animation
dc.description.seriesinformationComputer Graphics Forum
dc.description.volume41
dc.identifier.doi10.1111/cgf.14640
dc.identifier.issn1467-8659
dc.identifier.pages255-265
dc.identifier.pages11 pages
dc.identifier.urihttps://doi.org/10.1111/cgf.14640
dc.identifier.urihttps://diglib.eg.org:443/handle/10.1111/cgf14640
dc.publisherThe Eurographics Association and John Wiley & Sons Ltd.en_US
dc.subjectCCS Concepts: Computing methodologies --> Animation; Neural networks; Latent variable models; Learning latent representations; Additional Key Words and Phrases: Deep Learning, Facial animation, Tongue animation, Lip synchronization, Rig animation
dc.subjectComputing methodologies
dc.subjectAnimation
dc.subjectNeural networks
dc.subjectLatent variable models
dc.subjectLearning latent representations
dc.subjectAdditional Key Words and Phrases
dc.subjectDeep Learning
dc.subjectFacial animation
dc.subjectTongue animation
dc.subjectLip synchronization
dc.subjectRig animation
dc.titleVoice2Face: Audio-driven Facial and Tongue Rig Animations with cVAEsen_US
Files
Original bundle
Now showing 1 - 4 of 4
Loading...
Thumbnail Image
Name:
v41i8pp255-265.pdf
Size:
5.63 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
v2f_appendices.pdf
Size:
1.16 MB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
webm.zip
Size:
24.84 MB
Format:
Zip file
Loading...
Thumbnail Image
Name:
V2F_errata .pdf
Size:
39.13 KB
Format:
Adobe Portable Document Format
Description:
Errata
Collections