Fully Controllable Data Generation For Realtime Face Capture

No Thumbnail Available
Date
2023-01-31
Journal Title
Journal ISSN
Volume Title
Publisher
ETH Zurich
Abstract
Data driven realtime face capture has gained considerable momentum in the last few years thanks to deep neural networks that leverage specialized datasets to speedup the acquisition of face geometry and appearance. However generalizing such neural solutions to generic in-the-wild face capture continues to remain a challenge due to the lack of, or a means to generate a high quality in-the-wild face database with all forms of groundtruth (geometry, appearance, environment maps, etc.). In this thesis we recognize this data bottleneck and propose a comprehensive framework for controllable, high quality, in-the-wild data generation that can support present and future applications in face capture. We approach this problem in four stages starting with the building of a high quality 3D face database consisting of a few hundred subjects in a studio setting. This database will serve as a strong prior for 3D face geometry and appearance for several methods discussed in this thesis. To build this 3D database and to automate the registration of scans to a template mesh, we propose the first deep facial landmark detector capable of operating on 4K resolution imagery while also achieving state-of-the-art performance on several in-the-wild benchmarks. Our second stage leverages the proposed 3D face database to build powerful nonlinear 3D morphable models for static geometry modelling and synthesis. We propose the first semantic deep face model that combines the semantic interpretability of traditional 3D morphable models with the nonlinear expressivity of neural networks. We later extend this semantic deep face model with a novel transformer based architecture and propose the Shape Transformer, for representing and manipulating face shapes irrespective of their mesh connectivity. The third stage of our data generation pipeline involves extending the approaches for static geometry synthesis to support facial deformations across time so as to synthesize dynamic performances. To synthesize facial performances we propose two parallel approaches, one involving performance retargeting and another based on a data driven 4D (3D + time) morphable model. We propose a local anatomically constrained facial performance retargeting technique that uses only a handful of blendshapes (20 shapes) to achieve production quality results. This retargeting technique can readily be used to create novel animations for any given actor via animation transfer. Our second contribution for generating facial performances is through a transformer based 4D autoencoder that encodes a sequence of expression blend weights into a learned performance latent space. Novel performances can then be generated at inference time by sampling this learned latent space. The fourth and final stage of our data generation pipeline involves the creation of photorealistic imagery that can go along with the facial geometry and animations synthesized thus far. We propose a hybrid rendering approach that leverages state-of-the-art techniques for ray traced skin rendering and a pretrained 2D generative model for photorealistic and consistent inpainting of the skin renders. Our hybrid rendering technique allows for the creation of an infinite number of training samples where the user has full control over the facial geometry, appearance, lighting and viewpoint. The techniques presented in this thesis will serve as the foundation for creating large scale photorealistic in-the-wild face datasets to support the next generation of realtime face capture.
Description
Citation
Collections