Skip to main content

Solving Neural Articulated Performances

Author
Nikolaos Zioulis
mocap🔹volcap 🔹AI | CTO @ Moverse

Dynamic Human Performances
#

a digitization challenge

Humans move in distinctly varying ways depending on the context and surrounding environment. Human motion exploits and engages with the space around us to convey thoughts and emotions, perform actions, play instruments or express intentions. Body motion is the common denominator to all these activities. Our motions mix coarse limb motion, fine grained finger manipulation, and facial expressions.

Image of a group of performance by Ahmad Odeh
Image of a group performance by Ahmad Odeh

Even though performances vary, our flexibility and adaptability in motion allows us to fluently transition from one type of motion (e.g. sports) to another (e.g. dance). We blend and mix motion types, showing emotions while dancing or playing sports. This high-dimensional transition space renders human performance capturing and digitization very challenging. Using objects, clothing and hair in said performances significantly expands capturing complexity. Additionally considering simultaneous and/or interacting performers, as well as (social) group dynamics, makes the challenge and complexity scale exponentially. But nowdays we are witnessing another exponential curve, a technological one.

Neural Graphics & Vision
#

the new continuum

Reconstruction and simulation are the two ends of a digital thread we use to make sense of our physical world. On one end, computer vision uses sensors to reconstruct our surroundings, whereas, on the other end, computer graphics rely on models to simulate them. For years, the reconstruction end has lagged behind simulation which has been producing impressive visual effects, virtual worlds and cinematics.

A generated image of the digital computer graphics-to-vision continuum.

But nowadays, the recent advances in machine learning have improved machine perception. It is now possible to understand our environments and reconstruct their various layers. These span from lower level ones like 3D geometry and appearance to higher level ones, like semantics and even affordances. Said advances are also affecting the other end, enabling simulation technology that is more efficient. This works in tandem with the reconstruction progress, creating a virtuous cycle.

It used to be that engineers and scientists worked at the intersection of graphics and vision. Neural techniques are now tying the knot, connecting these two ends. As a result, novel techniques are emerging, allowing us to traverse the graphics-to-vision continuum seamlessly, and opening up new pathways to explore.

Richer Assets
#

the real in hyper-realistic

High quality character performances are produced for film, visual effects and game cinematics. Technologies like animation, deformation and physically-based rendering are the driving forces of such compelling content. Still, these need immense human effort - and also, artistry - to create, and have been enabled by real-time editing tools, themselves being aggregations of many technical advances.

Animated 3D Asset
Animated 3D Asset by Jungle Jim

Up to now this has been solely on the simulation side of the continuum. It is almost a decade ago that we started reconstructing production-grade human performances. Microsoft’s Mixed Reality Capture Studio Technology 1 (now at Arcturus), 4DViews, Volucap and Volograms delivered impressive 4D capture systems.

Microsoft Mixed Reality Capture Studio
Microsoft Mixed Reality Capture Studio

Still, recent advances exhibit new potential to extend capturing to deformations, thin structures, and even effects not possible before. The earlier works of 20212 and 20223 using implicit neural fields achieved human capture with body pose or time dependent deformations.

2021

NeuralActor
·267 words
Papers NeRF SMPL Texture SIGGRAPH Asia 2021
A-NeRF
·220 words
Papers NeRF Skeleton NeurIPS 2021

2022

ENeRF
·242 words
Papers NeRF SIGGRAPH Asia 2022
NeuralAM
·320 words
Papers SDF Deformation SIGGRAPH Asia 2022

Following up, more recent works are starting to reconstruct and simulate complex cloth4 dynamics, as well as hair 5:

ClothHair

Another new capability emerging from the use of neural rendering and radiance field techniques is the capture of effects from participating media. These enable the capture of lighting variations and reflections:

Further, these techniques can drive the capturing of subsurface scattering6, smoke and clouds7, or scenes through fog8 and water9!

Examples being the capture of volumetric effects such as fire and lights , now possible, as shown by the creative Gaussian Splatting works of Stéphane Couchoud and Eric Paré respectively:

Fire Lights

Evidently, we are reaching a critical point where the combined progress driven by AI and differentiable rendering will consolidate into human capturing technology that delivers much richer assets. Assets that disentangle the various appearance, geometry, deformation and animation layers, maximizing editability and control. More importantly, this should be possible while simultaneously downscaling the capturing requirements from hundreds of cameras and controlled environments, to portable and/or mobile systems deployed in-the-wild.

Live Representations
#

holograms, my young Padawan

Tele-presence promises to elevate real-time collaboration and communication beyond contemporary audio-visual systems. Considering its digital nature, it has the potential to surpass the “golden standard” of face-to-face communication. Up to now systems like Microsoft’s Holoportation and Google’s Starline spearhead tele-presence technology. The requirement of interaction level latency makes said systems very complex. It is a huge pain point as it applies end-to-end, extending beyond capturing to transmission and rendering. The chosen 3D representation changes the design surface of any real-time telepresence system. This includes the capturing setup and upstream bandwidth; the capacity and performance of the server and client components for encoding/decoding; rendering, and/or spatial tiling/level-of-detail support. All these influence latency and up/downstream bandwidth, as well as the fidelity of the reconstruction and resulting QoE. While research efforts started focusing on new forms of compression10, and streamability11, animateable radiance field based representations pose as a promising solution to better manage the underlying trade-offs.

Using such drivable representations, streaming lightweight animation data can control a deformable representation. This alleviates the need to transmit dense visual representations and only pay a one-off cost for the drivable representation. Compared to contemporary skinned animation, it can also support pose-controlled deformations, conveying the necessary realism. The video above shows Meta’s Codec Avatars, a technology that has been focusing on exploring the advantages of real-time avatar drivability12,13.

So … SNAP!
#

a new way ?

Animatable human digital representations consolidate human reconstruction and simulation. They can jointly serve as the technical backbone for holistic performance capture, rich 3D human assets and efficient tele-presence.

SNAP is a project that seeks to explore the synergies between human motion and appearance capture as enabled by the neural graphics and vision continuum. The tools are numerous, ranging from explicit parametric body models, to implicit fields, differentiable rendering and efficiently optimized gaussian splats.

From an end goal perspective, animatable humans based on neural fields and/or gaussian splats require high quality motion capture. But once a simulation ready digital radiance field avatar of an actor is reconstructed, one that can deform clothing and respect lighting variations as the body moves, or even move hair; then it can also be used to solve for complex human performances using solely the raw image data. Taking into account that markerless motion capture and human radiance field technologies are developed for the same acquisition system - color camera(s) - the interplay between these two technologies is significant and worth exploring.

References
#