WO2022096105A1

WO2022096105A1 - 3d tongue reconstruction from single images

Info

Publication number: WO2022096105A1
Application number: PCT/EP2020/081148
Authority: WO
Inventors: Stylianos PLOUMPIS; Stylianos MOSCHOGLOU; Vasilios TRIANTAFYLLOU; Stefanos ZAFEIRIOU
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2022-05-12
Anticipated expiration: 2023-05-05

Abstract

A method (1) for 3D reconstruction of a tongue is provided. The method (1) comprises: determining (101) one (207) of a plurality of synthetic tongue-and-head 3D meshes based on an image (201) of the tongue; and modifying (108) the determined synthetic tongue- and-head 3D mesh (207) based on a synthetic tongue 3D point-cloud (210) which is in turn indicative of the image (201) of the tongue. This enables reconstructing a 3D model of a head together with the 3D tongue pose of the depicted tongue.

Description

3D TONGUE RECONSTRUCTION FROM SINGLE IMAGES

TECHNICAL FIELD

The present disclosure relates to 3D reconstruction of human tongues for applications in 3D face reconstruction, animation and/or verification. The present disclosure provides, to this end, a method, a computer program and a device.

BACKGROUND

Conventional examples of avatar creation and speech modelling fail to reproduce characteristics of an oral cavity and, in particular, of a tongue. Inter aha, this diminishes realistic representation of avatars, face verification, user experience in augmented/virtual reality applications, and realistic speech dynamics. The human tongue is a highly deformable object and thus cannot be registered under a template. Thus all methods that operate under a common template reference are unavailable. Furthermore, the human tongue is a non-watertight object, i.e., an object that has holes and thus it is not closed. As a result, no implicit function approximation methods can be used.

SUMMARY

In view of the above-mentioned problems and disadvantages, it is an object to include characteristics of the oral cavity and, in particular, a pose of the tongue, in 3D face reconstruction.

This and other objectives are achieved by the embodiments as defined by the appended independent claims. Further embodiments are set forth in the dependent claims and in the following description and drawings.

A first aspect of the present disclosure provides a method for 3D reconstruction of a tongue, comprising: determining one of a plurality of synthetic tongue-and-head 3D meshes based on an image of the tongue; and modifying the determined synthetic tongue-and-head 3D mesh based on a synthetic tongue 3D point-cloud being indicative of the image of the tongue.

Thereby, a combined tongue-and-head 3D model may be obtained which has a 3D tongue pose reconstructed from the single image of the tongue.

In an implementation of the method, the determining of the synthetic tongue-and-head 3D mesh comprises encoding the image of the tongue into one of a plurality of latent tongue 3D features representing a corresponding one of a plurality of raw tongue 3D point-clouds; transforming the obtained latent tongue 3D feature into a corresponding one of a plurality of latent tongue-and-head expression shape parameters of a corresponding one of the plurality of synthetic tongue-and-head 3D meshes; and converting the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.

Thereby, a roughly approximated synthetic tongue-and-head 3D mesh may be obtained which is subject to further deformation.

In an implementation of the method, the encoding of the image of the tongue into one of a plurality of latent tongue 3D features comprises using an embedding network trained to encode a plurality of images of tongues into the corresponding plurality of the latent tongue 3D features. The plurality of the latent tongue 3D features may be established by autoencoding the plurality of raw tongue 3D point-clouds.

Thereby, efficiently coded labels (latent tongue 3D features) may be obtained which may capture all the desired 3D surface information of the depicted tongues.

In an implementation of the method, the plurality of images of the tongues comprises a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds.

In an implementation of the method, the plurality of images of the tongues are rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D pointcloud.

Thereby, a high quality image dataset is obtained for training purposes, which obviates actual “in-the-wild” images.

In an implementation of the method, the transforming the obtained latent tongue 3D feature comprises using a first regression matrix transforming the plurality of latent tongue 3D features into the plurality of corresponding latent tongue-and-head expression shape parameters of the plurality of corresponding synthetic tongue-and-head 3D meshes.

Thereby, a computationally inexpensive inference phase is achieved, since the first regression matrix brings together various findings attained in a preceding training phase.

In an implementation of the method, the plurality of latent tongue-and-head expression shape parameters are established by applying a second regression matrix to a plurality of first latent tongue-and-mouth expression shape parameters of a plurality of corresponding first tongue-and-mouth landmark vertex sets defined in the plurality of raw tongue 3D point-clouds. The plurality of first latent tongue-and-mouth expression shape parameters of the plurality of corresponding first tongue-and-mouth landmark vertex sets may be established by using a first Principal Component Analysis (PCA) on the plurality of first tongue-and-mouth landmark vertex sets. The first PCA may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets defined in the plurality of corresponding synthetic tongue-and-head 3D meshes and corresponding to the plurality of first tongue-and-mouth landmark vertex sets in terms of the underlying landmarks. The second regression matrix may be established by regressing a plurality of second latent tongue-and-mouth expression shape parameters of the plurality of second tongue-and-mouth landmark vertex sets to the plurality of corresponding latent tongue-and-head expression shape parameters. The plurality of latent tongue-and-head expression shape parameters may be established by using a second PCA on the plurality of synthetic tongue-and-head 3D meshes. The second PCA may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes. Thereby, the above-mentioned computationally inexpensive inference phase is prepared by associating the latent tongue 3D features of raw tongue 3D point-clouds with appropriate synthetic tongue-and-head 3D meshes in a PCA representation.

In an implementation of the method, the converting the obtained latent tongue-and-head expression shape parameters comprises using the second PCA to convert the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.

Thereby, the efficiently coded latent tongue-and-head expression shape parameters pt of a synthetic tongue-and-head 3D mesh in a PCA representation are projected back to a synthetic tongue-and-head 3D mesh in a Cartesian representation.

In an implementation of the method, the first and second tongue-and-mouth landmark vertex sets are arranged circumferentially around a tongue and mouth of the respective raw tongue 3D point-cloud or respective synthetic tongue-and-head 3D mesh.

Thereby, corresponding landmark annotations in raw tongue 3D point-clouds and synthetic tongue-and-head 3D meshes may be associated.

In an implementation of the method, the modifying the determined synthetic tongue-and- head 3D mesh comprises generating a plurality of synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features; and adapting, based on an optimization procedure, the one of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters.

Thereby, a shape of the pre-shaped synthetic tongue-and-head 3D mesh may further be optimized in accordance with the 3D tongue pose depicted in the single input image.

In an implementation of the method, the generating a plurality of synthetic points of the synthetic tongue 3D point-cloud comprises using a generative network of a Generative Adversarial Network (GAN) trained to establish individual synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features and respective Gaussian noise samples to be indistinguishable, by a discriminative network of the GAN, from points of one of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features.

Thereby, individual points of a synthetic tongue 3D point-cloud may be generated in accordance with latent tongue 3D features of a raw tongue 3D point-cloud. Accordingly, raw tongue 3D point-clouds do not need to have the same number of points, which reduces dataset preprocessing, and as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.

In an implementation of the method, the points of the one of the plurality of raw tongue 3D point-clouds undergo a diversification by an isotropic multi-variate normal distribution N having a variance that declines with progressing training epoch e of the GAN.

Thereby, the training of the GAN is improved and stabilized, by softening the binary behavior of the discriminative network especially in an early training phase.

In an implementation of the method, the error metric comprises a Chamfer distance loss to modify a 3D position of points of the one of the plurality of synthetic tongue-and-head 3D meshes; a normal loss to modify a 3D orientation of the one of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss to constrain relative 3D positions of neighboring points of the one of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss to constrain any possible outlier points of the one of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss to inhibit penetration of a surface of an oral cavity of the one of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points.

Thereby, an error between the pre-formed synthetic tongue-and-head 3D mesh and the generated synthetic tongue 3D point-cloud is minimized in accordance with multiple objectives which respectively contribute to a satisfactory result. In an implementation of the method, the collision loss comprises a sum of distances of each colliding point to a plurality of spheres having a radius and being centered at mouth landmark vertices of the tongue-and-mouth 3D landmark vertex set defined in the one of the synthetic tongue-and-head 3D meshes.

Thereby, an immersion of the tongue of the synthetic tongue-and-head 3D mesh into the mouth region of the synthetic tongue-and-head 3D mesh is effectively avoided.

A second aspect of the present disclosure provides a computer program, comprising executable instructions which, when executed by a processor, cause the processor to perform the method of the first aspect or any of its implementations.

A third aspect of the present disclosure provides a device for 3D reconstruction of a tongue, comprising a processor being configured to perform the method of the first aspect or any of its implementations.

As such, the effects and advantages mentioned in connection with the method of the first aspect similarly apply to the computer program of the second aspect as well as to the device of the third aspect.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. BRIEF DESCRIPTION OF DRAWINGS

The above described aspects and implementations will be explained in the following description of various embodiments in relation to the enclosed drawings, in which

FIG. 1 illustrates a flow chart of a method according to an embodiment of the invention;

FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method of Fig. 1;

FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method of Fig. 1;

FIG. 4 illustrates an exemplary GAN as it may be used in connection with the method of Fig. 1; and

FIG.4a exemplary illustrates Layer and Injection building blocks that may be used in connection with the GAN according to FIG.4.

FIG. 5 illustrates a diversification of points of a raw tongue 3D point-clouds used in connection with the method of Fig. 1.

DETAILED DESCRIPTION OF EMBODIMENTS

The above described aspects will now be described with respect to various embodiments illustrated in the enclosed drawings.

The features of these embodiments may be combined with each other unless specified otherwise.

The drawings are to be regarded as being schematic representations and elements illustrated in the drawings are not necessarily shown to scale. Rather, the various elements are represented such that their function and general purpose become apparent to a person skilled in the art.

FIG. 1 illustrates a flow chart of a method 1 according to an embodiment of the invention, while FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method 1 of Fig. 1.

As used herein, a tongue may refer to a muscular organ being anchored in an oral cavity of a human subject.

The method 1 may achieve a 3D reconstruction of a tongue by comprising: determining 101 one 207 of a plurality of synthetic tongue-and-head 3D meshes based on an image 201 of the tongue; and modifying 108 the determined synthetic tongue-and-head 3D mesh 207 based on a synthetic tongue 3D point-cloud 210 being indicative of the image 201 of the tongue.

In other words, the method 1 may define a processing pipeline as shown in FIG. 2 that can predict a 3D tongue mesh with fixed topology from a single image, which can be further optimized based on a generated point-cloud for more accurate results.

According to FIG. 1, the determining 101 of the synthetic tongue-and-head 3D mesh 207 may comprise encoding 102 the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y representing a corresponding one 401 of a plurality of raw tongue 3D point-clouds.

According to FIGs. 1 and 2, the encoding 102 of the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y may comprise using 103 an embedding network 202.

According to FIG. 1, the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise transforming 104 the obtained latent tongue 3D feature 203, y into a corresponding one 205 of a plurality of latent tongue-and-head expression shape parameters pt of a corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes. According to FIGs. 1 and 2, the transforming 104 the obtained latent tongue 3D feature 203, y may comprise using 105 a first regression matrix 204, W_t,_y transforming the plurality of latent tongue 3D features 203, y into the plurality of corresponding latent tongue-and-head expression shape parameters 205, p_t of the plurality of corresponding synthetic tongue-and-head 3D meshes.

According to FIG. 1, the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise converting 106 the obtained latent tongue-and-head expression shape parameters 205, p_t into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.

According to FIGs. 1 and 2, the converting 106 of the obtained latent tongue-and-head expression shape parameters 205, p_t of the plurality of synthetic tongue-and-head 3D meshes may comprise using 107 a second PCA 206, U_t to convert the obtained latent tongue-and-head expression shape parameters 205, p_t into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.

The afore-mentioned steps 101 - 107 of the method 1 are able to make use of knowledge acquired in a training phase that may be concluded before the inference phase illustrated in FIGs. 1 and 2. This preparatory work in the training phase is described in the following:

At least two separate datasets form the starting point. A first dataset may be captured under controlled conditions and is comprised of raw 3D tongue scans in a 3D point-cloud form known as the ‘plurality of raw tongue 3D point-clouds’.

A second dataset may be manually created and comprises only synthetic full head data with tongue expressions known as the ‘plurality of synthetic tongue-and-head 3D meshes’. Each of these meshes may be based on the mean template of the Universal Head Model (UHM), which has manually been diversified to render a wide range of tongue expressions. In each one of the datasets, a set of landmark vertices 3 may be annotated around a circumference of the respective tongue and mouth areas. The same landmark protocol may be utilized to annotate the plurality of raw tongue 3D point-clouds as well as the plurality of synthetic tongue-and-head 3D meshes based on the same underlying landmarks. The total number of landmarks in each set of landmark vertices 3 is exemplarily 24 as can be seen in FIG. 3, and is divided into two groups 302, 301 which highlight the tongue and the mouth, respectively. This constitutes a first tongue-and- mouth landmark vertex set 3, l_r per raw tongue 3D point-cloud, and a second tongue-and- mouth landmark vertex set 3, l_t per synthetic tongue-and-head 3D mesh. The set of landmark vertices 3 serve to associate the two datasets, so that a raw tongue 3D pointcloud can be linked to a synthetic tongue-and-head 3D mesh having a (closest) corresponding tongue expression.

A first PCA Uu may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets It of the plurality of corresponding synthetic tongue-and-head 3D meshes, and used to establish a plurality of second latent tongue-and-mouth expression shape parameters pi.

As used herein, a PCA may refer to a process of computing principal components of an n-dimensional point cloud, by fitting an n-dimensional ellipsoid to the point cloud, wherein each axis of the ellipsoid represents a principal component. An i^th principal component may be a direction of a line that is orthogonal to the first i-1 vectors and minimizes the average squared distance from the points to that line. As such, the principal components may be linearly uncorrelated and constitute an orthonormal basis that best fits the point cloud.

A second PCA 206, U_t may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes as a whole, and used to establish a plurality of latent tongue-and-head expression shape parameters 205, p_t.

A second regression matrix W_t,i may be established by regressing the plurality of second latent tongue-and-mouth expression shape parameters / / to the plurality of latent tongue- and-head expression shape parameters 205, p_t. In other words, the second regression matrix W_t,i associates the landmark vertex sets 3, l_t of the synthetic tongue-and-head 3D meshes with the plurality of corresponding synthetic tongue-and-head 3D meshes.

However, as the landmark vertex sets 3 of the raw tongue 3D point-clouds and the synthetic tongue-and-head 3D meshes relate to the same underlying landmarks, it doesn’t matter if the particular set of landmark vertices 3 relates to a mesh or a point-cloud. In other words, the second regression matrix W_t,i may also associate the plurality of first tongue-and-mouth landmark vertex sets 3. /,- of the plurality of raw tongue 3D pointclouds with the plurality of corresponding synthetic tongue-and-head 3D meshes. This merely requires (re)using the first PCA Un on the plurality of first tongue-and-mouth landmark vertex sets 3, l_r of the plurality of raw tongue 3D point-clouds and then (re)using the second regression matrix W_t,i to derive the associated plurality of corresponding synthetic tongue-and-head 3D meshes.

Also in the training phase, the above-mentioned plurality of the latent tongue 3D features 203, y may be established by auto-encoding the plurality of raw tongue 3D point-clouds, and the embedding network 202 may be trained to encode a plurality of images 201 of tongues into the corresponding plurality of the latent tongue 3D features 203, y. In particular, the embedding network 202 may be based on a ResNet-50 model pre-trained on the image database ImageNet and fine-tuned to work as an image encoder. A last layer of the embedding network 202 may be modified to output a vector y similar to the dimensions of the ground truth vector y. Then the goal of the embedding task it to minimize a L2 loss.

The above-mentioned plurality of images 201 of the tongues may comprise a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds, and may be rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D point-cloud.

More specifically, the plurality of synthetic tongue-and-head 3D meshes may be rendered with a precomputed radiance transfer technique using spherical harmonics which efficiently represent global light scattering. Additionally, 145 second-order spherical harmonics of more than 15 different indoor scenes may be coupled with random light positions and mesh orientations around all 3D axes, resulting in a rich plurality of images 201.

This concludes the determining 101 step that establishes a preliminary synthetic tongue- and-head 3D mesh 207 from a single image 201, and initiates the modifying 108 step that improves a shape of the preliminary mesh 207 based on a synthetic tongue 3D pointcloud 210 which is also indicative of the image 201 of the tongue.

Regressing from single images the expression parameters of the synthetic tongue model yields a good estimate of the tongue position, but fine details such as the volume and the orientation of the tongue are absent. This issue appears because only a small number of the principal components (n_t = 25; nl_t = 15) is utilized in order to avoid unrealistic tongue expressions when bridging the gap between the real and the synthetic data.

To this end, the regressed tongue expression is viewed as an initial shape state, and the full principal component spectrum n_t = 110 of the second PC A 206, U_t by utilizing a synthetic tongue 3D point-cloud 210 which is indicative of the image 201 of the tongue.

According to FIG. 1, the modifying 108 the determined synthetic tongue-and-head 3D mesh 207 may comprise generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y, and adapting 111, based on an optimization procedure, the one 207 of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud 210 to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters 205, p_t.

According to FIGs. 1 and 2, the generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4.

As used herein, a Generative Adversarial Network (GAN) may refer to a machine learning framework in which two neural networks (generative and discriminative network agents, respectively) contest with each other in a zero-sum game where one agent's gam is another agent's loss. Given a training set of (real) samples, this framework leams to generate new (synthetic) samples with the same statistics as the training set. The discriminative network is trained by presenting samples from the training set until it achieves acceptable accuracy. The generative network is seeded with randomized input that is sampled from a predefined latent space, and leams to generate (synthetic) samples, i.e., to map from the latent space to a data distribution of interest. The discriminative network seeks to distinguish the synthetic samples produced by the generator from the real samples, i.e., the true data distribution. Backpropagation is applied in both networks so that the generative network generates better synthetic images, while the discriminative network becomes more skilled at flagging synthetic images.

In particular, the generative network 209, G may randomly predict 10K synthetic points G(z, y) that describe a tongue surface in accordance with Gaussian noise samples 208, z being constituted as follows:

The afore-mentioned step 110 of the method 1 makes use of knowledge acquired in the training phase. This preparatory work in the training phase may comprise that the GAN 4 is trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D pointcloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points x_t of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.

The above-mentioned error metric may comprise a Chamfer distance loss LCD to modify a 3D position of points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a normal loss L_nOm to modify a 3D orientation of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss Li„_p to constrain relative 3D positions of neighboring points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss L_ed_ge to constrain any possible outlier points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss L_coi to inhibit penetration of a surface of an oral cavity of the one 207 of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points, as follows:

In particular, the collision loss L_coi may comprise a sum of distances of each colliding point q' to a plurality of spheres k having a radius r and being centered at mouth landmark vertices 301 (see FIG. 3) having coordinates (xk, yk, Zk) of the tongue-and-mouth 3D landmark vertex set 3, l_t defined in the one 207 of the synthetic tongue-and-head 3D meshes, as follows:

FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method 1 of Fig. 1.

The depicted tongue-and-mouth landmark vertex set 3, l_t is arranged circumferentially around a tongue and mouth, respectively, of a synthetic tongue-and-head 3D mesh 207.

The 24 landmarks around the oral cavity of the UHM template are divided into two groups 302, 301 which highlight the tongue and the mouth, respectively.

In the general case, the first and second tongue-and-mouth landmark vertex sets 3, l_r, It may be arranged circumferentially around the tongue and the mouth of the respective raw tongue 3D point-cloud 401 or the respective synthetic tongue-and-head 3D mesh 207.

FIG. 4 illustrates an exemplary GAN 4 as it may be used in connection with the method 1 of Fig. 1. As already mentioned in connection with FIG. 1, the generating 109 of the plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4, trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points x_t of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.

In order to generate high-detailed 3D point-clouds, a conditional GAN setting may be used in which the generative network 209, G is guided by labels throughout the training process in order to leam to produce samples that belong to specific categories which are dictated by the labels. In order to generate accurate point-clouds that correspond to certain tongues, the GAN 4 is preferably guided by meaningful labels which capture all the desired 3D surface information. These labels, i.e., the plurality of latent tongue 3D features 203, y, may be learned by auto-encoding the plurality of raw tongue 3D point-clouds. A selforganizing map framework may be used for hierarchical feature extraction. An underlying assumption is that the latent space of point-clouds can be well approximated in a low-rank linear space. Based on this assumption, a PCA embedding of the hierarchical 3D features is computed and these PCA coefficients are used as a more compact and rich feature representation of the tongue point-clouds. Since the generation is driven by those coefficients, the GAN 4 is a conditional one. Thus, the generative network 209, G produces a novel point-cloud point G(z, y) that belongs to the tongue surface represented by the label y. On the other hand, the discriminative network 403, D receives as inputs the label y, a real point-cloud point x_t (which belongs to the tongue represented by the label y) or the output G(z, y) of the generative network 209, G and tries to discriminate the fake (i.e., generated) from the real point. Mathematically this may be described as: [log D ( x, . y ) ] - _it [log D ( x, , y }] ,

log D (x_< . y)j . where D tries to maximize LD, whereas G tries to minimize <, The training process is considered complete when D is no longer able to differentiate between the real and fake point-cloud points. Instead of generating whole point-clouds for every provided pair (z, y) of noise and label, respectively, one point corresponding to the surface which the label y represents may be generated at a time. This confers several advantages: Firstly, the raw tongue 3D pointclouds of the training set do not need to have a same number of points, so that the GAN 4 may be trained without any data preprocessing. Secondly, as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.

The generative network 209, G may comprise L(ayer) and I(nj ection) building blocks 404, 405 being interconnected as shown in FIG. 4, for example. These L/I building blocks 404, 405 may respectively comprise a Multilayer Perceptron (MLP) 406 and Rectified Linear Unit (ReLU) 407 layers, as can be seen in FIG. 4a. Further building blocks describe a processing of the propagated signals. Processing block 408 (symbol c) stands for row-wise concatenation along the channel dimension, whereas processing block 409 (symbol o) stands for element-wise (i.e., Hadamard) product. The inputs of the generative network 209, G are: a label y corresponding to a particular raw tongue 3D point-cloud from a 3D point is to be sampled, and a Gaussian noise sample z.

The discriminative network 403, D may be based on the building blocks already mentioned in connection with the generative network 209, G which may be interconnected as shown in FIG. 4, for example. The inputs of the discriminative network 403, D are: (y, x), where is a label corresponding to a raw tongue 3D point-cloud and x_t is a point of this particular raw tongue 3D point-cloud, or (f: G(z, y) where G(z, y) is a point generated in accordance with this particular raw tongue 3D point-cloud. The switch symbol between the generative network 209, G and the discriminative network 403, D indicates that this feed is performed on a random basis.

FIG. 5 illustrates a diversification of points x_t of a raw tongue 3D point-cloud 401 used in connection with the method 1 of Fig. 1.

The discriminative network 403, D shows a binary behavior in that it decides whether a point is either fake or real. This rigidity is not very helpful especially in the early steps of the training process, as the generative network 209, G struggles to learn the distribution of points of the plurality of raw tongue 3D point-clouds (i.e., all of the generated points are discarded as fake by the discriminator with high confidence). To remedy this, the strict nature of the discriminative network 403, D may be softened, especially in the initial training steps, by diversifying the points x_t fed to it. To achieve that, instead of directly feeding a real point x_t corresponding to a label y to the discriminative network 403, D, the following is provided:

In other words, the points x_t of the one 401 of the plurality of raw tongue 3D point-clouds may undergo a diversification 402 (see FIG. 4) by an isotropic multi-variate normal distribution TV having mean x_t and (isotropic) variance a_e that declines with progressing training epoch e of the GAN 4:

Thereby, when the training process commences, the generative network 209, G can better learn the distribution of points of the plurality of raw tongue 3D point-clouds as it does not get severely punished by the discriminative network 403, D when it slightly misses out the actual surface. This yields better results and stabilizes the training. The training may be started with a relatively small value for the variance a_e which is further reduced subsequently until it becomes zero towards the final training epochs e

Those skilled in the art will appreciate that besides the method 1 explained previously, a computer program (not shown) may be provided comprising executable instructions which, when executed by a processor, cause the processor to perform the above-mentioned method 1, and that a device (not shown) for 3D reconstruction of a tongue may be provided comprising a processor configured to perform the above-mentioned method 1.

The processor or processing circuitry of the device may comprise hardware and/or the processing circuitry may be controlled by software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field- programmable gate arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. The present disclosure describes various embodiments as examples as well as implementations. However, other variations can be understood and effected by those skilled in the art and practicing the claimed subject-matter, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A method (1) for 3D reconstruction of a tongue, comprising: determining (101) one (207) of a plurality of synthetic tongue-and-head 3D meshes based on an image (201) of the tongue; and modifying (108) the determined synthetic tongue-and-head 3D mesh (207) based on a synthetic tongue 3D point-cloud (210) being indicative of the image (201) of the tongue.

2. The method (1) of claim 1, the determining (101) of the synthetic tongue-and-head 3D mesh (207) comprising encoding (102) the image (201) of the tongue into one of a plurality of latent tongue 3D features (203, y) representing a corresponding one (401) of a plurality of raw tongue 3D point-clouds; transforming (104) the obtained latent tongue 3D feature (203, y) into a corresponding one (205) of a plurality of latent tongue-and-head expression shape parameters (pt) of a corresponding one (207) of the plurality of synthetic tongue-and-head 3D meshes; and converting (106) the obtained latent tongue-and-head expression shape parameters (205, pt) into the corresponding one (207) of the plurality of synthetic tongue-and-head 3D meshes.

3. The method (1) of claim 2, the encoding (102) of the image (201) of the tongue into one of a plurality of latent tongue 3D features (203, y) comprising using (103) an embedding network (202) trained to encode a plurality of images (201) of tongues into the corresponding plurality of the latent tongue 3D features (203, y); the plurality of the latent tongue 3D features (203, y) being established by auto-encoding the plurality of raw tongue 3D point-clouds.

4. The method (1) of claim 3, the plurality of images (201) of the tongues comprising a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D pointclouds.

5. The method (1) of claim 4, the plurality of images (201) of the tongues being rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D point-cloud.

6. The method (1) of any of the claims 2 to 5, the transforming (104) the obtained latent tongue 3D feature (203, y) comprising using (105) a first regression matrix (204, Wt,y) transforming the plurality of latent tongue 3D features (203, y) into the plurality of corresponding latent tongue-and-head expression shape parameters (205, pt) of the plurality of corresponding synthetic tongue- and-head 3D meshes.

7. The method (1) of claim 6, the plurality of latent tongue-and-head expression shape parameters (205, pt) being established by applying a second regression matrix (Wt,i) to a plurality of first latent tongue-and-mouth expression shape parameters (pi) of a plurality of corresponding first tongue-and-mouth landmark vertex sets (lr) defined in the plurality of raw tongue 3D point-clouds; the plurality of first latent tongue-and-mouth expression shape parameters (pi) of the plurality of corresponding first tongue-and-mouth landmark vertex sets (lr) being established by using a first PCA (Ult) on the plurality of first tongue-and-mouth landmark vertex sets (lr); the first PCA (Ult) being established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets (It) defined in the plurality of corresponding synthetic tongue-and-head 3D meshes and corresponding to the plurality of first tongue-and-mouth landmark vertex sets (lr) in terms of the underlying landmarks; the second regression matrix (Wt,i) being established by regressing a plurality of second latent tongue-and-mouth expression shape parameters (pi) of the plurality of second tongue-and-mouth landmark vertex sets (It) to the plurality of corresponding latent tongue-and-head expression shape parameters (205, pt); the plurality of latent tongue-and-head expression shape parameters (205, pt) being established by using a second PCA (206, Ut) on the plurality of synthetic tongue-and- head 3D meshes; and the second PCA (206, Ut) being established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes.

8. The method (1) of claim 7, the converting (106) the obtained latent tongue-and-head expression shape parameters (205, pt) comprising using (107) the second PCA (206, Ut) to convert the obtained latent tongue-and-head expression shape parameters (205, pt) into the corresponding one (207) of the plurality of synthetic tongue-and-head 3D meshes.

9. The method (1) of claim 7 or claim 8, the first and second tongue-and-mouth landmark vertex sets (3, l_r, It) being arranged circumferentially around a tongue and mouth of the respective raw tongue 3D point-cloud (401) or respective synthetic tongue-and-head 3D mesh (207).

10. The method (1) of any of the claims 2 - 9, the modifying (108) the determined synthetic tongue-and-head 3D mesh (207) comprising generating (109) a plurality of synthetic points (G(z, y)) of the synthetic tongue 3D point-cloud (210) based on the one of the plurality of latent tongue 3D features (203, y); and adapting (111), based on an optimization procedure, the one (207) of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D pointcloud (210) to back-propagate the error to the one of the plurality of latent tongue-and- head expression shape parameters (205, pt).

11. The method (1) of claim 10, the generating (109) a plurality of synthetic points (G(z, y)) of the synthetic tongue 3D point-cloud (210) comprising using (110) a generative network (209) of a generative adversarial network, GAN (4), trained to establish individual synthetic points (G(z, y)) of the synthetic tongue 3D pointcloud (210) based on the one of the plurality of latent tongue 3D features (203, y) and respective Gaussian noise samples (208, z) to be indistinguishable, by a discriminative network (402) of the GAN (4), from points (x) of one (401) of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features (203, y).

12. The method (1) of claim 11, the points (x) of the one (401) of the plurality of raw tongue 3D point-clouds undergoing a diversification by an isotropic multi-variate normal distribution N having a variance (o_e) that declines with progressing training epoch e of the GAN (4).

13. The method (1) of any of the claims 10 to 12, the error metric comprising a Chamfer distance loss LCD to modify a 3D position of points of the one (207) of the plurality of synthetic tongue-and-head 3D meshes; a normal loss (Lnorm) to modify a 3D orientation of the one (207) of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss (Liap) to constrain relative 3D positions of neighboring points of the one (207) of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss (Ledge) to constrain any possible outlier points of the one (207) of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss (Lcoi) to inhibit penetration of a surface of an oral cavity of the one (207) of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points.

14. The method (1) of claim 13, the collision loss (Lcoi) comprising a sum of distances of each colliding point (qi) to a plurality of spheres (k) having a radius (r) and being centered at mouth landmark vertices (301) of the tongue-and-mouth 3D landmark vertex set (3, It) defined in the one (207) of the synthetic tongue-and-head 3D meshes.

15. A computer program, comprising executable instructions which, when executed by a processor, cause the processor to perform the method (1) of any of the claims 1 to 14.

16. A device for 3D reconstruction of a tongue, comprising a processor being configured to perform the method (1) of any of the claims 1 to 14.

22