[go: up one dir, main page]

WO2022096105A1 - 3d tongue reconstruction from single images - Google Patents

3d tongue reconstruction from single images Download PDF

Info

Publication number
WO2022096105A1
WO2022096105A1 PCT/EP2020/081148 EP2020081148W WO2022096105A1 WO 2022096105 A1 WO2022096105 A1 WO 2022096105A1 EP 2020081148 W EP2020081148 W EP 2020081148W WO 2022096105 A1 WO2022096105 A1 WO 2022096105A1
Authority
WO
WIPO (PCT)
Prior art keywords
tongue
synthetic
head
latent
meshes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2020/081148
Other languages
French (fr)
Inventor
Stylianos PLOUMPIS
Stylianos MOSCHOGLOU
Vasilios TRIANTAFYLLOU
Stefanos ZAFEIRIOU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to PCT/EP2020/081148 priority Critical patent/WO2022096105A1/en
Publication of WO2022096105A1 publication Critical patent/WO2022096105A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/56Particle system, point based geometry or rendering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2004Aligning objects, relative positioning of parts
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2021Shape modification

Definitions

  • the present disclosure relates to 3D reconstruction of human tongues for applications in 3D face reconstruction, animation and/or verification.
  • the present disclosure provides, to this end, a method, a computer program and a device.
  • a first aspect of the present disclosure provides a method for 3D reconstruction of a tongue, comprising: determining one of a plurality of synthetic tongue-and-head 3D meshes based on an image of the tongue; and modifying the determined synthetic tongue-and-head 3D mesh based on a synthetic tongue 3D point-cloud being indicative of the image of the tongue.
  • a combined tongue-and-head 3D model may be obtained which has a 3D tongue pose reconstructed from the single image of the tongue.
  • the determining of the synthetic tongue-and-head 3D mesh comprises encoding the image of the tongue into one of a plurality of latent tongue 3D features representing a corresponding one of a plurality of raw tongue 3D point-clouds; transforming the obtained latent tongue 3D feature into a corresponding one of a plurality of latent tongue-and-head expression shape parameters of a corresponding one of the plurality of synthetic tongue-and-head 3D meshes; and converting the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.
  • the encoding of the image of the tongue into one of a plurality of latent tongue 3D features comprises using an embedding network trained to encode a plurality of images of tongues into the corresponding plurality of the latent tongue 3D features.
  • the plurality of the latent tongue 3D features may be established by autoencoding the plurality of raw tongue 3D point-clouds.
  • the plurality of images of the tongues comprises a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds.
  • the plurality of images of the tongues are rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D pointcloud.
  • the transforming the obtained latent tongue 3D feature comprises using a first regression matrix transforming the plurality of latent tongue 3D features into the plurality of corresponding latent tongue-and-head expression shape parameters of the plurality of corresponding synthetic tongue-and-head 3D meshes.
  • the plurality of latent tongue-and-head expression shape parameters are established by applying a second regression matrix to a plurality of first latent tongue-and-mouth expression shape parameters of a plurality of corresponding first tongue-and-mouth landmark vertex sets defined in the plurality of raw tongue 3D point-clouds.
  • the plurality of first latent tongue-and-mouth expression shape parameters of the plurality of corresponding first tongue-and-mouth landmark vertex sets may be established by using a first Principal Component Analysis (PCA) on the plurality of first tongue-and-mouth landmark vertex sets.
  • PCA Principal Component Analysis
  • the first PCA may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets defined in the plurality of corresponding synthetic tongue-and-head 3D meshes and corresponding to the plurality of first tongue-and-mouth landmark vertex sets in terms of the underlying landmarks.
  • the second regression matrix may be established by regressing a plurality of second latent tongue-and-mouth expression shape parameters of the plurality of second tongue-and-mouth landmark vertex sets to the plurality of corresponding latent tongue-and-head expression shape parameters.
  • the plurality of latent tongue-and-head expression shape parameters may be established by using a second PCA on the plurality of synthetic tongue-and-head 3D meshes.
  • the second PCA may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes.
  • the above-mentioned computationally inexpensive inference phase is prepared by associating the latent tongue 3D features of raw tongue 3D point-clouds with appropriate synthetic tongue-and-head 3D meshes in a PCA representation.
  • the converting the obtained latent tongue-and-head expression shape parameters comprises using the second PCA to convert the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.
  • the first and second tongue-and-mouth landmark vertex sets are arranged circumferentially around a tongue and mouth of the respective raw tongue 3D point-cloud or respective synthetic tongue-and-head 3D mesh.
  • the modifying the determined synthetic tongue-and- head 3D mesh comprises generating a plurality of synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features; and adapting, based on an optimization procedure, the one of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters.
  • a shape of the pre-shaped synthetic tongue-and-head 3D mesh may further be optimized in accordance with the 3D tongue pose depicted in the single input image.
  • the generating a plurality of synthetic points of the synthetic tongue 3D point-cloud comprises using a generative network of a Generative Adversarial Network (GAN) trained to establish individual synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features and respective Gaussian noise samples to be indistinguishable, by a discriminative network of the GAN, from points of one of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features.
  • GAN Generative Adversarial Network
  • raw tongue 3D point-clouds may be generated in accordance with latent tongue 3D features of a raw tongue 3D point-cloud. Accordingly, raw tongue 3D point-clouds do not need to have the same number of points, which reduces dataset preprocessing, and as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.
  • the points of the one of the plurality of raw tongue 3D point-clouds undergo a diversification by an isotropic multi-variate normal distribution N having a variance that declines with progressing training epoch e of the GAN.
  • the training of the GAN is improved and stabilized, by softening the binary behavior of the discriminative network especially in an early training phase.
  • the error metric comprises a Chamfer distance loss to modify a 3D position of points of the one of the plurality of synthetic tongue-and-head 3D meshes; a normal loss to modify a 3D orientation of the one of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss to constrain relative 3D positions of neighboring points of the one of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss to constrain any possible outlier points of the one of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss to inhibit penetration of a surface of an oral cavity of the one of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points.
  • the collision loss comprises a sum of distances of each colliding point to a plurality of spheres having a radius and being centered at mouth landmark vertices of the tongue-and-mouth 3D landmark vertex set defined in the one of the synthetic tongue-and-head 3D meshes.
  • a second aspect of the present disclosure provides a computer program, comprising executable instructions which, when executed by a processor, cause the processor to perform the method of the first aspect or any of its implementations.
  • a third aspect of the present disclosure provides a device for 3D reconstruction of a tongue, comprising a processor being configured to perform the method of the first aspect or any of its implementations.
  • FIG. 1 illustrates a flow chart of a method according to an embodiment of the invention
  • FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method of Fig. 1;
  • FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method of Fig. 1;
  • FIG. 4 illustrates an exemplary GAN as it may be used in connection with the method of Fig. 1;
  • FIG.4a exemplary illustrates Layer and Injection building blocks that may be used in connection with the GAN according to FIG.4.
  • FIG. 5 illustrates a diversification of points of a raw tongue 3D point-clouds used in connection with the method of Fig. 1.
  • FIG. 1 illustrates a flow chart of a method 1 according to an embodiment of the invention
  • FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method 1 of Fig. 1.
  • a tongue may refer to a muscular organ being anchored in an oral cavity of a human subject.
  • the method 1 may achieve a 3D reconstruction of a tongue by comprising: determining 101 one 207 of a plurality of synthetic tongue-and-head 3D meshes based on an image 201 of the tongue; and modifying 108 the determined synthetic tongue-and-head 3D mesh 207 based on a synthetic tongue 3D point-cloud 210 being indicative of the image 201 of the tongue.
  • the method 1 may define a processing pipeline as shown in FIG. 2 that can predict a 3D tongue mesh with fixed topology from a single image, which can be further optimized based on a generated point-cloud for more accurate results.
  • the determining 101 of the synthetic tongue-and-head 3D mesh 207 may comprise encoding 102 the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y representing a corresponding one 401 of a plurality of raw tongue 3D point-clouds.
  • the encoding 102 of the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y may comprise using 103 an embedding network 202.
  • the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise transforming 104 the obtained latent tongue 3D feature 203, y into a corresponding one 205 of a plurality of latent tongue-and-head expression shape parameters pt of a corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
  • the transforming 104 the obtained latent tongue 3D feature 203, y may comprise using 105 a first regression matrix 204, W t , y transforming the plurality of latent tongue 3D features 203, y into the plurality of corresponding latent tongue-and-head expression shape parameters 205, p t of the plurality of corresponding synthetic tongue-and-head 3D meshes.
  • the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise converting 106 the obtained latent tongue-and-head expression shape parameters 205, p t into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
  • the converting 106 of the obtained latent tongue-and-head expression shape parameters 205, p t of the plurality of synthetic tongue-and-head 3D meshes may comprise using 107 a second PCA 206, U t to convert the obtained latent tongue-and-head expression shape parameters 205, p t into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
  • steps 101 - 107 of the method 1 are able to make use of knowledge acquired in a training phase that may be concluded before the inference phase illustrated in FIGs. 1 and 2.
  • This preparatory work in the training phase is described in the following:
  • a first dataset may be captured under controlled conditions and is comprised of raw 3D tongue scans in a 3D point-cloud form known as the ‘plurality of raw tongue 3D point-clouds’.
  • a second dataset may be manually created and comprises only synthetic full head data with tongue expressions known as the ‘plurality of synthetic tongue-and-head 3D meshes’.
  • Each of these meshes may be based on the mean template of the Universal Head Model (UHM), which has manually been diversified to render a wide range of tongue expressions.
  • UHM Universal Head Model
  • a set of landmark vertices 3 may be annotated around a circumference of the respective tongue and mouth areas.
  • the same landmark protocol may be utilized to annotate the plurality of raw tongue 3D point-clouds as well as the plurality of synthetic tongue-and-head 3D meshes based on the same underlying landmarks.
  • the total number of landmarks in each set of landmark vertices 3 is exemplarily 24 as can be seen in FIG. 3, and is divided into two groups 302, 301 which highlight the tongue and the mouth, respectively. This constitutes a first tongue-and- mouth landmark vertex set 3, l r per raw tongue 3D point-cloud, and a second tongue-and- mouth landmark vertex set 3, l t per synthetic tongue-and-head 3D mesh.
  • the set of landmark vertices 3 serve to associate the two datasets, so that a raw tongue 3D pointcloud can be linked to a synthetic tongue-and-head 3D mesh having a (closest) corresponding tongue expression.
  • a first PCA Uu may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets It of the plurality of corresponding synthetic tongue-and-head 3D meshes, and used to establish a plurality of second latent tongue-and-mouth expression shape parameters pi.
  • a PCA may refer to a process of computing principal components of an n-dimensional point cloud, by fitting an n-dimensional ellipsoid to the point cloud, wherein each axis of the ellipsoid represents a principal component.
  • An i th principal component may be a direction of a line that is orthogonal to the first i-1 vectors and minimizes the average squared distance from the points to that line.
  • the principal components may be linearly uncorrelated and constitute an orthonormal basis that best fits the point cloud.
  • a second PCA 206, U t may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes as a whole, and used to establish a plurality of latent tongue-and-head expression shape parameters 205, p t .
  • a second regression matrix W t ,i may be established by regressing the plurality of second latent tongue-and-mouth expression shape parameters / / to the plurality of latent tongue- and-head expression shape parameters 205, p t .
  • the second regression matrix W t ,i associates the landmark vertex sets 3, l t of the synthetic tongue-and-head 3D meshes with the plurality of corresponding synthetic tongue-and-head 3D meshes.
  • the second regression matrix W t ,i may also associate the plurality of first tongue-and-mouth landmark vertex sets 3. /,- of the plurality of raw tongue 3D pointclouds with the plurality of corresponding synthetic tongue-and-head 3D meshes.
  • the above-mentioned plurality of the latent tongue 3D features 203, y may be established by auto-encoding the plurality of raw tongue 3D point-clouds, and the embedding network 202 may be trained to encode a plurality of images 201 of tongues into the corresponding plurality of the latent tongue 3D features 203, y.
  • the embedding network 202 may be based on a ResNet-50 model pre-trained on the image database ImageNet and fine-tuned to work as an image encoder.
  • a last layer of the embedding network 202 may be modified to output a vector y similar to the dimensions of the ground truth vector y. Then the goal of the embedding task it to minimize a L2 loss.
  • the above-mentioned plurality of images 201 of the tongues may comprise a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds, and may be rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D point-cloud.
  • the plurality of synthetic tongue-and-head 3D meshes may be rendered with a precomputed radiance transfer technique using spherical harmonics which efficiently represent global light scattering.
  • spherical harmonics which efficiently represent global light scattering.
  • 145 second-order spherical harmonics of more than 15 different indoor scenes may be coupled with random light positions and mesh orientations around all 3D axes, resulting in a rich plurality of images 201.
  • the modifying 108 the determined synthetic tongue-and-head 3D mesh 207 may comprise generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y, and adapting 111, based on an optimization procedure, the one 207 of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud 210 to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters 205, p t .
  • the generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4.
  • a Generative Adversarial Network may refer to a machine learning framework in which two neural networks (generative and discriminative network agents, respectively) contest with each other in a zero-sum game where one agent's gam is another agent's loss. Given a training set of (real) samples, this framework leams to generate new (synthetic) samples with the same statistics as the training set.
  • the discriminative network is trained by presenting samples from the training set until it achieves acceptable accuracy.
  • the generative network is seeded with randomized input that is sampled from a predefined latent space, and leams to generate (synthetic) samples, i.e., to map from the latent space to a data distribution of interest.
  • the discriminative network seeks to distinguish the synthetic samples produced by the generator from the real samples, i.e., the true data distribution. Backpropagation is applied in both networks so that the generative network generates better synthetic images, while the discriminative network becomes more skilled at flagging synthetic images.
  • the generative network 209, G may randomly predict 10K synthetic points G(z, y) that describe a tongue surface in accordance with Gaussian noise samples 208, z being constituted as follows:
  • the afore-mentioned step 110 of the method 1 makes use of knowledge acquired in the training phase.
  • This preparatory work in the training phase may comprise that the GAN 4 is trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D pointcloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points x t of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.
  • the above-mentioned error metric may comprise a Chamfer distance loss LCD to modify a 3D position of points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a normal loss L nO m to modify a 3D orientation of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss Li counter p to constrain relative 3D positions of neighboring points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss L e d ge to constrain any possible outlier points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss L coi to inhibit penetration of a surface of an oral cavity of the one 207 of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points, as follows:
  • the collision loss L coi may comprise a sum of distances of each colliding point q' to a plurality of spheres k having a radius r and being centered at mouth landmark vertices 301 (see FIG. 3) having coordinates (xk, yk, Zk) of the tongue-and-mouth 3D landmark vertex set 3, l t defined in the one 207 of the synthetic tongue-and-head 3D meshes, as follows:
  • FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method 1 of Fig. 1.
  • the depicted tongue-and-mouth landmark vertex set 3, l t is arranged circumferentially around a tongue and mouth, respectively, of a synthetic tongue-and-head 3D mesh 207.
  • the 24 landmarks around the oral cavity of the UHM template are divided into two groups 302, 301 which highlight the tongue and the mouth, respectively.
  • the first and second tongue-and-mouth landmark vertex sets 3, l r may be arranged circumferentially around the tongue and the mouth of the respective raw tongue 3D point-cloud 401 or the respective synthetic tongue-and-head 3D mesh 207.
  • FIG. 4 illustrates an exemplary GAN 4 as it may be used in connection with the method 1 of Fig. 1.
  • the generating 109 of the plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4, trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points x t of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.
  • a conditional GAN setting may be used in which the generative network 209, G is guided by labels throughout the training process in order to leam to produce samples that belong to specific categories which are dictated by the labels.
  • the GAN 4 is preferably guided by meaningful labels which capture all the desired 3D surface information.
  • These labels i.e., the plurality of latent tongue 3D features 203, y, may be learned by auto-encoding the plurality of raw tongue 3D point-clouds.
  • a selforganizing map framework may be used for hierarchical feature extraction.
  • the discriminative network 403, D receives as inputs the label y, a real point-cloud point x t (which belongs to the tongue represented by the label y) or the output G(z, y) of the generative network 209, G and tries to discriminate the fake (i.e., generated) from the real point.
  • this may be described as: [log D ( x, . y ) ] - it [log D ( x, , y ⁇ ] , log D (x ⁇ . y)j .
  • D tries to maximize LD
  • G tries to minimize ⁇
  • one point corresponding to the surface which the label y represents may be generated at a time.
  • the raw tongue 3D pointclouds of the training set do not need to have a same number of points, so that the GAN 4 may be trained without any data preprocessing.
  • as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.
  • the generative network 209, G may comprise L(ayer) and I(nj ection) building blocks 404, 405 being interconnected as shown in FIG. 4, for example. These L/I building blocks 404, 405 may respectively comprise a Multilayer Perceptron (MLP) 406 and Rectified Linear Unit (ReLU) 407 layers, as can be seen in FIG. 4a. Further building blocks describe a processing of the propagated signals. Processing block 408 (symbol c) stands for row-wise concatenation along the channel dimension, whereas processing block 409 (symbol o) stands for element-wise (i.e., Hadamard) product.
  • the inputs of the generative network 209, G are: a label y corresponding to a particular raw tongue 3D point-cloud from a 3D point is to be sampled, and a Gaussian noise sample z.
  • the discriminative network 403, D may be based on the building blocks already mentioned in connection with the generative network 209, G which may be interconnected as shown in FIG. 4, for example.
  • the inputs of the discriminative network 403, D are: (y, x), where is a label corresponding to a raw tongue 3D point-cloud and x t is a point of this particular raw tongue 3D point-cloud, or (f: G(z, y) where G(z, y) is a point generated in accordance with this particular raw tongue 3D point-cloud.
  • the switch symbol between the generative network 209, G and the discriminative network 403, D indicates that this feed is performed on a random basis.
  • FIG. 5 illustrates a diversification of points x t of a raw tongue 3D point-cloud 401 used in connection with the method 1 of Fig. 1.
  • the discriminative network 403, D shows a binary behavior in that it decides whether a point is either fake or real. This rigidity is not very helpful especially in the early steps of the training process, as the generative network 209, G struggles to learn the distribution of points of the plurality of raw tongue 3D point-clouds (i.e., all of the generated points are discarded as fake by the discriminator with high confidence). To remedy this, the strict nature of the discriminative network 403, D may be softened, especially in the initial training steps, by diversifying the points x t fed to it. To achieve that, instead of directly feeding a real point x t corresponding to a label y to the discriminative network 403, D, the following is provided:
  • the points x t of the one 401 of the plurality of raw tongue 3D point-clouds may undergo a diversification 402 (see FIG. 4) by an isotropic multi-variate normal distribution TV having mean x t and (isotropic) variance a e that declines with progressing training epoch e of the GAN 4:
  • the generative network 209 G can better learn the distribution of points of the plurality of raw tongue 3D point-clouds as it does not get severely punished by the discriminative network 403, D when it slightly misses out the actual surface. This yields better results and stabilizes the training.
  • the training may be started with a relatively small value for the variance a e which is further reduced subsequently until it becomes zero towards the final training epochs e
  • a computer program (not shown) may be provided comprising executable instructions which, when executed by a processor, cause the processor to perform the above-mentioned method 1, and that a device (not shown) for 3D reconstruction of a tongue may be provided comprising a processor configured to perform the above-mentioned method 1.
  • the processor or processing circuitry of the device may comprise hardware and/or the processing circuitry may be controlled by software.
  • the hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry.
  • the digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field- programmable gate arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors.
  • ASICs application-specific integrated circuits
  • FPGAs field- programmable gate arrays
  • DSPs digital signal processors
  • multi-purpose processors multi-purpose processors.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Architecture (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A method (1) for 3D reconstruction of a tongue is provided. The method (1) comprises: determining (101) one (207) of a plurality of synthetic tongue-and-head 3D meshes based on an image (201) of the tongue; and modifying (108) the determined synthetic tongue- and-head 3D mesh (207) based on a synthetic tongue 3D point-cloud (210) which is in turn indicative of the image (201) of the tongue. This enables reconstructing a 3D model of a head together with the 3D tongue pose of the depicted tongue.

Description

3D TONGUE RECONSTRUCTION FROM SINGLE IMAGES
TECHNICAL FIELD
The present disclosure relates to 3D reconstruction of human tongues for applications in 3D face reconstruction, animation and/or verification. The present disclosure provides, to this end, a method, a computer program and a device.
BACKGROUND
Conventional examples of avatar creation and speech modelling fail to reproduce characteristics of an oral cavity and, in particular, of a tongue. Inter aha, this diminishes realistic representation of avatars, face verification, user experience in augmented/virtual reality applications, and realistic speech dynamics. The human tongue is a highly deformable object and thus cannot be registered under a template. Thus all methods that operate under a common template reference are unavailable. Furthermore, the human tongue is a non-watertight object, i.e., an object that has holes and thus it is not closed. As a result, no implicit function approximation methods can be used.
SUMMARY
In view of the above-mentioned problems and disadvantages, it is an object to include characteristics of the oral cavity and, in particular, a pose of the tongue, in 3D face reconstruction.
This and other objectives are achieved by the embodiments as defined by the appended independent claims. Further embodiments are set forth in the dependent claims and in the following description and drawings.
A first aspect of the present disclosure provides a method for 3D reconstruction of a tongue, comprising: determining one of a plurality of synthetic tongue-and-head 3D meshes based on an image of the tongue; and modifying the determined synthetic tongue-and-head 3D mesh based on a synthetic tongue 3D point-cloud being indicative of the image of the tongue.
Thereby, a combined tongue-and-head 3D model may be obtained which has a 3D tongue pose reconstructed from the single image of the tongue.
In an implementation of the method, the determining of the synthetic tongue-and-head 3D mesh comprises encoding the image of the tongue into one of a plurality of latent tongue 3D features representing a corresponding one of a plurality of raw tongue 3D point-clouds; transforming the obtained latent tongue 3D feature into a corresponding one of a plurality of latent tongue-and-head expression shape parameters of a corresponding one of the plurality of synthetic tongue-and-head 3D meshes; and converting the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.
Thereby, a roughly approximated synthetic tongue-and-head 3D mesh may be obtained which is subject to further deformation.
In an implementation of the method, the encoding of the image of the tongue into one of a plurality of latent tongue 3D features comprises using an embedding network trained to encode a plurality of images of tongues into the corresponding plurality of the latent tongue 3D features. The plurality of the latent tongue 3D features may be established by autoencoding the plurality of raw tongue 3D point-clouds.
Thereby, efficiently coded labels (latent tongue 3D features) may be obtained which may capture all the desired 3D surface information of the depicted tongues.
In an implementation of the method, the plurality of images of the tongues comprises a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds.
In an implementation of the method, the plurality of images of the tongues are rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D pointcloud.
Thereby, a high quality image dataset is obtained for training purposes, which obviates actual “in-the-wild” images.
In an implementation of the method, the transforming the obtained latent tongue 3D feature comprises using a first regression matrix transforming the plurality of latent tongue 3D features into the plurality of corresponding latent tongue-and-head expression shape parameters of the plurality of corresponding synthetic tongue-and-head 3D meshes.
Thereby, a computationally inexpensive inference phase is achieved, since the first regression matrix brings together various findings attained in a preceding training phase.
In an implementation of the method, the plurality of latent tongue-and-head expression shape parameters are established by applying a second regression matrix to a plurality of first latent tongue-and-mouth expression shape parameters of a plurality of corresponding first tongue-and-mouth landmark vertex sets defined in the plurality of raw tongue 3D point-clouds. The plurality of first latent tongue-and-mouth expression shape parameters of the plurality of corresponding first tongue-and-mouth landmark vertex sets may be established by using a first Principal Component Analysis (PCA) on the plurality of first tongue-and-mouth landmark vertex sets. The first PCA may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets defined in the plurality of corresponding synthetic tongue-and-head 3D meshes and corresponding to the plurality of first tongue-and-mouth landmark vertex sets in terms of the underlying landmarks. The second regression matrix may be established by regressing a plurality of second latent tongue-and-mouth expression shape parameters of the plurality of second tongue-and-mouth landmark vertex sets to the plurality of corresponding latent tongue-and-head expression shape parameters. The plurality of latent tongue-and-head expression shape parameters may be established by using a second PCA on the plurality of synthetic tongue-and-head 3D meshes. The second PCA may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes. Thereby, the above-mentioned computationally inexpensive inference phase is prepared by associating the latent tongue 3D features of raw tongue 3D point-clouds with appropriate synthetic tongue-and-head 3D meshes in a PCA representation.
In an implementation of the method, the converting the obtained latent tongue-and-head expression shape parameters comprises using the second PCA to convert the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.
Thereby, the efficiently coded latent tongue-and-head expression shape parameters pt of a synthetic tongue-and-head 3D mesh in a PCA representation are projected back to a synthetic tongue-and-head 3D mesh in a Cartesian representation.
In an implementation of the method, the first and second tongue-and-mouth landmark vertex sets are arranged circumferentially around a tongue and mouth of the respective raw tongue 3D point-cloud or respective synthetic tongue-and-head 3D mesh.
Thereby, corresponding landmark annotations in raw tongue 3D point-clouds and synthetic tongue-and-head 3D meshes may be associated.
In an implementation of the method, the modifying the determined synthetic tongue-and- head 3D mesh comprises generating a plurality of synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features; and adapting, based on an optimization procedure, the one of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters.
Thereby, a shape of the pre-shaped synthetic tongue-and-head 3D mesh may further be optimized in accordance with the 3D tongue pose depicted in the single input image.
In an implementation of the method, the generating a plurality of synthetic points of the synthetic tongue 3D point-cloud comprises using a generative network of a Generative Adversarial Network (GAN) trained to establish individual synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features and respective Gaussian noise samples to be indistinguishable, by a discriminative network of the GAN, from points of one of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features.
Thereby, individual points of a synthetic tongue 3D point-cloud may be generated in accordance with latent tongue 3D features of a raw tongue 3D point-cloud. Accordingly, raw tongue 3D point-clouds do not need to have the same number of points, which reduces dataset preprocessing, and as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.
In an implementation of the method, the points of the one of the plurality of raw tongue 3D point-clouds undergo a diversification by an isotropic multi-variate normal distribution N having a variance that declines with progressing training epoch e of the GAN.
Thereby, the training of the GAN is improved and stabilized, by softening the binary behavior of the discriminative network especially in an early training phase.
In an implementation of the method, the error metric comprises a Chamfer distance loss to modify a 3D position of points of the one of the plurality of synthetic tongue-and-head 3D meshes; a normal loss to modify a 3D orientation of the one of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss to constrain relative 3D positions of neighboring points of the one of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss to constrain any possible outlier points of the one of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss to inhibit penetration of a surface of an oral cavity of the one of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points.
Thereby, an error between the pre-formed synthetic tongue-and-head 3D mesh and the generated synthetic tongue 3D point-cloud is minimized in accordance with multiple objectives which respectively contribute to a satisfactory result. In an implementation of the method, the collision loss comprises a sum of distances of each colliding point to a plurality of spheres having a radius and being centered at mouth landmark vertices of the tongue-and-mouth 3D landmark vertex set defined in the one of the synthetic tongue-and-head 3D meshes.
Thereby, an immersion of the tongue of the synthetic tongue-and-head 3D mesh into the mouth region of the synthetic tongue-and-head 3D mesh is effectively avoided.
A second aspect of the present disclosure provides a computer program, comprising executable instructions which, when executed by a processor, cause the processor to perform the method of the first aspect or any of its implementations.
A third aspect of the present disclosure provides a device for 3D reconstruction of a tongue, comprising a processor being configured to perform the method of the first aspect or any of its implementations.
As such, the effects and advantages mentioned in connection with the method of the first aspect similarly apply to the computer program of the second aspect as well as to the device of the third aspect.
It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. BRIEF DESCRIPTION OF DRAWINGS
The above described aspects and implementations will be explained in the following description of various embodiments in relation to the enclosed drawings, in which
FIG. 1 illustrates a flow chart of a method according to an embodiment of the invention;
FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method of Fig. 1;
FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method of Fig. 1;
FIG. 4 illustrates an exemplary GAN as it may be used in connection with the method of Fig. 1; and
FIG.4a exemplary illustrates Layer and Injection building blocks that may be used in connection with the GAN according to FIG.4.
FIG. 5 illustrates a diversification of points of a raw tongue 3D point-clouds used in connection with the method of Fig. 1.
DETAILED DESCRIPTION OF EMBODIMENTS
The above described aspects will now be described with respect to various embodiments illustrated in the enclosed drawings.
The features of these embodiments may be combined with each other unless specified otherwise.
The drawings are to be regarded as being schematic representations and elements illustrated in the drawings are not necessarily shown to scale. Rather, the various elements are represented such that their function and general purpose become apparent to a person skilled in the art.
FIG. 1 illustrates a flow chart of a method 1 according to an embodiment of the invention, while FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method 1 of Fig. 1.
As used herein, a tongue may refer to a muscular organ being anchored in an oral cavity of a human subject.
The method 1 may achieve a 3D reconstruction of a tongue by comprising: determining 101 one 207 of a plurality of synthetic tongue-and-head 3D meshes based on an image 201 of the tongue; and modifying 108 the determined synthetic tongue-and-head 3D mesh 207 based on a synthetic tongue 3D point-cloud 210 being indicative of the image 201 of the tongue.
In other words, the method 1 may define a processing pipeline as shown in FIG. 2 that can predict a 3D tongue mesh with fixed topology from a single image, which can be further optimized based on a generated point-cloud for more accurate results.
According to FIG. 1, the determining 101 of the synthetic tongue-and-head 3D mesh 207 may comprise encoding 102 the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y representing a corresponding one 401 of a plurality of raw tongue 3D point-clouds.
According to FIGs. 1 and 2, the encoding 102 of the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y may comprise using 103 an embedding network 202.
According to FIG. 1, the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise transforming 104 the obtained latent tongue 3D feature 203, y into a corresponding one 205 of a plurality of latent tongue-and-head expression shape parameters pt of a corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes. According to FIGs. 1 and 2, the transforming 104 the obtained latent tongue 3D feature 203, y may comprise using 105 a first regression matrix 204, Wt,y transforming the plurality of latent tongue 3D features 203, y into the plurality of corresponding latent tongue-and-head expression shape parameters 205, pt of the plurality of corresponding synthetic tongue-and-head 3D meshes.
According to FIG. 1, the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise converting 106 the obtained latent tongue-and-head expression shape parameters 205, pt into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
According to FIGs. 1 and 2, the converting 106 of the obtained latent tongue-and-head expression shape parameters 205, pt of the plurality of synthetic tongue-and-head 3D meshes may comprise using 107 a second PCA 206, Ut to convert the obtained latent tongue-and-head expression shape parameters 205, pt into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
The afore-mentioned steps 101 - 107 of the method 1 are able to make use of knowledge acquired in a training phase that may be concluded before the inference phase illustrated in FIGs. 1 and 2. This preparatory work in the training phase is described in the following:
At least two separate datasets form the starting point. A first dataset may be captured under controlled conditions and is comprised of raw 3D tongue scans in a 3D point-cloud form known as the ‘plurality of raw tongue 3D point-clouds’.
A second dataset may be manually created and comprises only synthetic full head data with tongue expressions known as the ‘plurality of synthetic tongue-and-head 3D meshes’. Each of these meshes may be based on the mean template of the Universal Head Model (UHM), which has manually been diversified to render a wide range of tongue expressions. In each one of the datasets, a set of landmark vertices 3 may be annotated around a circumference of the respective tongue and mouth areas. The same landmark protocol may be utilized to annotate the plurality of raw tongue 3D point-clouds as well as the plurality of synthetic tongue-and-head 3D meshes based on the same underlying landmarks. The total number of landmarks in each set of landmark vertices 3 is exemplarily 24 as can be seen in FIG. 3, and is divided into two groups 302, 301 which highlight the tongue and the mouth, respectively. This constitutes a first tongue-and- mouth landmark vertex set 3, lr per raw tongue 3D point-cloud, and a second tongue-and- mouth landmark vertex set 3, lt per synthetic tongue-and-head 3D mesh. The set of landmark vertices 3 serve to associate the two datasets, so that a raw tongue 3D pointcloud can be linked to a synthetic tongue-and-head 3D mesh having a (closest) corresponding tongue expression.
A first PCA Uu may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets It of the plurality of corresponding synthetic tongue-and-head 3D meshes, and used to establish a plurality of second latent tongue-and-mouth expression shape parameters pi.
As used herein, a PCA may refer to a process of computing principal components of an n-dimensional point cloud, by fitting an n-dimensional ellipsoid to the point cloud, wherein each axis of the ellipsoid represents a principal component. An ith principal component may be a direction of a line that is orthogonal to the first i-1 vectors and minimizes the average squared distance from the points to that line. As such, the principal components may be linearly uncorrelated and constitute an orthonormal basis that best fits the point cloud.
A second PCA 206, Ut may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes as a whole, and used to establish a plurality of latent tongue-and-head expression shape parameters 205, pt.
A second regression matrix Wt,i may be established by regressing the plurality of second latent tongue-and-mouth expression shape parameters / / to the plurality of latent tongue- and-head expression shape parameters 205, pt. In other words, the second regression matrix Wt,i associates the landmark vertex sets 3, lt of the synthetic tongue-and-head 3D meshes with the plurality of corresponding synthetic tongue-and-head 3D meshes.
However, as the landmark vertex sets 3 of the raw tongue 3D point-clouds and the synthetic tongue-and-head 3D meshes relate to the same underlying landmarks, it doesn’t matter if the particular set of landmark vertices 3 relates to a mesh or a point-cloud. In other words, the second regression matrix Wt,i may also associate the plurality of first tongue-and-mouth landmark vertex sets 3. /,- of the plurality of raw tongue 3D pointclouds with the plurality of corresponding synthetic tongue-and-head 3D meshes. This merely requires (re)using the first PCA Un on the plurality of first tongue-and-mouth landmark vertex sets 3, lr of the plurality of raw tongue 3D point-clouds and then (re)using the second regression matrix Wt,i to derive the associated plurality of corresponding synthetic tongue-and-head 3D meshes.
Also in the training phase, the above-mentioned plurality of the latent tongue 3D features 203, y may be established by auto-encoding the plurality of raw tongue 3D point-clouds, and the embedding network 202 may be trained to encode a plurality of images 201 of tongues into the corresponding plurality of the latent tongue 3D features 203, y. In particular, the embedding network 202 may be based on a ResNet-50 model pre-trained on the image database ImageNet and fine-tuned to work as an image encoder. A last layer of the embedding network 202 may be modified to output a vector y similar to the dimensions of the ground truth vector y. Then the goal of the embedding task it to minimize a L2 loss.
The above-mentioned plurality of images 201 of the tongues may comprise a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds, and may be rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D point-cloud.
More specifically, the plurality of synthetic tongue-and-head 3D meshes may be rendered with a precomputed radiance transfer technique using spherical harmonics which efficiently represent global light scattering. Additionally, 145 second-order spherical harmonics of more than 15 different indoor scenes may be coupled with random light positions and mesh orientations around all 3D axes, resulting in a rich plurality of images 201.
This concludes the determining 101 step that establishes a preliminary synthetic tongue- and-head 3D mesh 207 from a single image 201, and initiates the modifying 108 step that improves a shape of the preliminary mesh 207 based on a synthetic tongue 3D pointcloud 210 which is also indicative of the image 201 of the tongue.
Regressing from single images the expression parameters of the synthetic tongue model yields a good estimate of the tongue position, but fine details such as the volume and the orientation of the tongue are absent. This issue appears because only a small number of the principal components (nt = 25; nlt = 15) is utilized in order to avoid unrealistic tongue expressions when bridging the gap between the real and the synthetic data.
To this end, the regressed tongue expression is viewed as an initial shape state, and the full principal component spectrum nt = 110 of the second PC A 206, Ut by utilizing a synthetic tongue 3D point-cloud 210 which is indicative of the image 201 of the tongue.
According to FIG. 1, the modifying 108 the determined synthetic tongue-and-head 3D mesh 207 may comprise generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y, and adapting 111, based on an optimization procedure, the one 207 of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud 210 to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters 205, pt.
According to FIGs. 1 and 2, the generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4.
As used herein, a Generative Adversarial Network (GAN) may refer to a machine learning framework in which two neural networks (generative and discriminative network agents, respectively) contest with each other in a zero-sum game where one agent's gam is another agent's loss. Given a training set of (real) samples, this framework leams to generate new (synthetic) samples with the same statistics as the training set. The discriminative network is trained by presenting samples from the training set until it achieves acceptable accuracy. The generative network is seeded with randomized input that is sampled from a predefined latent space, and leams to generate (synthetic) samples, i.e., to map from the latent space to a data distribution of interest. The discriminative network seeks to distinguish the synthetic samples produced by the generator from the real samples, i.e., the true data distribution. Backpropagation is applied in both networks so that the generative network generates better synthetic images, while the discriminative network becomes more skilled at flagging synthetic images.
In particular, the generative network 209, G may randomly predict 10K synthetic points G(z, y) that describe a tongue surface in accordance with Gaussian noise samples 208, z being constituted as follows:
Figure imgf000015_0001
The afore-mentioned step 110 of the method 1 makes use of knowledge acquired in the training phase. This preparatory work in the training phase may comprise that the GAN 4 is trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D pointcloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points xt of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.
The above-mentioned error metric may comprise a Chamfer distance loss LCD to modify a 3D position of points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a normal loss LnOm to modify a 3D orientation of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss Li„p to constrain relative 3D positions of neighboring points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss Ledge to constrain any possible outlier points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss Lcoi to inhibit penetration of a surface of an oral cavity of the one 207 of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points, as follows:
Figure imgf000016_0001
In particular, the collision loss Lcoi may comprise a sum of distances of each colliding point q' to a plurality of spheres k having a radius r and being centered at mouth landmark vertices 301 (see FIG. 3) having coordinates (xk, yk, Zk) of the tongue-and-mouth 3D landmark vertex set 3, lt defined in the one 207 of the synthetic tongue-and-head 3D meshes, as follows:
Figure imgf000016_0002
FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method 1 of Fig. 1.
The depicted tongue-and-mouth landmark vertex set 3, lt is arranged circumferentially around a tongue and mouth, respectively, of a synthetic tongue-and-head 3D mesh 207.
The 24 landmarks around the oral cavity of the UHM template are divided into two groups 302, 301 which highlight the tongue and the mouth, respectively.
In the general case, the first and second tongue-and-mouth landmark vertex sets 3, lr, It may be arranged circumferentially around the tongue and the mouth of the respective raw tongue 3D point-cloud 401 or the respective synthetic tongue-and-head 3D mesh 207.
FIG. 4 illustrates an exemplary GAN 4 as it may be used in connection with the method 1 of Fig. 1. As already mentioned in connection with FIG. 1, the generating 109 of the plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4, trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points xt of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.
In order to generate high-detailed 3D point-clouds, a conditional GAN setting may be used in which the generative network 209, G is guided by labels throughout the training process in order to leam to produce samples that belong to specific categories which are dictated by the labels. In order to generate accurate point-clouds that correspond to certain tongues, the GAN 4 is preferably guided by meaningful labels which capture all the desired 3D surface information. These labels, i.e., the plurality of latent tongue 3D features 203, y, may be learned by auto-encoding the plurality of raw tongue 3D point-clouds. A selforganizing map framework may be used for hierarchical feature extraction. An underlying assumption is that the latent space of point-clouds can be well approximated in a low-rank linear space. Based on this assumption, a PCA embedding of the hierarchical 3D features is computed and these PCA coefficients are used as a more compact and rich feature representation of the tongue point-clouds. Since the generation is driven by those coefficients, the GAN 4 is a conditional one. Thus, the generative network 209, G produces a novel point-cloud point G(z, y) that belongs to the tongue surface represented by the label y. On the other hand, the discriminative network 403, D receives as inputs the label y, a real point-cloud point xt (which belongs to the tongue represented by the label y) or the output G(z, y) of the generative network 209, G and tries to discriminate the fake (i.e., generated) from the real point. Mathematically this may be described as: [log D ( x, . y ) ] - it [log D ( x, , y }] ,
Figure imgf000017_0001
log D (x< . y)j . where D tries to maximize LD, whereas G tries to minimize <, The training process is considered complete when D is no longer able to differentiate between the real and fake point-cloud points. Instead of generating whole point-clouds for every provided pair (z, y) of noise and label, respectively, one point corresponding to the surface which the label y represents may be generated at a time. This confers several advantages: Firstly, the raw tongue 3D pointclouds of the training set do not need to have a same number of points, so that the GAN 4 may be trained without any data preprocessing. Secondly, as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.
The generative network 209, G may comprise L(ayer) and I(nj ection) building blocks 404, 405 being interconnected as shown in FIG. 4, for example. These L/I building blocks 404, 405 may respectively comprise a Multilayer Perceptron (MLP) 406 and Rectified Linear Unit (ReLU) 407 layers, as can be seen in FIG. 4a. Further building blocks describe a processing of the propagated signals. Processing block 408 (symbol c) stands for row-wise concatenation along the channel dimension, whereas processing block 409 (symbol o) stands for element-wise (i.e., Hadamard) product. The inputs of the generative network 209, G are: a label y corresponding to a particular raw tongue 3D point-cloud from a 3D point is to be sampled, and a Gaussian noise sample z.
The discriminative network 403, D may be based on the building blocks already mentioned in connection with the generative network 209, G which may be interconnected as shown in FIG. 4, for example. The inputs of the discriminative network 403, D are: (y, x), where is a label corresponding to a raw tongue 3D point-cloud and xt is a point of this particular raw tongue 3D point-cloud, or (f: G(z, y) where G(z, y) is a point generated in accordance with this particular raw tongue 3D point-cloud. The switch symbol between the generative network 209, G and the discriminative network 403, D indicates that this feed is performed on a random basis.
FIG. 5 illustrates a diversification of points xt of a raw tongue 3D point-cloud 401 used in connection with the method 1 of Fig. 1.
The discriminative network 403, D shows a binary behavior in that it decides whether a point is either fake or real. This rigidity is not very helpful especially in the early steps of the training process, as the generative network 209, G struggles to learn the distribution of points of the plurality of raw tongue 3D point-clouds (i.e., all of the generated points are discarded as fake by the discriminator with high confidence). To remedy this, the strict nature of the discriminative network 403, D may be softened, especially in the initial training steps, by diversifying the points xt fed to it. To achieve that, instead of directly feeding a real point xt corresponding to a label y to the discriminative network 403, D, the following is provided:
Figure imgf000019_0001
In other words, the points xt of the one 401 of the plurality of raw tongue 3D point-clouds may undergo a diversification 402 (see FIG. 4) by an isotropic multi-variate normal distribution TV having mean xt and (isotropic) variance ae that declines with progressing training epoch e of the GAN 4:
Thereby, when the training process commences, the generative network 209, G can better learn the distribution of points of the plurality of raw tongue 3D point-clouds as it does not get severely punished by the discriminative network 403, D when it slightly misses out the actual surface. This yields better results and stabilizes the training. The training may be started with a relatively small value for the variance ae which is further reduced subsequently until it becomes zero towards the final training epochs e
Those skilled in the art will appreciate that besides the method 1 explained previously, a computer program (not shown) may be provided comprising executable instructions which, when executed by a processor, cause the processor to perform the above-mentioned method 1, and that a device (not shown) for 3D reconstruction of a tongue may be provided comprising a processor configured to perform the above-mentioned method 1.
The processor or processing circuitry of the device may comprise hardware and/or the processing circuitry may be controlled by software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field- programmable gate arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. The present disclosure describes various embodiments as examples as well as implementations. However, other variations can be understood and effected by those skilled in the art and practicing the claimed subject-matter, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

Claims
1. A method (1) for 3D reconstruction of a tongue, comprising: determining (101) one (207) of a plurality of synthetic tongue-and-head 3D meshes based on an image (201) of the tongue; and modifying (108) the determined synthetic tongue-and-head 3D mesh (207) based on a synthetic tongue 3D point-cloud (210) being indicative of the image (201) of the tongue.
2. The method (1) of claim 1, the determining (101) of the synthetic tongue-and-head 3D mesh (207) comprising encoding (102) the image (201) of the tongue into one of a plurality of latent tongue 3D features (203, y) representing a corresponding one (401) of a plurality of raw tongue 3D point-clouds; transforming (104) the obtained latent tongue 3D feature (203, y) into a corresponding one (205) of a plurality of latent tongue-and-head expression shape parameters (pt) of a corresponding one (207) of the plurality of synthetic tongue-and-head 3D meshes; and converting (106) the obtained latent tongue-and-head expression shape parameters (205, pt) into the corresponding one (207) of the plurality of synthetic tongue-and-head 3D meshes.
3. The method (1) of claim 2, the encoding (102) of the image (201) of the tongue into one of a plurality of latent tongue 3D features (203, y) comprising using (103) an embedding network (202) trained to encode a plurality of images (201) of tongues into the corresponding plurality of the latent tongue 3D features (203, y); the plurality of the latent tongue 3D features (203, y) being established by auto-encoding the plurality of raw tongue 3D point-clouds.
4. The method (1) of claim 3, the plurality of images (201) of the tongues comprising a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D pointclouds.
5. The method (1) of claim 4, the plurality of images (201) of the tongues being rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D point-cloud.
6. The method (1) of any of the claims 2 to 5, the transforming (104) the obtained latent tongue 3D feature (203, y) comprising using (105) a first regression matrix (204, Wt,y) transforming the plurality of latent tongue 3D features (203, y) into the plurality of corresponding latent tongue-and-head expression shape parameters (205, pt) of the plurality of corresponding synthetic tongue- and-head 3D meshes.
7. The method (1) of claim 6, the plurality of latent tongue-and-head expression shape parameters (205, pt) being established by applying a second regression matrix (Wt,i) to a plurality of first latent tongue-and-mouth expression shape parameters (pi) of a plurality of corresponding first tongue-and-mouth landmark vertex sets (lr) defined in the plurality of raw tongue 3D point-clouds; the plurality of first latent tongue-and-mouth expression shape parameters (pi) of the plurality of corresponding first tongue-and-mouth landmark vertex sets (lr) being established by using a first PCA (Ult) on the plurality of first tongue-and-mouth landmark vertex sets (lr); the first PCA (Ult) being established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets (It) defined in the plurality of corresponding synthetic tongue-and-head 3D meshes and corresponding to the plurality of first tongue-and-mouth landmark vertex sets (lr) in terms of the underlying landmarks; the second regression matrix (Wt,i) being established by regressing a plurality of second latent tongue-and-mouth expression shape parameters (pi) of the plurality of second tongue-and-mouth landmark vertex sets (It) to the plurality of corresponding latent tongue-and-head expression shape parameters (205, pt); the plurality of latent tongue-and-head expression shape parameters (205, pt) being established by using a second PCA (206, Ut) on the plurality of synthetic tongue-and- head 3D meshes; and the second PCA (206, Ut) being established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes.
8. The method (1) of claim 7, the converting (106) the obtained latent tongue-and-head expression shape parameters (205, pt) comprising using (107) the second PCA (206, Ut) to convert the obtained latent tongue-and-head expression shape parameters (205, pt) into the corresponding one (207) of the plurality of synthetic tongue-and-head 3D meshes.
9. The method (1) of claim 7 or claim 8, the first and second tongue-and-mouth landmark vertex sets (3, lr, It) being arranged circumferentially around a tongue and mouth of the respective raw tongue 3D point-cloud (401) or respective synthetic tongue-and-head 3D mesh (207).
10. The method (1) of any of the claims 2 - 9, the modifying (108) the determined synthetic tongue-and-head 3D mesh (207) comprising generating (109) a plurality of synthetic points (G(z, y)) of the synthetic tongue 3D point-cloud (210) based on the one of the plurality of latent tongue 3D features (203, y); and adapting (111), based on an optimization procedure, the one (207) of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D pointcloud (210) to back-propagate the error to the one of the plurality of latent tongue-and- head expression shape parameters (205, pt).
11. The method (1) of claim 10, the generating (109) a plurality of synthetic points (G(z, y)) of the synthetic tongue 3D point-cloud (210) comprising using (110) a generative network (209) of a generative adversarial network, GAN (4), trained to establish individual synthetic points (G(z, y)) of the synthetic tongue 3D pointcloud (210) based on the one of the plurality of latent tongue 3D features (203, y) and respective Gaussian noise samples (208, z) to be indistinguishable, by a discriminative network (402) of the GAN (4), from points (x) of one (401) of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features (203, y).
12. The method (1) of claim 11, the points (x) of the one (401) of the plurality of raw tongue 3D point-clouds undergoing a diversification by an isotropic multi-variate normal distribution N having a variance (oe) that declines with progressing training epoch e of the GAN (4).
13. The method (1) of any of the claims 10 to 12, the error metric comprising a Chamfer distance loss LCD to modify a 3D position of points of the one (207) of the plurality of synthetic tongue-and-head 3D meshes; a normal loss (Lnorm) to modify a 3D orientation of the one (207) of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss (Liap) to constrain relative 3D positions of neighboring points of the one (207) of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss (Ledge) to constrain any possible outlier points of the one (207) of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss (Lcoi) to inhibit penetration of a surface of an oral cavity of the one (207) of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points.
14. The method (1) of claim 13, the collision loss (Lcoi) comprising a sum of distances of each colliding point (qi) to a plurality of spheres (k) having a radius (r) and being centered at mouth landmark vertices (301) of the tongue-and-mouth 3D landmark vertex set (3, It) defined in the one (207) of the synthetic tongue-and-head 3D meshes.
15. A computer program, comprising executable instructions which, when executed by a processor, cause the processor to perform the method (1) of any of the claims 1 to 14.
16. A device for 3D reconstruction of a tongue, comprising a processor being configured to perform the method (1) of any of the claims 1 to 14.
22
PCT/EP2020/081148 2020-11-05 2020-11-05 3d tongue reconstruction from single images Ceased WO2022096105A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/081148 WO2022096105A1 (en) 2020-11-05 2020-11-05 3d tongue reconstruction from single images

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/081148 WO2022096105A1 (en) 2020-11-05 2020-11-05 3d tongue reconstruction from single images

Publications (1)

Publication Number Publication Date
WO2022096105A1 true WO2022096105A1 (en) 2022-05-12

Family

ID=73172707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/081148 Ceased WO2022096105A1 (en) 2020-11-05 2020-11-05 3d tongue reconstruction from single images

Country Status (1)

Country Link
WO (1) WO2022096105A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210248812A1 (en) * 2021-03-05 2021-08-12 University Of Electronic Science And Technology Of China Method for reconstructing a 3d object based on dynamic graph network
CN117649494A (en) * 2024-01-29 2024-03-05 南京信息工程大学 Reconstruction method and system of three-dimensional tongue body based on point cloud pixel matching
CN120726059A (en) * 2025-09-03 2025-09-30 浙江中医药大学 A method and system for traditional Chinese medicine tongue image recognition based on image processing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017028961A1 (en) * 2015-08-14 2017-02-23 Thomson Licensing 3d reconstruction of a human ear from a point cloud
CN107256575A (en) * 2017-04-07 2017-10-17 天津市天中依脉科技开发有限公司 A kind of three-dimensional tongue based on binocular stereo vision is as method for reconstructing
WO2020053551A1 (en) * 2018-09-12 2020-03-19 Sony Interactive Entertainment Inc. Method and system for generating a 3d reconstruction of a human
EP3726467A1 (en) * 2019-04-18 2020-10-21 Zebra Medical Vision Ltd. Systems and methods for reconstruction of 3d anatomical images from 2d anatomical images

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017028961A1 (en) * 2015-08-14 2017-02-23 Thomson Licensing 3d reconstruction of a human ear from a point cloud
CN107256575A (en) * 2017-04-07 2017-10-17 天津市天中依脉科技开发有限公司 A kind of three-dimensional tongue based on binocular stereo vision is as method for reconstructing
WO2020053551A1 (en) * 2018-09-12 2020-03-19 Sony Interactive Entertainment Inc. Method and system for generating a 3d reconstruction of a human
EP3726467A1 (en) * 2019-04-18 2020-10-21 Zebra Medical Vision Ltd. Systems and methods for reconstruction of 3d anatomical images from 2d anatomical images

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HEWER ALEXANDER ET AL: "A multilinear tongue model derived from speech related MRI data of the human vocal tract", COMPUTER SPEECH AND LANGUAGE, ELSEVIER, LONDON, GB, vol. 51, 21 February 2018 (2018-02-21), pages 68 - 92, XP085398906, ISSN: 0885-2308, DOI: 10.1016/J.CSL.2018.02.001 *
JUN YU ET AL: "A realistic and reliable 3D pronunciation visualization instruction system for computer-assisted language learning", 2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), IEEE, 15 December 2016 (2016-12-15), pages 786 - 789, XP033046448, DOI: 10.1109/BIBM.2016.7822623 *
YU JUN: "A Real-Time Music VR System for 3D External and Internal Articulators", 2019 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES (VR), IEEE, 23 March 2019 (2019-03-23), pages 1259 - 1260, XP033597801, DOI: 10.1109/VR.2019.8798288 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210248812A1 (en) * 2021-03-05 2021-08-12 University Of Electronic Science And Technology Of China Method for reconstructing a 3d object based on dynamic graph network
US11715258B2 (en) * 2021-03-05 2023-08-01 University Of Electronic Science And Technology Of China Method for reconstructing a 3D object based on dynamic graph network
CN117649494A (en) * 2024-01-29 2024-03-05 南京信息工程大学 Reconstruction method and system of three-dimensional tongue body based on point cloud pixel matching
CN117649494B (en) * 2024-01-29 2024-04-19 南京信息工程大学 A three-dimensional tongue reconstruction method and system based on point cloud pixel matching
CN120726059A (en) * 2025-09-03 2025-09-30 浙江中医药大学 A method and system for traditional Chinese medicine tongue image recognition based on image processing

Similar Documents

Publication Publication Date Title
US12347010B2 (en) Single image-based real-time body animation
Ye et al. Audio-driven talking face video generation with dynamic convolution kernels
Ichim et al. Dynamic 3D avatar creation from hand-held video input
CN114973349B (en) Facial image processing method and facial image processing model training method
CN115004236A (en) Photo-level realistic talking face from audio
Sun et al. Masked lip-sync prediction by audio-visual contextual exploitation in transformers
Bermano et al. Facial performance enhancement using dynamic shape space analysis
WO2024114321A1 (en) Image data processing method and apparatus, computer device, computer-readable storage medium, and computer program product
Yu et al. A video, text, and speech-driven realistic 3-D virtual head for human–machine interface
CN114202615B (en) Facial expression reconstruction method, device, equipment and storage medium
CN118553001B (en) Texture-controllable three-dimensional fine face reconstruction method and device based on sketch input
US20230141392A1 (en) Systems and methods for human pose and shape recovery
WO2022096105A1 (en) 3d tongue reconstruction from single images
Dundar et al. Fine detailed texture learning for 3d meshes with generative models
CN117635897A (en) Three-dimensional object posture complement method, device, equipment, storage medium and product
CN114863013A (en) A method for reconstructing 3D model of target object
Gan et al. Fine-grained multi-view hand reconstruction using inverse rendering
Lifkooee et al. Real-time avatar pose transfer and motion generation using locally encoded laplacian offsets
CN114998690B (en) A text-controlled 3D face generation method based on StyleCLIP and 3DDFA
Park et al. Df-3dface: One-to-many speech synchronized 3d face animation with diffusion
Ekmen et al. From 2D to 3D real-time expression transfer for facial animation
Song et al. Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation.
Quan et al. Facial animation using CycleGAN
Gan et al. Xhand: Real-time expressive hand avatar
Wang et al. OT-Talk: Animating 3D Talking Head with Optimal Transportation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20803514

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20803514

Country of ref document: EP

Kind code of ref document: A1