TWI712032B

TWI712032B - Voice conversion method for virtual face image

Info

Publication number: TWI712032B
Application number: TW108100400A
Authority: TW
Inventors: 伯利都; 石千泓
Original assignee: 香港商成境科技有限公司
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2020-12-01
Also published as: TW202027063A

Abstract

一種語音轉換虛擬臉部影像的方法，由一電腦系統執行，包含以下步驟：(A)利用一用於將語音片段轉換成音素的音素轉換模型，將一語音擷取單元連續地擷取一受試者之一當前語音片段轉換成一目標音素；(B)根據該目標音素，從所儲存多幀相關於一數位角色於發出多種不同音素時所對應之多種不同嘴型的嘴型影像中，獲得對應於該目標音素的一目標嘴型影像，其中每一嘴型影像對應於該等音素之其中一者；及(C)根據該目標嘴型影像獲得至少一相關於該數位角色的虛擬臉部影像。A method for voice-converting virtual facial images is executed by a computer system and includes the following steps: (A) A phoneme conversion model for converting speech fragments into phonemes is used to continuously capture a voice capture unit. One of the examinee’s current speech fragments is converted into a target phoneme; (B) According to the target phoneme, it is obtained from the stored multiple frames of mouth images corresponding to a variety of different mouth shapes when a digital character emits multiple different phonemes. A target mouth shape image corresponding to the target phoneme, wherein each mouth shape image corresponds to one of the phonemes; and (C) obtaining at least one virtual face related to the digital character according to the target mouth shape image image.

Description

Voice conversion method for virtual face image

本發明是有關於一種轉換方法，特別是指一種語音轉換虛擬臉部影像的方法。The invention relates to a conversion method, in particular to a method for voice conversion to a virtual face image.

動畫、遊戲和電影等行業常會創建虛擬角色，而在創造虛擬角色之其中一個重要的技術為使虛擬角色講話時能夠具有自然流暢並與聲音同步的嘴型動作，為達聲音與嘴型動作的同步，設計師需要根據音頻的內容在時間軸上調整嘴型配置，非常消耗時間和人力，再者，事先製作出動畫，無法依現場氣氛做出變化，因此現有多家廠商致力發展即時語音驅動的虛擬角色說話的技術。Industries such as animation, games, and movies often create virtual characters. One of the important technologies in creating virtual characters is to enable the virtual characters to have natural and smooth mouth movements that are synchronized with the sound. Synchronization, the designer needs to adjust the mouth configuration on the timeline according to the audio content, which is very time-consuming and labor-intensive. Moreover, the animation is produced in advance and cannot be changed according to the scene atmosphere. Therefore, many existing manufacturers are committed to the development of real-time voice drive The virtual character talking technology.

然而，現有即時語音驅動的虛擬角色說話的技術多是將聲音先轉至文字，再由文字轉換成虛擬角色的嘴型影像，先轉換成文字的轉換處理的時間過長，無法得到即時之效果。However, the existing real-time voice-driven virtual character speech technology mostly converts the voice to text, and then converts the text to the virtual character's mouth image. The conversion process of first converting to text takes too long to obtain the real-time effect. .

因此，本發明的目的，即在提供一種能即時將語音轉換虛擬臉部影像的方法。Therefore, the purpose of the present invention is to provide a method that can instantly convert voices into virtual facial images.

於是，本發明語音轉換虛擬臉部影像的方法，由一電腦系統執行，該電腦系統儲存多幀相關於一數位角色於發出多種不同音素時所對應之多種不同嘴型的嘴型影像，每一嘴型影像對應於該等音素之其中一者，該電腦系統包含一用以連續地擷取一受試者之一當前語音片段的語音擷取單元，該語音轉換虛擬臉部影像的方法包含一步驟(A)、一步驟(B)，及一步驟(C)。Therefore, the method for voice-converting virtual facial images of the present invention is executed by a computer system, and the computer system stores multiple frames of mouth images corresponding to multiple different mouth shapes when a digital character emits multiple different phonemes, each The mouth shape image corresponds to one of the phonemes. The computer system includes a voice capturing unit for continuously capturing a current voice segment of a subject. The method of voice converting a virtual facial image includes a Step (A), one step (B), and one step (C).

該步驟(A)中，該電腦系統利用一用於將語音片段轉換成音素的音素轉換模型，將該當前語音片段轉換成一目標音素。In this step (A), the computer system uses a phoneme conversion model for converting a speech segment into a phoneme to convert the current speech segment into a target phoneme.

該步驟(B)中，該電腦系統根據該目標音素，從該等嘴型影像獲得對應於該目標音素的一目標嘴型影像。In the step (B), the computer system obtains a target mouth shape image corresponding to the target phoneme from the mouth shape images according to the target phoneme.

該步驟(C)中，該電腦系統根據該目標嘴型影像獲得至少一相關於該數位角色的虛擬臉部影像。In the step (C), the computer system obtains at least one virtual face image related to the digital character according to the target mouth shape image.

本發明之功效在於：藉由該電腦系統擷取該當前語音片段的後，利用該音素轉換模型，以快速獲得對應於該目標音素的該目標嘴型影像，並根據該目標嘴型影像即時獲得該至少一虛擬臉部影像，以達到快速轉換之效果。The effect of the present invention is that after the computer system captures the current speech segment, the phoneme conversion model is used to quickly obtain the target mouth shape image corresponding to the target phoneme, and to obtain the target mouth shape image in real time according to the target mouth shape image The at least one virtual face image can achieve the effect of rapid conversion.

在本發明被詳細描述前，應當注意在以下的說明內容中，類似的元件是以相同的編號來表示。Before the present invention is described in detail, it should be noted that in the following description, similar elements are represented by the same numbers.

參閱圖1，本發明語音轉換虛擬臉部影像的方法的一實施例是由一電腦系統1執行，該電腦系統1包含一儲存單元11、一語音擷取單元12及一電連接該儲存單元11及該語音擷取單元12的處理單元13。Referring to FIG. 1, an embodiment of the method for voice conversion of a virtual face image of the present invention is executed by a computer system 1. The computer system 1 includes a storage unit 11, a voice capture unit 12, and an electrical connection to the storage unit 11 And the processing unit 13 of the voice capture unit 12.

該儲存單元11儲存有多幀相關於一數位角色於發出多種不同音素(Phoneme)時所對應之多種不同嘴型的嘴型影像、多筆談話、多筆影音資料，及多幀相關於該數位角色之多種不同臉部表情的表情影像。值得注意的是，在本實施例中，該等音素例如為OO、IY、EE、AA、WW、LL、ER、UU、FV、MM、CH，以及DD等，每一嘴型影像對應於該等音素之其中一者(如圖2所示)，但不以此為限。每一談話包含多個談話片段，每一影音資料包括多幀相關於一訓練者演說的臉部影像及多個分別對應該等臉部影像的音訊片段，每一種臉部表情對應於一指示出該臉部表情的表情參數，每一表情影像對應於該等臉部表情之其中一者所對應的表情參數。The storage unit 11 stores multiple frames of lip-shape images, multiple conversations, multiple audiovisual data, and multiple frames related to the digital character when it emits multiple different phonemes. A variety of facial expression images of the character. It is worth noting that in this embodiment, the phonemes are, for example, OO, IY, EE, AA, WW, LL, ER, UU, FV, MM, CH, and DD, etc., and each mouth image corresponds to the One of the isophones (as shown in Figure 2), but not limited to this. Each conversation includes multiple conversation fragments, and each audiovisual data includes multiple frames of facial images related to a trainer’s speech and multiple audio fragments corresponding to the facial images. Each facial expression corresponds to an indicator For the expression parameters of the facial expression, each expression image corresponds to the expression parameter corresponding to one of the facial expressions.

該語音擷取單元12用以連續地擷取一受試者之一當前語音片段，該當前語音片段包含至少一語音子片段。在本實施例中，該語音擷取單元12例如為麥克風，但不以此為限。The voice capturing unit 12 is used for continuously capturing a current voice segment of a subject, and the current voice segment includes at least one voice sub-segment. In this embodiment, the voice capturing unit 12 is, for example, a microphone, but it is not limited to this.

本發明語音轉換虛擬臉部影像的方法的該實施例包含一音素轉換模型建立程序2、一表情轉換模型建立程序3，及一語音轉換虛擬臉部影像程序4。This embodiment of the method for voice conversion of virtual facial images of the present invention includes a phoneme conversion model creation program 2, an expression conversion model creation program 3, and a voice conversion virtual facial image program 4.

參閱圖1、3，該音素轉換模型建立程序2包含步驟21~23，以下詳述圖3所示的該音素轉換模型建立程序2的各個步驟。Referring to FIGS. 1 and 3, the phoneme conversion model establishment program 2 includes steps 21 to 23. The steps of the phoneme conversion model establishment program 2 shown in FIG. 3 are detailed below.

在步驟21中，該處理單元13從一有聲字典提取該等音素的特徵。In step 21, the processing unit 13 extracts the features of the phonemes from an audio dictionary.

在步驟22中，對於每一談話，該處理單元13根據該等音素的特徵及該談話，產生一包含多個排列組合出該談話的音素的音素序列，該談話的每一談話片段對應該音素序列中的該等音素之其中一者。In step 22, for each conversation, the processing unit 13 generates a phoneme sequence including a plurality of permutations and combinations of the phonemes of the conversation according to the characteristics of the phonemes and the conversation, and each conversation segment of the conversation corresponds to the phoneme One of the phonemes in the sequence.

在步驟23中，該處理單元13將每一談話的每一談話片段及其對應的音素進行機器學習演算，例如卷積神經網路(Convolutional neural network, CNN)，以建立出一用於將語音片段轉換成音素的音素轉換模型。In step 23, the processing unit 13 performs machine learning calculations, such as Convolutional Neural Network (CNN), on each conversation segment and its corresponding phoneme of each conversation, to create a voice A phoneme conversion model where segments are converted to phonemes.

參閱圖1、4，該表情轉換模型建立程序包含步驟31~34，以下詳述圖4所示的該表情轉換模型建立程序的各個步驟。Referring to FIGS. 1 and 4, the expression conversion model establishment procedure includes steps 31 to 34, and each step of the expression conversion model establishment procedure shown in FIG. 4 will be described in detail below.

在步驟31中，對於該儲存單元11儲存的該等影音資料之每一臉部影像，該處理單元13獲得該臉部影像中對應於該訓練者臉部之眉毛部分。In step 31, for each face image of the video data stored in the storage unit 11, the processing unit 13 obtains the eyebrow portion of the face image corresponding to the face of the trainer.

在步驟32中，對於該儲存單元11儲存的該等影音資料之每一臉部影像，該處理單元13根據該臉部影像中的眉毛部分獲得一眉毛特徵。值得注意的是，在本實施例中，步驟31、32中該處理單元13是獲得眉毛部分及眉毛特徵，在其他實施方式亦可獲得其他臉部器官的部分及特徵，不以此為限。In step 32, for each face image of the video and audio data stored in the storage unit 11, the processing unit 13 obtains an eyebrow feature according to the eyebrow portion in the face image. It is worth noting that in this embodiment, in steps 31 and 32, the processing unit 13 obtains the eyebrows and the features of the eyebrows. In other embodiments, the parts and features of other facial organs can also be obtained, and it is not limited thereto.

在步驟33中，對於該儲存單元11儲存的該等影音資料之每一臉部影像，該處理單元13根據該臉部影像對應的眉毛特徵將該影像進行表情辨識，以獲得該臉部影像所對應之該訓練者的表情辨識結果。值得注意的是，在本實施例中，每一表情辨識結果包括高興、生氣、難過，及無表情之其中一者，在其他實施方式中每一表情辨識結果更可包括其他表情，不以此為限。In step 33, for each facial image of the video and audio data stored in the storage unit 11, the processing unit 13 performs expression recognition on the image according to the eyebrow characteristics corresponding to the facial image to obtain the facial image. Corresponds to the expression recognition result of the trainer. It is worth noting that in this embodiment, each expression recognition result includes one of happiness, anger, sadness, and no expression. In other embodiments, each expression recognition result may include other expressions instead of Is limited.

在步驟34中，該處理單元13將每一臉部影像所對應之表情辨識結果及音訊片段進行機器學習演算，以建立一用於將語音子片段轉換成表情參數的表情轉換模型。值得注意的是，在本實施例中，每一臉部影像所對應之音訊片段的時間長度與每一語音子片段的時間長度相等，即播放一幀影像的時間。In step 34, the processing unit 13 performs machine learning calculations on the facial expression recognition results and audio segments corresponding to each facial image to establish an expression conversion model for converting voice sub-segments into expression parameters. It is worth noting that, in this embodiment, the time length of the audio segment corresponding to each face image is equal to the time length of each voice sub-segment, that is, the time for playing one frame of image.

參閱圖1、5，該語音轉換虛擬臉部影像程序4包含步驟41~45，以下詳述圖5所示的該表情轉換模型建立程序的各個步驟。Referring to FIGS. 1 and 5, the voice conversion virtual face image program 4 includes steps 41 to 45. The steps of the expression conversion model establishment program shown in FIG. 5 are described in detail below.

在步驟41中，該處理單元13利用該音素轉換模型，將該當前語音片段轉換成一目標音素。In step 41, the processing unit 13 uses the phoneme conversion model to convert the current speech segment into a target phoneme.

在步驟42中，該處理單元13根據該目標音素，從該等嘴型影像獲得對應於該目標音素的一目標嘴型影像。In step 42, the processing unit 13 obtains a target mouth shape image corresponding to the target phoneme from the mouth shape images according to the target phoneme.

在步驟43中，對於該當前語音片段的該至少一語音子片段之每一者，該處理單元13利用該表情轉換模型，將該語音子片段轉換成一目標表情參數。In step 43, for each of the at least one voice sub-segment of the current voice segment, the processing unit 13 uses the expression conversion model to convert the voice sub-segment into a target expression parameter.

在步驟44中，對於每一目標表情參數，該處理單元13根據該目標表情參數，從該等表情影像獲得對應該目標表情參數的一目標表情影像。In step 44, for each target expression parameter, the processing unit 13 obtains a target expression image corresponding to the target expression parameter from the expression images according to the target expression parameter.

要特別注意的是，在本實施例中，該表情轉換模型包括二次指數平滑法（Double exponential smoothing method），即不同時間的語音子片段擁有不同的權重，越接近當前時間的語音子片段權重越大，在步驟43中，對於該當前語音片段的該至少一語音子片段之每一者，該處理單元13根據該語音子片段之前的所有語音片段以及該語音子片段，進行預測以轉換出該目標表情參數，利用二次指數平滑法可使得步驟44中所獲得該目標表情影像與前一時段獲得的目標表情影像較為連貫，由於本發明之特徵並不在於熟知此技藝者所已知二次指數平滑法，因此為了簡潔，故在此省略了二次指數平滑法的細節。It should be noted that, in this embodiment, the expression conversion model includes a double exponential smoothing method, that is, voice sub-segments at different times have different weights, and the closer the weights of the voice sub-segments at the current time are The larger is, in step 43, for each of the at least one voice sub-segment of the current voice segment, the processing unit 13 performs prediction to convert according to all the voice segments before the voice sub-segment and the voice sub-segment The target expression parameters, using the quadratic exponential smoothing method, can make the target expression image obtained in step 44 more consistent with the target expression image obtained in the previous period, because the characteristics of the present invention are not known to those skilled in the art. The sub-exponential smoothing method, so for the sake of brevity, the details of the second exponential smoothing method are omitted here.

要再特別注意的是，在本實施例中，步驟41、42與步驟43、44同時進行，在其他實施方式中，步驟41、42可在步驟43、44之前或之後執行，不以此為限。It should be particularly noted that in this embodiment, steps 41 and 42 are performed simultaneously with steps 43 and 44. In other embodiments, steps 41 and 42 can be performed before or after steps 43 and 44. limit.

在步驟42及步驟44之後的步驟45中，該處理單元13根據該目標嘴型影像及該至少一目標表情影像，獲得至少一相關於該數位角色的虛擬臉部影像。In step 45 after step 42 and step 44, the processing unit 13 obtains at least one virtual face image related to the digital character according to the target mouth image and the at least one target expression image.

要再特別注意的是，當該當前語音片段僅包含一語音子片段時，在步驟43中，該處理單元13將該語音子片段轉換成一目標表情參數，並在步驟44中，該處理單元13從該等表情影像獲得對應該目標表情參數的一目標表情影像，最後在步驟45中，該處理單元13根據該目標嘴型影像及該目標表情影像，獲得一虛擬臉部影像；而當該當前語音片段包含多個語音子片段時，在步驟43中，該處理單元13將該等語音子片段轉換成多個目標表情參數，並在步驟44中，該處理單元13分別從該等表情影像獲得對應該等目標表情參數的多個目標表情影像，最後在步驟45中，該處理單元13根據該目標嘴型影像及該等目標表情影像，獲得多幀虛擬臉部影像，表示該等虛擬臉部影像具有相同的嘴型。It should be noted that when the current voice segment only contains a voice sub-segment, in step 43, the processing unit 13 converts the voice sub-segment into a target expression parameter, and in step 44, the processing unit 13 Obtain a target expression image corresponding to the target expression parameters from the expression images. Finally, in step 45, the processing unit 13 obtains a virtual face image according to the target mouth shape image and the target expression image; and when the current When the voice segment includes multiple voice sub-segments, in step 43, the processing unit 13 converts the voice sub-segments into multiple target expression parameters, and in step 44, the processing unit 13 obtains the expression images from the expression images. A plurality of target expression images corresponding to the target expression parameters, and finally in step 45, the processing unit 13 obtains multiple frames of virtual face images according to the target mouth shape image and the target expression images, representing the virtual faces The images have the same mouth shape.

綜上所述，本發明語音轉換虛擬臉部影像的方法，藉由該處理單元13將該等談話及音素序列進行機器學習演算，以建立該音素轉換模型，且將該等臉部影像所對應之表情辨識結果及該等音訊片段進行機器學習演算，以建立該表情轉換模型，使得該語音擷取單元12擷取該當前語音片段的後，該處理單元13利用該音素轉換模型及該表情轉換模型，快速獲得對應於該目標音素的該目標嘴型影像及該至少一目標表情影像，並根據該目標嘴型影像及該至少一目標表情影像即時獲得該至少一虛擬臉部影像，以達到快速轉換之效果，此外，該音素轉換模型不受限於語言，能支援各國語言的嘴型轉換，故確實能達成本發明的目的。In summary, the method for voice conversion of virtual facial images of the present invention uses the processing unit 13 to perform machine learning calculations on the speech and phoneme sequences to establish the phoneme conversion model and correspond to the facial images The expression recognition result and the audio fragments are subjected to machine learning calculations to establish the expression conversion model, so that after the voice capture unit 12 captures the current voice fragment, the processing unit 13 uses the phoneme conversion model and the expression conversion The model quickly obtains the target mouth shape image and the at least one target expression image corresponding to the target phoneme, and obtains the at least one virtual face image in real time according to the target mouth shape image and the at least one target expression image to achieve rapid The effect of conversion, in addition, the phoneme conversion model is not limited to languages, and can support mouth shape conversion in various languages, so it can indeed achieve the purpose of the invention.

惟以上所述者，僅為本發明的實施例而已，當不能以此限定本發明實施的範圍，凡是依本發明申請專利範圍及專利說明書內容所作的簡單的等效變化與修飾，皆仍屬本發明專利涵蓋的範圍內。However, the above are only examples of the present invention. When the scope of implementation of the present invention cannot be limited by this, all simple equivalent changes and modifications made in accordance with the scope of the patent application of the present invention and the content of the patent specification still belong to Within the scope of the patent for the present invention.

1:電腦系統 11:儲存單元 12:語音擷取單元 13:處理單元 21~23:步驟 31~34:步驟 41~45:步驟 1: Computer system 11: storage unit 12: Voice capture unit 13: processing unit 21~23: Steps 31~34: Steps 41~45: Steps

本發明的其他的特徵及功效，將於參照圖式的實施方式中清楚地呈現，其中：圖1是一方塊圖，說明用以實施本發明語音轉換虛擬臉部影像的方法的一電腦系統；圖2是一示意圖，說明本發明語音轉換虛擬臉部影像的方法的該實施例的每一音素對應的嘴型影像；圖3是一流程圖，說明本發明語音轉換虛擬臉部影像的方法的一實施例的一音素轉換模型建立程序；圖4是一流程圖，說明該實施例的一表情轉換模型建立程序；及圖5是一流程圖，說明該實施例的一語音轉換虛擬臉部影像程序。Other features and effects of the present invention will be clearly presented in the embodiments with reference to the drawings, in which: FIG. 1 is a block diagram illustrating a computer system for implementing the method of voice conversion virtual face image of the present invention; FIG. 2 is a schematic diagram illustrating the mouth shape image corresponding to each phoneme in the embodiment of the method of voice conversion virtual face image of the present invention; 3 is a flowchart illustrating a phoneme conversion model establishment procedure of an embodiment of the method for voice conversion to virtual facial images of the present invention; Fig. 4 is a flowchart illustrating an expression conversion model establishment procedure of this embodiment; and Fig. 5 is a flowchart illustrating a voice conversion virtual face image procedure of the embodiment.

41~45:步驟 41~45: Steps

Claims

A method for voice-converting virtual facial images, executed by a computer system, the computer system stores multiple frames of mouth-shaped images and multiple conversations corresponding to multiple different mouth shapes when a digital character emits multiple different phonemes, and A plurality of phoneme sequences corresponding to the conversations, each mouth shape image corresponds to one of the phonemes, and the phoneme sequence corresponding to each conversation includes a plurality of permutations and combinations of phonemes of the conversation, and each conversation includes multiple phonemes. A conversation segment, each conversation segment corresponds to one of the phonemes in the corresponding phoneme sequence, the computer system includes a voice capturing unit for continuously capturing a current voice segment of a subject, The method for voice-converting virtual facial images includes the following steps: (A) Perform machine learning calculations on each conversation segment and its corresponding phoneme of each conversation to establish a phoneme conversion for converting the speech segment into phoneme Model; (B) using the phoneme conversion model to convert the current speech segment into a target phoneme; (C) according to the target phoneme, obtain a target mouth shape image corresponding to the target phoneme from the mouth shape images; and ( D) Obtain at least one virtual face image related to the digital character according to the target mouth shape image.

According to the method for voice-converting virtual facial images in claim 1, the computer system also stores multiple frames of facial expression images related to the digital character, each facial expression corresponds to an indication of the face Expression parameters of facial expressions, each expression image corresponds to one of the facial expressions The current voice segment captured by the voice capture unit includes at least one voice sub-segment, and the following steps are also included before step (D): (F) For the at least one voice segment of the current voice segment Each of a voice sub-segment uses an expression conversion model for converting the voice sub-segment into an expression parameter to convert the voice sub-segment into a target expression parameter; and (G) for each target expression parameter, according to the The target expression parameter is to obtain a target expression image corresponding to the target expression parameter from the expression images; wherein, in step (D), the at least one virtual facial image is also obtained according to the target expression image obtained in step (G) .

According to the method for voice-converting virtual facial images according to claim 2, the computer system also stores multiple pieces of audio-visual data, and each audio-visual data includes multiple frames of facial images related to a trainer’s speech and multiple corresponding The audio segment of the facial image includes the following steps before step (F): (H) For each facial image of the audio-visual data, obtain one of the organs in the facial image corresponding to the face of the trainer (I) For each facial image of the audio-visual data, obtain an organ feature according to the organ part in the facial image; (J) For each facial image of the audio-visual data, according to the The organ features corresponding to the facial image perform expression recognition on the facial image to obtain the expression recognition result of the trainer corresponding to the facial image; and (K) the expression recognition result corresponding to each facial image and The audio segment is subjected to machine learning calculations to establish the expression conversion model.

The method for voice-converting a virtual face image according to claim 3, wherein, in step (H), each organ part is the eyebrow corresponding to the face of the trainer.

The method for voice-converting virtual facial images according to claim 3, wherein in step (J), each facial expression recognition result includes one of happy, angry, sad, and expressionless.