TWI712032B - Voice conversion method for virtual face image - Google Patents
Voice conversion method for virtual face image Download PDFInfo
- Publication number
- TWI712032B TWI712032B TW108100400A TW108100400A TWI712032B TW I712032 B TWI712032 B TW I712032B TW 108100400 A TW108100400 A TW 108100400A TW 108100400 A TW108100400 A TW 108100400A TW I712032 B TWI712032 B TW I712032B
- Authority
- TW
- Taiwan
- Prior art keywords
- voice
- expression
- segment
- facial
- target
- Prior art date
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 42
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000001815 facial effect Effects 0.000 claims abstract description 33
- 230000014509 gene expression Effects 0.000 claims description 58
- 230000008921 facial expression Effects 0.000 claims description 11
- 210000004709 eyebrow Anatomy 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000010801 machine learning Methods 0.000 claims description 6
- 210000000056 organ Anatomy 0.000 claims description 6
- 239000012634 fragment Substances 0.000 abstract description 6
- 230000000694 effects Effects 0.000 description 5
- 238000009499 grossing Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Landscapes
- Processing Or Creating Images (AREA)
Abstract
一種語音轉換虛擬臉部影像的方法,由一電腦系統執行,包含以下步驟:(A)利用一用於將語音片段轉換成音素的音素轉換模型,將一語音擷取單元連續地擷取一受試者之一當前語音片段轉換成一目標音素;(B)根據該目標音素,從所儲存多幀相關於一數位角色於發出多種不同音素時所對應之多種不同嘴型的嘴型影像中,獲得對應於該目標音素的一目標嘴型影像,其中每一嘴型影像對應於該等音素之其中一者;及(C)根據該目標嘴型影像獲得至少一相關於該數位角色的虛擬臉部影像。A method for voice-converting virtual facial images is executed by a computer system and includes the following steps: (A) A phoneme conversion model for converting speech fragments into phonemes is used to continuously capture a voice capture unit. One of the examinee’s current speech fragments is converted into a target phoneme; (B) According to the target phoneme, it is obtained from the stored multiple frames of mouth images corresponding to a variety of different mouth shapes when a digital character emits multiple different phonemes. A target mouth shape image corresponding to the target phoneme, wherein each mouth shape image corresponds to one of the phonemes; and (C) obtaining at least one virtual face related to the digital character according to the target mouth shape image image.
Description
本發明是有關於一種轉換方法,特別是指一種語音轉換虛擬臉部影像的方法。The invention relates to a conversion method, in particular to a method for voice conversion to a virtual face image.
動畫、遊戲和電影等行業常會創建虛擬角色,而在創造虛擬角色之其中一個重要的技術為使虛擬角色講話時能夠具有自然流暢並與聲音同步的嘴型動作,為達聲音與嘴型動作的同步,設計師需要根據音頻的內容在時間軸上調整嘴型配置,非常消耗時間和人力,再者,事先製作出動畫,無法依現場氣氛做出變化,因此現有多家廠商致力發展即時語音驅動的虛擬角色說話的技術。Industries such as animation, games, and movies often create virtual characters. One of the important technologies in creating virtual characters is to enable the virtual characters to have natural and smooth mouth movements that are synchronized with the sound. Synchronization, the designer needs to adjust the mouth configuration on the timeline according to the audio content, which is very time-consuming and labor-intensive. Moreover, the animation is produced in advance and cannot be changed according to the scene atmosphere. Therefore, many existing manufacturers are committed to the development of real-time voice drive The virtual character talking technology.
然而,現有即時語音驅動的虛擬角色說話的技術多是將聲音先轉至文字,再由文字轉換成虛擬角色的嘴型影像,先轉換成文字的轉換處理的時間過長,無法得到即時之效果。However, the existing real-time voice-driven virtual character speech technology mostly converts the voice to text, and then converts the text to the virtual character's mouth image. The conversion process of first converting to text takes too long to obtain the real-time effect. .
因此,本發明的目的,即在提供一種能即時將語音轉換虛擬臉部影像的方法。Therefore, the purpose of the present invention is to provide a method that can instantly convert voices into virtual facial images.
於是,本發明語音轉換虛擬臉部影像的方法,由一電腦系統執行,該電腦系統儲存多幀相關於一數位角色於發出多種不同音素時所對應之多種不同嘴型的嘴型影像,每一嘴型影像對應於該等音素之其中一者,該電腦系統包含一用以連續地擷取一受試者之一當前語音片段的語音擷取單元,該語音轉換虛擬臉部影像的方法包含一步驟(A)、一步驟(B),及一步驟(C)。Therefore, the method for voice-converting virtual facial images of the present invention is executed by a computer system, and the computer system stores multiple frames of mouth images corresponding to multiple different mouth shapes when a digital character emits multiple different phonemes, each The mouth shape image corresponds to one of the phonemes. The computer system includes a voice capturing unit for continuously capturing a current voice segment of a subject. The method of voice converting a virtual facial image includes a Step (A), one step (B), and one step (C).
該步驟(A)中,該電腦系統利用一用於將語音片段轉換成音素的音素轉換模型,將該當前語音片段轉換成一目標音素。In this step (A), the computer system uses a phoneme conversion model for converting a speech segment into a phoneme to convert the current speech segment into a target phoneme.
該步驟(B)中,該電腦系統根據該目標音素,從該等嘴型影像獲得對應於該目標音素的一目標嘴型影像。In the step (B), the computer system obtains a target mouth shape image corresponding to the target phoneme from the mouth shape images according to the target phoneme.
該步驟(C)中,該電腦系統根據該目標嘴型影像獲得至少一相關於該數位角色的虛擬臉部影像。In the step (C), the computer system obtains at least one virtual face image related to the digital character according to the target mouth shape image.
本發明之功效在於:藉由該電腦系統擷取該當前語音片段的後,利用該音素轉換模型,以快速獲得對應於該目標音素的該目標嘴型影像,並根據該目標嘴型影像即時獲得該至少一虛擬臉部影像,以達到快速轉換之效果。The effect of the present invention is that after the computer system captures the current speech segment, the phoneme conversion model is used to quickly obtain the target mouth shape image corresponding to the target phoneme, and to obtain the target mouth shape image in real time according to the target mouth shape image The at least one virtual face image can achieve the effect of rapid conversion.
在本發明被詳細描述前,應當注意在以下的說明內容中,類似的元件是以相同的編號來表示。Before the present invention is described in detail, it should be noted that in the following description, similar elements are represented by the same numbers.
參閱圖1,本發明語音轉換虛擬臉部影像的方法的一實施例是由一電腦系統1執行,該電腦系統1包含一儲存單元11、一語音擷取單元12及一電連接該儲存單元11及該語音擷取單元12的處理單元13。Referring to FIG. 1, an embodiment of the method for voice conversion of a virtual face image of the present invention is executed by a
該儲存單元11儲存有多幀相關於一數位角色於發出多種不同音素(Phoneme)時所對應之多種不同嘴型的嘴型影像、多筆談話、多筆影音資料,及多幀相關於該數位角色之多種不同臉部表情的表情影像。值得注意的是,在本實施例中,該等音素例如為OO、IY、EE、AA、WW、LL、ER、UU、FV、MM、CH,以及DD等,每一嘴型影像對應於該等音素之其中一者(如圖2所示),但不以此為限。每一談話包含多個談話片段,每一影音資料包括多幀相關於一訓練者演說的臉部影像及多個分別對應該等臉部影像的音訊片段,每一種臉部表情對應於一指示出該臉部表情的表情參數,每一表情影像對應於該等臉部表情之其中一者所對應的表情參數。The
該語音擷取單元12用以連續地擷取一受試者之一當前語音片段,該當前語音片段包含至少一語音子片段。在本實施例中,該語音擷取單元12例如為麥克風,但不以此為限。The
本發明語音轉換虛擬臉部影像的方法的該實施例包含一音素轉換模型建立程序2、一表情轉換模型建立程序3,及一語音轉換虛擬臉部影像程序4。This embodiment of the method for voice conversion of virtual facial images of the present invention includes a phoneme conversion model creation program 2, an expression conversion
參閱圖1、3,該音素轉換模型建立程序2包含步驟21~23,以下詳述圖3所示的該音素轉換模型建立程序2的各個步驟。Referring to FIGS. 1 and 3, the phoneme conversion model establishment program 2 includes
在步驟21中,該處理單元13從一有聲字典提取該等音素的特徵。In
在步驟22中,對於每一談話,該處理單元13根據該等音素的特徵及該談話,產生一包含多個排列組合出該談話的音素的音素序列,該談話的每一談話片段對應該音素序列中的該等音素之其中一者。In
在步驟23中,該處理單元13將每一談話的每一談話片段及其對應的音素進行機器學習演算,例如卷積神經網路(Convolutional neural network, CNN),以建立出一用於將語音片段轉換成音素的音素轉換模型。In
參閱圖1、4,該表情轉換模型建立程序包含步驟31~34,以下詳述圖4所示的該表情轉換模型建立程序的各個步驟。Referring to FIGS. 1 and 4, the expression conversion model establishment procedure includes
在步驟31中,對於該儲存單元11儲存的該等影音資料之每一臉部影像,該處理單元13獲得該臉部影像中對應於該訓練者臉部之眉毛部分。In
在步驟32中,對於該儲存單元11儲存的該等影音資料之每一臉部影像,該處理單元13根據該臉部影像中的眉毛部分獲得一眉毛特徵。值得注意的是,在本實施例中,步驟31、32中該處理單元13是獲得眉毛部分及眉毛特徵,在其他實施方式亦可獲得其他臉部器官的部分及特徵,不以此為限。In
在步驟33中,對於該儲存單元11儲存的該等影音資料之每一臉部影像,該處理單元13根據該臉部影像對應的眉毛特徵將該影像進行表情辨識,以獲得該臉部影像所對應之該訓練者的表情辨識結果。值得注意的是,在本實施例中,每一表情辨識結果包括高興、生氣、難過,及無表情之其中一者,在其他實施方式中每一表情辨識結果更可包括其他表情,不以此為限。In
在步驟34中,該處理單元13將每一臉部影像所對應之表情辨識結果及音訊片段進行機器學習演算,以建立一用於將語音子片段轉換成表情參數的表情轉換模型。值得注意的是,在本實施例中,每一臉部影像所對應之音訊片段的時間長度與每一語音子片段的時間長度相等,即播放一幀影像的時間。In
參閱圖1、5,該語音轉換虛擬臉部影像程序4包含步驟41~45,以下詳述圖5所示的該表情轉換模型建立程序的各個步驟。Referring to FIGS. 1 and 5, the voice conversion virtual
在步驟41中,該處理單元13利用該音素轉換模型,將該當前語音片段轉換成一目標音素。In
在步驟42中,該處理單元13根據該目標音素,從該等嘴型影像獲得對應於該目標音素的一目標嘴型影像。In
在步驟43中,對於該當前語音片段的該至少一語音子片段之每一者,該處理單元13利用該表情轉換模型,將該語音子片段轉換成一目標表情參數。In
在步驟44中,對於每一目標表情參數,該處理單元13根據該目標表情參數,從該等表情影像獲得對應該目標表情參數的一目標表情影像。In
要特別注意的是,在本實施例中,該表情轉換模型包括二次指數平滑法(Double exponential smoothing method),即不同時間的語音子片段擁有不同的權重,越接近當前時間的語音子片段權重越大,在步驟43中,對於該當前語音片段的該至少一語音子片段之每一者,該處理單元13根據該語音子片段之前的所有語音片段以及該語音子片段,進行預測以轉換出該目標表情參數,利用二次指數平滑法可使得步驟44中所獲得該目標表情影像與前一時段獲得的目標表情影像較為連貫,由於本發明之特徵並不在於熟知此技藝者所已知二次指數平滑法,因此為了簡潔,故在此省略了二次指數平滑法的細節。It should be noted that, in this embodiment, the expression conversion model includes a double exponential smoothing method, that is, voice sub-segments at different times have different weights, and the closer the weights of the voice sub-segments at the current time are The larger is, in
要再特別注意的是,在本實施例中,步驟41、42與步驟43、44同時進行,在其他實施方式中,步驟41、42可在步驟43、44之前或之後執行,不以此為限。It should be particularly noted that in this embodiment,
在步驟42及步驟44之後的步驟45中,該處理單元13根據該目標嘴型影像及該至少一目標表情影像,獲得至少一相關於該數位角色的虛擬臉部影像。In
要再特別注意的是,當該當前語音片段僅包含一語音子片段時,在步驟43中,該處理單元13將該語音子片段轉換成一目標表情參數,並在步驟44中,該處理單元13從該等表情影像獲得對應該目標表情參數的一目標表情影像,最後在步驟45中,該處理單元13根據該目標嘴型影像及該目標表情影像,獲得一虛擬臉部影像;而當該當前語音片段包含多個語音子片段時,在步驟43中,該處理單元13將該等語音子片段轉換成多個目標表情參數,並在步驟44中,該處理單元13分別從該等表情影像獲得對應該等目標表情參數的多個目標表情影像,最後在步驟45中,該處理單元13根據該目標嘴型影像及該等目標表情影像,獲得多幀虛擬臉部影像,表示該等虛擬臉部影像具有相同的嘴型。It should be noted that when the current voice segment only contains a voice sub-segment, in
綜上所述,本發明語音轉換虛擬臉部影像的方法,藉由該處理單元13將該等談話及音素序列進行機器學習演算,以建立該音素轉換模型,且將該等臉部影像所對應之表情辨識結果及該等音訊片段進行機器學習演算,以建立該表情轉換模型,使得該語音擷取單元12擷取該當前語音片段的後,該處理單元13利用該音素轉換模型及該表情轉換模型,快速獲得對應於該目標音素的該目標嘴型影像及該至少一目標表情影像,並根據該目標嘴型影像及該至少一目標表情影像即時獲得該至少一虛擬臉部影像,以達到快速轉換之效果,此外,該音素轉換模型不受限於語言,能支援各國語言的嘴型轉換,故確實能達成本發明的目的。In summary, the method for voice conversion of virtual facial images of the present invention uses the
惟以上所述者,僅為本發明的實施例而已,當不能以此限定本發明實施的範圍,凡是依本發明申請專利範圍及專利說明書內容所作的簡單的等效變化與修飾,皆仍屬本發明專利涵蓋的範圍內。However, the above are only examples of the present invention. When the scope of implementation of the present invention cannot be limited by this, all simple equivalent changes and modifications made in accordance with the scope of the patent application of the present invention and the content of the patent specification still belong to Within the scope of the patent for the present invention.
1:電腦系統
11:儲存單元
12:語音擷取單元
13:處理單元
21~23:步驟
31~34:步驟
41~45:步驟
1: Computer system
11: storage unit
12: Voice capture unit
13:
本發明的其他的特徵及功效,將於參照圖式的實施方式中清楚地呈現,其中: 圖1是一方塊圖,說明用以實施本發明語音轉換虛擬臉部影像的方法的一電腦系統; 圖2是一示意圖,說明本發明語音轉換虛擬臉部影像的方法的該實施例的每一音素對應的嘴型影像; 圖3是一流程圖,說明本發明語音轉換虛擬臉部影像的方法的一實施例的一音素轉換模型建立程序; 圖4是一流程圖,說明該實施例的一表情轉換模型建立程序;及 圖5是一流程圖,說明該實施例的一語音轉換虛擬臉部影像程序。Other features and effects of the present invention will be clearly presented in the embodiments with reference to the drawings, in which: FIG. 1 is a block diagram illustrating a computer system for implementing the method of voice conversion virtual face image of the present invention; FIG. 2 is a schematic diagram illustrating the mouth shape image corresponding to each phoneme in the embodiment of the method of voice conversion virtual face image of the present invention; 3 is a flowchart illustrating a phoneme conversion model establishment procedure of an embodiment of the method for voice conversion to virtual facial images of the present invention; Fig. 4 is a flowchart illustrating an expression conversion model establishment procedure of this embodiment; and Fig. 5 is a flowchart illustrating a voice conversion virtual face image procedure of the embodiment.
41~45:步驟 41~45: Steps
Claims (5)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW108100400A TWI712032B (en) | 2019-01-04 | 2019-01-04 | Voice conversion method for virtual face image |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW108100400A TWI712032B (en) | 2019-01-04 | 2019-01-04 | Voice conversion method for virtual face image |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW202027063A TW202027063A (en) | 2020-07-16 |
| TWI712032B true TWI712032B (en) | 2020-12-01 |
Family
ID=73005058
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW108100400A TWI712032B (en) | 2019-01-04 | 2019-01-04 | Voice conversion method for virtual face image |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI712032B (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6232965B1 (en) * | 1994-11-30 | 2001-05-15 | California Institute Of Technology | Method and apparatus for synthesizing realistic animations of a human speaking using a computer |
| US20140031086A1 (en) * | 2012-06-05 | 2014-01-30 | Tae Ho Yoo | System of servicing famous people's characters in smart phone and operation method thereof |
| KR20160121825A (en) * | 2015-04-13 | 2016-10-21 | 주식회사 카이스인포 | artificial intelligence base hologram deceased platform construction method |
| CN106446406A (en) * | 2016-09-23 | 2017-02-22 | 天津大学 | A simulation system and simulation method for converting Chinese sentences into human mouth shapes |
-
2019
- 2019-01-04 TW TW108100400A patent/TWI712032B/en active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6232965B1 (en) * | 1994-11-30 | 2001-05-15 | California Institute Of Technology | Method and apparatus for synthesizing realistic animations of a human speaking using a computer |
| US20140031086A1 (en) * | 2012-06-05 | 2014-01-30 | Tae Ho Yoo | System of servicing famous people's characters in smart phone and operation method thereof |
| KR20160121825A (en) * | 2015-04-13 | 2016-10-21 | 주식회사 카이스인포 | artificial intelligence base hologram deceased platform construction method |
| CN106446406A (en) * | 2016-09-23 | 2017-02-22 | 天津大学 | A simulation system and simulation method for converting Chinese sentences into human mouth shapes |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202027063A (en) | 2020-07-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112184858B (en) | Virtual object animation generation method and device based on text, storage medium and terminal | |
| Vougioukas et al. | Video-driven speech reconstruction using generative adversarial networks | |
| CN112785671B (en) | Virtual dummy face animation synthesis method | |
| CN113077537A (en) | Video generation method, storage medium and equipment | |
| CN110610534B (en) | Automatic mouth shape animation generation method based on Actor-Critic algorithm | |
| US11461948B2 (en) | System and method for voice driven lip syncing and head reenactment | |
| KR102319753B1 (en) | Method and apparatus for producing video contents based on deep learning | |
| Hassid et al. | More than words: In-the-wild visually-driven prosody for text-to-speech | |
| US20230267665A1 (en) | End-to-end virtual object animation generation method and apparatus, storage medium, and terminal | |
| CN115511994A (en) | Method for quickly cloning real person into two-dimensional virtual digital person | |
| US20250200855A1 (en) | Method for real-time generation of empathy expression of virtual human based on multimodal emotion recognition and artificial intelligence system using the method | |
| JP2015038725A (en) | Utterance animation generation apparatus, method, and program | |
| JP7421869B2 (en) | Information processing program, information processing device, information processing method, and learned model generation method | |
| CN117523051B (en) | Method, device, equipment and storage medium for generating dynamic images based on audio | |
| Bigioi et al. | Multilingual video dubbing—a technology review and current challenges | |
| CN117115310A (en) | Digital face generation method and system based on audio and image | |
| CN117456064A (en) | Method and system for rapidly generating intelligent companion based on photo and short audio | |
| JP2020091338A (en) | Speaker conversion device, speaker conversion method, learning device, learning method and program | |
| CN119255064B (en) | Text-driven AIGC video generation method and device | |
| TWI712032B (en) | Voice conversion method for virtual face image | |
| CN119152837B (en) | Speech synthesis method and device | |
| CN112907706A (en) | Multi-mode-based sound-driven animation video generation method, device and system | |
| Verma et al. | Animating expressive faces across languages | |
| CN112992120A (en) | Method for converting voice into virtual face image | |
| CN117059123A (en) | Small-sample digital human voice-driven action replay method based on gesture action graph |