CN112951198B

CN112951198B - Singing voice synthesis

Info

Publication number: CN112951198B
Application number: CN201911156831.7A
Authority: CN
Inventors: 卢佩玲; 栾剑; 吴洁
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2024-08-06
Anticipated expiration: 2039-11-22
Also published as: WO2021101665A1; CN112951198A

Abstract

The present disclosure provides methods and apparatus for singing voice synthesis. First score phoneme information extracted from a score may be received, the first score phoneme information including a first phoneme and a pitch and a beat of notes corresponding to the first phoneme. A base frequency delta and a spectral parameter corresponding to the first phoneme may be generated based on the first score phoneme information. The fundamental frequency corresponding to the first phoneme may be obtained by adjusting the pitch of the note with the fundamental frequency delta. An acoustic waveform corresponding to the first phoneme may be generated based at least in part on the fundamental frequency and the spectral parameters.

Description

Singing voice synthesis

Background

Singing voice synthesis (SVS: singing Voice Synthesis) is a technique for generating virtual singing voice based on a score including information such as lyrics, tempo, pitch, etc. Singing voice synthesis may include predicting acoustic features based on the score, and in turn generating speech waveforms based on the acoustic features. Singing voice synthesis is intended to automatically generate singing voice simulating real human singing

Disclosure of Invention

This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure provide methods and apparatus for singing voice synthesis. First score phoneme information extracted from a score may be received, which may include a first phoneme and a pitch and a beat of a note corresponding to the first phoneme. A base frequency delta and a spectral parameter corresponding to the first phoneme may be generated based on the first score phoneme information. The pitch of the notes may be adjusted using the fundamental frequency delta to obtain a fundamental frequency corresponding to the first phoneme. An acoustic waveform corresponding to the first phoneme may be generated based at least in part on the fundamental frequency and the spectral parameters.

It is noted that one or more aspects above include the features specifically noted and the detailed description that follows. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.

Drawings

The disclosed aspects will be described below in conjunction with the drawings, which are provided to illustrate and not limit the disclosed aspects.

Fig. 1 illustrates an existing exemplary TTS system architecture.

Fig. 2 shows an exemplary process of parsing a score according to an embodiment of the present invention.

FIG. 3 illustrates an exemplary SVS system architecture according to an embodiment of the invention.

Fig. 4 shows an exemplary process of generating a score according to an embodiment of the invention.

Fig. 5 shows an exemplary architecture of a score encoder according to an embodiment of the present invention.

Fig. 6 shows an exemplary architecture of a spectrum decoder according to an embodiment of the invention.

Fig. 7 illustrates an exemplary application scenario of singing voice synthesis according to an embodiment of the present invention.

Fig. 8 illustrates an exemplary process of singing voice synthesis based on a score according to an embodiment of the present invention.

FIG. 9 illustrates an exemplary training process for an acoustic feature predictor according to an embodiment of the present invention.

Fig. 10 shows a flowchart of an exemplary method for singing voice synthesis, according to an embodiment of the invention.

Fig. 11 shows a block diagram of an exemplary apparatus for singing voice synthesis, according to an embodiment of the invention.

Fig. 12 shows a block diagram of an exemplary apparatus for singing voice synthesis, according to an embodiment of the invention.

Detailed Description

The present disclosure will now be discussed with reference to various exemplary embodiments. It should be understood that the discussion of these embodiments is merely intended to enable one skilled in the art to better understand and thereby practice the examples of the present disclosure and is not intended to limit the scope of the present disclosure in any way.

The traditional singing voice synthesis technology adopts modes of waveform unit splicing or statistical parameter synthesis and the like to simulate the singing voice of a human being. However, the quality of the synthesized singing voice still has a large gap from that of human recordings. Recently, deep learning-based models such as Deep Neural Networks (DNNs), long Short Term Memory (LSTM), etc. have been introduced into the SVS field. In one approach, waveNet models are proposed for predicting acoustic features, where the duration, pitch, and spectrum models are trained independently. In one approach, a deep autoregressive neural network is proposed for modeling the fundamental frequency F0 and spectral features. In addition, an end-to-end SVS system for countermeasure training based on an autoregressive model is also presented. The autoregressive model has forward dependency. In one approach, a strategy is presented to post-process the predicted fundamental frequency based on note pitch, such that the fundamental frequencies are pitch-matched. However, there is still a need for a singing voice synthesis model that synthesizes singing voices with high naturalness, high processing speed, and good audio quality.

Singing voice synthesis techniques as proposed by embodiments of the present disclosure refer, at least in part, to a text-to-speech (TTS) model such as FASTSPEECH. FASTSPEECH is a transformer architecture based feed forward network for generating mel-spectra in parallel. Embodiments of the present disclosure may modify FASTSPEECH models to adapt to the SVS task of generating singing voice based on a score. Unlike plain text, a score includes information associated with lyrics and notes, where notes have corresponding beats and pitches.

In one aspect, considering that the note tempo may bring a rhythmic sensation and enhance the auditory sensation, and the human brain is more sensitive to the song tempo, in predicting the acoustic features, embodiments of the present disclosure consider the note-level duration alignment in addition to the phoneme-level duration alignment, so that the finally generated singing voice may more conform to the note tempo, thereby enhancing the rhythmic sensation and bringing a smoother auditory sensation.

In one aspect, embodiments of the present disclosure propose a residual connection between the input note pitch and the output fundamental frequency, such that the acoustic feature prediction model only needs to predict deviations or fundamental frequency differences with respect to the note pitch in the score. This not only can overcome the problem that the training data is difficult to cover all pitch ranges, avoid the need to perform data enhancement by pitch shifting the training data, but also can flexibly enhance the fundamental frequency dynamic range such as tremolo to more expressively express emotion.

In one aspect, embodiments of the present disclosure may enable flexible switching between singer timbres and/or singing styles based on settings for different singers' timbres and/or different singing styles. For example, embodiments of the present disclosure may enable synthesizing singing sounds with a designated singer's timbre, synthesizing singing sounds with a designated singing style, and so forth.

Although the following discussion of the embodiments of the present disclosure is directed to singing voice synthesis, it should be understood that the innovative concepts of the present disclosure can be applied in a similar manner to any other scenario in the field of speech synthesis.

Fig. 1 illustrates an existing exemplary TTS system architecture. The TTS system architecture may include a speech synthesizer 100 and a phoneme extraction module 110.

The phoneme extraction module 110 may be used to extract phoneme data from the text 102 one by one and provide the extracted phoneme data to the speech synthesizer 100. Text 102 may be identified from any type of data such as electronic documents, images, symbolic data, and the like. The phoneme data includes the name of the phoneme. Phonemes are the smallest phonetic units that make up a syllable. In general, a syllable can be divided into a plurality of phonemes. For example, in chinese, one chinese character is one syllable, and one chinese character may be divided into, for example, 3 phonemes. For example, the combination of the initials and finals of the syllables of the Chinese character "I" is "wo", and the Chinese character can be decomposed into three phonemes, e.g., "w", "o". "w" is the phone name of the first phone, and so on. The number of phonemes into which a syllable is divided may be referred to as the phoneme granularity. The larger the phoneme granularity, the more different combinations of phonemes that make up a syllable. Continuing with the previous example, in the case of decomposing the Chinese character "I" into three phonemes, the phoneme granularity is 3.

The speech synthesizer 100 is configured to convert the phoneme data from the phoneme extraction module 110 into a speech waveform 104 representing a virtual speech corresponding to the phoneme data. The speech synthesizer 100 includes an acoustic feature predictor 120 and a vocoder 130.

The acoustic feature predictor 120 is configured to predict acoustic feature parameters corresponding to the phoneme data based on the phoneme data. The acoustic characteristic parameters may include mel-spectrum parameters, fundamental frequencies, etc. The acoustic feature predictor 120 includes a phoneme encoder 122, a duration predictor 124, a length adjuster 126, and a spectral decoder 128. The phoneme encoder 122 is configured to encode the phoneme data from the phoneme extraction module 110 into a corresponding phoneme-side vector representation. The duration predictor 124 may predict a phoneme duration associated with the phoneme data based on the phoneme side vector representation. The phoneme duration characterizes the length in time of the spectrum corresponding to the phoneme. The phoneme duration may be in units of time frames of audio, for example. The duration predictor 124 considers phoneme duration in human real voice in prediction to provide more accurate prediction results. The length adjuster 126 expands the phoneme-side vector representation into a spectrum-side vector representation in accordance with the phoneme length predicted by the length predictor 124 to be suitable for the subsequent spectrum prediction process. The spectral side vector representation is provided to a spectral decoder 128. The spectral decoder 128 generates acoustic feature parameters corresponding to the spectral side vector representations based on the received spectral side vector representations.

The vocoder 130 may convert the acoustic feature parameters generated by the spectral decoder 128 into the speech waveform 104. The vocoder 130 may be a vocoder that generates a voice waveform based on mel-spectrum parameters, such as WaveGlow, griffin-Lim, waveNet, or the like.

Although the speech synthesizer 100 shown in fig. 1 is capable of synthesizing virtual speech based on the input text 102, the speech synthesizer 100 cannot be directly used to synthesize virtual singing based on a score because the score generally includes not only lyrics in text form but also various note information.

Fig. 2 illustrates an exemplary process 200 of parsing a score according to an embodiment of the invention. The process of parsing a score is illustrated in fig. 2 with a segment of a chinese song.

The score 210 includes the chinese lyrics "me", "and", "you" and the corresponding notes. From the score it is also known that the speed of the music is specified to be 120 beats per minute. In other words, the duration of each beat is 0.5 seconds. By parsing the score 210, each syllable in the lyrics can be divided into a plurality of phonemes, and the pitch and beat of the note corresponding to each phoneme are obtained.

The lyrics in the score 210 include 3 syllables "me", "and", "you", each of which can be exemplarily divided into 3 phonemes. For example, syllables "me" can be divided into 3 phones "w", "o" and "o". The 1 st phoneme "w" corresponds to the initial of the syllable "me", and the 2 nd and 3 rd phonemes both correspond to the final of the syllable "me". Accordingly, the phonemes included in syllables "I", "AND", "you" form a phoneme sequence [ w, o, o, h, e, e, n, i, i ].

The notes for each phoneme can be determined. For example, if a syllable corresponds to a note m, it may be determined that multiple phones for the syllable all correspond to the note m. Taking the syllable "me" as an example, in the score 210, the syllable "me" corresponds to the note 211, and thus, the 3 phonemes "w", "o", and "o" of the syllable "me" each correspond to the note 211.

Since the pitch of the note 211 is "C4", the phonemes "w", "o" and "o" may each be labeled with a pitch "C4". Alternatively, the pitch may be quantized according to a specific music standard specification. For example, pitch C4 may be quantized to number 60 in accordance with MIDI standard specifications. Thus, the phonemes "w", "o" and "o" may also each be labeled with a pitch "60".

According to the score 210, the tempo of the note 211 is "1" beat, and thus, the phonemes "w", "o", and "o" may each be labeled with a tempo "1" beat. Alternatively, beats may be quantified in time, e.g., 1 beat corresponds to 0.5 seconds. Thus, the phonemes "w", "o" and "o" may each be labeled with a beat "0.5" in seconds. Further, alternatively, the beat may be quantized in frames, for example, 1 beat corresponds to 33 frames. Thus, the phonemes "w", "o" and "o" may each be labeled with a beat "33" in frames.

In the process of parsing the score of fig. 2, values of pitch and beat are copied with reference to the number of phonemes included in syllables. For example, syllables "me" include 3 phones, and the corresponding pitches "C4", "60" and beats "1", "0.5", "33" are each duplicated in triplicate and associated with 3 phones, respectively. Herein, the score phoneme information may include various information associated with phonemes in the score, such as a phoneme name, a pitch, a beat, and the like. The phoneme name is a phoneme indicator. For example, "w" is the phone name of the first phone of syllable "me". The pitch refers to the pitch of notes corresponding to the phonemes. The tempo refers to the tempo of a note corresponding to a phoneme. It should be appreciated that the above description of the score phoneme information is for exemplary purposes only, and that in fact the score phoneme information may include any information indicative of phoneme name, pitch and beat.

The manner in which the process 200 parses the score is merely exemplary, and embodiments of the present disclosure may employ any other parsing manner to parse the score.

It should be appreciated that although a chinese song is illustrated in fig. 2, process 200 may be applicable to any other language. Different languages may have different base phonemes. Taking english as an example, 48 english international phonetic symbols may be used as the basic phonemes. For the exemplary english word "you," which is a syllable with a phonetic symbol of "ju:" the syllable can be divided into, for example, 3 phones "j", "u:" and "u:". The 1 st phoneme "j" corresponds to the consonant of the syllable, and the 2 nd and 3 rd phonemes "u:" both correspond to the vowels of the syllable. After syllables in lyrics of other languages are divided into phonemes, information such as pitch, tempo, etc. of notes may be labeled for each phoneme similar to process 200. It should be understood that any process of embodiments of the present disclosure is not limited by a particular language class.

FIG. 3 illustrates an exemplary SVS system architecture according to an embodiment of the invention. The SVS system architecture may include a singing voice synthesizer 300, a score parser 310, and the like. It should be appreciated that although the score resolver 310 is shown in fig. 3 as being separate from the singing voice synthesizer 300, the score resolver 310 may alternatively be included in the singing voice synthesizer 300 as part of the singing voice synthesizer 300.

The score parser 310 may extract score phoneme information from the score 302 phoneme by phoneme in the manner as described in fig. 2 and provide the extracted score phoneme information to the singing voice synthesizer 300.

The singing voice synthesizer 300 is used for predicting an acoustic waveform 304 of a virtual singing voice corresponding to the score phoneme information based on the score phoneme information from the score parser 310. The singing voice synthesizer 300 may include an acoustic feature predictor 320, a pitch adjuster 330, a vocoder 340, and the like.

The SVS system architecture may include a timbre encoder 350 configured to provide a timbre vector representation based on the timbre ID. Timbre is a sound attribute that depends on the overtones of human voices. The tone color ID may be an indicator, e.g., index, artist name, etc., that indicates the inherent tone color of a particular artist. A timbre ID may uniquely correspond to the timbre of a singer. In one embodiment, when a timbre ID is received, timbre encoder 350 may generate a timbre vector representation characterizing the vocal characteristics of the singer based on the audio data of the singer to which the timbre ID corresponds. In one embodiment, a timbre vector representation corresponding to each timbre ID may be pre-generated and stored in a timbre database associated with timbre encoder 350. Upon receiving the timbre ID, timbre encoder 350 may retrieve the timbre vector representation corresponding to the timbre ID in a timbre database. It should be appreciated that the timbre ID input to timbre encoder 350 may be provided by the user. For example, when a user wants to obtain a song sung with the tone of a particular singer, the user may provide the tone ID of the particular singer to the SVS system architecture.

The SVS system architecture may include a style encoder 360 configured to provide a style vector representation based on the singing style ID. The style of singing may indicate how the singer sings a song, such as the manner of pronunciation, the skill of pronunciation, etc. The singing style may be associated with the duration of the phonemes and/or the fundamental frequency corresponding to the phonemes. For example, different singers may have different pronunciation duration habits of initials or finals while singing, resulting in different pronunciation patterns. Furthermore, if, for example, the singer uses pronunciation skills such as tremolo, the fundamental frequency will also reflect the corresponding characteristics. The singing style ID may be an indicator, e.g., index, singing style name, etc., indicating a particular singing style. In some cases, the singing style may refer to the type of song, e.g., rock song, ballad, etc. In some cases, a singing style may refer to a style of singing for a particular singer. In one embodiment, upon receiving the singing style ID, the style encoder 360 may generate a style vector representation for characterizing the singing style based on audio data of the singing style to which the singing style ID corresponds. In one embodiment, a style vector representation corresponding to each singing style ID may be pre-generated and stored in a style database associated with the style encoder 360. Upon receiving the singing style ID, the style encoder 360 may retrieve a style vector representation corresponding to the singing style ID in the style database. It should be appreciated that the singing style ID input to the style encoder 360 may be provided by the user. For example, when a user wants to obtain a song that takes a particular style of singing, the user may provide the SVS system architecture with the style ID for that particular style of singing.

Although the timbre encoder 350 and the style encoder 360 are shown in fig. 3 as being located outside of the singing voice synthesizer 300, this is for clarity and illustration purposes only. It should be appreciated that the timbre encoder 350 and/or the style encoder 360 may also be embedded within the singing voice synthesizer 300 or within the acoustic feature predictor 320. Alternatively, in one embodiment, timbre encoder 350 and style encoder 360 may output a fixed timbre vector representation and a fixed style vector representation, independent of the input of timbre ID and style ID. Thus, the generated singing voice will take the timbre corresponding to the fixed timbre vector representation, and the style of singing corresponding to the fixed style vector representation. Further, optionally, either or both of tone encoder 350 and style encoder 360 may also be omitted in the SVS system architecture of fig. 3.

The acoustic feature predictor 320 is configured to predict acoustic feature parameters corresponding to the score phoneme information based on the score phoneme information, possibly a timbre vector representation from the timbre encoder 350, and possibly a style vector representation from the style encoder 360. The acoustic characteristic parameters may include spectral parameters, fundamental dispersion, and the like. The spectral parameters may be mel-generalized cepstrum (MGC) parameters, band Aperiodic (BAP) parameters, mel-spectrum parameters, etc. The acoustic feature predictor 320 may include a score encoder 322, a vector combination module 324, a duration predictor 326, a length adjuster 328, a spectrum decoder 329, and the like.

The score encoder 322 is configured to encode the score phoneme information from the score parser 310 into corresponding phoneme-side vector representations. The score encoder 322 may generate the phoneme-side vector representation in a non-autoregressive manner. In one embodiment, the score encoder 322 may include a feed-forward neural network structure based on a self-attention mechanism in the transformer and one-dimensional (1D) convolution. The phoneme-side vector representation may be a hidden state related to the score phoneme information generated by the feedforward neural network structure. The score phoneme information may be provided to the score encoder 322 on a phoneme-by-phoneme basis so that the score encoder 322 may generate the phoneme-side vector representation on a phoneme-by-phoneme basis. Taking the score shown in fig. 2 as an example, for the first phoneme "w" of syllables "me", the score encoder 322 may generate a phoneme-side vector representation corresponding to the phoneme "w" based on the phoneme information of the phoneme "w". Similarly, the score encoder 322 may in turn generate a phoneme side vector representation of the second phoneme "o" for syllables "me", a phoneme side vector representation of the third phoneme "o" for syllables "me", and so on. Assuming that the set of all phonemes is represented as a sequence of phonemes, a sequence of phoneme-side vector representations, also referred to as a sequence of hidden states, corresponding to the sequence of phonemes may be obtained by the score encoder 322. Taking the syllable "me" as an example, three phones "w", "o" of the syllable form a phone sequence [ w, o, o ], and the score encoder 322 may generate a hidden state sequence H _pho＝[h₁,h₂,h₃ corresponding to the phone sequence based on the phone information of the three phones, respectively, where H ₁ is a hidden state corresponding to the first phone "w", H ₂ is a hidden state corresponding to the second phone "o", and H ₃ is a hidden state corresponding to the third phone "o". An exemplary architecture of the score encoder 322 will be described in detail later in connection with fig. 5.

As previously described, the acoustic feature predictor 320 will ultimately predict spectral parameters corresponding to each phoneme, and a plurality of spectral parameters corresponding to a plurality of phonemes will form a spectral sequence. Since each phoneme generally corresponds to a plurality of time frames and thus to a plurality of segments of spectrum, the length of the spectrum sequence will be longer than the length of its corresponding phoneme sequence. The length adjuster 328 may upsample the phoneme sequence according to the phoneme length predicted by the length predictor 326 to match the length of the spectral sequence.

Because the singing style of a song is related to the phoneme duration, when the acoustic feature predictor 320 is able to generate acoustic feature parameters from the specified singing style ID, the vector combination module 324 may be used to combine the style vector representation output by the style encoder 360 with the phoneme-side vector representation from the score encoder 322 to obtain a phoneme-side combined vector representation. In one embodiment, the combining operation may refer to vector stitching the phoneme side representation and the style vector representation such that the dimension of the resulting phoneme side combined vector representation will be the sum of the dimensions of the phoneme side vector representation and the dimensions of the style vector representation. In one embodiment, the combining operation may refer to vector summing the phoneme lateral representation and the style vector representation. In this case, the phoneme-side vector representation, the style vector representation, and the phoneme-side combination vector representation will all have the same dimensions. The phoneme-side combination vector representation containing the singing style information may be input to a duration predictor 326.

The duration predictor 326 may predict a phoneme duration associated with a phoneme based on the phoneme-side combined vector representation for the phoneme. The set of multiple phone-side combined vector representations provided to the duration predictor 326 may be represented as a sequence of hidden states, e.g., H _pho＝[h₁,h₂,h₃. The duration predictor 326 may predict the corresponding sequence of phoneme durations d= [ D ₁,d₂,d₃ ], where D ₁、d₂、d₃ represents the predicted phoneme durations corresponding to the hidden state h ₁、h₂、h₃, respectively. For example, when the phoneme duration predicted by the time length predictor 326 for the hidden state h ₁ is 3 frames, the value of d ₁ is 3. Unlike the duration predictor 124 of fig. 1, the duration predictor 326 considers not only the phoneme duration in the real singing voice of the human but also the standard beat of notes associated with the phonemes in the score when predicting, so that the prediction result of the duration predictor 326 can help to realize a virtual singing voice with more rhythmic sensation and smooth hearing. Meanwhile, since the duration predictor 326 predicts based on the phoneme-side combination vector and the phoneme-side combination vector includes information about the singing style, it can be considered that the prediction of the phoneme duration by the duration predictor 326 is based at least on the singing style. An exemplary training process of duration predictor 326 will be described in detail below in conjunction with FIG. 9.

The length adjuster 328 is configured to adjust or spread the phoneme-side combined vector representation of a phoneme into a spectrum-side vector representation in accordance with the phoneme length predicted by the length predictor 326 for that phoneme to be suitable for subsequent spectrum prediction processing. For ease of illustration, in accordance with the above example, the set of multiple phone-side combination vector representations received by the length adjuster 328 from the score encoder 322 is represented as a hidden state sequence H _pho＝[h₁,h₂,h₃, and the set of multiple phone-side combination vector representations predicted by the duration predictor 326 based on these phone-side combination vector representations is represented as a predicted phone duration sequence d= [ D ₁,d₂,d₃ ]. The length adjuster 326 may expand the hidden state h _i of phoneme i by d _i. The spectrum side vector representation sequence H _spec obtained by the length adjuster 326 can be calculated as:

H _spec＝LR(H_pho, D, a) formula (1)

Where LR denotes the processing of the length adjuster, a is a super parameter that determines the length of the extended H _spec sequence and thus enables to control the singing voice speed. Given H _pho＝[h₁,h₂,h₃ and the corresponding predicted phoneme length sequence d= [2, 3], H _spec＝[h₁,h₁,h₂,h₂,h₃,h₃,h₃ ] in the case of a=1. In the case of the parameter a=1.3, i.e., in the fast case, the phoneme duration sequence D is updated to D _a＝1.3 = [2.6,2.6,3.9] ≡3,3,4], then H_spec＝[h₁,h₁,h₁,h₂,h₂,h₂,h₃,h₃,h₃,h₃]. in the case of the parameter a=0.5, i.e., in the slow case, the phoneme duration sequence D is updated to D _a＝0.5 = [1,1,1.5] ≡1,1,2], then H _spec＝[h₁,h₂,h₃,h₃ ].

The spectral decoder 329 receives the spectral side vector representation corresponding to one phoneme from the length adjuster 328 and generates corresponding acoustic feature parameters, e.g., spectral parameters, base frequency differences, etc., based at least on the spectral side vector representation. Optionally, the process of spectral decoder 329 generating acoustic feature parameters may also be further based on a possible timbre vector representation from timbre encoder 350. Because the timbre is associated with spectral parameters, the generation of spectral parameters may be further based on the timbre of the target singer characterized by the timbre vector representation. In another aspect, since the singing style is associated with the fundamental frequency and thus the fundamental frequency delta output by the spectral decoder 329, and the singing style information is contained in the spectral side vector representation, the generation of the fundamental frequency delta may be considered to be further based on the singing style characterized by the style vector representation. Furthermore, since the spectral side vector representation containing the singing style information is also used by the spectral decoder 329 to generate spectral parameters, the generation of spectral parameters can be considered to be further based on the singing style characterized by the style vector representation. The spectral decoder 329 may generate the acoustic feature parameters in a non-autoregressive manner. In one embodiment, the spectral decoder 329 may comprise a feed-forward neural network structure based on a self-attention mechanism and one-dimensional convolution in the transformer.

The frequency coverage of the human singing voice is far higher than that of the normal voice, for example, the frequency coverage of the human singing voice can reach 80Hz-3400Hz. Furthermore, a human singer may employ singing skills accompanied by dynamics when singing songs, which further results in frequency variations. Thus, the frequency coverage of different songs has a large variability, resulting in that training data for conventional acoustic feature predictors hardly covers all pitch ranges completely. Furthermore, the shift in pitch of notes in the singing voice relative to a standard pitch, such as running pitch, can greatly affect the auditory sensation of the singing voice. Accordingly, embodiments of the present disclosure introduce a residual between the pitch in the input score phoneme information and the fundamental frequency of the output in the singing voice synthesizer 300.

The spectral decoder 329 in the acoustic feature predictor 320 may be trained to predict the base frequency delta. The fundamental frequency delta indicates the deviation between: a standard fundamental frequency corresponding to a standard pitch of a current phoneme parsed from the score 302 by the score parser 310; and a fundamental frequency corresponding to the phoneme to be employed in the synthesized singing voice. Since the acoustic feature predictor 320 only needs to predict the fundamental frequency delta, not the fundamental frequency itself, the acoustic feature predictor 320 does not need training data to cover all pitch ranges. In one embodiment, the base frequency difference may be set to be not higher than the semitone, so that the problem of running of the synthesized singing voice can be avoided. An exemplary system architecture of spectrum decoder 329 will be described in detail later in connection with fig. 6, and an exemplary training process of spectrum decoder 329 will be described in detail in connection with fig. 9.

The pitch adjuster 330 is configured to adjust the standard pitch of the current phoneme from the score parser 310 using the base frequency delta output by the spectrum decoder 329 for the current phoneme to generate a base frequency to be employed for the current phoneme in the synthesized singing voice. In one embodiment, the pitch adjuster 330 may be an adder that adds the standard fundamental frequency corresponding to the standard pitch of the current phoneme to the fundamental frequency delta from the spectral decoder 328 to generate the fundamental frequency to be employed.

The vocoder 340 may generate the corresponding acoustic waveform 304 based on the fundamental frequency from the pitch adjuster 330 and the spectral parameters generated by the spectral decoder 328. The vocoder 340 may be any type of vocoder, such as a vocoder that generates acoustic waveforms based on mel-generalized cepstrum parameters, or the like, such as a WORLD vocoder.

It should be appreciated that although the above discussion relates to generating acoustic waveforms on a phone-by-phone basis using a singing voice synthesizer, since the SVS system architecture of fig. 3 has no dependency on the processing of different phones, the singing voice synthesizer may also be deployed to process multiple phones in parallel, so that acoustic waveforms may be generated in parallel.

It should be appreciated that while fig. 3 illustrates the provision of a style vector representation by the style encoder 360 to the acoustic feature predictor 320, alternatively, the acoustic feature predictor 320 need not receive a style vector representation when the singing voice synthesizer 300 need not employ a specified singing style to synthesize singing voice, and thus may not include the vector combination module 324. In this case, the duration predictor 326 may predict the phoneme duration directly based on the phoneme side vector representation output by the score encoder 322, and the length adjuster 328 may extend the phoneme side vector representation output by the score encoder 322 according to the phoneme duration.

Fig. 4 shows an exemplary process of generating a score according to an embodiment of the invention. In fig. 4, the score generator 410 may generate the score 420 based on various types of score data related to information about the score.

In one case, the score data may be image score data 402. The image score data 402 presents information about a score, such as a score photograph, etc., in the form of an image. The score generator 410 may include an image score identification module 412 that may identify the score 420 from the image score data 402 by any existing image identification technique. In one case, the score data may be audio music data 404. The audio music data 404 presents information about the score, e.g., audio of a piece of song, etc., in the form of audio. Score generator 410 may include an audio score identification module 414 that may identify score 420 from audio music data 404 by any existing audio parsing technique. In one instance, the score data may be symbolic score data 406. The symbolic score data 406 presents information about the score in the form of symbols following a predetermined standard or format, e.g. a score file of MIDI format, etc. Score generator 410 may include a symbolic score identification module 416 that may identify score 420 from symbolic score data 406 based on predetermined criteria or formats. In one instance, the score data may be text score data 408. The text score data 408 presents information about the score in the form of text, e.g., a score file in text format, etc. Score generator 410 may include a text score recognition module 418 that may recognize score 420 from text score data 408 by any existing text recognition technique.

In addition, the score generator 410 may also construct a complete score by combining information in different types of score data. For example, assuming that note information is identified with a high degree of confidence from the image score data 402 and lyric information is identified with a high degree of confidence from the audio music data 404, the note information and the lyric information may be combined to form a complete score.

Fig. 5 shows an exemplary architecture of a score encoder 520 according to an embodiment of the invention. The score encoder 520 may correspond to the score encoder 322 of fig. 3.

The score encoder 520 may include a phoneme embedding module 522, a beat embedding module 524, a pitch embedding module 526, a position encoder module 528, a stacked plurality of Feed Forward Transformer (FFT) modules 530-532, and the like. Although only two FFT modules are shown in fig. 5, it should be understood that this is for exemplary purposes only, and the score encoder 520 may include more or fewer FFT modules.

Referring to the description of fig. 3, the score encoder 520 receives the score phoneme information 510 from the score parser 310. The score phoneme information 510 includes a phoneme name 512, note beats 514 and Fu Yingao 516 of notes corresponding to the phoneme name 512. The phone name 512, note tempo 514 and pitch Fu Yingao 516 are input to a phone embedding module 522, a tempo embedding module 524, and a pitch embedding module 526, respectively. The phoneme embedding module 522 may perform an embedding process on the phoneme names 512 to generate a phoneme embedding vector. Beat embedding module 524 may perform an embedding process on note beat 514 to generate a beat embedding vector. The pitch embedding module 526 may perform an embedding process on the note pitches 516 to generate pitch embedding vectors. The phoneme embedding vector, the beat embedding vector, and the pitch embedding vector may have the same dimensions. The position encoder module 528 may sum the phoneme embedded vector, the beat embedded vector, and the pitch embedded vector by position encoding to obtain a summed vector. The sum vector is passed to stacked FFT modules 530, 532 to obtain the final encoded output, i.e., phoneme-side vector representation 540. In one embodiment, the FFT module may include a self-attention network and a 1D convolutional network with ReLU activation. The self-attention network may include a multi-headed attention mechanism to extract intersection location information.

Fig. 6 shows an exemplary architecture of a spectrum decoder 620 according to an embodiment of the invention. The spectrum decoder 620 may correspond to the spectrum decoder 329 of fig. 3.

The spectral decoder 620 may include a vector combining module 622, a position encoding module 624, a stacked plurality of FFT modules 626-628, a linear layer 630, and the like. Although only two FFT modules are shown in fig. 6, it should be understood that this is for exemplary purposes only and that the spectral decoder 620 may include more or fewer FFT modules.

Referring to the description of fig. 3, spectrum decoder 620 receives spectrum side vector representation 612 from length adjuster 326 and possibly timbre vector representation 614 from timbre encoder 350. Vector combination module 622 may combine spectral side vector representation 612 and tone vector representation 614 to obtain a combined vector representation. In one embodiment, the combining operation may refer to vector stitching of the spectral side vector representation 612 and the timbre vector representation 614 such that the dimension of the resulting combined vector representation will be the sum of the dimensions of the spectral side vector representation 612 and the dimension of the timbre vector representation 614. In one embodiment, the combining operation may refer to vector summing of the spectral side vector representation 612 and the color vector representation 614. In this case, the spectral side vector representation 612, the timbre vector representation 614, and the combined vector representation will all have the same dimensions. .

The position encoding module 624 position encodes the combined vector representation from the vector combining module 622 to generate a position encoded combined vector representation. The position-encoded combined vector representation is passed to stacked FFT modules 626, 628. Similar to the FFT modules 530-532 in the score encoder 520, the FFT modules 626-628 may include a self-attention network and a 1D convolution network. The linear layer 630 may linearly transform the output vector representation from the last FFT module 628 to obtain the fundamental delta 632 and the spectral parameters 634. As previously described, timbre vector representation 614 will affect at least the generation of spectral parameters 634.

It should be appreciated that although fig. 6 shows that spectral decoder 620 may obtain timbre vector representation 614, spectral decoder 620 may not receive timbre vector representation 614, depending on the actual application scenario, so that vector combination module 622 may be omitted.

Fig. 7 illustrates an exemplary application scenario of singing voice synthesis according to an embodiment of the present invention. In this exemplary application scenario, the user may wish to synthesize a particular song sung based on the style of singer a and the timbre of singer B using singing voice synthesizer 700.

The user may input a style ID indicating the style of singing for singer a. Style encoder 710 may provide a style vector representation corresponding to singer a based on the style ID. Meanwhile, the user may input a tone color ID indicating the tone color of singer B. Tone encoder 720 may provide a tone vector representation corresponding to singer B based on the tone ID. The style vector representation corresponding to singer a and the timbre vector representation corresponding to singer B are provided as parameters to singer synthesizer 700. Meanwhile, the user may input score data of a specific song C. The score generator 730 may generate a score based on the score data and provide the score to the singing voice synthesizer 700. The singing voice synthesizer 700 may correspond to the singing voice synthesizer 300 of fig. 3 and is capable of synthesizing a song C singing with the timbre of singer B and with the style of singer a.

It should be appreciated that fig. 7 illustrates only one exemplary scenario to which embodiments of the present disclosure may be applied, which may vary with specific application requirements, and that embodiments of the present disclosure may be applied to a variety of other scenarios as well.

In one application scenario, to enable a user to synthesize a song using his own singing style or timbre, the user's own corpus may be obtained in advance and used to train a style encoder and/or timbre encoder to obtain a style vector representation and/or timbre vector representation associated with the user. When the user wants to use his own singing style, the user can provide a "style ID" corresponding to himself, so that the singing voice synthesizer can obtain a style vector representation of the user and thus synthesize the singing voice in the user's singing style. When the user wants to use his own tone, the user can provide a "tone ID" corresponding to himself, so that the singing voice synthesizer can obtain a tone vector representation of the user and thus synthesize the singing voice using the user's tone.

In one application scenario, a user may desire to adapt an audio clip of a presentation song. The presentation song audio clip may be a recording of the user's own singing song, or a singing recording of any other singer. The user may wish to replace the singer's voice in the original audio clip with the designated singer's timbre, replace the singing style of the original audio clip with the designated singing style, and so on. In this case, the presentation song audio pieces provided by the user may be input as audio music data to, for example, the score generator 410 of fig. 4, and the score generator may generate the corresponding score based on the presentation song audio pieces. In addition, the user may also provide a desired timbre ID and/or singing style ID. Accordingly, the singing voice synthesizer may perform singing voice synthesis based on the generated score, the timbre vector representation corresponding to the timbre ID, and/or the style vector representation corresponding to the singing style ID.

Fig. 8 shows an exemplary process 800 for singing voice synthesis based on a score in accordance with an embodiment of the invention.

At 810, first score phoneme information associated with, for example, a first phoneme may be extracted from a score. The first score phoneme information may include a pitch and a beat of a note corresponding to the first phoneme. For example, the first score phoneme information may be extracted from the score by a score parser.

At 815, a first vector representation, such as a phoneme-side vector representation, may be generated based on the first score phoneme information. For example, a first vector representation corresponding to the first phoneme may be generated by a score encoder based on the first score phoneme information.

At 820, optionally, an indication of a style of singing may be received. The indication about the singing style may be a singing style ID or a style vector representation obtained based on the singing style ID.

At 825, a phoneme-side combined vector representation may be generated based on the first vector representation and a style vector representation corresponding to the singing style. For example, the first vector representation and the style vector representation may be added, stitched, or concatenated.

At 830, a phoneme duration for the first phoneme may be determined by a duration predictor based on the phoneme-side combined vector representation. The duration predictor may be configured to predict phoneme duration at least under the constraint of note tempo.

At 835, the phoneme-side combined vector representation may be expanded into a second vector representation, such as a spectral-side vector representation, based on the phoneme length of the first phoneme. For example, the phoneme-side combined vector representation may be expanded into a second vector representation by a length adjuster based on the phoneme length of the first phoneme.

At 840, optionally, an indication of the singer's timbre may be received. The indication of the singer's timbre may be a timbre ID or a timbre vector representation obtained based on the timbre ID.

At 845, a base frequency delta and spectral parameters corresponding to the first phone may be generated based on the second vector representation and possibly a timbre vector representation corresponding to the singer timbre. The base frequency delta and the spectral parameters may be generated, for example, by a spectral decoder.

At 850, a fundamental frequency corresponding to the first phoneme may be obtained by adjusting the pitch of the note with the fundamental frequency delta. For example, the base frequency dispersion may be superimposed with the standard base frequency of the pitch of the first phoneme by a pitch adjuster to obtain the base frequency to be employed.

At 860, an acoustic waveform corresponding to the first phoneme may be generated based at least in part on the fundamental frequency obtained at 850 and the spectral parameters obtained at 845. For example, acoustic waveforms may be generated by a vocoder.

By performing the above-described process 800 on each of all phonemes identified from the score, a plurality of acoustic waveforms corresponding to the phonemes, respectively, may be obtained. The set of these acoustic waveforms forms the singing voice to be synthesized.

It should be appreciated that all of the processing in process 800 is exemplary and that process 800 may be modified in any manner depending on the particular application requirements. For example, in the event that no indication of the style of singing is received, steps 820 and 825 may be omitted, and the phoneme length is predicted directly based on the first vector representation at 830. For example, in the event that no indication is received regarding the singer's timbre, step 840 may be omitted, and the base frequency delta and spectral parameters generated based on the second vector representation at 845.

FIG. 9 illustrates an exemplary training process 900 for an acoustic feature predictor according to an embodiment of the present invention. The example training process 900 may employ a large amount of reference audio as training data. The reference audio may be singing voice audio previously collected from different sources, each singing voice audio being singed by a reference singer in a reference singing style. Based on the input reference audio, a loss value may be calculated for the entire acoustic feature predictor by means of at least one loss function predefined and the entire acoustic feature predictor is optimized according to the loss value, thereby generating constraints on a score encoder, a spectrum decoder, a duration predictor, etc. in the acoustic feature predictor. The calculation of the loss value during training is illustrated in fig. 9 by way of example of the input of an exemplary reference audio 902.

The reference audio 902 may be input to a score parser 920. The score parser 920 parses the reference score phoneme information for each reference phoneme from the reference audio 902 and feeds it to the score encoder 932. The reference score phoneme information includes a reference phoneme name, a reference note tempo and a reference note pitch of a reference note corresponding to the reference phoneme name.

The score encoder 932 generates a phoneme side vector representation based on the input reference score phoneme information. When it is desired that the acoustic feature predictor is able to generate acoustic feature parameters from the specified singing style ID, the phoneme side vector representation is input to a vector combining module 934. The vector combination module 934 may combine the style vector representation from the style encoder 950 with the phoneme-side vector representation to obtain a phoneme-side combined vector representation. The phoneme-side combination vector representation is input to a duration predictor 936. The duration predictor 936 predicts the predicted phoneme duration for the reference phoneme based on the phoneme-side combined vector representation for the reference phoneme.

The reference audio 902 may also be input to an audio processing module 910. The audio processing module 910 analyzes the reference audio 902 using a speech recognition algorithm to obtain phoneme alignment data associated with reference phonemes in the reference audio 902. The phoneme alignment data may include information about: a reference phoneme name, and a start time of the reference phoneme in the reference audio 902. Based on the start time of each reference phoneme, the reference phoneme length of the reference phoneme can be calculated. The reference phoneme duration characterizes the actual phoneme duration of the reference phoneme in the reference audio 902.

A phoneme loss function 992 may be defined, the phoneme loss function 992 being used to calculate a phoneme length loss value L _pd based on the difference of the predicted phoneme length of the current reference phoneme from the corresponding reference phoneme length. Thus, alignment of phoneme level durations can be considered when constraining the temporal length predictor 936 by training. The alignment of the phoneme level duration may refer to the alignment of the predicted phoneme duration with the reference phoneme duration over a length of time. Thus, the trained duration predictor 936 can predict phoneme durations at least under the constraint of phoneme-level durations.

In addition, note duration is also significant for SVS, in addition to phoneme duration, because melodies brought by note duration can generate stronger rhythmic feeling, thus making synthesized singing voice more realistic. Thus, a note loss function 994 is further presented in process 900. The note loss function 994 increases control over the note level duration. Assume that the phoneme granularity is 3, i.e. one syllable is divided into 3 phonemes and the syllable has a corresponding note, so that the note is also associated with the 3 phonemes. The 3 predicted phoneme durations derived by the duration predictor 936 for the 3 reference phonemes associated with one reference note may be accumulated first. The note loss function 994 may then calculate a note duration loss value L _sd based on the difference between the total duration of the 3 predicted phoneme durations and the reference note tempo of the reference note in the reference score phoneme information.

The length adjuster 938 generates a spectral side vector representation based on the predicted phoneme length and the phoneme side combined vector representation output by the vector combining module 934.

Tone encoder 950 may provide a tone vector representation associated with the input tone ID to spectrum decoder 939.

The spectral decoder 939 may generate the predicted spectral parameters and the predicted fundamental dispersion based on the spectral side-representation and the timbre vector representation. The pitch adjuster 940 may generate the predicted fundamental frequency based on the predicted fundamental frequency delta and the reference note pitch associated with the current reference phoneme from the score parser 920.

The audio processing module 910 may analyze the reference audio 902 using a speech recognition algorithm to obtain reference fundamental frequency and reference spectral parameters associated with the current reference phoneme. The reference fundamental frequency and the reference spectral parameters characterize the actual fundamental frequency and the actual spectral parameters of the current reference phoneme in the reference audio 902.

A pitch loss function 996 may be defined for calculating a pitch loss value L _f based on the difference between the predicted fundamental frequency from the pitch adjuster 940 and the reference fundamental frequency. A spectral loss function 998 may be defined for calculating a spectral loss value L _sp based on the difference between the predicted spectral parameters from the spectral decoder 938 and the reference spectral parameters.

Thus, the loss value for the entire acoustic feature predictor can be calculated as:

L ₀＝w_f*L_f+w_sp*L_sp+w_pd*L_pd+w_sd*L_sd formula

(2)

Where w denotes weights for different loss values, f denotes pitch or fundamental frequency, sp denotes spectral parameters, pd denotes phoneme duration, sd denotes note duration.

By setting the weight w _sd to a larger value, for example, larger than the weight w _pd, the prediction result of the acoustic feature predictor can be constrained to be more consistent with the rhythm of the music score in the training process, so that singing voice with stronger melody and more similar to the nature of human can be synthesized.

In one embodiment, the average loss value may be calculated for one reference audio set. Further, the acoustic feature predictor may be trained multiple times based on multiple average loss values for multiple reference audio sets.

It should be appreciated that although the process 900 involves a phoneme loss function 992, a note loss function 994, a pitch loss function 996, a spectrum loss function 998, the process 900 may employ more or fewer loss functions depending on the particular application requirements.

It should be appreciated that although the style encoder 950 is referred to in the process 900, the style encoder 950 may be omitted and the vector combination module 934 may be omitted accordingly. In addition, the phoneme encoder 960 may be omitted.

Fig. 10 shows a flowchart of an exemplary method 1000 for singing voice synthesis, according to an embodiment of the invention.

At 1010, first score phoneme information extracted from a score may be received. The first score phoneme information may include a first phoneme and a pitch and a beat of a note corresponding to the first phoneme.

At 1020, a base frequency delta and a spectral parameter corresponding to the first phone may be generated based on the first phone information.

At 1030, a fundamental frequency corresponding to the first phoneme may be obtained by adjusting a pitch of the note with the fundamental frequency delta.

At 1040, an acoustic waveform corresponding to the first phoneme may be generated based at least in part on the fundamental frequency and the spectral parameters.

In one embodiment, the generating the base frequency delta and the spectral parameter corresponding to the first phoneme may include: generating a first vector representation based on the first score phoneme information; determining, by a duration predictor, a phoneme duration of the first phoneme based on the first vector representation, the duration predictor configured to predict a phoneme duration at least under the constraint of a note tempo; expanding the first vector representation into a second vector representation based on a phoneme duration of the first phoneme; and generating the base frequency delta and the spectral parameter corresponding to the first phoneme based at least on the second vector representation.

In one embodiment, the training data for the duration predictor may include at least: the reference phoneme duration of each reference phoneme extracted from the reference audio and the tempo of each reference note.

In one embodiment, the training of the duration predictor employs a first loss function for calculating a difference between: a phoneme duration predicted by the duration predictor for a reference phoneme; and a reference phoneme duration of the reference phoneme.

In one embodiment, the training of the duration predictor also employs a second loss function for calculating a difference between: a sum of a plurality of phoneme durations predicted by the duration predictor for a plurality of reference phonemes corresponding to a reference note; and the tempo of the reference note.

In one embodiment, the first and second loss functions may have different weights in the training of the duration predictor.

In one embodiment, the first loss function may have a weight that is less than the weight of the second loss function.

In one embodiment, the method 1000 may further include: an indication of a style of singing is received. The determining of the phoneme duration of the first phoneme may be further based on the singing style. The generating of the base frequency delta and the spectral parameter corresponding to the first phoneme may be further based on the singing style.

In one embodiment, the method 1000 may further include: an indication of a timbre of the target singer is received. The generating of the spectral parameters corresponding to the first phoneme may be further based on a timbre of the target singer.

In one embodiment, the method 1000 may further include: receiving an indication of a style of singing of a first target singer; and receiving an indication of a timbre of the second target singer. The determining of the phone duration of the first phone may be further based on a singing style of the first target singer, the generating of the base frequency delta corresponding to the first phone may be further based on a singing style of the first target singer, and the generating of the spectral parameter corresponding to the first phone may be further based on a singing style of the first target singer and a tone of the second target singer.

In one embodiment, the fundamental frequency delta and the spectral parameter corresponding to the first phoneme may be generated by a feed-forward neural network based on a self-attention mechanism.

In one embodiment, the fundamental frequency delta and the spectral parameter corresponding to the first phoneme may be generated in a non-autoregressive manner.

In one embodiment, the score may be generated based on at least one of: image score data, audio music data, symbolic score data, and text score data.

In one embodiment, the phoneme duration may be in time frames.

It should be appreciated that the method 1000 may also include any steps/processes for singing voice synthesis in accordance with embodiments of the present disclosure described above.

Fig. 11 shows a block diagram of an exemplary apparatus 1100 for singing voice synthesis, according to an embodiment of the invention.

The apparatus 1100 may include: an acoustic feature predictor 1110 for receiving first score phoneme information extracted from a score, the first score phoneme information including a first phoneme and a pitch and a beat of a note corresponding to the first phoneme, and generating a fundamental frequency delta and a spectral parameter corresponding to the first phoneme based on the first score phoneme information; a pitch adjuster 1120 for obtaining a fundamental frequency corresponding to the first phoneme by adjusting a pitch of the notes using the fundamental frequency delta; and a vocoder 1130 for generating an acoustic waveform corresponding to the first phoneme based at least in part on the fundamental frequency and the spectral parameters.

In one implementation, the acoustic feature predictor 1110 may further include: a score encoder 1112 for generating a first vector representation based on the first score phoneme information; a duration predictor 1114 for determining a phoneme duration of the first phoneme based on the first vector representation, the duration predictor being configured to predict a phoneme duration at least under the constraint of a note beat; a length adjuster 1116 for expanding the first vector representation into a second vector representation based on a phoneme length of the first phoneme; and a spectral decoder 1118 for generating the base frequency delta and the spectral parameters corresponding to the first phoneme based at least on the second vector representation.

In one embodiment, the training data for the duration predictor 1114 may include at least a reference phoneme duration of each reference phoneme extracted from the reference audio and a tempo of each reference note. The duration predictor 1114 may be trained based at least on a loss function for calculating the difference between: a sum of a plurality of phoneme durations predicted by the duration predictor 1114 for a plurality of reference phonemes corresponding to a reference note; and the tempo of the reference note.

In one embodiment, the spectrum decoder 1118 may be configured to: receiving an indication of a style of singing; and generating the base frequency delta and the spectral parameter corresponding to the first phoneme based at least on the second vector representation and the singing style.

In one embodiment, the spectrum decoder 1118 may be configured to: receiving an indication of a timbre of the target singer; and generating the spectral parameters corresponding to the first phone based at least on the second vector representation and a timbre of the target singer.

In addition, the apparatus 1100 may also include any other modules that perform any of the steps/processes in the method for singing voice synthesis according to the embodiments of the present disclosure described above.

Fig. 12 shows a block diagram of an exemplary apparatus 1200 for singing voice synthesis, according to an embodiment of the invention.

The apparatus 1200 may include: at least one processor 1210; and a memory 1220 storing computer executable instructions. The computer-executable instructions, when executed, cause the at least one processor 1210 to: receiving first score phoneme information extracted from a score, the first score phoneme information comprising a first phoneme and a pitch and a beat of a note corresponding to the first phoneme; generating a base frequency delta and a spectrum parameter corresponding to the first phoneme based on the first music score phoneme information; obtaining a fundamental frequency corresponding to the first phoneme by adjusting a pitch of the note using the fundamental frequency delta; and generating an acoustic waveform corresponding to the first phoneme based at least in part on the fundamental frequency and the spectral parameters. Further, processor 1210 may also perform any of the steps/processes for singing voice synthesis according to embodiments of the present disclosure described above.

Embodiments of the present disclosure may be embodied in non-transitory computer readable media. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any of the operations of the method for singing voice synthesis according to the embodiments of the present disclosure described above.

It should be understood that all operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or to the order of such operations, but rather should cover all other equivalent variations under the same or similar concepts.

It should also be understood that all of the modules in the apparatus described above may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. Furthermore, any of these modules may be functionally further divided into sub-modules or combined together.

The processor has been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and the overall design constraints imposed on the system. As an example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, digital Signal Processor (DSP), field Programmable Gate Array (FPGA), programmable Logic Device (PLD), state machine, gate logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software that is executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software should be construed broadly to mean instructions, instruction sets, code segments, program code, programs, subroutines, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. Computer-readable media may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strips), optical disk, smart card, flash memory device, random Access Memory (RAM), read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), registers, or removable disk. Although the memory is shown separate from the processor in various aspects presented in this disclosure, the memory may also be located internal to the processor (e.g., in a cache or register).

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Accordingly, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described herein that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

1. A method for singing voice synthesis, comprising:

receiving first score phoneme information extracted from a score, the first score phoneme information comprising a first phoneme and a pitch and a beat of a note corresponding to the first phoneme;

Based on the first score phoneme information, generating a base frequency delta and a spectral parameter corresponding to the first phoneme, wherein the base frequency delta is indicative of a deviation between: a standard fundamental frequency corresponding to a standard pitch of the first phoneme; and a fundamental frequency corresponding to the first phoneme to be used in the synthesized singing voice;

obtaining a fundamental frequency corresponding to the first phoneme by adjusting a pitch of the note using the fundamental frequency delta; and

An acoustic waveform corresponding to the first phoneme is generated based at least in part on the fundamental frequency and the spectral parameters.

2. The method of claim 1, wherein the generating a base frequency delta and a spectral parameter corresponding to the first phoneme comprises:

Generating a first vector representation based on the first score phoneme information;

determining, by a duration predictor, a phoneme duration of the first phoneme based on the first vector representation, the duration predictor configured to predict a phoneme duration at least under the constraint of a note tempo;

expanding the first vector representation into a second vector representation based on a phoneme duration of the first phoneme; and

The base frequency delta and the spectral parameter corresponding to the first phoneme are generated based at least on the second vector representation.

3. The method of claim 2, wherein training data for the duration predictor comprises at least: the reference phoneme duration of each reference phoneme extracted from the reference audio and the tempo of each reference note.

4. A method according to claim 3, wherein the training of the duration predictor employs a first loss function for calculating a difference between:

A phoneme duration predicted by the duration predictor for a reference phoneme; and

And the reference phoneme duration of the reference phoneme.

5. The method of claim 4, wherein training of the duration predictor further employs a second loss function for calculating a difference between:

a sum of a plurality of phoneme durations predicted by the duration predictor for a plurality of reference phonemes corresponding to a reference note; and

The tempo of the reference note.

6. The method of claim 5, wherein the first and second loss functions have different weights in training of the duration predictor.

7. The method of claim 6, wherein the first loss function has a weight that is less than a weight of the second loss function.

8. The method of claim 2, further comprising:

An indication of a style of singing is received,

Wherein the determining of the phone duration of the first phone is further based on the singing style, and the generating of the base frequency delta and the spectral parameter corresponding to the first phone is further based on the singing style.

9. The method of claim 1, further comprising:

Receiving an indication of a timbre of the target singer, an

Wherein the generating of the spectral parameters corresponding to the first phoneme is further based on a timbre of the target singer.

10. The method of claim 2, further comprising:

Receiving an indication of a style of singing of a first target singer; and

Receiving an indication of a timbre of the second target singer, an

Wherein the determining of the phone duration of the first phone is further based on a style of singing of the first target singer,

The generating of the base frequency difference corresponding to the first phoneme is further based on a singing style of the first target singer, and

The generating of the spectral parameters corresponding to the first phoneme is further based on a style of singing of the first target singer and a timbre of the second target singer.

11. The method of claim 1, wherein the base frequency delta and the spectral parameter corresponding to the first phoneme are generated by a self-attention based feedforward neural network.

12. The method of claim 1, wherein the base frequency delta and the spectral parameter corresponding to the first phoneme are generated in a non-autoregressive manner.

13. The method of claim 1, wherein the score is generated based on at least one of: image score data, audio music data, symbolic score data, and text score data.

14. The method of claim 2, wherein the phoneme duration is in time frames.

15. An apparatus for singing voice synthesis, comprising:

An acoustic feature predictor for: receiving first score phoneme information extracted from a score, the first score phoneme information comprising a first phoneme and a pitch and a beat of a note corresponding to the first phoneme; and generating a base frequency delta and a spectral parameter corresponding to the first phoneme based on the first score phoneme information, wherein the base frequency delta indicates a deviation between: a standard fundamental frequency corresponding to a standard pitch of the first phoneme; and a fundamental frequency corresponding to the first phoneme to be used in the synthesized singing voice;

a pitch adjuster for obtaining a fundamental frequency corresponding to the first phoneme by adjusting a pitch of the note using the fundamental frequency delta; and

A vocoder for generating an acoustic waveform corresponding to the first phoneme based at least in part on the fundamental frequency and the spectral parameters.

16. The apparatus of claim 15, wherein the acoustic feature predictor comprises:

A score encoder for generating a first vector representation based on the first score phoneme information;

a duration predictor for determining a phoneme duration of the first phoneme based on the first vector representation, the duration predictor being configured to predict a phoneme duration at least under the constraint of a note tempo;

A length adjuster for expanding the first vector representation into a second vector representation based on a phoneme length of the first phoneme; and

A spectral decoder for generating the base frequency delta and the spectral parameters corresponding to the first phoneme based at least on the second vector representation.

17. The apparatus of claim 16, wherein the training data for the duration predictor includes at least a reference phoneme duration of each reference phoneme extracted from the reference audio and a tempo of each reference note, and the duration predictor is trained based at least on a loss function for calculating a difference between:

The tempo of the reference note.

18. The apparatus of claim 16, wherein the spectral decoder is configured to:

Receiving an indication of a style of singing; and

The base frequency delta and the spectral parameter corresponding to the first phoneme are further generated based at least on the second vector representation and the singing style.

19. The apparatus of claim 16, wherein the spectral decoder is configured to:

receiving an indication of a timbre of the target singer; and

The spectral parameters corresponding to the first phone are generated based at least on the second vector representation and a timbre of the target singer.

20. An apparatus for singing voice synthesis, comprising:

At least one processor; and

A memory storing computer-executable instructions that, when executed, cause the at least one processor to:

receiving first score phoneme information extracted from a score, the first score phoneme information including a first phoneme and a pitch and a beat of a note corresponding to the first phoneme,

Based on the first score phoneme information, generating a base frequency delta and a spectral parameter corresponding to the first phoneme, wherein the base frequency delta is indicative of a deviation between: a standard fundamental frequency corresponding to a standard pitch of the first phoneme; and a fundamental frequency corresponding to the first phoneme to be used in the synthesized singing voice,

Obtaining a fundamental frequency corresponding to the first phoneme by adjusting a pitch of the note using the fundamental frequency delta, and