WO2023276234A1 - 情報処理装置、情報処理方法およびプログラム - Google Patents
情報処理装置、情報処理方法およびプログラム Download PDFInfo
- Publication number
- WO2023276234A1 WO2023276234A1 PCT/JP2022/005001 JP2022005001W WO2023276234A1 WO 2023276234 A1 WO2023276234 A1 WO 2023276234A1 JP 2022005001 W JP2022005001 W JP 2022005001W WO 2023276234 A1 WO2023276234 A1 WO 2023276234A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speaker
- feature amount
- feature
- information processing
- vocal signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
- G10H1/361—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
- G10H1/366—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems with means for modifying or correcting the external signal, e.g. pitch correction, reverberation, changing a singer's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Definitions
- the present disclosure relates to an information processing device, an information processing method, and a program.
- Voice quality refers to the attributes of human speech produced by a speaker and perceived by the listener over a plurality of speech units (e.g., phonemes). An utterance with the same timbre that is perceived differently by listeners.
- Japanese Unexamined Patent Application Publication No. 2002-200001 describes a voice quality conversion technique that converts a general uttered voice into the voice quality of another speaker while preserving the contents of the utterance.
- One object of the present disclosure is to provide an information processing device, an information processing method, and a program that perform appropriate voice quality conversion processing.
- An information processing apparatus having a voice conversion unit that separates a vocal signal and an accompaniment signal from a mixed sound signal into sound sources and performs voice conversion using the result of the sound source separation.
- This is an information processing method in which a voice conversion unit separates sound sources into a vocal signal and an accompaniment signal from a mixed sound signal, and performs voice conversion using the result of the sound source separation.
- FIG. 1 is a diagram for explaining an overview of one embodiment.
- FIG. 2 is a block diagram showing a configuration example of a smartphone according to one embodiment.
- FIG. 3 is a block diagram illustrating a configuration example of a voice quality conversion unit according to one embodiment.
- FIG. 4 is a diagram for explaining an example of learning performed by the voice quality conversion unit according to one embodiment.
- FIG. 5 is a diagram referred to when describing the operation of the smartphone according to one embodiment.
- FIG. 6 is a diagram for explaining an example of processing that is performed accompanying the voice quality conversion processing that is performed in one embodiment.
- FIG. 7 is a diagram for explaining another example of processing that accompanies the voice quality conversion processing that is performed in one embodiment.
- FIG. 8 is a diagram for explaining a modification.
- FIG. 9 is a diagram for explaining a modification.
- a sound source separation process PA is performed on the mixed sound source shown in FIG.
- a mixed sound source can be provided by a recording medium such as a CD (Compact Disc) or distributed via a network.
- the mixed sound source includes, for example, an artist's vocal signal (an example of the first vocal signal, hereinafter also referred to as vocal signal VSA as appropriate).
- the mixed sound source includes signals other than the vocal signal VSA (such as musical instrument sounds, hereinafter also referred to as accompaniment signals as appropriate).
- the karaoke user's singing voice is picked up by a microphone or the like.
- a user's singing voice (an example of a second vocal signal) is also appropriately referred to as a vocal signal VSB.
- Voice conversion processing PB is performed on the vocal signal VSA and the vocal signal VSB.
- a process of bringing one of the vocal signal VSA and the vocal signal VSB closer to (similar to) the other vocal signal is performed.
- voice quality conversion processing is performed to bring a karaoke user's vocal signal VSB closer to an artist's vocal signal VSA.
- an addition process PC is performed to add the vocal signal VSB subjected to the voice quality conversion process and the accompaniment signal, and the signal subjected to the addition process PC is subjected to the reproduction process PD.
- the user's singing voice that has undergone voice quality conversion processing to approximate the vocal signal of the artist is reproduced.
- FIG. 2 is a block diagram illustrating a configuration example of an information processing apparatus according to one embodiment.
- a smart phone smart phone 100
- the user can easily perform karaoke in which voice quality can be changed.
- karaoke that is, singing
- the present disclosure is not limited to singing, and can also be applied to voice quality conversion processing for utterances such as conversations.
- the information processing apparatus according to the present disclosure is applicable not only to smart phones, but also to portable electronic devices such as smart watches, personal computers, stationary karaoke machines, and the like.
- the smart phone 100 has, for example, a control unit 101, a sound source separation unit 102, a voice conversion unit 103, a microphone 104, and a speaker 105.
- the control unit 101 comprehensively controls the smartphone 100 as a whole.
- the control unit 101 is configured, for example, in a CPU (Central Processing Unit), and has a ROM (Read Only Memory) in which programs are stored, a RAM (Random Access Memory) used as a work memory, etc. ( Note that the illustration of these memories is omitted.).
- a CPU Central Processing Unit
- ROM Read Only Memory
- RAM Random Access Memory
- the control unit 101 has a speaker feature estimation unit 101A as a functional block.
- the speaker feature amount estimation unit 101A estimates a feature amount corresponding to a feature that does not change with time as the song progresses, specifically, a feature amount related to the speaker (hereinafter, appropriately referred to as a speaker feature amount). do.
- control unit 101 has a feature quantity mixing unit 101B as a functional block.
- the feature amount mixing unit 101B for example, mixes two or more speaker feature amounts with appropriate weights.
- the sound source separation unit 102 separates the input mixed sound signal into a vocal signal and an accompaniment signal (sound source separation processing).
- a vocal signal whose sound source has been separated is supplied to the voice conversion unit 103 .
- the accompaniment signal separated from the sound source is supplied to the speaker 105 .
- the voice quality conversion unit 103 performs voice quality conversion processing so that the voice quality of the vocal signal corresponding to the user's singing voice picked up by the microphone 104 is brought closer to the vocal signal separated by the sound source separation unit 102 .
- the details of the processing performed by voice quality conversion section 103 will be described later.
- the voice quality in this embodiment includes feature amounts such as pitch and volume in addition to speaker feature amounts.
- the microphone 104 picks up, for example, singing or speaking (singing in this example) of the user of the smartphone 100 .
- a vocal signal corresponding to the collected singing is supplied to the voice conversion unit 103 .
- the accompaniment signal supplied from the sound source separation unit 102 and the vocal signal output from the voice quality conversion unit 103 are added by an addition unit (not shown).
- the summed signal is reproduced from speaker 105 .
- the smartphone 100 may have a configuration other than the configuration illustrated in FIG. 2 (for example, a display and buttons configured as a touch panel).
- FIG. 3 is a block diagram showing a configuration example of the voice quality conversion section 103.
- Voice quality conversion section 103 has encoder 103A, feature amount mixing section 103B, and decoder 103C.
- Encoder 103A uses a learning model obtained by predetermined learning to extract feature quantities from the vocal signal.
- the feature amount extracted by the encoder 103A is, for example, a feature amount that changes over time as the song progresses.
- at least one of pitch information, volume information, and utterance (lyrics) information include.
- the feature amount mixing unit 103B mixes the feature amounts extracted by the encoder 103A.
- the feature amount mixed by the feature amount mixing unit 103B is supplied to the decoder 103C.
- the decoder 103C generates a vocal signal based on the feature amount and speaker feature amount supplied from the feature amount mixing unit 103B.
- FIG. 4 omits illustration of the feature amount mixing section 103B and the feature amount mixing section 101B in the voice quality conversion section 103.
- FIG. 4 omits illustration of the feature amount mixing section 103B and the feature amount mixing section 101B in the voice quality conversion section 103.
- the voice quality conversion unit 103 is learned using vocal signals (which may include normal speech) of multiple singers.
- the vocal signal may be parallel data in which a plurality of singers sing the same content, or may not be parallel data. In this example, it is treated as non-parallel data that is more realistic and difficult to learn.
- the vocal signals of multiple singers are stored in a suitable database 110 .
- a predetermined vocal signal is input as input singing voice data x to the above-described speaker feature estimation section 101A and encoder 103A.
- the speaker feature amount estimation unit 101A estimates speaker feature amounts from the input singing voice data x.
- the encoder 103A extracts pitch information, volume information, and utterance content (lyrics) from the input singing voice data x as examples of feature amounts.
- These feature quantities are defined, for example, by an embedding vector represented by a multidimensional vector. Each feature defined by the embedding vector is speaker embedding pitch embedding volume embedding content embedding appropriately called.
- the decoder 103C takes these feature amounts as input and performs processing to construct speech. During learning, the decoder 103C learns so that the output of the decoder 103C reconstructs the input singing voice data x. For example, the decoder 103C performs learning so as to minimize the loss function between the input singing voice data x calculated by the loss function calculator 115 shown in FIG. 4 and the output of the decoder 103C.
- each embedding reflects only the corresponding feature and does not have information on other features
- some embeddings are replaced with other ones at the time of inference.
- only the corresponding features can be transformed.
- speaker embedding By replacing only with that of others, it is possible to convert the voice quality (a narrowly defined voice quality that does not include pitch, etc.) while maintaining the pitch, volume, and utterance content.
- a method of obtaining an embedding vector that separates features a method of obtaining embedding from a feature amount that reflects only a specific feature, or a method of learning an encoder that extracts only a specific feature from data (predetermined vocal signal). There is a way.
- the fundamental tone f0 is extracted by the fundamental tone extractor, pitch embedding to get Volume embedding from average power p to get speaker embedding from speaker label n to get Features obtained from speech recognition (Automatic Speech Recognition) to content embedding There is a way to obtain
- learning is performed using the following formula.
- the loss function for training the encoder 103A and decoder 103C denotes the loss function for training the encoder 103A and decoder 103C.
- a critic is the loss function for is the weight parameter.
- sg() is a gradient stop operator that prevents the gradient information of the neural network from being passed on to subsequent layers
- V() is a vector quantization operation.
- loss function for reconstruction Various forms are conceivable for , depending on the types of decoders and encoders.
- the variational lower bound (ELBO) for variational autoencoders (VAEs) and vector quantized VAEs can be used, and in the case of a generative adversarial network, the input and output circumstances error and adversarial loss can be expressed as a weighted sum (formula below) of
- the learning described above is performed without changing the speaker information estimated by the speaker feature estimation unit. Once learned, the speaker information may change. Future information may also be used during learning.
- the speaker embedding that determines the voice quality is I explained how to ask for it.
- the singer to be converted must be included in the training data in advance, and voice quality conversion cannot be performed for an arbitrary singer (unknown speaker). Therefore, a method of obtaining speaker embedding from a speech signal will be described. For example, the following two methods are conceivable.
- a first method is speaker embedding for estimating speaker information of a predetermined speaker (for example, a speaker whose singing voice data has similar characteristics to that of a singer to be converted) based on a vocal signal. It is a method of making an estimation. Speaker embedding learned using speaker label n is the singing sound of speaker n The speaker feature quantity estimator F( ) that is estimated from is learned. F can be constructed with a neural network or the like, and is learned to minimize the distance from the speaker embedding. Lp norm as distance can be used.
- a second method is a method of performing singer identification model learning for estimating speaker information of the speaker based on a predetermined vocal signal.
- singing sound from speaker embedding is learned prior to learning of the voice quality conversion unit 103 .
- G can be learned by minimizing the following objective function L using singer-labeled singing sound data of multiple singers.
- K(x, y) is the cosine distance between x and y, is a different singing voice by singer n, is a singing voice by a singer (m ⁇ n).
- Speaker embedding using G learned in this way is obtained as follows and used for learning of the voice quality conversion unit 103 .
- the input speech input to the speaker feature estimation unit G( ) is sufficiently long in order to obtain accurate speaker embedding. This is because the singer's features cannot be sufficiently extracted from short speech. On the other hand, an excessively long input has the demerit that the required memory becomes enormous. Therefore, it is possible to use a recurrent neural network having a recursive structure in G( ), or use an average of speaker embeddings obtained using a plurality of short-time segments.
- Voice conversion is performed by the voice conversion unit 103 learned as described above.
- the voice quality conversion process performed by the smartphone 100 will be described with reference to FIG.
- the vocal signal VSB is the karaoke user's singing voice data.
- the vocal signal VSA is singing data of a singer whose voice quality the karaoke user wants to approximate, and is a vocal signal separated from the sound source.
- Each of the vocal signal VSA and the vocal signal VSB is input to the voice quality conversion section 103 .
- the encoder 103A extracts feature quantities such as pitch and volume from the vocal signal VSA and the vocal signal VSB.
- a control signal designating a feature to be replaced is input to the feature quantity mixing unit 103B.
- the feature amount mixing unit 101B extracts the pitch information extracted from the vocal signal VSB.
- the pitch information extracted from the vocal signal VSA is replaced with the pitch information extracted from the vocal signal VSA.
- the feature amount mixed by the feature amount mixing unit 101B is input to the decoder 103C.
- the vocal signal VSA and the vocal signal VSB are input to the speaker feature estimation unit 101A.
- the speaker feature estimation unit 101A estimates speaker information from each vocal signal.
- the estimated speaker information is supplied to the feature quantity mixing unit 101B.
- the feature quantity mixing unit 101B appropriately replaces the speaker feature quantity. For example, when the speaker feature amount obtained from the vocal signal VSB is replaced with the speaker feature amount obtained from the vocal signal VSA, the voice quality (narrowly defined voice quality) specified by the speaker feature amount is is replaced by the voice quality of the singer corresponding to the vocal signal VSA.
- the speaker feature amount mixed by the feature amount mixing unit 101B is supplied to the decoder 103C.
- the decoder 103C generates singing voice data based on the feature amount supplied from the feature amount mixing section 101B and the speaker feature amount supplied from the feature amount mixing section 101B.
- the generated singing voice data is reproduced from the speaker 105 .
- a singing voice in which a part of the karaoke user's voice quality is replaced with a part of the professional singer's voice quality is reproduced.
- the speaker embedding of singer A the speaker embedding of singer B interpolation function that smoothly varies to Use ⁇ is a scalar variable that determines the amount of change, and can also be determined by the user.
- the interpolation function can use linear interpolation or spherical linear interpolation.
- the singing in the original sound source and the singing of the user are focused on the relationship of parallel data, which is the same utterance (lyrics), and the characteristics are used. Enables high-quality conversion even with real-time processing. A specific example of processing for realizing such conversion will be described below.
- the encoder 103A and decoder 103C of the voice quality conversion unit 103 are all functions that do not use future information. This is because when the encoder 103A and decoder 103C are configured by a recurrent neural network (RNN) or a convolutional neural network (CNN), they are configured using a unidirectional RNN or Causal convolution that does not use future information. It can be realized by
- karaoke voice conversion we consider using the relationship of parallel data at the time of inference, and using only short-time inputs for estimating the speaker's embedding.
- the short time is the duration of a singing voice containing one or a small number of phonemes, for example, from several hundred milliseconds to several seconds.
- voice quality conversion between homophones of different speakers is relatively easy, and conversion can be performed with high quality. Therefore, by making speaker embedding dependent on phonemes, high-quality conversion is possible even with short-term information.
- a speaker feature estimator that once learns the encoder 103A and the decoder 103C with time-invariant speaker embeddings, freezes the parameters of those models, uses those models, and estimates the event speaker embeddings. to learn. Therefore, the speaker embedding when performing this process is treated as a feature amount of the event.
- the objective function for learning of is It can be expressed as. Note that the parameters of encoder 103A and decoder 103C are fixed. The receptive field of is limited to the short time and is obtained by minimizing the objective function.
- the speaker feature quantity estimating unit F learned in this way is It is an estimator that obtains speaker embedding depending on the utterance content (phoneme) specified in , and enables high-quality conversion in real time based only on short-term information.
- the method using the speaker feature estimation unit F that has performed the learning described with reference to FIG. May have high temporal stability.
- a speaker feature estimation unit 101A uses long-time information longer than a predetermined time (hereinafter referred to as a global feature estimation unit 121A as appropriate). , a speaker feature amount estimation unit using short-time information shorter than a predetermined time (hereinafter referred to as a local (phoneme) feature amount estimation unit 121B as appropriate), and a feature amount combination unit 121C. Then, the speaker feature amount can be obtained using both the global feature amount estimation unit 121A and the local feature amount estimation unit 121B. The speaker feature quantities obtained from both estimating units are combined by the feature quantity combining unit 121C and used to obtain the final speaker embedding.
- a weighted linear combination, a spherical linear combination, or the like can be used for the combination, and the connection weight parameter can be obtained from the duration, the input signal, or the like.
- speaker embedding can be obtained as follows.
- T is the input length after the conversion is started.
- ⁇ can also be obtained as follows, depending only on T.
- it can be obtained using a neural network from the input x like ⁇ (x), or it can be obtained using either information of T or x.
- the real-time processing described above is based on the premise (parallel data assumption) that the singing content included in the original song matches the singing content of the user at the time of inference.
- the user may sing wrongly, and this premise does not necessarily hold true.
- the speaker embedding is obtained by the above-described method using only the short-time input between greatly different phonemes, the conversion quality may be greatly degraded.
- the similarity calculation unit 103D calculates the content embeddings of the target singer and the original singer. Calculate the similarity of A calculation result by the similarity calculation unit 103D is supplied to the speaker feature amount estimation unit 101A.
- the speaker feature quantity estimating unit 101A calculates a coupling coefficient between the global feature quantity and the local feature quantity when estimating the speaker feature quantity according to the degree of similarity. weights for each of the quantities) and other feature mixture weights. Specifically, when the degree of similarity is low, the contents of the utterances are different, so the weight of the connection to the speaker feature amount based on the short-time information is decreased to lower the degree of dependence. In other words, the processing result of the global feature estimation unit 121A is mainly used. In addition, in other feature amount mixtures, excessive conversion is suppressed by increasing the weight for the feature amount of the original speaker, and significant deterioration in sound quality is suppressed.
- the singing voice of the target speaker is a voice that has undergone sound source separation, and contains noise that accompanies this separation. Therefore, the estimation accuracy of each embedding deteriorates due to noise, and the sound quality of the converted speech tends to contain noise. In order to prevent this, a method of constructing a robust system against sound source separation noise will be described.
- Robustness to source separation noise is constrained during training of the encoder, decoder, and speaker feature estimator so that the embedding vectors extracted for the source-separated speech and the original clean speech are the same. It can be realized by applying Specifically, if the clean speech signal is x, the accompaniment signal is b, and the sound source separator is h(), then the regularization term is added to the learning objective function.
- E is an encoder or feature extractor. Loss function for reconstruction
- the encoder 103A can be trained so that the feature amount extraction result from the separated speech matches that for the clean speech while keeping the output of the decoder 103C clean.
- the smart phone 100 may perform all of the processing described in one embodiment. Some processing may be performed by a device other than smartphone 100, such as a server.
- a device other than smartphone 100 such as a server.
- the sound source separation processing and speaker feature amount estimation processing may be performed by a server, and the voice quality conversion processing and reproduction processing may be performed by a smart phone.
- the sound source separation processing may be performed by the server, and the voice quality conversion processing, reproduction processing, and speaker feature amount estimation processing may be performed by the smart phone. Processing results are transmitted and received via a network between the server and the smartphone.
- the present disclosure can also be implemented in any form such as a device, method, program, system, and the like.
- a program that performs the functions described in the above embodiments, and by downloading and installing the program in a device that does not have the functions described in the embodiments, the device can perform the control described in the embodiments. can be done.
- the present disclosure can also be implemented by a server that distributes such programs.
- the items described in each embodiment and modifications can be combined as appropriate.
- the contents of the present disclosure should not be construed as being limited by the effects exemplified herein.
- the present disclosure can also adopt the following configurations.
- An information processing apparatus having a voice quality conversion unit that separates a vocal signal and an accompaniment signal from a mixed sound signal into sound sources and performs voice quality conversion using the result of the sound source separation.
- a first vocal signal is separated from the mixed sound signal by the sound source separation; a collected second vocal signal is input to the voice quality conversion unit;
- the information processing device according to (1), wherein the voice quality conversion unit approximates one of the first vocal signal and the second vocal signal to the other vocal signal.
- the information processing apparatus according to (2), wherein the amount of change that brings one of them closer to the other vocal signal can be set.
- (4) further comprising a speaker feature quantity estimation unit for estimating a feature quantity related to a speaker;
- the information processing device includes an encoder and a decoder.
- the feature amount related to the speaker is a feature amount corresponding to a feature that does not change over time
- the encoder extracts feature quantities corresponding to features that change over time from the input vocal signal
- the information processing apparatus according to (4), wherein the decoder generates a vocal signal based on the feature amount estimated by the speaker feature amount estimation unit and the feature amount extracted by the encoder.
- the feature quantity corresponding to the feature that does not change over time is speaker information;
- the encoder uses a learning model obtained by learning to obtain an embedding vector from a feature amount that reflects only a specific feature or learning to extract only a specific feature from a vocal signal, and uses the learning model to change the time-varying
- the speaker feature amount estimating unit estimates the speaker feature amount using a learning model obtained by learning to estimate speaker information of the speaker based on the vocal signal of a predetermined speaker (6) (8) The information processing apparatus according to any one of the items up to (8). (10) The speaker feature amount estimation unit estimates the speaker feature amount using a learning model obtained by learning to estimate speaker information of the speaker based on a predetermined vocal signal. (6) to (8) The information processing device according to any one of the above. (11)
- the speaker feature estimation unit includes a first speaker feature estimation unit and a second speaker feature estimation unit, a feature amount combining unit that combines the feature amount related to the speaker estimated by the first speaker feature amount estimation unit and the feature amount related to the speaker estimated by the second speaker feature estimation unit (4 ) to (10).
- the first speaker feature quantity estimating unit estimates a speaker-related feature quantity based on a vocal signal of a predetermined time or longer, and the second speaker feature quantity estimating unit estimates a feature quantity of a speaker based on a vocal signal shorter than the predetermined time.
- a coupling coefficient in the feature amount coupling unit is changed according to the degree of similarity between the first vocal signal and the second vocal signal.
- the coupling coefficient is a weighting for each of the speaker feature amount estimated by the first speaker feature amount estimation unit and the speaker feature amount estimated by the second speaker feature amount estimation unit.
- Reference numerals 100 smartphone 102: sound source separation unit 101A: speaker feature amount estimation unit 101B: speaker feature amount mixing unit 103: voice quality conversion unit 103A: encoder 103C: decoder 103D... Similarity calculator 121A... Global feature amount estimator 121B... Local feature amount estimator
Landscapes
- Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
Abstract
Description
混合音信号からボーカル信号と伴奏信号とを音源分離し、当該音源分離の結果を用いて声質変換を行う声質変換部を有する
情報処理装置である。
声質変換部が、混合音信号からボーカル信号と伴奏信号とを音源分離し、当該音源分離の結果を用いて声質変換を行う
情報処理方法である。
声質変換部が、混合音信号からボーカル信号と伴奏信号とを音源分離し、当該音源分離の結果を用いて声質変換を行う
情報処理方法をコンピュータに実行させるプログラムである。
<本開示の背景>
<一実施形態>
<変形例>
以下に説明する実施形態等は本開示の好適な具体例であり、本開示の内容がこれらの実施形態等に限定されるものではない。
始めに、本開示の理解を容易とするために、本開示の背景について説明する。近年カラオケにおいて、あらかじめ作成されたMIDI(Musical Instrument Digital Interface)音源や録音音源を伴奏として用いるのではなく、ボーカル音声入りの原音源をボーカル信号と伴奏信号とに音源分離し、分離された伴奏信号を用いることが増えている。
[一実施形態の概要]
始めに図1を参照しつつ、一実施形態の概要について説明する。図1に示す混合音源に対して、音源分離処理PAが行われる。混合音源は、CD(Compact Disc)等の記録媒体やネットワークを介した配信によって提供され得る。混合音源には、例えば、アーティストのボーカル信号(第1のボーカル信号の一例であり、以下、ボーカル信号VSAとも適宜、称する)が含まれる。また、混合音源には、ボーカル信号VSA以外の信号(楽器音等であり、以下、伴奏信号とも適宜、称する)が含まれる。
(全体の構成例)
図2は、一実施形態に係る情報処理装置の構成例を示すブロック図である。本実施形態に係る情報処理装置としては、例えば、スマートホン(スマートホン100)が挙げられる。スマートホン100を用いて、ユーザーは、声質変換が可能なカラオケを手軽に行うことができる。なお、本実施形態では、カラオケ、即ち、歌唱を例にして説明するが、本開示は歌唱に限らず、会話等の発話に対する声質変換処理に対しても適用可能である。また、本開示に係る情報処理装置は、スマートホンに限らず、スマートウォッチ等の携帯型の電子機器や、パーソナルコンピュータや据え置き型のカラオケ装置等に対しても適用可能である。
図3は、声質変換部103の構成例を示すブロック図である。声質変換部103は、エンコーダ103A、特徴量混合部103B、および、デコーダ103Cを有している。エンコーダ103Aは、所定の学習により得られる学習モデルを用いて、ボーカル信号から特徴量を抽出する。エンコーダ103Aにより抽出される特徴量は、例えば、歌唱の進行に伴って時間的に変化する特徴量であり、具体的には、音高情報、音量情報、発話(歌詞)情報の少なくとも一つを含む。
次に、図4を参照しつつ、声質変換部103で行われる学習方法の一例について説明する。なお、図4では、声質変換部103における特徴量混合部103B、および、特徴量混合部101Bに関する図示は省略している。
話者エンベディング
音高エンベディング
音量エンベディング
コンテンツエンベディング
と適宜、称する。
話者エンベディング
のみを他者のものに置き換えることで、音高、音量、発話内容を保ったまま声質(音高等を含まない狭義の声質)を変換することができる。このように、特徴を分離するようなエンベディングベクトルを得る方法として特定の特徴のみを反映した特徴量からエンベディングを得る方法や、データ(所定のボーカル信号)から特定の特徴のみを抽出するエンコーダを学習する方法がある。
音高エンベディング
を得る、
平均パワーpから音量エンベディング
を得る、
話者ラベルnから話者エンベディング
を得る、
音声認識から得られる特徴量
(Automatic Speech Recognition)からコンテンツエンベディング
を得るなどの方法がある。
音高エンベディング
音量エンベディング
話者エンベディング
のそれぞれについては、敵対的学習により得られる。また、正確なラベルの取得が困難なコンテンツエンベディング
についてデータから学習することで得られる。
コンテンツエンベディング
を抽出するエンコーダ
は、
コンテンツエンベディング
から他の特徴量
を推定するクリティック
を用いた損失関数
を入力の再構成についての損失関数
に加えることで学習できる。
但し、上記式における
はエンコーダ103A及びデコーダ103Cの学習のための損失関数を示す。
また、
はクリティック
のための損失関数であり、
は重みパラメータである。
のそれぞれは、エンコーダ103A及びデコーダ103Cのパラメータであり、
はクリティック
のパラメータである。
入力歌声データxからコンテンツエンベディング
を抽出するエンコーダ
の出力をベクトル量子化し、情報を圧縮することで、デコーダに与えられている他の情報
に含まれない情報のみをコンテンツエンベディング
に保持するように誘導することができる。
但し、sg()はニューラルネットワークの勾配情報を以下の層に伝えないようにする勾配停止演算子、V()はベクトル量子化演算である。
再構成についての損失関数
についてはデコーダやエンコーダの種類により色々な形が考えられる。例えば、variational autoencoder (VAE)やベクトル量子化VAEの場合は変分下界(ELBO)
を用いることができ、Generative adversarial networkの場合は入力と出力の事情誤差と敵対的損失
の重み付き和(下記の式)として表すことができる。
と求める方法について説明した。しかしながら、この方法では変換先の歌手があらかじめ学習データになくてはならず、任意の歌手(未知の話者)に対して声質変換を行うことができない。そこで、音声信号から話者エンベディングを求める方法を説明する。例えば、以下の2つの方法が考えらえる。
を話者nの歌唱音
から推定する話者特徴量推定部F()を学習する。Fはニューラルネットワークなどで構成することができ、話者エンベディングとの距離を最小化するように学習される。距離としてはLpノルム
を利用することができる。
歌唱音
から話者エンベディング
を抽出する話者特徴量推定部G()を声質変換部103の学習に先立って学習する。Gは歌手ラベルの付いた複数歌手の歌唱音データを用いて以下の目的関数Lを最小化することで学習できる。
但し、K(x,y)はxとyのコサイン距離、
は歌手nによる異なる歌唱音声、
は歌手(m≠n)による歌唱音声である。
この様にして学習されたGを用いて話者エンベディング
を以下のように求め、声質変換部103の学習に用いる。
以上のようにして学習された声質変換部103により声質変換が行われる。図5を参照しつつ、スマートホン100で行われる声質変換の処理について説明する。
次に、声質変換処理に付随して行われる処理について説明する。始めに、滑らかな声質変換を実現する処理について説明する。カラオケなどの用途で自分の歌声を、原曲の歌手の歌声に変えて楽しみたいという要求がある。これは、推論時(声質変換処理の実行時)に歌手A(自分)の歌声を他の歌手(原曲歌手)の声質に変更するため、例えば、歌手Aの話者エンベディング
を歌手Bの話者エンベディング
に置き換えることで実現できる。
を歌手Bの話者エンベディング
に滑らかに変化させる内挿関数
を用いる。αは変化量を決定するスカラー変数であり、ユーザーが決定することもできる。内挿関数は線形補間や、球面線形補間を用いることができる。
を学習する。従って、本処理を行う際の話者エンベディングは事変の特徴量として取り扱われる。
の学習のための目的関数は
と表すことができる。
ここで、エンコーダ103Aやデコーダ103Cのパラメータは固定されていることに注意されたい。
の受容野は上記短時間に限られており、上記目的関数を最小化することで求められる。
は以下のように求めることができる。
但し、Tは変換を始めてからの入力長である。αはTのみに依存して以下のように求めることもできる。
またはα(x)の様に入力xからニューラルネットワークを用いて求めることもできるし、T、xどちらの情報を用いて求めることも可能である。
の類似度を計算する。類似度計算部103Dによる計算結果が話者特徴量推定部101Aに供給される。
を学習の目的関数に加える。
ここでEはエンコーダ、又は特徴量抽出器である。再構築に関する損失関数
に関する計算はクリーン音声のみを用いることでデコーダ103Cの出力をクリーンに保ったまま、分離音声からの特徴量抽出結果がクリーン音声に対するそれと一致するようにエンコーダ103Aを学習することが可能となる。
以上、本開示の一実施形態について説明したが、本開示は、上述した実施形態に限定されることはなく、本開示の趣旨を逸脱しない範囲で種々の変形が可能である。
(1)
混合音信号からボーカル信号と伴奏信号とを音源分離し、当該音源分離の結果を用いて声質変換を行う声質変換部を有する
情報処理装置。
(2)
前記音源分離により前記混合音信号から第1のボーカル信号が分離され、
前記声質変換部に対して、収音された第2のボーカル信号が入力され、
前記声質変換部は、前記第1のボーカル信号および前記第2のボーカル信号の何れか一方を他方のボーカル信号に近づける
(1)に記載の情報処理装置。
(3)
何れか一方を他方のボーカル信号に近づける変化量が設定可能とされる
(2)に記載の情報処理装置。
(4)
さらに、話者に関する特徴量を推定する話者特徴量推定部を有し、
前記声質変換部は、エンコーダおよびデコーダを有する
(2)に記載の情報処理装置。
(5)
前記話者に関する特徴量は、時間的に変化しない特徴に対応する特徴量であり、
前記エンコーダは、入力されたボーカル信号から、時間的に変化する特徴に対応する特徴量を抽出し、
前記デコーダは、前記話者特徴量推定部により推定された特徴量および前記エンコーダにより抽出された特徴量に基づいてボーカル信号を生成する
(4)に記載の情報処理装置。
(6)
前記時間的に変化しない特徴に対応する特徴量は話者情報であり、
前記時間的に変化する特徴に対応する特徴量は、音高情報、音量情報、発話情報の少なくとも一つを含む
(5)に記載の情報処理装置。
(7)
前記特徴量は、エンベディングベクトルにより規定される
(6)に記載の情報処理装置。
(8)
前記エンコーダは、特定の特徴のみを反映した特徴量からエンベディングベクトルを得る学習またはボーカル信号から特定の特徴のみを抽出するような学習を行うことで得られる学習モデルを用いて、前記時間的に変化する特徴に対応する特徴量のエンベディングベクトルを抽出する
(7)に記載の情報処理装置。
(9)
前記話者特徴量推定部は、所定の話者のボーカル信号に基づいて当該話者の話者情報を推定する学習により得られる学習モデルを用いて話者の特徴量を推定する
(6)から(8)までの何れかに記載の情報処理装置。
(10)
前記話者特徴量推定部は、所定のボーカル信号に基づいて当該話者の話者情報を推定する学習により得られる学習モデルを用いて話者の特徴量を推定する
(6)から(8)までの何れかに記載の情報処理装置。
(11)
話者特徴量推定部は、第1の話者特徴量推定部および第2の話者特徴推定部を含み、
前記第1の話者特徴量推定部により推定された話者に関する特徴量と前記第2の話者特徴推定部により推定された話者に関する特徴量とを結合する特徴量結合部を有する
(4)から(10)までの何れかに記載の情報処理装置。
(12)
前記第1の話者特徴量推定部は所定時間以上のボーカル信号に基づいて話者に関する特徴量を推定し、前記第2の話者特徴量推定部は前記所定時間より短いボーカル信号に基づいて話者に関する特徴量を推定する
(11)に記載の情報処理装置。
(13)
前記第1のボーカル信号と前記第2のボーカル信号との類似度に応じて、前記特徴量結合部における結合係数を変化させる
(11)に記載の情報処理装置。
(14)
前記結合係数は、前記第1の話者特徴量推定部により推定された話者に関する特徴量および前記第2の話者特徴量推定部により推定された話者に関する特徴量のそれぞれに対する重み付けである
(13)に記載の情報処理装置。
(15)
声質変換部が、混合音信号からボーカル信号と伴奏信号とを音源分離し、当該音源分離の結果を用いて声質変換を行う
情報処理方法。
(16)
声質変換部が、混合音信号からボーカル信号と伴奏信号とを音源分離し、当該音源分離の結果を用いて声質変換を行う
情報処理方法をコンピュータに実行させるプログラム。
102・・・音源分離部
101A・・・話者特徴量推定部
101B・・・話者特徴量混合部
103・・・声質変換部
103A・・・エンコーダ
103C・・・デコーダ
103D・・・類似度計算部
121A・・・大域的特徴量推定部
121B・・・局所的特徴量推定部
Claims (16)
- 混合音信号からボーカル信号と伴奏信号とを音源分離し、当該音源分離の結果を用いて声質変換を行う声質変換部を有する
情報処理装置。 - 前記音源分離により前記混合音信号から第1のボーカル信号が分離され、
前記声質変換部に対して、収音された第2のボーカル信号が入力され、
前記声質変換部は、前記第1のボーカル信号および前記第2のボーカル信号の何れか一方を他方のボーカル信号に近づける
請求項1に記載の情報処理装置。 - 何れか一方を他方のボーカル信号に近づける変化量が設定可能とされる
請求項2に記載の情報処理装置。 - さらに、話者に関する特徴量を推定する話者特徴量推定部を有し、
前記声質変換部は、エンコーダおよびデコーダを有する
請求項2に記載の情報処理装置。 - 前記話者に関する特徴量は、時間的に変化しない特徴に対応する特徴量であり、
前記エンコーダは、入力されたボーカル信号から、時間的に変化する特徴に対応する特徴量を抽出し、
前記デコーダは、前記話者特徴量推定部により推定された特徴量および前記エンコーダにより抽出された特徴量に基づいてボーカル信号を生成する
請求項4に記載の情報処理装置。 - 前記時間的に変化しない特徴に対応する特徴量は話者情報であり、
前記時間的に変化する特徴に対応する特徴量は、音高情報、音量情報、発話情報の少なくとも一つを含む
請求項5に記載の情報処理装置。 - 前記特徴量は、エンベディングベクトルにより規定される
請求項6に記載の情報処理装置。 - 前記エンコーダは、特定の特徴のみを反映した特徴量からエンベディングベクトルを得る学習またはボーカル信号から特定の特徴のみを抽出するような学習を行うことで得られる学習モデルを用いて、前記時間的に変化する特徴に対応する特徴量のエンベディングベクトルを抽出する
請求項7に記載の情報処理装置。 - 前記話者特徴量推定部は、所定の話者のボーカル信号に基づいて当該話者の話者情報を推定する学習により得られる学習モデルを用いて話者の特徴量を推定する
請求項6に記載の情報処理装置。 - 前記話者特徴量推定部は、所定のボーカル信号に基づいて当該話者の話者情報を推定する学習により得られる学習モデルを用いて話者の特徴量を推定する
請求項6に記載の情報処理装置。 - 話者特徴量推定部は、第1の話者特徴量推定部および第2の話者特徴推定部を含み、
前記第1の話者特徴量推定部により推定された話者に関する特徴量と前記第2の話者特徴推定部により推定された話者に関する特徴量とを結合する特徴量結合部を有する
請求項4に記載の情報処理装置。 - 前記第1の話者特徴量推定部は所定時間以上のボーカル信号に基づいて話者に関する特徴量を推定し、前記第2の話者特徴量推定部は前記所定時間より短いボーカル信号に基づいて話者に関する特徴量を推定する
請求項11に記載の情報処理装置。 - 前記第1のボーカル信号と前記第2のボーカル信号との類似度に応じて、前記特徴量結合部における結合係数を変化させる
請求項11に記載の情報処理装置。 - 前記結合係数は、前記第1の話者特徴量推定部により推定された話者に関する特徴量および前記第2の話者特徴量推定部により推定された話者に関する特徴量のそれぞれに対する重み付けである
請求項13に記載の情報処理装置。 - 声質変換部が、混合音信号からボーカル信号と伴奏信号とを音源分離し、当該音源分離の結果を用いて声質変換を行う
情報処理方法。 - 声質変換部が、混合音信号からボーカル信号と伴奏信号とを音源分離し、当該音源分離の結果を用いて声質変換を行う
情報処理方法をコンピュータに実行させるプログラム。
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202280045017.1A CN117561570A (zh) | 2021-06-29 | 2022-02-09 | 信息处理装置、信息处理方法和程序 |
| EP22832402.6A EP4365891A4 (en) | 2021-06-29 | 2022-02-09 | DEVICE AND METHOD FOR PROCESSING INFORMATION, AND ASSOCIATED PROGRAM |
| JP2023531371A JPWO2023276234A1 (ja) | 2021-06-29 | 2022-02-09 | |
| US18/571,738 US20240135945A1 (en) | 2021-06-29 | 2022-02-09 | Information processing apparatus, information processing method, and program |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021-107651 | 2021-06-29 | ||
| JP2021107651 | 2021-06-29 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023276234A1 true WO2023276234A1 (ja) | 2023-01-05 |
Family
ID=84691116
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2022/005001 Ceased WO2023276234A1 (ja) | 2021-06-29 | 2022-02-09 | 情報処理装置、情報処理方法およびプログラム |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US20240135945A1 (ja) |
| EP (1) | EP4365891A4 (ja) |
| JP (1) | JPWO2023276234A1 (ja) |
| CN (1) | CN117561570A (ja) |
| WO (1) | WO2023276234A1 (ja) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024202977A1 (ja) * | 2023-03-24 | 2024-10-03 | ヤマハ株式会社 | 音変換方法およびプログラム |
| WO2024202975A1 (ja) * | 2023-03-24 | 2024-10-03 | ヤマハ株式会社 | 音変換方法およびプログラム |
| WO2025100323A1 (ja) * | 2023-11-08 | 2025-05-15 | ヤマハ株式会社 | 情報処理システム、情報処理方法及びプログラム |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001117598A (ja) * | 1999-10-21 | 2001-04-27 | Yamaha Corp | 音声変換装置及び方法 |
| JP2018005048A (ja) | 2016-07-05 | 2018-01-11 | クリムゾンテクノロジー株式会社 | 声質変換システム |
| WO2019116889A1 (ja) * | 2017-12-12 | 2019-06-20 | ソニー株式会社 | 信号処理装置および方法、学習装置および方法、並びにプログラム |
| KR20200065248A (ko) * | 2018-11-30 | 2020-06-09 | 한국과학기술원 | 음원의 가수 목소리를 사용자의 음색으로 변환하는 시스템 및 방법 |
| CN113781993A (zh) * | 2021-01-20 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | 定制音色歌声的合成方法、装置、电子设备和存储介质 |
Family Cites Families (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101578659B (zh) * | 2007-05-14 | 2012-01-18 | 松下电器产业株式会社 | 音质转换装置及音质转换方法 |
| JP4296231B2 (ja) * | 2007-06-06 | 2009-07-15 | パナソニック株式会社 | 声質編集装置および声質編集方法 |
| CN101627427B (zh) * | 2007-10-01 | 2012-07-04 | 松下电器产业株式会社 | 声音强调装置及声音强调方法 |
| JP5510852B2 (ja) * | 2010-07-20 | 2014-06-04 | 独立行政法人産業技術総合研究所 | 声色変化反映歌声合成システム及び声色変化反映歌声合成方法 |
| JP5961950B2 (ja) * | 2010-09-15 | 2016-08-03 | ヤマハ株式会社 | 音声処理装置 |
| WO2013008471A1 (ja) * | 2011-07-14 | 2013-01-17 | パナソニック株式会社 | 声質変換システム、声質変換装置及びその方法、声道情報生成装置及びその方法 |
| EP2930714B1 (en) * | 2012-12-04 | 2018-09-05 | National Institute of Advanced Industrial Science and Technology | Singing voice synthesizing system and singing voice synthesizing method |
| KR20200027475A (ko) * | 2017-05-24 | 2020-03-12 | 모듈레이트, 인크 | 음성 대 음성 변환을 위한 시스템 및 방법 |
| TWI742486B (zh) * | 2019-12-16 | 2021-10-11 | 宏正自動科技股份有限公司 | 輔助歌唱系統、輔助歌唱方法及其非暫態電腦可讀取記錄媒體 |
| US11257480B2 (en) * | 2020-03-03 | 2022-02-22 | Tencent America LLC | Unsupervised singing voice conversion with pitch adversarial network |
-
2022
- 2022-02-09 EP EP22832402.6A patent/EP4365891A4/en active Pending
- 2022-02-09 US US18/571,738 patent/US20240135945A1/en active Pending
- 2022-02-09 JP JP2023531371A patent/JPWO2023276234A1/ja active Pending
- 2022-02-09 WO PCT/JP2022/005001 patent/WO2023276234A1/ja not_active Ceased
- 2022-02-09 CN CN202280045017.1A patent/CN117561570A/zh active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001117598A (ja) * | 1999-10-21 | 2001-04-27 | Yamaha Corp | 音声変換装置及び方法 |
| JP2018005048A (ja) | 2016-07-05 | 2018-01-11 | クリムゾンテクノロジー株式会社 | 声質変換システム |
| WO2019116889A1 (ja) * | 2017-12-12 | 2019-06-20 | ソニー株式会社 | 信号処理装置および方法、学習装置および方法、並びにプログラム |
| KR20200065248A (ko) * | 2018-11-30 | 2020-06-09 | 한국과학기술원 | 음원의 가수 목소리를 사용자의 음색으로 변환하는 시스템 및 방법 |
| CN113781993A (zh) * | 2021-01-20 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | 定制音色歌声的合成方法、装置、电子设备和存储介质 |
Non-Patent Citations (3)
| Title |
|---|
| DENG CHENGQI, YU CHENGZHU, LU HENG, WENG CHAO, YU DONG: "Pitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network", ARXIV.ORG, 18 February 2020 (2020-02-18), pages 1 - 5, XP055855382, Retrieved from the Internet <URL:https://arxiv.org/pdf/1912.01852.pdf> [retrieved on 20211027], DOI: 10.1109/ICASSP40776.2020.9054199 * |
| See also references of EP4365891A4 |
| TOMOYA YAMADA, SHOGO SEKI, KAZUHIRO KOBAYASHI, TOMOKI TODA: "Singing Voice Processing in Songs Based on Singing Voice Separation and Statistical Singing Voice Quality Conversion", IPSJ SIG TECHNICAL REPORT, INFORMATION PROCESSING SOCIETY OF JAPAN, JP, vol. 2017, no. 30, 1 June 2017 (2017-06-01), JP , XP093020417, ISSN: 2188-8752 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024202977A1 (ja) * | 2023-03-24 | 2024-10-03 | ヤマハ株式会社 | 音変換方法およびプログラム |
| WO2024202975A1 (ja) * | 2023-03-24 | 2024-10-03 | ヤマハ株式会社 | 音変換方法およびプログラム |
| WO2025100323A1 (ja) * | 2023-11-08 | 2025-05-15 | ヤマハ株式会社 | 情報処理システム、情報処理方法及びプログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117561570A (zh) | 2024-02-13 |
| JPWO2023276234A1 (ja) | 2023-01-05 |
| US20240135945A1 (en) | 2024-04-25 |
| EP4365891A4 (en) | 2024-07-17 |
| EP4365891A1 (en) | 2024-05-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Valle et al. | Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis | |
| Mor et al. | A universal music translation network | |
| CN114242033B (zh) | 语音合成方法、装置、设备、存储介质及程序产品 | |
| Wang et al. | PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network | |
| AU2023337867B2 (en) | Generating audio using auto-regressive generative neural networks | |
| CN110211556B (zh) | 音乐文件的处理方法、装置、终端及存储介质 | |
| CN111899720A (zh) | 用于生成音频的方法、装置、设备和介质 | |
| WO2023276234A1 (ja) | 情報処理装置、情報処理方法およびプログラム | |
| CN112970058A (zh) | 信息处理方法及信息处理系统 | |
| KR102272554B1 (ko) | 텍스트- 다중 음성 변환 방법 및 시스템 | |
| JP6737320B2 (ja) | 音響処理方法、音響処理システムおよびプログラム | |
| GB2582952A (en) | Audio contribution identification system and method | |
| Sarroff et al. | Blind arbitrary reverb matching | |
| Huang et al. | Musical timbre style transfer with diffusion model | |
| CN116438599A (zh) | 通过标准arm嵌入式平台上的卷积神经网络嵌入式语音指纹进行人声轨道去除 | |
| CN116013248A (zh) | 说唱音频生成方法、装置、设备和可读存储介质 | |
| CN120279868A (zh) | 音乐生成方法、音乐生成装置、电子设备和存储介质 | |
| CN113781989A (zh) | 一种音频的动画播放、节奏卡点识别方法及相关装置 | |
| CN119339691A (zh) | 音乐生成方法、装置、电子设备及存储介质 | |
| CN119993114A (zh) | 基于多模态风格嵌入的语音合成方法、装置、设备及介质 | |
| WO2021251364A1 (ja) | 音響処理方法、音響処理システムおよびプログラム | |
| CN118942469A (zh) | 音频分轨提取方法、装置、介质和计算设备 | |
| US20240105203A1 (en) | Enhanced audio file generator | |
| CN115331648A (zh) | 音频数据处理方法、装置、设备、存储介质及产品 | |
| Sarkar | Time-domain music source separation for choirs and ensembles |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22832402 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023531371 Country of ref document: JP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18571738 Country of ref document: US |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202280045017.1 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2022832402 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2022832402 Country of ref document: EP Effective date: 20240129 |