WO2021234873A1 - Dispositif d'apprentissage de modèle de séparation de source sonore, dispositif de séparation de source sonore, procédé d'apprentissage de modèle de séparation de source sonore, et programme - Google Patents
Dispositif d'apprentissage de modèle de séparation de source sonore, dispositif de séparation de source sonore, procédé d'apprentissage de modèle de séparation de source sonore, et programme Download PDFInfo
- Publication number
- WO2021234873A1 WO2021234873A1 PCT/JP2020/019997 JP2020019997W WO2021234873A1 WO 2021234873 A1 WO2021234873 A1 WO 2021234873A1 JP 2020019997 W JP2020019997 W JP 2020019997W WO 2021234873 A1 WO2021234873 A1 WO 2021234873A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sound source
- spectrogram
- unit
- template
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
Definitions
- the present invention relates to a sound source separation model learning device, a sound source separation device, a sound source separation model learning method, and a program.
- a sound source separation technology that separates the signal of each sound source from the monaural mixed sound signal of multiple sound sources.
- a technique proposed based on the idea of a class identification problem that identifies which speaker's energy is dominant at each time frequency point of the spectrogram of the observed signal.
- a technique using a machine learning method has been proposed as a technique proposed based on the idea of such a class identification problem.
- a sound source separation technique using a machine learning method for example, a sound source separation technique using a neural network (NN) has been proposed (see Non-Patent Documents 1 and 2).
- a sound source separation technique using a neural network for example, a sound source separation technique using a deep clustering (DC) method (see Non-Patent Documents 3 and 4) has been proposed.
- DC deep clustering
- the time frequency point is a point (that is, a source included in the time frequency space) in the space (time frequency space) where the time axis and the frequency axis extend.
- Each time frequency point indicates an N-dimensional feature quantity vector for each time and frequency indicated by the position of each time frequency point in the time frequency space (N is an integer of 2 or more).
- the feature amount vector is a set of information that satisfies a predetermined condition obtained through learning or the like among the information obtained from the analysis target.
- Learning the low-dimensional embedded representation means learning a mapping that transforms an N-dimensional feature vector into a feature vector with dimensions less than N.
- the sound source separation technique using the DC method sound source separation is performed by clustering the obtained embedded vectors by using an unsupervised clustering method such as the k-means method.
- the embedded vector is a feature vector having a dimension less than N at each time frequency point. It has been experimentally shown that the sound source separation technique using the DC method is capable of highly accurate separation even for mixed voices of unknown sound sources.
- the trained model which is a map obtained by learning and is a map for performing sound source separation.
- Interpreting a trained model means knowing the basis for the predicted results of the trained model. For example, in the case of the DC method, it may be difficult for the user to determine the basis for determining the embedded vector.
- the understanding of the DC method will be deepened, and it is expected that the sound source separation technology will be further improved, such as improvement of generalization performance and adaptation to sound sources other than the speaker.
- the sound source separation technology will be greatly improved if the user can visualize what kind of spectrogram structure is specifically used as a clue when determining the embedded vector.
- One aspect of the present invention is a learning data acquisition unit that acquires a spectrogram of a mixed signal in which a plurality of sounds are mixed and dominant sound source information indicating whether or not a target sound source is dominant for each time frequency point of the spectrogram. And estimation of the composite product using a template, which is information representing one or more values related to the spectrogram, which is one or more values at time frequency points belonging to one section divided in the time axis direction of the spectrogram.
- a weight estimation unit that estimates the weights used in the above, a dominant sound source information estimation unit that acquires the estimation result of the dominant sound source information based on the combined product, and a loss acquisition unit that acquires the difference between the estimation result and the dominant sound source information.
- the template and the weight used for estimating the synthetic product indicate the estimation result regarding the spectrogram of the target sound source, and the weight estimation unit is a machine that estimates the weight so as to reduce the difference. It is a sound source separation model learning device that learns a learning model.
- the present invention makes it possible to facilitate the interpretation of a trained model that separates sound sources.
- the flowchart which shows an example of the flow of the process executed by the sound source separation model learning apparatus 1 in embodiment.
- the figure which shows the 2nd result of the separation experiment in an embodiment The figure which shows the 3rd result of the separation experiment in an embodiment.
- the figure which shows the 4th result of the separation experiment in an embodiment The figure which shows the 5th result of the separation experiment in an embodiment.
- the figure which shows the sixth result of the separation experiment in an embodiment The figure which shows the 7th result of the separation experiment in an embodiment.
- FIG. 1 is an explanatory diagram illustrating an outline of the sound source separation system 100 of the embodiment.
- the sound source separation system 100 will be described below by taking an audio signal as an example as a signal for dealing with the processing of the sound source separation system 100 for the sake of simplicity.
- any signal to be processed by the sound source separation system 100 is a sound signal. It may be a thing.
- the signal to be processed by the sound source separation system 100 may be a signal of the sound of a musical instrument.
- the sound source is a monaural sound source.
- the sound source separation system 100 separates the non-mixed sound signal from the mixed sound signal to be separated.
- the mixed sound signal is a sound signal in which a plurality of non-mixed sound signals are mixed. Different non-mixed sound signals are signals with different sound sources.
- the mixed sound signal is, for example, a voice signal in which the voice emitted by the first person is mixed with the voice emitted by the second person.
- the sound source separation system 100 separates the voice signal emitted by the first person and the voice signal emitted by the second person.
- the voice signal emitted by the first person and the voice signal emitted by the second person are examples of non-mixed sound signals.
- the number of non-mixed sound signals separated by the sound source separation system 100 may be one or a plurality.
- the sound source separation system 100 includes a sound source separation model learning device 1 and a sound source separation device 2.
- the sound source separation model learning device 1 obtains a trained model (hereinafter referred to as “sound source separation model”) that estimates dominant sound source information from the mixed spectrogram by machine learning.
- the mixed spectrogram is a spectrogram of a mixed sound signal. Dominant means that the spectrogram strength (ie, sound intensity) is stronger than other sound sources.
- the time frequency point represents one point in the spectrogram. That is, a time frequency point is a point in space where one axis represents time and one axis represents frequency. The value of the time frequency point in the spectrogram represents the sound intensity.
- the dominant sound source information is information indicating which of the plurality of sound sources included in the mixed spectrogram is dominant for each time frequency point of the mixed spectrogram. Therefore, the sound source separation model is a model that acquires the estimation result of the dominant sound source information (hereinafter referred to as "estimated dominant sound source information") from the mixed spectrogram.
- Learning for the sake of simplicity of the following explanation means to appropriately adjust the value of the parameter in the machine learning model (hereinafter referred to as “machine learning model”) based on the input.
- learning to be A means that the value of the parameter in the machine learning model is adjusted to satisfy A.
- A represents a predetermined condition.
- the trained model is a machine learning model after one or a plurality of learnings have been performed, and is a machine learning model at a timing when a predetermined end condition (hereinafter referred to as “learning end condition”) is satisfied.
- the sound source separation model learning device 1 performs learning using data for obtaining a trained model (hereinafter referred to as "learning data").
- the training data specifically includes a plurality of pairs of data.
- the paired data is a pair of the learning spectrum gram X and the learning dominant sound source information Y.
- the spectrogram X for learning is a mixed spectrogram used as an explanatory variable when the sound source separation model learning device 1 obtains a trained model.
- the spectrogram X for learning is information represented by the following equation (1).
- Equation (1) f (f is an integer of 0 or more (F-1) or less.
- F is an integer of 1 or more.
- Equation (1) Represents the position of each point in the mixed spectrogram on the frequency axis.
- N is an integer of 1 or more) represents the position of each point of the mixed spectrogram on the time axis. Therefore, equation (1) represents a mixed spectrogram having (F ⁇ N) time frequency points. More specifically, the spectrogram X for learning is expressed by the following equation (2).
- K (k is an integer of 1 or more and K or less. K is an integer of 1 or more) in the equation (2) is an identifier for identifying each time frequency point.
- the learning dominant sound source information Y is information used as an objective variable when the sound source separation model learning device 1 obtains a trained model. That is, the learning dominant sound source information Y is a correct label in the learning data.
- the learning dominant sound source information Y indicates whether or not a predetermined sound source (hereinafter referred to as “learning sound source”) is dominant for each time frequency point of the learning spectrum gram X. Whether or not the learning sound source is dominant at each time frequency point is represented by, for example, a binary value of 0 or 1 for each time frequency point.
- FIG. 2 is an explanatory diagram illustrating an outline of the sound source separation model learning device 1 in the embodiment.
- the sound source separation model learning device 1 estimates the spectrogram template and the template weight, which will be described later, based on the spectrogram X for learning, and acquires the combined product of the estimated spectrogram template and the template weight.
- the sound source separation model learning device 1 acquires the estimation result of the learning dominant sound source information Y (hereinafter referred to as “estimated dominant sound source information V”) based on the synthetic product.
- the sound source separation model learning device 1 is a machine learning model (hereinafter referred to as a machine learning model) that estimates the template weight based on the spectrogram template and the learning spectrogram X based on the difference between the acquired estimated dominant sound source information V and the learning dominant sound source information Y. "Weight estimation model”) and is updated.
- the spectrogram template is one or more values at time frequency points belonging to one interval (hereinafter referred to as "time interval") divided in the time axis direction of the learning spectrogram X, and is one or more related to the learning spectrogram X. Information that represents the value of.
- the spectrogram template is the same regardless of the interval.
- the spectrogram template is updated by learning.
- the value of the spectrogram X for learning represented by the spectrogram template depends on the learning process by the sound source separation model learning device 1. Therefore, the value related to the spectrogram X for learning represented by the spectrogram template may be a physical quantity or a value that is not a physical quantity such as a statistical value, and what kind of value it is is a sound source separation model learning device. It is not a value determined in advance by one user.
- the spectrogram template is updated by training during the learning stage (ie, until the learning end condition is met), but does not change during the stage of separating the mixed sound signal to be separated using the trained model (that is, the sound source separation model). ..
- the template weight is a weight used for estimating the synthetic product using the spectrogram template based on the spectrogram X for learning.
- the template weight is a value corresponding to the mixed sound signal to be separated even at the stage of separating the mixed sound signal to be separated by using the trained model (that is, the sound source separation model).
- the sound source separation model is a trained model having a weight estimation model at the timing when the learning end condition is satisfied, and is a trained model having a spectrogram template at the timing when the learning end condition is satisfied as a (trained) parameter. ..
- the sound source separation model learning device 1 includes a sound source separation neural network 110, a loss acquisition unit 120, and a template update unit 130.
- the sound source separation neural network 110, the loss acquisition unit 120, and the template update unit 130 cooperate to perform learning to obtain a sound source separation model.
- the sound source separation neural network 110 is a neural network that obtains a sound source separation model by learning based on the loss acquired by the loss acquisition unit 120, which will be described in detail later.
- the sound source separation neural network 110 includes an input information acquisition unit 111, a configuration information estimation unit 112, and a dominant sound source information estimation unit 113.
- the input information acquisition unit 111 acquires the learning spectrogram X.
- the input information acquisition unit 111 is an input layer in the sound source separation neural network 110.
- the configuration information estimation unit 112 estimates the template weight based on the learning spectrogram X.
- the configuration information estimation unit 112 may be any as long as the template weight can be estimated based on the learning spectrogram X and the weight estimation model can be updated by learning.
- the configuration information estimation unit 112 is, for example, a convolutional neural network (CNN).
- the configuration information estimation unit 112 is, for example, an intermediate layer from the first intermediate layer to the (L-1) intermediate layer in the sound source separation neural network 110.
- the configuration information estimation unit 112 learns based on the loss acquired by the loss acquisition unit 120, which will be described in detail later.
- the weight estimation model is updated by learning by the configuration information estimation unit 112.
- the weight estimation model is updated to reduce the loss.
- the dominant sound source information estimation unit 113 acquires the combined product of the spectrogram template and the template weight.
- the dominant sound source information estimation unit 113 acquires the estimated dominant sound source information V based on the acquired synthetic product.
- the dominant sound source information estimation unit 113 is, for example, the Lth intermediate layer and the output layer in the sound source separation neural network 110.
- the loss acquisition unit 120 acquires the difference between the estimated dominant sound source information V and the learning dominant sound source information Y.
- the difference between the estimated dominant sound source information V and the learning dominant sound source information Y is referred to as a loss.
- the loss is expressed by, for example, the following equation (3).
- the symbol on the left side of the equation (3) is a symbol representing the loss.
- YY T is when the element in the k row and k'column is 1 when the same sound source is dominant at the time frequency point k and the time frequency point k'of the learning spectrogram X, and when it is not. It is a binary matrix of K rows and K columns such that it is 0. Note that k and k'are integers of 1 or more and K or less, and K is an integer of 2 or more.
- the template update unit 130 updates the spectrogram template based on the loss. More specifically, the template update unit 130 updates the spectrogram template so as to reduce the loss.
- the template update unit 130 updates the spectrogram template, it means that the values of the parameters representing the spectrogram template in the neural network constituting the dominant sound source information estimation unit 113 are appropriately adjusted.
- the template update unit 130 updates the spectrogram template to a non-negative value (hereinafter referred to as “non-negative value”) when updating the spectrogram template.
- the spectrogram template (that is, the initial value of the spectrogram template) at the stage where the sound source separation neural network 110 has never been learned is a predetermined value.
- the initial value of the spectrogram template is a predetermined value using, for example, a random number.
- the spectrogram template does not have to be one, and may be multiple.
- the number of spectrogram templates may be a predetermined number preset by the user, or may be a predetermined number using a method such as cross validation.
- the input layer of the sound source separation neural network 110 is the input information acquisition unit 111
- the intermediate layer from the first intermediate layer to the (L-1) intermediate layer is the configuration information estimation unit 112
- the L The condition is that the intermediate layer and the output layer are the dominant sound source information estimation unit 113.
- the template weight is estimated based on the learning spectrogram X input to the input layer.
- the output result of the first (L-1) intermediate layer is the template weight.
- the activation function of the third (L-1) intermediate layer outputs a non-negative value. Therefore, the template weight value is non-negative.
- the activation function that outputs a non-negative value is, for example, a soft plus function or a rectified linear function.
- the first intermediate layer to the (L-1) layer intermediate layer may be any neural network that can estimate the template weight based on the learning spectrogram X input to the input layer.
- the composite product of the spectrogram template and the template weight is acquired.
- the process of acquiring the composite product is expressed by a mathematical formula, for example, by the following equation (5).
- H (L) represents the output of the Lth layer
- H (L-1) represents the output of the (L-1) layer.
- the equation (5) is expressed in more detail by the following equation (6) for each element of H (L).
- d represents a sound source.
- d is a value of 0 or 1, where 1 represents one of the two speakers and 0 represents the other.
- m is an integer of 1 or more and N or less, and represents the time on the time axis of the spectrogram X for learning.
- J (j is an integer of 1 or more and J is an integer of 1 or more. J is an integer of 1 or more) in the equation (6) is an identifier for identifying the spectrogram template for the sound source d. Therefore, J is the total number of spectrogram templates for the sound source d.
- equation (6) is expressed by the following equation (8) after the left side of the equation (6) is shifted by m in the time axis direction for each of the J spectrogram templates represented by the following equation (7). It is shown that it is the sum of the products multiplied by the values to be calculated.
- Equation (8) represents the template weight multiplied by the spectrogram template j at the time (nm ) of the sound source d of H (L-1).
- the synthetic product acquired in the Lth intermediate layer is standardized.
- the processing of the final layer is represented by, for example, the following equation (9).
- Equation (9) is expressed in more detail by the following equation (10) for each element of V.
- Equation (10) represents that the squared norm of the estimated dominant sound source information V is 1.
- the estimated dominant sound source information V may be standardized in any way, and may be standardized so that the p-th power norm of the estimated dominant sound source information V is 1 (p is an integer of 1 or more).
- the left side of the equation (10) can be interpreted as representing the Wiener mask.
- H (L) may be acquired as the estimated dominant sound source information V in the final layer. Since the estimated dominant sound source information V represented by the equation (9) is only a standardized synthetic product, the loss is a quantity representing the difference between the synthetic product and the learning dominant sound source information Y.
- the sound source separation device 2 separates the non-mixed sound signal from the input mixed sound signal by using the sound source separation model obtained by the sound source separation model learning device 1 by learning.
- the number of non-mixed sound signals separated from the mixed sound signal may be a number specified in advance by the user of the sound source separation device 2 (hereinafter referred to as “user-specified number”), or may be mixed based on some other learning model. It may be a number estimated by using a technique for estimating the number of sound sources from a sound signal. Some other learning model is, for example, the method described in Reference 1 below.
- the sound source separation system 100 will be described by taking as an example the case where the number of non-mixed sound signals separated from the mixed sound signal is a number specified in advance by the user.
- FIG. 3 is a diagram showing an example of the hardware configuration of the sound source separation model learning device 1 in the embodiment.
- the sound source separation model learning device 1 includes a control unit 10 including a processor 91 such as a CPU (Central Processing Unit) connected by a bus and a memory 92, and executes a program.
- the sound source separation model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14 by executing a program. More specifically, the processor 91 reads out the program stored in the storage unit 13, and stores the read program in the memory 92. By executing the program stored in the memory 92 by the processor 91, the sound source separation model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14. ..
- a control unit 10 including a processor 91 such as a CPU (Central Processing Unit) connected by a bus and a memory 92, and executes a program.
- the control unit 10 controls the operation of various functional units included in the sound source separation model learning device 1.
- the control unit 10 executes, for example, a unit learning process.
- the unit learning process is a series of processes in which a loss is acquired using one learning data, and the spectrogram template and the weight estimation model are updated based on the acquired loss.
- the input unit 11 includes an input device such as a mouse, a keyboard, and a touch panel.
- the input unit 11 may be configured as an interface for connecting these input devices to its own device.
- the input unit 11 receives input of various information to its own device.
- the input unit 11 receives, for example, an input instructing the start of learning.
- the input unit 11 accepts, for example, input of learning data.
- the instruction to start learning may be, for example, input of learning data.
- the interface unit 12 includes a communication interface for connecting the own device to an external device.
- the interface unit 12 communicates with an external device via wired or wireless.
- the external device may be a storage device such as a USB (Universal Serial Bus) memory.
- USB Universal Serial Bus
- the interface unit 12 acquires the learning data output by the external device by communicating with the external device.
- the interface unit 12 includes a communication interface for connecting the own device to the sound source separation device 2.
- the interface unit 12 communicates with the sound source separation device 2 via wired or wireless.
- the interface unit 12 outputs a sound source separation model to the sound source separation device 2 by communicating with the sound source separation device 2.
- the storage unit 13 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device.
- the storage unit 13 stores various information about the sound source separation model learning device 1.
- the storage unit 13 stores, for example, a weight estimation model in advance.
- the storage unit 13 stores, for example, the initial value of the spectrogram template in advance.
- the storage unit 13 stores, for example, a spectrogram template.
- the output unit 14 outputs various information.
- the output unit 14 includes display devices such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display.
- the output unit 14 may be configured as an interface for connecting these display devices to its own device.
- the output unit 14 outputs, for example, the information input to the input unit 11.
- the output unit 14 may display information indicating the spectrogram template at the time when the learning end condition is satisfied, for example.
- FIG. 4 is a diagram showing an example of the functional configuration of the control unit 10 in the embodiment.
- the control unit 10 includes a controlled unit 101 and a management unit 102.
- the managed unit 101 executes the unit learning process.
- the managed unit 101 includes a sound source separation neural network 110, a loss acquisition unit 120, a template update unit 130, and a learning data acquisition unit 140.
- the learning data acquisition unit 140 acquires the learning data input to the input unit 11 or the interface unit 12.
- the learning data acquisition unit 140 outputs the learning spectrogram X out of the acquired learning data to the sound source separation neural network 110, and outputs the learning control sound source information Y to the loss acquisition unit 120. More specifically, the learning data acquisition unit 140 outputs the learning spectrogram X to the input information acquisition unit 111.
- the management unit 102 controls the operation of the managed unit 101.
- the management unit 102 controls, for example, the execution of the unit learning process as the operation control of the managed unit 101.
- the management unit 102 controls, for example, the operations of the input unit 11, the interface unit 12, the storage unit 13, and the output unit 14.
- the management unit 102 reads various information from the storage unit 13 and outputs it to the managed unit 101.
- the management unit 102 acquires, for example, the information input to the input unit 11 and outputs the information to the managed unit 101.
- the management unit 102 acquires, for example, the information input to the input unit 11 and records it in the storage unit 13.
- the information input to the management unit 102, for example, the interface unit 12 is acquired and output to the managed unit 101.
- the information input to the management unit 102, for example, the interface unit 12, is acquired and recorded in the storage unit 13.
- the management unit 102 causes the output unit 14, for example, to output the information input to the input unit 11.
- the management unit 102 records, for example, the information used for executing the unit learning process and the information generated by executing the unit learning process in the storage unit 13.
- FIG. 5 is a diagram showing an example of the hardware configuration of the sound source separation device 2 in the embodiment.
- the sound source separation device 2 includes a control unit 20 including a processor 93 such as a CPU connected by a bus and a memory 94, and executes a program.
- the sound source separation device 2 functions as a device including a control unit 20, an input unit 21, an interface unit 22, a storage unit 23, and an output unit 24 by executing a program. More specifically, the processor 93 reads out the program stored in the storage unit 23, and stores the read program in the memory 94. By executing the program stored in the memory 94 by the processor 93, the sound source separation device 2 functions as a device including a control unit 20, an input unit 21, an interface unit 22, a storage unit 23, and an output unit 24.
- the control unit 20 controls the operation of various functional units included in the sound source separation device 2.
- the control unit 20 separates a user-specified number of non-mixed sound signals from the mixed sound signals to be separated by using, for example, the sound source separation model obtained by the sound source separation model learning device 1.
- the sound source separation device 2 will be described by taking as an example the case where the mixed sound signal to be separated is input in advance before being input to the sound source separation device 2.
- the input unit 21 includes an input device such as a mouse, a keyboard, and a touch panel.
- the input unit 21 may be configured as an interface for connecting these input devices to its own device.
- the input unit 21 receives input of various information to its own device.
- the input unit 21 accepts, for example, a user-specified number of inputs.
- the input unit 21 receives, for example, an input instructing the start of a process of separating the non-mixed sound signal from the mixed sound signal.
- the input unit 21 receives, for example, an input of a mixed sound signal to be separated.
- the interface unit 22 includes a communication interface for connecting the own device to an external device.
- the interface unit 22 communicates with an external device via wired or wireless.
- the external device is, for example, the output destination of the non-mixed sound signal separated from the mixed sound signal.
- the interface unit 22 outputs a non-mixed sound signal to the external device by communicating with the external device.
- the external device for outputting the non-mixed sound signal is a sound output device such as a speaker.
- the external device may be, for example, a storage device such as a USB memory that stores the sound source separation model.
- the interface unit 22 acquires the sound source separation model by communicating with the external device.
- the external device is, for example, an output source of a mixed sound signal.
- the interface unit 22 acquires the mixed sound signal from the external device by communicating with the external device.
- the interface unit 22 includes a communication interface for connecting the own device to the sound source separation model learning device 1.
- the interface unit 22 communicates with the sound source separation model learning device 1 via wired or wireless.
- the interface unit 22 acquires a sound source separation model from the sound source separation model learning device 1 by communicating with the sound source separation model learning device 1.
- the storage unit 23 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device.
- the storage unit 23 stores various information about the sound source separation device 2.
- the storage unit 13 stores, for example, the sound source separation model acquired via the interface unit 22.
- the storage unit 13 stores, for example, a user-specified number input via the input unit 11.
- the storage unit 13 stores the number of spectrogram templates.
- the output unit 24 outputs various information.
- the output unit 24 includes display devices such as a CRT display, a liquid crystal display, and an organic EL display.
- the output unit 24 may be configured as an interface for connecting these display devices to the own device.
- the output unit 24 outputs, for example, the information input to the input unit 21.
- the output unit 24 outputs, for example, the spectrogram template used when the non-mixed sound signal is separated from the mixed sound signal and the template weight corresponding to the spectrogram template.
- FIG. 6 is a diagram showing an example of the functional configuration of the control unit 20 in the embodiment.
- the control unit 20 includes a separation target acquisition unit 201, a spectrogram acquisition unit 202, a separation information acquisition unit 203, a non-mixed sound signal generation unit 204, a sound signal output control unit 205, and an interface control unit 206.
- the separation target acquisition unit 201 acquires the mixed sound signal to be separated.
- the separation target acquisition unit 201 acquires, for example, the mixed sound signal input to the input unit 21.
- the separation target acquisition unit 201 acquires, for example, the mixed sound signal input to the interface unit 22.
- the spectrogram acquisition unit 202 acquires a spectrogram of the mixed sound signal acquired by the separation target acquisition unit 201 (hereinafter referred to as “separation target spectrogram”).
- the method for acquiring the spectrogram may be any method as long as the spectrogram can be acquired from the mixed sound signal.
- the spectrogram acquisition method may be, for example, a method of applying a short-time Fourier transform to the waveform of a mixed sound signal and then acquiring an amplitude spectrogram obtained by extracting only the amplitude information. Is.
- the acquired spectrogram is output to the separation information acquisition unit 203.
- the separation information acquisition unit 203 acquires the estimated dominant sound source information V for each of the user-specified number of non-mixed sound signals included in the mixed sound signal to be separated by using the sound source separation model based on the separation target spectrogram.
- the sound source separation model For the sound source separation model, spectrogram templates for all sound sources used for learning are obtained. Therefore, when the number specified by the user is plural, the sound source separation model can separate all the sound sources used for learning.
- the non-mixed sound signal generation unit 204 generates a non-mixed sound signal by using the mixed sound signal to be separated, the spectrogram to be separated, and the estimated dominant sound source information V acquired by the separation information acquisition unit 203. For example, the non-mixed sound signal generation unit 204 multiplies the estimated dominant sound source information V by the input amplitude spectrogram, adds the phase information based on the phase reconstruction method such as the Griffin-Lim method, and then applies the inverse short-time Fourier transform. Generates a non-mixed sound signal. In this way, the non-mixed sound signal generation unit 204 separates the non-mixed sound signal from the mixed sound signal to be separated. The separated non-mixed sound signal is output to the sound signal output control unit 205.
- the sound signal output control unit 205 controls the operation of the interface unit 22.
- the sound signal output control unit 205 controls the operation of the interface unit 22 so that the interface unit 22 outputs a separated non-mixed sound signal.
- FIG. 7 is a flowchart showing an example of the flow of processing executed by the sound source separation model learning device 1 in the embodiment. More specifically, FIG. 7 is a flowchart showing an example of the flow of the unit learning process.
- the sound source separation model learning device 1 executes the unit learning process shown in FIG. 7 every time the learning data is input to obtain a sound source separation model.
- Learning data is input to the input unit 11 or the interface unit 12 (step S101).
- the input information acquisition unit 111 acquires the learning spectrogram X included in the learning data (step S102).
- the configuration information estimation unit 112 estimates the template weight using the weight estimation model based on the learning spectrogram X (step S103).
- the dominant sound source information estimation unit 113 estimates the estimated dominant sound source information V based on the spectrogram template and the template weight (step S104).
- the loss acquisition unit 120 acquires the difference (that is, the loss) between the estimated dominant sound source information V and the learning dominant sound source information Y included in the learning data (step S105).
- the template update unit 130 updates the spectrogram template so as to reduce the loss, and the configuration information estimation unit 112 updates the weight estimation model so as to reduce the loss (step S106).
- FIG. 8 is a flowchart showing an example of the flow of processing executed by the sound source separation device 2 in the embodiment.
- the process executed by the sound source separation device 2 is performed by taking as an example the case where the user-specified number has been input to the sound source separation device 2 in advance and the input user-specified number has been stored in the storage unit 23. An example of the flow will be described.
- the separation target acquisition unit 201 acquires the separation target mixed sound signal input to the input unit 21 or the interface unit 22 (step S201).
- the spectrogram acquisition unit 202 acquires the spectrogram to be separated using the mixed sound signal to be separated (step S202).
- the separation information acquisition unit 203 acquires the estimated dominant sound source information V for each of the user-specified number of non-mixed sound signals included in the mixed sound signal to be separated by using the sound source separation model based on the separation target spectrogram (step). S203).
- the non-mixed sound signal generation unit 204 uses the mixed sound signal to be separated, the spectrogram to be separated, and the estimated dominant sound source information V acquired by the separation information acquisition unit 203, and the non-mixed sound from the mixed sound signal. Separate the signals (step S204).
- the sound signal output control unit 205 controls the operation of the interface unit 22 so that the interface unit 22 outputs the separated non-mixed sound signal (step S205).
- the training data was created as follows. First, a short-time Fourier transform using a humming window was applied to each one utterance signal of speaker 0 and speaker 1. Next, the weights generated from the uniform distribution on the closed interval from 0 to 1 were multiplied by each signal after the short-time Fourier transform to obtain the spectrogram X ⁇ (d) for each speaker. In the separation experiment, d is 0 or 1, where 0 indicates speaker 0 and 1 indicates speaker 1. Note that X ⁇ means a symbol represented by the following equation (11).
- X ⁇ (q) means a symbol represented by the following equation (12).
- the inputs X (X f, n ) f and n to the proposed model were scaled so that the maximum value was 1, and the amplitude spectrograms X f and n were obtained.
- the amplitude spectrograms X f and n are represented by the following equation (13).
- the information represented by the following equation (15) was used as the learning dominant sound source information Y indicating the dominant speaker at each time frequency point (f, n).
- the left side of the equation (15) represents the learning dominant sound source information Y used in the separation experiment.
- test data 66 utterances of each of the voices of speaker 0 (bdl) and speaker 1 (clb) were used.
- the method of creating the test data is the same as that of the training data, but the weight to be multiplied after applying the short-time Fourier transform is set to 1 for both speakers.
- FIG. 9 is a diagram showing the first result of the separation experiment in the embodiment. Specifically, FIG. 9 is an example of a spectrogram of test data using a sound source separation model obtained by learning 500 epochs. The result R1 in FIG. 9 is the spectrogram of speaker 0, and the result R2 in FIG. 9 is the spectrogram of speaker 1.
- FIG. 10 is a diagram showing the second result of the separation experiment in the embodiment. Specifically, FIG. 10 shows the dominant sound source information of the correct answer data with respect to the test data of FIG.
- the result R3 in FIG. 10 is the correct answer data corresponding to the speaker 0, and the result R4 in FIG. 9 is the correct answer data corresponding to the speaker 1.
- FIG. 11 is a diagram showing the third result of the separation experiment in the embodiment. Specifically, FIG. 11 is an estimation result before normalization of the sound source separation device 2 with respect to the test data of FIG.
- the result R5 in FIG. 11 is the estimation result corresponding to the speaker 0, and the result R6 in FIG. 11 is the estimation result corresponding to the speaker 1.
- FIG. 12 is a diagram showing the fourth result of the separation experiment in the embodiment. Specifically, FIG. 12 is an estimation result after normalization of the sound source separation device 2 with respect to the test data of FIG.
- the result R7 in FIG. 12 is the estimation result corresponding to the speaker 0, and the result R8 in FIG. 12 is the estimation result corresponding to the speaker 1.
- FIG. 13 is a diagram showing the fifth result of the separation experiment in the embodiment. Specifically, FIG. 13 shows a spectrogram template acquired by the sound source separation device 2 with respect to the test data of FIG.
- the result R9 in FIG. 13 is the spectrogram template corresponding to the speaker 0, and the result R10 in FIG. 13 is the spectrogram template corresponding to the speaker 1.
- FIG. 13 represents five spectrogram templates in ascending order of j.
- the horizontal axis of each spectrogram template represents time, and the vertical axis represents frequency.
- j is a number for distinguishing a plurality of spectrogram templates.
- FIG. 14 is a diagram showing the sixth result of the separation experiment in the embodiment. Specifically, FIG. 14 shows the template weight corresponding to the speaker 0 acquired by the sound source separation device 2 with respect to the test data of FIG.
- FIG. 15 is a diagram showing the seventh result of the separation experiment in the embodiment. Specifically, FIG. 15 shows the template weight corresponding to the speaker 1 acquired by the sound source separation device 2 with respect to the test data of FIG.
- FIGS. 13 to 15 show how the sound source separating device 2 separated the difference between the speakers. Therefore, the results of the separation experiment show that the sound source separation system 100 facilitates the interpretation of the trained model.
- the sound source separation system 100 of the embodiment configured in this way estimates the spectrogram template and the template weight, and learns to reduce the loss based on the estimation result. Specifically, if the sound source separation system 100 is used, the user can grasp the information of the frequency pattern used for sound source separation for the input signal and its time change by looking at the spectrogram template and its weight. can.
- the frequency pattern is information representing the distribution of energy according to the frequency. Therefore, if the sound source separation system 100 is used, the user can know at least the time change of the frequency pattern as to how the sound source is separated, and the time change of the frequency pattern can be useful for interpreting the trained model. can. In this way, the sound source separation system 100 facilitates the interpretation of the trained model.
- the sound source separation system 100 of the embodiment configured in this way learns so that the values of the spectrogram template and the template weight are non-negative values. In such cases, the spectrogram template value and the template weight value are no longer negative, making it easier to interpret the physical meaning of the spectrogram template and the physical meaning of the template weight. .. Therefore, the sound source separation system 100 configured in this way facilitates the interpretation of the trained model.
- B) is a non-negative function that outputs 0 when A and B match, and outputs a larger value as the difference between X and Y increases. Therefore, D (A
- ⁇ is a non-negative constant that represents the strength of regularization.
- Equation (17) is a term (regularization term) representing an error between the value obtained by summing the right side of equation (10) for all sound sources d and the spectrogram X for learning.
- the sound source separation model learning device 1 learns so as to reduce the loss represented by the equation (16), so that the right side of the equation (10) is the sum of all the sound sources d and the spectrogram X for learning. The difference can be small. Specifically, if the loss acquisition unit 120 acquires the loss represented by the equation (16) instead of the loss represented by the equation (3), the sound source separation model learning device 1 obtains the right side of the equation (10). The difference between the sum of all sound sources d and the learning spectrogram X can be reduced.
- the sound source separation device 2 does not necessarily have to include the spectrogram acquisition unit 202. In such a case, the separation target spectrogram is input to the sound source separation device 2 as it is.
- the sound source separation model learning device 1 and the sound source separation device 2 may be implemented by using a plurality of information processing devices that are communicably connected via a network. Each functional unit included in the sound source separation model learning device 1 may be distributed and mounted in a plurality of information processing devices.
- the template updating unit 130 may be provided by the dominant sound source information estimation unit 113.
- the non-mixed sound signal generation unit 204 is an example of a separation unit.
- the configuration information estimation unit 112 is an example of the weight estimation unit.
- the spectrogram template makes it easier to interpret the trained model when it is a non-negative value than when it is not a non-negative value, but it does not necessarily have to be a non-negative value.
- a non-negative value makes it easier to interpret the trained model than a case where it is not a non-negative value, but it does not necessarily have to be a non-negative value. It was
- All or part of each function of the sound source separation model learning device 1 and the sound source separation device 2 is equipped with hardware such as ASIC (Application Specific Integrated Circuit), PLD (Programmable Logic Device), and FPGA (Field Programmable Gate Array). It may be realized by using.
- the program may be recorded on a computer-readable recording medium.
- the computer-readable recording medium is, for example, a flexible disk, a magneto-optical disk, a portable medium such as a ROM or a CD-ROM, or a storage device such as a hard disk built in a computer system.
- the program may be transmitted over a telecommunication line.
- Spectrogram acquisition unit 203 ... Separation information acquisition unit, 204 ... non-mixed sound signal generation unit, 205 ... sound signal output control unit, 206 ... interface control unit, 91 ... processor, 92 ... memory, 93 ... processor, 94 ... memory
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
Selon la présente invention, un dispositif d'apprentissage de modèle de séparation de source sonore comprend : une unité d'acquisition de données d'apprentissage pour acquérir un spectrogramme d'un signal mélangé dans lequel de multiples types de sons sont mélangés et pour acquérir des informations de source sonore dominante indiquant si la source sonore cible est dominante ou non à chaque point temps fréquence dans le spectrogramme ; une unité d'estimation de poids pour estimer le poids utilisé pour l'estimation d'un produit de composition à l'aide d'un gabarit, le gabarit étant des informations indiquant une ou plusieurs valeurs qui sont situées à des points temps fréquence appartenant à une division du spectrogramme divisé dans la direction de l'axe des temps et sont relatives au spectrogramme ; une unité d'estimation d'informations de source sonore dominante pour acquérir un résultat d'estimation pour les informations de source sonore dominante sur la base du produit de composition ; et une unité d'acquisition de perte pour acquérir la différence entre le résultat d'estimation et les informations de source sonore dominante. L'unité d'estimation de poids apprend un modèle d'apprentissage automatique pour estimer le poids de façon à diminuer ladite différence.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2022524772A JP7376833B2 (ja) | 2020-05-20 | 2020-05-20 | 音源分離モデル学習装置、音源分離装置、音源分離モデル学習方法及びプログラム |
| PCT/JP2020/019997 WO2021234873A1 (fr) | 2020-05-20 | 2020-05-20 | Dispositif d'apprentissage de modèle de séparation de source sonore, dispositif de séparation de source sonore, procédé d'apprentissage de modèle de séparation de source sonore, et programme |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2020/019997 WO2021234873A1 (fr) | 2020-05-20 | 2020-05-20 | Dispositif d'apprentissage de modèle de séparation de source sonore, dispositif de séparation de source sonore, procédé d'apprentissage de modèle de séparation de source sonore, et programme |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021234873A1 true WO2021234873A1 (fr) | 2021-11-25 |
Family
ID=78708280
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2020/019997 Ceased WO2021234873A1 (fr) | 2020-05-20 | 2020-05-20 | Dispositif d'apprentissage de modèle de séparation de source sonore, dispositif de séparation de source sonore, procédé d'apprentissage de modèle de séparation de source sonore, et programme |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JP7376833B2 (fr) |
| WO (1) | WO2021234873A1 (fr) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2018502319A (ja) * | 2015-07-07 | 2018-01-25 | 三菱電機株式会社 | 信号の1つ又は複数の成分を区別する方法 |
| WO2018042791A1 (fr) * | 2016-09-01 | 2018-03-08 | ソニー株式会社 | Dispositif de traitement d'informations, procédé de traitement d'informations, et support d'enregistrement dispositif de traitement d'informations, procédé de traitement d'informations et support d'enregistrement |
| JP2019144511A (ja) * | 2018-02-23 | 2019-08-29 | 日本電信電話株式会社 | 音響信号モデル学習装置、音響信号解析装置、方法、及びプログラム |
-
2020
- 2020-05-20 WO PCT/JP2020/019997 patent/WO2021234873A1/fr not_active Ceased
- 2020-05-20 JP JP2022524772A patent/JP7376833B2/ja active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2018502319A (ja) * | 2015-07-07 | 2018-01-25 | 三菱電機株式会社 | 信号の1つ又は複数の成分を区別する方法 |
| WO2018042791A1 (fr) * | 2016-09-01 | 2018-03-08 | ソニー株式会社 | Dispositif de traitement d'informations, procédé de traitement d'informations, et support d'enregistrement dispositif de traitement d'informations, procédé de traitement d'informations et support d'enregistrement |
| JP2019144511A (ja) * | 2018-02-23 | 2019-08-29 | 日本電信電話株式会社 | 音響信号モデル学習装置、音響信号解析装置、方法、及びプログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2021234873A1 (fr) | 2021-11-25 |
| JP7376833B2 (ja) | 2023-11-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3292515B1 (fr) | Procédé permettant de distinguer une ou plusieurs composantes d'un signal | |
| Becker et al. | Interpreting and explaining deep neural networks for classification of audio signals | |
| CN1151218A (zh) | 用于语音识别的神经网络的训练方法 | |
| US6224636B1 (en) | Speech recognition using nonparametric speech models | |
| Vignolo et al. | Genetic wavelet packets for speech recognition | |
| WO2018159403A1 (fr) | Dispositif d'apprentissage, système de synthèse vocale et procédé de synthèse vocale | |
| Mansour et al. | Voice recognition using dynamic time warping and mel-frequency cepstral coefficients algorithms | |
| Avci | An expert system for speaker identification using adaptive wavelet sure entropy | |
| WO2019171457A1 (fr) | Dispositif et procédé de séparation de source sonore, et programme de stockage de support non transitoire lisible par ordinateur | |
| JP2020034870A (ja) | 信号解析装置、方法、及びプログラム | |
| Sunny et al. | Recognition of speech signals: an experimental comparison of linear predictive coding and discrete wavelet transforms | |
| Laroche et al. | Drum extraction in single channel audio signals using multi-layer non negative matrix factor deconvolution | |
| JP6099032B2 (ja) | 信号処理装置、信号処理方法及びコンピュータプログラム | |
| Khamsehashari et al. | Voice Privacy-leveraging multi-scale blocks with ECAPA-TDNN SE-Res2NeXt extension for speaker anonymization | |
| Bakhshi et al. | Recognition of emotion from speech using evolutionary cepstral coefficients | |
| Roy et al. | Pathological voice classification using deep learning | |
| JP7498408B2 (ja) | 音声信号変換モデル学習装置、音声信号変換装置、音声信号変換モデル学習方法及びプログラム | |
| CN112967734B (zh) | 基于多声部的音乐数据识别方法、装置、设备及存储介质 | |
| JP7376833B2 (ja) | 音源分離モデル学習装置、音源分離装置、音源分離モデル学習方法及びプログラム | |
| Grais et al. | Initialization of nonnegative matrix factorization dictionaries for single channel source separation | |
| Roy et al. | A hybrid VQ-GMM approach for identifying Indian languages | |
| Kameoka et al. | Nonnegative matrix factorization with basis clustering using cepstral distance regularization | |
| JP7567730B2 (ja) | 音源分離学習装置、音源分離学習方法、及び音源分離学習プログラム | |
| Badeau et al. | Nonnegative matrix factorization | |
| Ito et al. | On-line chord recognition using fifthnet with synchrosqueezing transform |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20936721 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2022524772 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20936721 Country of ref document: EP Kind code of ref document: A1 |