WO2021234873A1

WO2021234873A1 - Sound source separation model learning device, sound source separation device, sound source separation model learning method, and program

Info

Publication number: WO2021234873A1
Application number: PCT/JP2020/019997
Authority: WO
Inventors: 千紘渡邊; 弘和亀岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2021-11-25
Anticipated expiration: 2022-11-20
Also published as: JPWO2021234873A1; JP7376833B2

Abstract

This sound source separation model learning device comprises: a learning data acquisition unit for acquiring a spectrogram of a mixed signal in which multiple types of sound are mixed and acquiring dominant sound source information indicating whether the target sound source is dominant at each time frequency point in the spectrogram or not; a weight estimation unit for estimating the weight used for estimation of a composition product using a template, the template being information indicating one or more values which are located at time frequency points belonging to one division of the spectrogram divided in the time axis direction and are related to the spectrogram; a dominant sound source information estimation unit for acquiring an estimation result for the dominant sound source information on the basis of the composition product; and a loss acquisition unit for acquiring the difference between the estimation result and the dominant sound source information. The weight estimation unit learns a machine learning model for estimating the weight so as to decrease said difference.

Description

Sound source separation model learning device, sound source separation device, sound source separation model learning method and program

　本発明は、音源分離モデル学習装置、音源分離装置、音源分離モデル学習方法及びプログラムに関する。 The present invention relates to a sound source separation model learning device, a sound source separation device, a sound source separation model learning method, and a program.

　複数の音源によるモノラルの混合音信号から各音源の信号を分離する音源分離の技術がある。このような技術としては、例えば観測信号のスペクトログラムの各時間周波数点においてどの話者のエネルギーが支配的かを識別するクラス識別問題の考え方に着想を得て提案された技術がある。このようなクラス識別問題の考え方に着想を得て提案された技術として、近年、機械学習の手法を用いた技術が提案されている。機械学習の手法を用いた音源分離の技術としては、例えばニューラルネットワーク(Ｎｅｕｒａｌ　Ｎｅｔｗｏｒｋ；　ＮＮ)を用いた音源分離の技術が提案されている（非特許文献１及び２参照）。 There is a sound source separation technology that separates the signal of each sound source from the monaural mixed sound signal of multiple sound sources. As such a technique, for example, there is a technique proposed based on the idea of a class identification problem that identifies which speaker's energy is dominant at each time frequency point of the spectrogram of the observed signal. In recent years, a technique using a machine learning method has been proposed as a technique proposed based on the idea of such a class identification problem. As a sound source separation technique using a machine learning method, for example, a sound source separation technique using a neural network (NN) has been proposed (see Non-Patent Documents 1 and 2).

　ニューラルネットワークを用いた音源分離の技術としては、例えば深層クラスタリング（Ｄｅｅｐ　Ｃｌｕｓｔｅｒｉｎｇ；ＤＣ）法（非特許文献３及び４参照）を用いた音源分離の技術が提案されている。ＤＣ法を用いた音源分離の技術ではまず、ＮＮを用いて各時間周波数点の低次元埋め込み表現が学習される。時間周波数点とは、時間軸と周波数軸とが張る空間（時間周波数空間）内の点（すなわち時間周波数空間に含まれる元）である。 As a sound source separation technique using a neural network, for example, a sound source separation technique using a deep clustering (DC) method (see Non-Patent Documents 3 and 4) has been proposed. In the sound source separation technique using the DC method, first, a low-dimensional embedded representation of each time frequency point is learned using NN. The time frequency point is a point (that is, a source included in the time frequency space) in the space (time frequency space) where the time axis and the frequency axis extend.

　各時間周波数点は、時間周波数空間における各時間周波数点の位置が示す時間及び周波数ごとにＮ次元の特徴量ベクトルを示す（Ｎは２以上の整数）。特徴量ベクトルは、解析対象から得られる情報のうち学習等を通じて得られた所定の条件を満たす情報の集合である。低次元埋め込み表現を学習するとは、Ｎ次元の特徴量ベクトルを次元がＮ未満の特徴量ベクトルに変換する写像を学習することを意味する。 Each time frequency point indicates an N-dimensional feature quantity vector for each time and frequency indicated by the position of each time frequency point in the time frequency space (N is an integer of 2 or more). The feature amount vector is a set of information that satisfies a predetermined condition obtained through learning or the like among the information obtained from the analysis target. Learning the low-dimensional embedded representation means learning a mapping that transforms an N-dimensional feature vector into a feature vector with dimensions less than N.

　これにより各時間周波数点の低次元埋め込みを表現する学習済みモデルが得られる。ＤＣ法を用いた音源分離の技術では次に、ｋ平均法等の教師なしクラスタリング手法を用い、得られた埋め込みベクトルをクラスタリングすることにより音源分離を行う。埋め込みベクトルとは、各時間周波数点における次元がＮ未満の特徴量ベクトルである。ＤＣ法を用いた音源分離の技術は、未知の音源の混合音声に対しても高精度な分離が可能であることが実験的に示されている。 This gives a trained model that expresses the low-dimensional embedding of each time frequency point. Next, in the sound source separation technique using the DC method, sound source separation is performed by clustering the obtained embedded vectors by using an unsupervised clustering method such as the k-means method. The embedded vector is a feature vector having a dimension less than N at each time frequency point. It has been experimentally shown that the sound source separation technique using the DC method is capable of highly accurate separation even for mixed voices of unknown sound sources.

John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe,“DEEP CLUSTERING:DISCRIMINATIVE EMBEDDINGS FOR SEGMENTATION AND SEPARATION”, In ICASSP, pp.31-35, 2016John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe, “DEEP CLUSTERING: DISCRIMINATIVE EMBEDDINGS FOR SEGMENTATION AND SEPARATION”, In ICASSP, pp.31-35, 2016 Li Li, Hirokazu Kameoka,” DEEP CLUSTERING WITH GATED CONVOLUTIONAL NETWORKS”, In ICASSP,pp.16-20, 2018Li Li, Hirokazu Kameoka, ”DEEP CLUSTERING WITH GATED CONVOLUTIONAL NETWORKS”, InICASSP, pp.16-20, 2018

　しかしながら、ＤＣ法を含めＮＮを用いた音源分離の技術は、学習により得られた写像であって音源分離を行う写像である学習済みモデルの解釈が難しい場合があった。学習済みモデルを解釈するとは、学習済みモデルの予測結果の根拠を知ることを意味する。例えばＤＣ法であれば、埋め込みベクトルが決定された根拠をユーザが判断することが難しい場合があった。 However, in the sound source separation technique using NN including the DC method, it may be difficult to interpret the trained model which is a map obtained by learning and is a map for performing sound source separation. Interpreting a trained model means knowing the basis for the predicted results of the trained model. For example, in the case of the DC method, it may be difficult for the user to determine the basis for determining the embedded vector.

　もし学習済みモデルの解釈が容易になれば、ＤＣ法に対する理解が深まり、汎化性能の向上や話者以外の音源への適応などの、音源分離の技術の更なる向上が期待される。特に、埋め込みベクトルを決定する際に具体的にどのようなスペクトログラム構造を手がかりにしているか、を可視化する、つまりユーザが知ることができれば音源分離の技術が大きく向上することが期待される。 If the trained model can be easily interpreted, the understanding of the DC method will be deepened, and it is expected that the sound source separation technology will be further improved, such as improvement of generalization performance and adaptation to sound sources other than the speaker. In particular, it is expected that the sound source separation technology will be greatly improved if the user can visualize what kind of spectrogram structure is specifically used as a clue when determining the embedded vector.

　上記事情に鑑み、本発明は、音源分離を行う学習済みモデルの解釈を容易にする技術を提供することを目的としている。 In view of the above circumstances, it is an object of the present invention to provide a technique for facilitating the interpretation of a trained model that separates sound sources.

　本発明の一態様は、複数の音が混合された混合信号のスペクトログラムと前記スペクトログラムの各時間周波数点について目的の音源が支配的か否かを示す支配音源情報とを取得する学習用データ取得部と、前記スペクトログラムの時間軸方向に区分された１つの区間に属する時間周波数点における１又は複数の値であって前記スペクトログラムに関する１又は複数の値を表す情報であるテンプレートを用いた合成積の推定に用いる重みを推定する重み推定部と、前記合成積に基づき前記支配音源情報の推定結果を取得する支配音源情報推定部と、前記推定結果と前記支配音源情報との違いを取得する損失取得部と、を備え、前記合成積の推定に用いられる前記テンプレートと前記重みとは前記目的の音源のスペクトログラムに関する推定結果を示し、前記重み推定部は前記違いを小さくするように前記重みを推定する機械学習のモデルを学習する、音源分離モデル学習装置である。 One aspect of the present invention is a learning data acquisition unit that acquires a spectrogram of a mixed signal in which a plurality of sounds are mixed and dominant sound source information indicating whether or not a target sound source is dominant for each time frequency point of the spectrogram. And estimation of the composite product using a template, which is information representing one or more values related to the spectrogram, which is one or more values at time frequency points belonging to one section divided in the time axis direction of the spectrogram. A weight estimation unit that estimates the weights used in the above, a dominant sound source information estimation unit that acquires the estimation result of the dominant sound source information based on the combined product, and a loss acquisition unit that acquires the difference between the estimation result and the dominant sound source information. The template and the weight used for estimating the synthetic product indicate the estimation result regarding the spectrogram of the target sound source, and the weight estimation unit is a machine that estimates the weight so as to reduce the difference. It is a sound source separation model learning device that learns a learning model.

　本発明により、音源分離を行う学習済みモデルの解釈を容易にすることが可能となる。 The present invention makes it possible to facilitate the interpretation of a trained model that separates sound sources.

実施形態の音源分離システム１００の概要を説明する説明図。An explanatory diagram illustrating an outline of the sound source separation system 100 of the embodiment. 実施形態における音源分離モデル学習装置１の概要を説明する説明図。An explanatory diagram illustrating an outline of the sound source separation model learning device 1 in the embodiment. 実施形態における音源分離モデル学習装置１のハードウェア構成の一例を示す図。The figure which shows an example of the hardware composition of the sound source separation model learning apparatus 1 in an embodiment. 実施形態における制御部１０の機能構成の一例を示す図。The figure which shows an example of the functional structure of the control part 10 in an embodiment. 実施形態における音源分離装置２のハードウェア構成の一例を示す図。The figure which shows an example of the hardware composition of the sound source separation apparatus 2 in embodiment. 実施形態における制御部２０の機能構成の一例を示す図。The figure which shows an example of the functional structure of the control part 20 in embodiment. 実施形態における音源分離モデル学習装置１が実行する処理の流れの一例を示すフローチャート。The flowchart which shows an example of the flow of the process executed by the sound source separation model learning apparatus 1 in embodiment. 実施形態における音源分離装置２が実行する処理の流れの一例を示すフローチャート。The flowchart which shows an example of the flow of the process executed by the sound source separation apparatus 2 in embodiment. 実施形態における分離実験の第１の結果を示す図。The figure which shows the 1st result of the separation experiment in an embodiment. 実施形態における分離実験の第２の結果を示す図。The figure which shows the 2nd result of the separation experiment in an embodiment. 実施形態における分離実験の第３の結果を示す図。The figure which shows the 3rd result of the separation experiment in an embodiment. 実施形態における分離実験の第４の結果を示す図The figure which shows the 4th result of the separation experiment in an embodiment. 実施形態における分離実験の第５の結果を示す図The figure which shows the 5th result of the separation experiment in an embodiment. 実施形態における分離実験の第６の結果を示す図The figure which shows the sixth result of the separation experiment in an embodiment. 実施形態における分離実験の第７の結果を示す図The figure which shows the 7th result of the separation experiment in an embodiment.

（実施形態）
　図１及び図２を用いて、実施形態の音源分離システム１００の概要を説明する。図１は、実施形態の音源分離システム１００の概要を説明する説明図である。以下説明の簡単のため音源分離システム１００の処理対処の信号として音声の信号を例に音源分離システム１００を説明するが、音源分離システム１００の処理対象の信号は音の信号であればどのようなものであってもよい。例えば、音源分離システム１００の処理対象の信号は、楽器の音の信号であってもよい。なお音源は、モノラル音源である。音源分離システム１００は、分離対象の混合音信号から非混合音信号を分離する。混合音信号は、複数の非混合音信号が混合された音の信号である。異なる非混合音信号は、音源が異なる音の信号である。 (Embodiment)
An outline of the sound source separation system 100 of the embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is an explanatory diagram illustrating an outline of the sound source separation system 100 of the embodiment. The sound source separation system 100 will be described below by taking an audio signal as an example as a signal for dealing with the processing of the sound source separation system 100 for the sake of simplicity. However, any signal to be processed by the sound source separation system 100 is a sound signal. It may be a thing. For example, the signal to be processed by the sound source separation system 100 may be a signal of the sound of a musical instrument. The sound source is a monaural sound source. The sound source separation system 100 separates the non-mixed sound signal from the mixed sound signal to be separated. The mixed sound signal is a sound signal in which a plurality of non-mixed sound signals are mixed. Different non-mixed sound signals are signals with different sound sources.

　混合音信号は、例えば第１の人が発した音声に第２の人が発した音声とが混じった音声の信号である。このような場合、音源分離システム１００は、第１の人が発した音声の信号と第２の人が発した音声の信号とを分離する。この場合、第１の人が発した音声の信号と第２の人が発した音声の信号とはそれぞれ非混合音信号の一例である。音源分離システム１００によって分離される非混合音信号の数は１つであってもよいし複数であってもよい。 The mixed sound signal is, for example, a voice signal in which the voice emitted by the first person is mixed with the voice emitted by the second person. In such a case, the sound source separation system 100 separates the voice signal emitted by the first person and the voice signal emitted by the second person. In this case, the voice signal emitted by the first person and the voice signal emitted by the second person are examples of non-mixed sound signals. The number of non-mixed sound signals separated by the sound source separation system 100 may be one or a plurality.

　音源分離システム１００は、音源分離モデル学習装置１及び音源分離装置２を備える。音源分離モデル学習装置１は、混合スペクトログラムから支配音源情報を推定する学習済みのモデル（以下「音源分離モデル」という。）を機械学習によって得る。 The sound source separation system 100 includes a sound source separation model learning device 1 and a sound source separation device 2. The sound source separation model learning device 1 obtains a trained model (hereinafter referred to as “sound source separation model”) that estimates dominant sound source information from the mixed spectrogram by machine learning.

　混合スペクトログラムは、混合音信号のスペクトログラムである。支配的とは、他の音源よりもスペクトログラムの強度（すなわち音の強さ）が強いことを意味する。時間周波数点とは、スペクトログラムの一点を表す。すなわち時間周波数点は、一軸が時刻を表し一軸が周波数を表す空間における点である。スペクトログラムにおける時間周波数点の値は音の強さを表す。 The mixed spectrogram is a spectrogram of a mixed sound signal. Dominant means that the spectrogram strength (ie, sound intensity) is stronger than other sound sources. The time frequency point represents one point in the spectrogram. That is, a time frequency point is a point in space where one axis represents time and one axis represents frequency. The value of the time frequency point in the spectrogram represents the sound intensity.

　支配音源情報は、混合スペクトログラムの各時間周波数点について、混合スペクトログラムに含まれる複数の音源のいずれが支配的かを示す情報である。そのため、音源分離モデルは、混合スペクトログラムから支配音源情報の推定結果（以下「推定支配音源情報」という。）を取得するモデルである。 The dominant sound source information is information indicating which of the plurality of sound sources included in the mixed spectrogram is dominant for each time frequency point of the mixed spectrogram. Therefore, the sound source separation model is a model that acquires the estimation result of the dominant sound source information (hereinafter referred to as "estimated dominant sound source information") from the mixed spectrogram.

　以下説明の簡単のため学習するとは、入力に基づいて機械学習のモデル（以下「機械学習モデル」という。）におけるパラメータの値を好適に調整することを意味する。以下の説明において、Ａであるように学習するとは、機械学習モデルにおけるパラメータの値がＡを満たすように調整されることを意味する。Ａは予め定められた条件を表す。学習済みモデルとは、１又は複数回の学習が行われた後の機械学習モデルであって所定の終了条件（以下「学習終了条件」という。）が満たされたタイミングの機械学習モデルである。 Learning for the sake of simplicity of the following explanation means to appropriately adjust the value of the parameter in the machine learning model (hereinafter referred to as "machine learning model") based on the input. In the following description, learning to be A means that the value of the parameter in the machine learning model is adjusted to satisfy A. A represents a predetermined condition. The trained model is a machine learning model after one or a plurality of learnings have been performed, and is a machine learning model at a timing when a predetermined end condition (hereinafter referred to as “learning end condition”) is satisfied.

　音源分離モデル学習装置１は、学習済みモデルを得るためのデータ（以下「学習用データ」という。）を用いて学習を行う。学習用データは、具体的には複数の対データを含む。対データは、学習用スペクトルグラムＸと学習用支配音源情報Ｙとの対である。 The sound source separation model learning device 1 performs learning using data for obtaining a trained model (hereinafter referred to as "learning data"). The training data specifically includes a plurality of pairs of data. The paired data is a pair of the learning spectrum gram X and the learning dominant sound source information Y.

　学習用スペクトログラムＸは、音源分離モデル学習装置１が学習済みモデルを得る際の説明変数として用いられる混合スペクトログラムである。学習用スペクトログラムＸは以下の式（１）で表される情報である。 The spectrogram X for learning is a mixed spectrogram used as an explanatory variable when the sound source separation model learning device 1 obtains a trained model. The spectrogram X for learning is information represented by the following equation (1).

　式(１)におけるｆ（ｆは０以上（Ｆ－１）以下の整数。Ｆは１以上の整数。）は混合スペクトログラムの各点の周波数軸上の位置を表す。式（１）におけるｎ（ｎは０以上（Ｎ－１）以下の整数。Ｎは１以上の整数。）は混合スペクトログラムの各点の時間軸上の位置を表す。そのため、式（１）は（Ｆ×Ｎ）個の時間周波数点を有する混合スペクトログラムを表す。学習用スペクトログラムＸは、より具体的には以下の式（２）で表される。 In equation (1), f (f is an integer of 0 or more (F-1) or less. F is an integer of 1 or more.) Represents the position of each point in the mixed spectrogram on the frequency axis. In equation (1), n (n is an integer of 0 or more (N-1) or less. N is an integer of 1 or more) represents the position of each point of the mixed spectrogram on the time axis. Therefore, equation (1) represents a mixed spectrogram having (F × N) time frequency points. More specifically, the spectrogram X for learning is expressed by the following equation (2).

　式（２）におけるｋ（ｋは１以上Ｋ以下の整数。Ｋは１以上の整数）は、各時間周波数点を識別する識別子である。 K (k is an integer of 1 or more and K or less. K is an integer of 1 or more) in the equation (2) is an identifier for identifying each time frequency point.

　学習用支配音源情報Ｙは、音源分離モデル学習装置１が学習済みモデルを得る際の目的変数として用いられる情報である。すなわち、学習用支配音源情報Ｙは、学習用データにおける正解ラベルである。学習用支配音源情報Ｙは、学習用スペクトルグラムＸの時間周波数点ごとに、予め定められた音源（以下「学習用音源」という。）が支配的か否かを示す。各時間周波数点において学習用音源が支配的か否かは、例えば時間周波数点ごとに０又は１の２値で表される。 The learning dominant sound source information Y is information used as an objective variable when the sound source separation model learning device 1 obtains a trained model. That is, the learning dominant sound source information Y is a correct label in the learning data. The learning dominant sound source information Y indicates whether or not a predetermined sound source (hereinafter referred to as “learning sound source”) is dominant for each time frequency point of the learning spectrum gram X. Whether or not the learning sound source is dominant at each time frequency point is represented by, for example, a binary value of 0 or 1 for each time frequency point.

　図２は、実施形態における音源分離モデル学習装置１の概要を説明する説明図である。音源分離モデル学習装置１は、学習用スペクトログラムＸに基づき、後述するスペクトログラムテンプレートとテンプレート重みとを推定し、推定したスペクトログラムテンプレートとテンプレート重みとの合成積を取得する。音源分離モデル学習装置１は、合成積に基づき学習用支配音源情報Ｙの推定結果（以下「推定支配音源情報Ｖ」という）を取得する。次に音源分離モデル学習装置１は、取得した推定支配音源情報Ｖと学習用支配音源情報Ｙとの違いに基づき、スペクトログラムテンプレートと、学習用スペクトログラムＸに基づきテンプレート重みを推定する機械学習モデル（以下「重み推定モデル」という。）とを更新する。 FIG. 2 is an explanatory diagram illustrating an outline of the sound source separation model learning device 1 in the embodiment. The sound source separation model learning device 1 estimates the spectrogram template and the template weight, which will be described later, based on the spectrogram X for learning, and acquires the combined product of the estimated spectrogram template and the template weight. The sound source separation model learning device 1 acquires the estimation result of the learning dominant sound source information Y (hereinafter referred to as “estimated dominant sound source information V”) based on the synthetic product. Next, the sound source separation model learning device 1 is a machine learning model (hereinafter referred to as a machine learning model) that estimates the template weight based on the spectrogram template and the learning spectrogram X based on the difference between the acquired estimated dominant sound source information V and the learning dominant sound source information Y. "Weight estimation model") and is updated.

　スペクトログラムテンプレートは、学習用スペクトログラムＸの時間軸方向に区分された１つの区間（以下「時間区間」という。）に属する時間周波数点における１又は複数の値であって学習用スペクトログラムＸに関する１又は複数の値を表す情報である。スペクトログラムテンプレートは、区間によらず同一である。 The spectrogram template is one or more values at time frequency points belonging to one interval (hereinafter referred to as "time interval") divided in the time axis direction of the learning spectrogram X, and is one or more related to the learning spectrogram X. Information that represents the value of. The spectrogram template is the same regardless of the interval.

　スペクトログラムテンプレートは学習によって更新される。スペクトログラムテンプレートが表す学習用スペクトログラムＸに関する値は、音源分離モデル学習装置１による学習の過程に依存する。そのため、スペクトログラムテンプレートが表す学習用スペクトログラムＸに関する値は、物理量であることもあるし統計値等の物理量では無い値であることもあり、どのような種類の値であるかは音源分離モデル学習装置１のユーザが予め決定する値ではない。 The spectrogram template is updated by learning. The value of the spectrogram X for learning represented by the spectrogram template depends on the learning process by the sound source separation model learning device 1. Therefore, the value related to the spectrogram X for learning represented by the spectrogram template may be a physical quantity or a value that is not a physical quantity such as a statistical value, and what kind of value it is is a sound source separation model learning device. It is not a value determined in advance by one user.

　スペクトログラムテンプレートは学習の段階では（すなわち学習終了条件が満たされるまで）、学習によって更新されるものの、学習済みモデル（すなわち音源分離モデル）を用いて分離対象の混合音信号を分離する段階では変化しない。 The spectrogram template is updated by training during the learning stage (ie, until the learning end condition is met), but does not change during the stage of separating the mixed sound signal to be separated using the trained model (that is, the sound source separation model). ..

　テンプレート重みは、学習用スペクトログラムＸに基づきスペクトログラムテンプレートを用いた合成積の推定に用いる重みである。テンプレート重みは、学習済みモデル（すなわち音源分離モデル）を用いて分離対象の混合音信号を分離する段階においても、分離対象の混合音信号に応じた値である。 The template weight is a weight used for estimating the synthetic product using the spectrogram template based on the spectrogram X for learning. The template weight is a value corresponding to the mixed sound signal to be separated even at the stage of separating the mixed sound signal to be separated by using the trained model (that is, the sound source separation model).

　音源分離モデルは、学習終了条件が満たされたタイミングにおける重み推定モデルを有する学習済みモデルであって、学習終了条件が満たされたタイミングにおけるスペクトログラムテンプレートを（学習済み）パラメータとして有する学習済みモデルである。 The sound source separation model is a trained model having a weight estimation model at the timing when the learning end condition is satisfied, and is a trained model having a spectrogram template at the timing when the learning end condition is satisfied as a (trained) parameter. ..

　音源分離モデル学習装置１は、音源分離ニューラルネットワーク１１０、損失取得部１２０及びテンプレート更新部１３０を備える。音源分離モデル学習装置１においては、音源分離ニューラルネットワーク１１０、損失取得部１２０及びテンプレート更新部１３０が協働して音源分離モデルを得るための学習を実行する。 The sound source separation model learning device 1 includes a sound source separation neural network 110, a loss acquisition unit 120, and a template update unit 130. In the sound source separation model learning device 1, the sound source separation neural network 110, the loss acquisition unit 120, and the template update unit 130 cooperate to perform learning to obtain a sound source separation model.

　音源分離ニューラルネットワーク１１０は、詳細を後述する損失取得部１２０が取得する損失に基づいて学習することで音源分離モデルを得るニューラルネットワークである。音源分離ニューラルネットワーク１１０は、入力情報取得部１１１、構成情報推定部１１２及び支配音源情報推定部１１３を備える。 The sound source separation neural network 110 is a neural network that obtains a sound source separation model by learning based on the loss acquired by the loss acquisition unit 120, which will be described in detail later. The sound source separation neural network 110 includes an input information acquisition unit 111, a configuration information estimation unit 112, and a dominant sound source information estimation unit 113.

　入力情報取得部１１１は、学習用スペクトログラムＸを取得する。入力情報取得部１１１は、音源分離ニューラルネットワーク１１０においては入力層である。 The input information acquisition unit 111 acquires the learning spectrogram X. The input information acquisition unit 111 is an input layer in the sound source separation neural network 110.

　構成情報推定部１１２は、学習用スペクトログラムＸに基づきテンプレート重みを推定する。構成情報推定部１１２は、学習用スペクトログラムＸに基づきテンプレート重みを推定可能であって、重み推定モデルを学習により更新可能に構成されていればどのようなものであってもよい。構成情報推定部１１２は、例えば畳み込みネットワーク（ＣＮＮ：Convolutional Neural Network）である。構成情報推定部１１２は、音源分離ニューラルネットワーク１１０においては例えば第１中間層から第（Ｌ－１）中間層までの中間層である。 The configuration information estimation unit 112 estimates the template weight based on the learning spectrogram X. The configuration information estimation unit 112 may be any as long as the template weight can be estimated based on the learning spectrogram X and the weight estimation model can be updated by learning. The configuration information estimation unit 112 is, for example, a convolutional neural network (CNN). The configuration information estimation unit 112 is, for example, an intermediate layer from the first intermediate layer to the (L-1) intermediate layer in the sound source separation neural network 110.

　構成情報推定部１１２は、詳細を後述する損失取得部１２０が取得する損失に基づいて学習する。構成情報推定部１１２による学習により、重み推定モデルが更新される。重み推定モデルは、損失を小さくするように更新される。 The configuration information estimation unit 112 learns based on the loss acquired by the loss acquisition unit 120, which will be described in detail later. The weight estimation model is updated by learning by the configuration information estimation unit 112. The weight estimation model is updated to reduce the loss.

　支配音源情報推定部１１３は、スペクトログラムテンプレートとテンプレート重みとの合成積を取得する。支配音源情報推定部１１３は、取得した合成積に基づき推定支配音源情報Ｖを取得する。支配音源情報推定部１１３は、音源分離ニューラルネットワーク１１０においては例えば第Ｌ中間層及び出力層である。 The dominant sound source information estimation unit 113 acquires the combined product of the spectrogram template and the template weight. The dominant sound source information estimation unit 113 acquires the estimated dominant sound source information V based on the acquired synthetic product. The dominant sound source information estimation unit 113 is, for example, the Lth intermediate layer and the output layer in the sound source separation neural network 110.

　損失取得部１２０は、推定支配音源情報Ｖと学習用支配音源情報Ｙとの間の違いを取得する。以下、推定支配音源情報Ｖと学習用支配音源情報Ｙとの間の違いを損失という。損失は、例えば以下の式（３）で表される。式（３）の左辺の記号は損失を表す記号である。 The loss acquisition unit 120 acquires the difference between the estimated dominant sound source information V and the learning dominant sound source information Y. Hereinafter, the difference between the estimated dominant sound source information V and the learning dominant sound source information Y is referred to as a loss. The loss is expressed by, for example, the following equation (3). The symbol on the left side of the equation (3) is a symbol representing the loss.

　式（３）において以下の式（４）で表される記号は、Ｆｒｏｂｅｎｉｕｓノルムを表す。また、式（３）において「Ｔ」は行列の転置を意味する。そのため、例えばＶ^Ｔは行列Ｖの転置行列を意味する。 In the formula (3), the symbol represented by the following formula (4) represents the Frobenius norm. Further, in the equation (3), "T" means the transpose of the matrix. Therefore, for example, V ^T denotes a transposed matrix of the matrix V.

　式（３）においてＹＹ^Ｔは、学習用スペクトログラムＸの時間周波数点ｋと時間周波数点ｋ´とで同一の音源が支配的であるときにｋ行ｋ´列目の要素が１、そうでないときに０であるようなＫ行Ｋ列のバイナリ行列である。なお、ｋ及びｋ´は１以上Ｋ以下の整数であり、Ｋは２以上の整数である。 In equation (3), YY ^T is when the element in the k row and k'column is 1 when the same sound source is dominant at the time frequency point k and the time frequency point k'of the learning spectrogram X, and when it is not. It is a binary matrix of K rows and K columns such that it is 0. Note that k and k'are integers of 1 or more and K or less, and K is an integer of 2 or more.

　テンプレート更新部１３０は、損失に基づきスペクトログラムテンプレートを更新する。より具体的には、テンプレート更新部１３０は、損失を小さくするようにスペクトログラムテンプレートを更新する。テンプレート更新部１３０がスペクトログラムテンプレートを更新するとは、支配音源情報推定部１１３を構成するニューラルネットワークにおけるスペクトログラムテンプレートを表すパラメータの値を好適に調整することを意味する。なお、テンプレート更新部１３０は、スペクトログラムテンプレートの更新に際してスペクトログラムテンプレートを非負の値（以下「非負値」という。）に更新する。 The template update unit 130 updates the spectrogram template based on the loss. More specifically, the template update unit 130 updates the spectrogram template so as to reduce the loss. When the template update unit 130 updates the spectrogram template, it means that the values of the parameters representing the spectrogram template in the neural network constituting the dominant sound source information estimation unit 113 are appropriately adjusted. The template update unit 130 updates the spectrogram template to a non-negative value (hereinafter referred to as “non-negative value”) when updating the spectrogram template.

　なお、音源分離ニューラルネットワーク１１０の学習が一度も行われていない段階におけるスペクトログラムテンプレート（すなわちスペクトログラムテンプレートの初期値）は、予め定められた値である。スペクトログラムテンプレートの初期値は、例えば乱数を用いて予め定められた値である。なお、スペクトログラムテンプレートは１つである必要は無く複数であってもよい。スペクトログラムテンプレートの数は、予めユーザが設定した所定の数であってもよいし、交差検証（cross validation）などの手法を用いて予め決定された数であってもよい。 The spectrogram template (that is, the initial value of the spectrogram template) at the stage where the sound source separation neural network 110 has never been learned is a predetermined value. The initial value of the spectrogram template is a predetermined value using, for example, a random number. The spectrogram template does not have to be one, and may be multiple. The number of spectrogram templates may be a predetermined number preset by the user, or may be a predetermined number using a method such as cross validation.

　ここで、音源分離ニューラルネットワーク１１０で実行される処理の具体的な処理の一例を、以下の構成条件を満たす音源分離ニューラルネットワーク１１０を例に説明する。構成条件は、音源分離ニューラルネットワーク１１０の、入力層が入力情報取得部１１１であり、第１中間層から第（Ｌ－１）中間層までの中間層が構成情報推定部１１２であり、第Ｌ中間層及び出力層が支配音源情報推定部１１３であるという条件である。 Here, an example of specific processing of the processing executed by the sound source separation neural network 110 will be described by taking the sound source separation neural network 110 satisfying the following configuration conditions as an example. As for the configuration conditions, the input layer of the sound source separation neural network 110 is the input information acquisition unit 111, and the intermediate layer from the first intermediate layer to the (L-1) intermediate layer is the configuration information estimation unit 112, and the L The condition is that the intermediate layer and the output layer are the dominant sound source information estimation unit 113.

　第１中間層から第（Ｌ－１）層中間層までで、入力層に入力された学習用スペクトログラムＸに基づき、テンプレート重みが推定される。第（Ｌ－１）中間層の出力結果が、テンプレート重みである。第（Ｌ－１）中間層の活性化関数は非負値を出力する。そのため、テンプレート重みの値は非負値である。なお、非負値を出力する活性化関数は、例えばソフトプラス関数や正規化線形関数である。 From the first intermediate layer to the (L-1) layer intermediate layer, the template weight is estimated based on the learning spectrogram X input to the input layer. The output result of the first (L-1) intermediate layer is the template weight. The activation function of the third (L-1) intermediate layer outputs a non-negative value. Therefore, the template weight value is non-negative. The activation function that outputs a non-negative value is, for example, a soft plus function or a rectified linear function.

　第１中間層から第（Ｌ－１）層中間層は、入力層に入力された学習用スペクトログラムＸに基づき、テンプレート重みを推定可能なニューラルネットワークであればどのようなものであってもよい。 The first intermediate layer to the (L-1) layer intermediate layer may be any neural network that can estimate the template weight based on the learning spectrogram X input to the input layer.

　第Ｌ層では、スペクトログラムテンプレートとテンプレート重みとの合成積が取得される。合成積を取得する処理を数式で表すと例えば以下の式（５）で表される。 In the Lth layer, the composite product of the spectrogram template and the template weight is acquired. The process of acquiring the composite product is expressed by a mathematical formula, for example, by the following equation (5).

　式（５）においてＨ^（Ｌ）は第Ｌ層の出力を表し、Ｈ^{（Ｌ－１）}は第（Ｌ－１）層の出力を表す。式（５）をより詳しく、Ｈ^（Ｌ）の要素ごとに表すと以下の式（６）で表される。 In the formula (5), H ^(L) represents the output of the Lth layer, and H ^(L-1) represents the output of the (L-1) layer. The equation (5) is expressed in more detail by the following equation (6) for each element of ^{H (L).}

　式（６）においてｄは音源を表す。例えばｄは、０又は１の値であり、１が２人の話者のうちの一方の話者を表し、０が他方の話者を表す。式（６）においてｍは１以上Ｎ以下の整数であり、学習用スペクトログラムＸの時間軸上の時刻を表す。式（６）におけるｊ（ｊは１以上Ｊ以下の整数。Ｊは１以上の整数。）は、音源ｄについてのスペクトログラムテンプレートを識別する識別子である。そのためＪは音源ｄについてのスペクトログラムテンプレートの総数である。 In equation (6), d represents a sound source. For example, d is a value of 0 or 1, where 1 represents one of the two speakers and 0 represents the other. In equation (6), m is an integer of 1 or more and N or less, and represents the time on the time axis of the spectrogram X for learning. J (j is an integer of 1 or more and J is an integer of 1 or more. J is an integer of 1 or more) in the equation (6) is an identifier for identifying the spectrogram template for the sound source d. Therefore, J is the total number of spectrogram templates for the sound source d.

　すなわち式（６）は、式（６）の左辺が、以下の式（７）で表されるＪ個のスペクトログラムテンプレートをそれぞれ時間軸方向にｍだけシフトさせた後に以下の式（８）で表される値を乗算したものの総和、であることを示す。 That is, the equation (6) is expressed by the following equation (8) after the left side of the equation (6) is shifted by m in the time axis direction for each of the J spectrogram templates represented by the following equation (7). It is shown that it is the sum of the products multiplied by the values to be calculated.

　式（８）は、Ｈ^{（Ｌ－１）}の音源ｄの時刻（ｎ－ｍ）におけるスペクトログラムテンプレートｊに乗算されるテンプレート重みを表す。 Equation (8) represents the template weight multiplied by the spectrogram template j at the time (nm ^{) of the sound source d of H (L-1).}

　最終層では、第Ｌ中間層で取得された合成積が規格化される。最終層の処理は例えば以下の式（９）で表される。 In the final layer, the synthetic product acquired in the Lth intermediate layer is standardized. The processing of the final layer is represented by, for example, the following equation (9).

　式（９）をより詳しく、Ｖの要素ごとに表すと以下の式（１０）で表される。 The equation (9) is expressed in more detail by the following equation (10) for each element of V.

　式（１０）においてεは、ゼロ割りを避けるために予め定められた所定の定数である。εは、式（１０）の右辺の他の値と比較して、充分小さな値であることが望ましい。式（１０）は、推定支配音源情報Ｖの二乗ノルムが１となることを表している。しかしながら、推定支配音源情報Ｖはどのように規格化されてもよく、例えば推定支配音源情報Ｖのｐ乗ノルムが１となるように規格化されてもよい（ｐは１以上の整数）。 In equation (10), ε is a predetermined constant determined in advance to avoid zero division. It is desirable that ε is a sufficiently small value as compared with other values on the right side of the equation (10). Equation (10) represents that the squared norm of the estimated dominant sound source information V is 1. However, the estimated dominant sound source information V may be standardized in any way, and may be standardized so that the p-th power norm of the estimated dominant sound source information V is 1 (p is an integer of 1 or more).

　また、式（１０）の右辺の分子の値を音源ｄの振幅スペクトログラムと解釈する場合、式（１０）の左辺はＷｉｅｎｅｒマスクを表すと解釈することができる。 Further, when the value of the molecule on the right side of the equation (10) is interpreted as the amplitude spectrogram of the sound source d, the left side of the equation (10) can be interpreted as representing the Wiener mask.

　なお、最終層において規格化の処理は必ずしも実行される必要は無く、最終層ではＨ^（Ｌ）を推定支配音源情報Ｖとして取得してもよい。式（９）で表される推定支配音源情報Ｖは、合成積を規格化しただけなので、損失は合成積と学習用支配音源情報Ｙとの間の違いを表す量である。 It should be noted that the normalization process does not necessarily have to be executed in the final layer, and H ^(L) may be acquired as the estimated dominant sound source information V in the final layer. Since the estimated dominant sound source information V represented by the equation (9) is only a standardized synthetic product, the loss is a quantity representing the difference between the synthetic product and the learning dominant sound source information Y.

　音源分離装置２は、音源分離モデル学習装置１が学習によって得た音源分離モデルを用いて、入力された混合音信号から非混合音信号を分離する。混合音信号から分離する非混合音信号の数は、予め音源分離装置２のユーザが指定した数（以下「ユーザ指定数」という。）であってもよいし、その他の何らかの学習モデルに基づき混合音信号から音源の数を推定する技術を用いて推定した数であってもよい。その他の何らかの学習モデルは、例えば以下の参考文献１に記載の方法である。以下説明の簡単のため、混合音信号から分離する非混合音信号の数が予めユーザが指定した数である場合を例に音源分離システム１００を説明する。 The sound source separation device 2 separates the non-mixed sound signal from the input mixed sound signal by using the sound source separation model obtained by the sound source separation model learning device 1 by learning. The number of non-mixed sound signals separated from the mixed sound signal may be a number specified in advance by the user of the sound source separation device 2 (hereinafter referred to as “user-specified number”), or may be mixed based on some other learning model. It may be a number estimated by using a technique for estimating the number of sound sources from a sound signal. Some other learning model is, for example, the method described in Reference 1 below. For the sake of simplicity of the following description, the sound source separation system 100 will be described by taking as an example the case where the number of non-mixed sound signals separated from the mixed sound signal is a number specified in advance by the user.

　参考文献１：F. Stoter et al.,”CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Volume: 27, Issue: 2, Feb.2019, pp.268-282 Reference 1: F. Stoter et al., ”CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning” IEEE / ACM Transactions on Audio, Speech, and Language Processing, Volume: 27, Issue: 2, Feb. .268-282

　図３は、実施形態における音源分離モデル学習装置１のハードウェア構成の一例を示す図である。音源分離モデル学習装置１は、バスで接続されたＣＰＵ（Central Processing Unit）等のプロセッサ９１とメモリ９２とを備える制御部１０を備え、プログラムを実行する。音源分離モデル学習装置１は、プログラムの実行によって制御部１０、入力部１１、インタフェース部１２、記憶部１３及び出力部１４を備える装置として機能する。より具体的には、プロセッサ９１が記憶部１３に記憶されているプログラムを読み出し、読み出したプログラムをメモリ９２に記憶させる。プロセッサ９１が、メモリ９２に記憶させたプログラムを実行することによって、音源分離モデル学習装置１は、制御部１０、入力部１１、インタフェース部１２、記憶部１３及び出力部１４を備える装置として機能する。 FIG. 3 is a diagram showing an example of the hardware configuration of the sound source separation model learning device 1 in the embodiment. The sound source separation model learning device 1 includes a control unit 10 including a processor 91 such as a CPU (Central Processing Unit) connected by a bus and a memory 92, and executes a program. The sound source separation model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14 by executing a program. More specifically, the processor 91 reads out the program stored in the storage unit 13, and stores the read program in the memory 92. By executing the program stored in the memory 92 by the processor 91, the sound source separation model learning device 1 functions as a device including a control unit 10, an input unit 11, an interface unit 12, a storage unit 13, and an output unit 14. ..

　制御部１０は、音源分離モデル学習装置１が備える各種機能部の動作を制御する。制御部１０は、例えば単位学習処理を実行する。単位学習処理は、１つの学習用データ用いて損失を取得し、取得した損失に基づいてスペクトログラムテンプレートと重み推定モデルとを更新する一連の処理である。 The control unit 10 controls the operation of various functional units included in the sound source separation model learning device 1. The control unit 10 executes, for example, a unit learning process. The unit learning process is a series of processes in which a loss is acquired using one learning data, and the spectrogram template and the weight estimation model are updated based on the acquired loss.

　入力部１１は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部１１は、これらの入力装置を自装置に接続するインタフェースとして構成されてもよい。入力部１１は、自装置に対する各種情報の入力を受け付ける。入力部１１は、例えば学習の開始を指示する入力を受け付ける。入力部１１は、例えば学習用データの入力を受け付ける。学習の開始の指示は、例えば学習用データが入力されることであってもよい。 The input unit 11 includes an input device such as a mouse, a keyboard, and a touch panel. The input unit 11 may be configured as an interface for connecting these input devices to its own device. The input unit 11 receives input of various information to its own device. The input unit 11 receives, for example, an input instructing the start of learning. The input unit 11 accepts, for example, input of learning data. The instruction to start learning may be, for example, input of learning data.

　インタフェース部１２は、自装置を外部装置に接続するための通信インタフェースを含んで構成される。インタフェース部１２は、有線又は無線を介して外部装置と通信する。外部装置は、例えばＵＳＢ（Universal Serial Bus）メモリ等の記憶装置であってもよい。外部装置が例えば学習用データを出力する場合、インタフェース部１２は外部装置との通信によって外部装置が出力する学習用データを取得する。 The interface unit 12 includes a communication interface for connecting the own device to an external device. The interface unit 12 communicates with an external device via wired or wireless. The external device may be a storage device such as a USB (Universal Serial Bus) memory. When the external device outputs learning data, for example, the interface unit 12 acquires the learning data output by the external device by communicating with the external device.

　インタフェース部１２は、自装置を音源分離装置２に接続するための通信インタフェースを含んで構成される。インタフェース部１２は、有線又は無線を介して音源分離装置２と通信する。インタフェース部１２は、音源分離装置２との通信により、音源分離装置２に音源分離モデルを出力する。 The interface unit 12 includes a communication interface for connecting the own device to the sound source separation device 2. The interface unit 12 communicates with the sound source separation device 2 via wired or wireless. The interface unit 12 outputs a sound source separation model to the sound source separation device 2 by communicating with the sound source separation device 2.

　記憶部１３は、磁気ハードディスク装置や半導体記憶装置などの非一時的コンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部１３は音源分離モデル学習装置１に関する各種情報を記憶する。記憶部１３は、例えば予め重み推定モデルを記憶する。記憶部１３は、例えば予めスペクトログラムテンプレートの初期値を記憶する。記憶部１３は、例えばスペクトログラムテンプレートを記憶する。 The storage unit 13 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 13 stores various information about the sound source separation model learning device 1. The storage unit 13 stores, for example, a weight estimation model in advance. The storage unit 13 stores, for example, the initial value of the spectrogram template in advance. The storage unit 13 stores, for example, a spectrogram template.

　出力部１４は、各種情報を出力する。出力部１４は、例えばＣＲＴ（Cathode Ray Tube）ディスプレイや液晶ディスプレイ、有機ＥＬ（Electro-Luminescence）ディスプレイ等の表示装置を含んで構成される。出力部１４は、これらの表示装置を自装置に接続するインタフェースとして構成されてもよい。出力部１４は、例えば入力部１１に入力された情報を出力する。出力部１４は、例えば学習終了条件が満たされた時点におけるスペクトログラムテンプレートを示す情報を表示してもよい。 The output unit 14 outputs various information. The output unit 14 includes display devices such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display. The output unit 14 may be configured as an interface for connecting these display devices to its own device. The output unit 14 outputs, for example, the information input to the input unit 11. The output unit 14 may display information indicating the spectrogram template at the time when the learning end condition is satisfied, for example.

　図４は、実施形態における制御部１０の機能構成の一例を示す図である。制御部１０は、被管理部１０１及び管理部１０２を備える。被管理部１０１は、単位学習処理を実行する。被管理部１０１は、音源分離ニューラルネットワーク１１０、損失取得部１２０、テンプレート更新部１３０及び学習用データ取得部１４０を備える。 FIG. 4 is a diagram showing an example of the functional configuration of the control unit 10 in the embodiment. The control unit 10 includes a controlled unit 101 and a management unit 102. The managed unit 101 executes the unit learning process. The managed unit 101 includes a sound source separation neural network 110, a loss acquisition unit 120, a template update unit 130, and a learning data acquisition unit 140.

　学習用データ取得部１４０は、入力部１１又はインタフェース部１２に入力された学習用データを取得する。学習用データ取得部１４０は、取得した学習用データのうち、学習用スペクトログラムＸを音源分離ニューラルネットワーク１１０に出力し、学習用支配音源情報Ｙを損失取得部１２０に出力する。より具体的には学習用データ取得部１４０は、学習用スペクトログラムＸについては、入力情報取得部１１１に出力する。 The learning data acquisition unit 140 acquires the learning data input to the input unit 11 or the interface unit 12. The learning data acquisition unit 140 outputs the learning spectrogram X out of the acquired learning data to the sound source separation neural network 110, and outputs the learning control sound source information Y to the loss acquisition unit 120. More specifically, the learning data acquisition unit 140 outputs the learning spectrogram X to the input information acquisition unit 111.

　管理部１０２は、被管理部１０１の動作を制御する。管理部１０２は、被管理部１０１の動作の制御として、例えば単位学習処理の実行を制御する。 The management unit 102 controls the operation of the managed unit 101. The management unit 102 controls, for example, the execution of the unit learning process as the operation control of the managed unit 101.

　管理部１０２は、例えば入力部１１、インタフェース部１２、記憶部１３及び出力部１４の動作を制御する。管理部１０２は、例えば記憶部１３から各種情報を読み出し被管理部１０１に出力する。管理部１０２は、例えば入力部１１に入力された情報を取得し被管理部１０１に出力する。管理部１０２は、例えば入力部１１に入力された情報を取得し記憶部１３に記録する。管理部１０２、例えばインタフェース部１２に入力された情報を取得し被管理部１０１に出力する。管理部１０２、例えばインタフェース部１２に入力された情報を取得し記憶部１３に記録する。管理部１０２は、例えば入力部１１に入力された情報を出力部１４に出力させる。 The management unit 102 controls, for example, the operations of the input unit 11, the interface unit 12, the storage unit 13, and the output unit 14. For example, the management unit 102 reads various information from the storage unit 13 and outputs it to the managed unit 101. The management unit 102 acquires, for example, the information input to the input unit 11 and outputs the information to the managed unit 101. The management unit 102 acquires, for example, the information input to the input unit 11 and records it in the storage unit 13. The information input to the management unit 102, for example, the interface unit 12, is acquired and output to the managed unit 101. The information input to the management unit 102, for example, the interface unit 12, is acquired and recorded in the storage unit 13. The management unit 102 causes the output unit 14, for example, to output the information input to the input unit 11.

　管理部１０２は、例えば単位学習処理の実行に用いられる情報と単位学習処理の実行によって生じた情報とを記憶部１３に記録する。 The management unit 102 records, for example, the information used for executing the unit learning process and the information generated by executing the unit learning process in the storage unit 13.

　図５は、実施形態における音源分離装置２のハードウェア構成の一例を示す図である。音源分離装置２は、バスで接続されたＣＰＵ等のプロセッサ９３とメモリ９４とを備える制御部２０を備え、プログラムを実行する。音源分離装置２は、プログラムの実行によって制御部２０、入力部２１、インタフェース部２２、記憶部２３及び出力部２４を備える装置として機能する。より具体的には、プロセッサ９３が記憶部２３に記憶されているプログラムを読み出し、読み出したプログラムをメモリ９４に記憶させる。プロセッサ９３が、メモリ９４に記憶させたプログラムを実行することによって、音源分離装置２は、制御部２０、入力部２１、インタフェース部２２、記憶部２３及び出力部２４を備える装置として機能する。 FIG. 5 is a diagram showing an example of the hardware configuration of the sound source separation device 2 in the embodiment. The sound source separation device 2 includes a control unit 20 including a processor 93 such as a CPU connected by a bus and a memory 94, and executes a program. The sound source separation device 2 functions as a device including a control unit 20, an input unit 21, an interface unit 22, a storage unit 23, and an output unit 24 by executing a program. More specifically, the processor 93 reads out the program stored in the storage unit 23, and stores the read program in the memory 94. By executing the program stored in the memory 94 by the processor 93, the sound source separation device 2 functions as a device including a control unit 20, an input unit 21, an interface unit 22, a storage unit 23, and an output unit 24.

　制御部２０は、音源分離装置２が備える各種機能部の動作を制御する。制御部２０は、例えば音源分離モデル学習装置１が得た音源分離モデルを用いて、分離対象の混合音信号からユーザ指定数の非混合音信号を分離する。以下説明の簡単のためユーザ指定数は分離対象の混合音信号が音源分離装置２に入力される前に予め入力済みである場合を例に、音源分離装置２を説明する。 The control unit 20 controls the operation of various functional units included in the sound source separation device 2. The control unit 20 separates a user-specified number of non-mixed sound signals from the mixed sound signals to be separated by using, for example, the sound source separation model obtained by the sound source separation model learning device 1. For the sake of simplicity of the following description, the sound source separation device 2 will be described by taking as an example the case where the mixed sound signal to be separated is input in advance before being input to the sound source separation device 2.

　入力部２１は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部２１は、これらの入力装置を自装置に接続するインタフェースとして構成されてもよい。入力部２１は、自装置に対する各種情報の入力を受け付ける。入力部２１は、例えばユーザ指定数の入力を受け付ける。入力部２１は、例えば混合音信号から非混合音信号を分離する処理の開始を指示する入力を受け付ける。入力部２１は、例えば分離対象の混合音信号の入力を受け付ける。 The input unit 21 includes an input device such as a mouse, a keyboard, and a touch panel. The input unit 21 may be configured as an interface for connecting these input devices to its own device. The input unit 21 receives input of various information to its own device. The input unit 21 accepts, for example, a user-specified number of inputs. The input unit 21 receives, for example, an input instructing the start of a process of separating the non-mixed sound signal from the mixed sound signal. The input unit 21 receives, for example, an input of a mixed sound signal to be separated.

　インタフェース部２２は、自装置を外部装置に接続するための通信インタフェースを含んで構成される。インタフェース部２２は、有線又は無線を介して外部装置と通信する。外部装置は、例えば混合音信号から分離された非混合音信号の出力先である。このような場合、インタフェース部２２は、外部装置との通信によって外部装置に非混合音信号を出力する。非混合音信号の出力際の外部装置は、例えばスピーカー等の音の出力装置である。 The interface unit 22 includes a communication interface for connecting the own device to an external device. The interface unit 22 communicates with an external device via wired or wireless. The external device is, for example, the output destination of the non-mixed sound signal separated from the mixed sound signal. In such a case, the interface unit 22 outputs a non-mixed sound signal to the external device by communicating with the external device. The external device for outputting the non-mixed sound signal is a sound output device such as a speaker.

　外部装置は、例えば音源分離モデルを記憶したＵＳＢメモリ等の記憶装置であってもよい。外部装置が例えば音源分離モデルを記憶しており音源分離モデルを出力する場合、インタフェース部２２は外部装置との通信によって音源分離モデルを取得する。 The external device may be, for example, a storage device such as a USB memory that stores the sound source separation model. When the external device stores, for example, the sound source separation model and outputs the sound source separation model, the interface unit 22 acquires the sound source separation model by communicating with the external device.

　外部装置は、例えば混合音信号の出力元である。このような場合、インタフェース部２２は、外部装置との通信によって外部装置から混合音信号を取得する。 The external device is, for example, an output source of a mixed sound signal. In such a case, the interface unit 22 acquires the mixed sound signal from the external device by communicating with the external device.

　インタフェース部２２は、自装置を音源分離モデル学習装置１に接続するための通信インタフェースを含んで構成される。インタフェース部２２は、有線又は無線を介して音源分離モデル学習装置１と通信する。インタフェース部２２は、音源分離モデル学習装置１との通信により、音源分離モデル学習装置１から音源分離モデルを取得する。 The interface unit 22 includes a communication interface for connecting the own device to the sound source separation model learning device 1. The interface unit 22 communicates with the sound source separation model learning device 1 via wired or wireless. The interface unit 22 acquires a sound source separation model from the sound source separation model learning device 1 by communicating with the sound source separation model learning device 1.

　記憶部２３は、磁気ハードディスク装置や半導体記憶装置などの非一時的コンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部２３は音源分離装置２に関する各種情報を記憶する。記憶部１３は、例えばインタフェース部２２を介して取得した音源分離モデルを記憶する。記憶部１３は、例えば入力部１１を介して入力されたユーザ指定数を記憶する。記憶部１３は、スペクトログラムテンプレートの数を記憶する。 The storage unit 23 is configured by using a non-temporary computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 23 stores various information about the sound source separation device 2. The storage unit 13 stores, for example, the sound source separation model acquired via the interface unit 22. The storage unit 13 stores, for example, a user-specified number input via the input unit 11. The storage unit 13 stores the number of spectrogram templates.

　出力部２４は、各種情報を出力する。出力部２４は、例えばＣＲＴディスプレイや液晶ディスプレイ、有機ＥＬディスプレイ等の表示装置を含んで構成される。出力部２４は、これらの表示装置を自装置に接続するインタフェースとして構成されてもよい。出力部２４は、例えば入力部２１に入力された情報を出力する。出力部２４は、例えば混合音信号から非混合音信号を分離した際に用いたスペクトログラムテンプレートとスペクトログラムテンプレートに対応するテンプレート重みとを出力する。 The output unit 24 outputs various information. The output unit 24 includes display devices such as a CRT display, a liquid crystal display, and an organic EL display. The output unit 24 may be configured as an interface for connecting these display devices to the own device. The output unit 24 outputs, for example, the information input to the input unit 21. The output unit 24 outputs, for example, the spectrogram template used when the non-mixed sound signal is separated from the mixed sound signal and the template weight corresponding to the spectrogram template.

　図６は、実施形態における制御部２０の機能構成の一例を示す図である。制御部２０は、分離対象取得部２０１、スペクトログラム取得部２０２、分離情報取得部２０３、非混合音信号生成部２０４、音信号出力制御部２０５及びインタフェース制御部２０６を備える。 FIG. 6 is a diagram showing an example of the functional configuration of the control unit 20 in the embodiment. The control unit 20 includes a separation target acquisition unit 201, a spectrogram acquisition unit 202, a separation information acquisition unit 203, a non-mixed sound signal generation unit 204, a sound signal output control unit 205, and an interface control unit 206.

　分離対象取得部２０１は、分離対象の混合音信号を取得する。分離対象取得部２０１は、例えば入力部２１に入力された混合音信号を取得する。分離対象取得部２０１は、例えばインタフェース部２２に入力された混合音信号を取得する。 The separation target acquisition unit 201 acquires the mixed sound signal to be separated. The separation target acquisition unit 201 acquires, for example, the mixed sound signal input to the input unit 21. The separation target acquisition unit 201 acquires, for example, the mixed sound signal input to the interface unit 22.

　スペクトログラム取得部２０２は、分離対象取得部２０１が取得した混合音信号のスペクトログラム（以下「分離対象スペクトログラム」という。）を取得する。スペクトログラムの取得方法は、混合音信号からスペクトログラムを取得可能であればどのような方法であってもよい。スペクトログラムの取得方法は、例えば混合音信号の波形に対し短時間フーリエ変換を適用した後，その振幅情報のみを抽出した振幅スペクトログラムを取得する方法であってもよい。である。取得されたスペクトログラムは分離情報取得部２０３に出力される。 The spectrogram acquisition unit 202 acquires a spectrogram of the mixed sound signal acquired by the separation target acquisition unit 201 (hereinafter referred to as “separation target spectrogram”). The method for acquiring the spectrogram may be any method as long as the spectrogram can be acquired from the mixed sound signal. The spectrogram acquisition method may be, for example, a method of applying a short-time Fourier transform to the waveform of a mixed sound signal and then acquiring an amplitude spectrogram obtained by extracting only the amplitude information. Is. The acquired spectrogram is output to the separation information acquisition unit 203.

　分離情報取得部２０３は、分離対象スペクトログラムに基づき音源分離モデルを用いて、分離対象の混合音信号に含まれるユーザ指定数の非混合音信号それぞれについて推定支配音源情報Ｖを取得する。なお、音源分離モデルには、学習に用いた全ての音源に対するスペクトログラムテンプレートが入手される。そのため、ユーザ指定数が複数の場合には、音源分離モデルは、学習に用いられた全ての音源を分離可能である。 The separation information acquisition unit 203 acquires the estimated dominant sound source information V for each of the user-specified number of non-mixed sound signals included in the mixed sound signal to be separated by using the sound source separation model based on the separation target spectrogram. For the sound source separation model, spectrogram templates for all sound sources used for learning are obtained. Therefore, when the number specified by the user is plural, the sound source separation model can separate all the sound sources used for learning.

　非混合音信号生成部２０４は、分離対象の混合音信号と、分離対象スペクトログラムと、分離情報取得部２０３が取得した推定支配音源情報Ｖと、を用いて、非混合音信号を生成する。非混合音信号生成部２０４は、例えば推定支配音源情報Ｖを入力振幅スペクトログラムに乗じ、Ｇｒｉｆｆｉｎ－Ｌｉｍ法等の位相再構成法に基づき位相情報を付加した上で逆短時間フーリエ変換を適用することによって、非混合音信号を生成する。このようにして、非混合音信号生成部２０４は、分離対象の混合音信号から非混合音信号を分離する。分離された非混合音信号は音信号出力制御部２０５に出力される。 The non-mixed sound signal generation unit 204 generates a non-mixed sound signal by using the mixed sound signal to be separated, the spectrogram to be separated, and the estimated dominant sound source information V acquired by the separation information acquisition unit 203. For example, the non-mixed sound signal generation unit 204 multiplies the estimated dominant sound source information V by the input amplitude spectrogram, adds the phase information based on the phase reconstruction method such as the Griffin-Lim method, and then applies the inverse short-time Fourier transform. Generates a non-mixed sound signal. In this way, the non-mixed sound signal generation unit 204 separates the non-mixed sound signal from the mixed sound signal to be separated. The separated non-mixed sound signal is output to the sound signal output control unit 205.

　音信号出力制御部２０５は、インタフェース部２２の動作を制御する。音信号出力制御部２０５は、インタフェース部２２の動作を制御することでインタフェース部２２に分離した非混合音信号を出力させる。 The sound signal output control unit 205 controls the operation of the interface unit 22. The sound signal output control unit 205 controls the operation of the interface unit 22 so that the interface unit 22 outputs a separated non-mixed sound signal.

　図７は、実施形態における音源分離モデル学習装置１が実行する処理の流れの一例を示すフローチャートである。より具体的には、図７は単位学習処理の流れの一例を示すフローチャートである。音源分離モデル学習装置１は、学習用データが入力されるたびに図７に示す単位学習処理を実行し音源分離モデルを得る。 FIG. 7 is a flowchart showing an example of the flow of processing executed by the sound source separation model learning device 1 in the embodiment. More specifically, FIG. 7 is a flowchart showing an example of the flow of the unit learning process. The sound source separation model learning device 1 executes the unit learning process shown in FIG. 7 every time the learning data is input to obtain a sound source separation model.

　入力部１１又はインタフェース部１２に学習用データが入力される（ステップＳ１０１）。次に入力情報取得部１１１が学習用データに含まれる学習用スペクトログラムＸを取得する（ステップＳ１０２）。次に構成情報推定部１１２が、学習用スペクトログラムＸに基づき重み推定モデルを用いてテンプレート重みを推定する（ステップＳ１０３）。 Learning data is input to the input unit 11 or the interface unit 12 (step S101). Next, the input information acquisition unit 111 acquires the learning spectrogram X included in the learning data (step S102). Next, the configuration information estimation unit 112 estimates the template weight using the weight estimation model based on the learning spectrogram X (step S103).

　ステップＳ１０３の次に、支配音源情報推定部１１３が、スペクトログラムテンプレートとテンプレート重みとに基づき推定支配音源情報Ｖを推定する（ステップＳ１０４）。次に損失取得部１２０は、推定支配音源情報Ｖと学習用データに含まれる学習用支配音源情報Ｙとの間の違い（すなわち損失）を取得する（ステップＳ１０５）。次に、テンプレート更新部１３０が損失を小さくするようにスペクトログラムテンプレートを更新し、構成情報推定部１１２が損失を小さくするように重み推定モデルを更新する（ステップＳ１０６）。 Next to step S103, the dominant sound source information estimation unit 113 estimates the estimated dominant sound source information V based on the spectrogram template and the template weight (step S104). Next, the loss acquisition unit 120 acquires the difference (that is, the loss) between the estimated dominant sound source information V and the learning dominant sound source information Y included in the learning data (step S105). Next, the template update unit 130 updates the spectrogram template so as to reduce the loss, and the configuration information estimation unit 112 updates the weight estimation model so as to reduce the loss (step S106).

　図８は、実施形態における音源分離装置２が実行する処理の流れの一例を示すフローチャートである。以下説明の簡単のため、ユーザ指定数は予め音源分離装置２に入力済みであり、入力されたユーザ指定数は記憶部２３に記憶済みである場合を例に音源分離装置２が実行する処理の流れの一例を説明する。 FIG. 8 is a flowchart showing an example of the flow of processing executed by the sound source separation device 2 in the embodiment. For the sake of simplicity of the following explanation, the process executed by the sound source separation device 2 is performed by taking as an example the case where the user-specified number has been input to the sound source separation device 2 in advance and the input user-specified number has been stored in the storage unit 23. An example of the flow will be described.

　分離対象取得部２０１が、入力部２１又はインタフェース部２２に入力された分離対象の混合音信号を取得する（ステップＳ２０１）。次にスペクトログラム取得部２０２が、分離対象の混合音信号を用いて分離対象スペクトログラムを取得する（ステップＳ２０２）。次に分離情報取得部２０３が、分離対象スペクトログラムに基づき音源分離モデルを用いて、分離対象の混合音信号に含まれるユーザ指定数の非混合音信号それぞれについて推定支配音源情報Ｖを取得する（ステップＳ２０３）。 The separation target acquisition unit 201 acquires the separation target mixed sound signal input to the input unit 21 or the interface unit 22 (step S201). Next, the spectrogram acquisition unit 202 acquires the spectrogram to be separated using the mixed sound signal to be separated (step S202). Next, the separation information acquisition unit 203 acquires the estimated dominant sound source information V for each of the user-specified number of non-mixed sound signals included in the mixed sound signal to be separated by using the sound source separation model based on the separation target spectrogram (step). S203).

　次に非混合音信号生成部２０４が、分離対象の混合音信号と、分離対象スペクトログラムと、分離情報取得部２０３が取得した推定支配音源情報Ｖと、を用いて、混合音信号から非混合音信号を分離する（ステップＳ２０４）。次に音信号出力制御部２０５が、インタフェース部２２の動作を制御することでインタフェース部２２に分離した非混合音信号を出力させる（ステップＳ２０５）。 Next, the non-mixed sound signal generation unit 204 uses the mixed sound signal to be separated, the spectrogram to be separated, and the estimated dominant sound source information V acquired by the separation information acquisition unit 203, and the non-mixed sound from the mixed sound signal. Separate the signals (step S204). Next, the sound signal output control unit 205 controls the operation of the interface unit 22 so that the interface unit 22 outputs the separated non-mixed sound signal (step S205).

＜実験結果＞
　音源分離システム１００を用いて音声の分離を行った実験（以下「分離実験」という。）の実験結果を説明する。分離実験では、Ｔｈｅ　ＣＭＵ　Ａｒｃｔｉｃ　ｓｐｅｅｃｈ　ｄａｔａｂａｓｅｓ（参考文献２参照）の音声データが混合音信号として用いられた。学習用データとしては、話者０（ｂｄｌ）と話者１（ｃｌｂ）の音声をそれぞれ１０００発話ずつ用いた。 <Experimental results>
The experimental results of an experiment in which voice is separated using the sound source separation system 100 (hereinafter referred to as "separation experiment") will be described. In the separation experiment, the audio data of The CMU Arctic speech database (see Reference 2) was used as the mixed sound signal. As the learning data, the voices of speaker 0 (bdl) and speaker 1 (clb) were used for 1000 utterances each.

　参考文献２：J. Kominek and A. W. Black,“The CMU Arctic speech databases”, In 5th ISCA Speech Synthesis　Workshop, pp.223-224, 2004. Reference 2: J. Kominek and A. W. Black, “The CMU Arctic speech databases”, In 5th ISCA Speech Synthesis Workshop, pp.223-224, 2004.

　学習用データは、以下のようにして作成された。まず、話者０と話者１との各１発話の信号に対して、ハミング窓による短時間フーリエ変換を適用した。次に、０から１までの閉区間上の一様分布から生成された重みを短時間フーリエ変換後の各信号に乗じ、話者ごとにスペクトログラムＸ｛～｝^（ｄ）を得た。分離実験においてｄは０又は１であり、０は話者０を示し、１は話者１を示す。なお、Ｘ｛～｝は、以下の式（１１）で表される記号を意味する。 The training data was created as follows. First, a short-time Fourier transform using a humming window was applied to each one utterance signal of speaker 0 and speaker 1. Next, the weights generated from the uniform distribution on the closed interval from 0 to 1 were multiplied by each signal after the short-time Fourier transform to obtain the spectrogram X {~} ^(d) for each speaker. In the separation experiment, d is 0 or 1, where 0 indicates speaker 0 and 1 indicates speaker 1. Note that X {~} means a symbol represented by the following equation (11).

　また、Ｘ｛～｝^（ｑ）は、以下の式（１２）で表される記号を意味する。 Further, X {~} ^(q) means a symbol represented by the following equation (12).

　次にスペクトログラムＸ｛～｝^（ｄ）を合成し、混合信号の複素スペクトログラムＸ｛～｝を算出した。すなわち、Ｘ｛～｝＝（Ｘ｛～｝^（０）＋Ｘ｛～｝^（１））である。次に提案モデルへの入力Ｘ＝（Ｘ_ｆ、ｎ）_ｆ、ｎを最大値が１になるようにスケーリングし、振幅スペクトログラムＸ_ｆ、ｎを取得した。振幅スペクトログラムＸ_ｆ、ｎは以下の式（１３）で表される。 Next, the spectrogram X {~} ^(d) was synthesized to calculate the complex spectrogram X {~} of the mixed signal. That is, X {~} = (X {~} ⁽⁰⁾ + X {~} ⁽¹⁾ ). Next, the inputs X = (X _{f, n} ) _{f and n} to the proposed model were scaled so that the maximum value was 1, and the amplitude spectrograms X _{f and n} were obtained. The amplitude spectrograms X _{f and n} are represented by the following equation (13).

　また、分離実験では各時間周波数点（ｆ、ｎ) について、以下の式（１４）を満たすものを無音として扱った。 Also, in the separation experiment, for each time frequency point (f, n), those satisfying the following equation (14) were treated as silence.

　また、分離実験では、各時間周波数点（ｆ、ｎ）の支配的な話者を示す学習用支配音源情報Ｙとして以下の式（１５）で表される情報を用いた。式（１５）の左辺が分離実験で用いた学習用支配音源情報Ｙを表す。 Further, in the separation experiment, the information represented by the following equation (15) was used as the learning dominant sound source information Y indicating the dominant speaker at each time frequency point (f, n). The left side of the equation (15) represents the learning dominant sound source information Y used in the separation experiment.

　テストデータの作成には、話者０（ｂｄｌ）と話者１（ｃｌｂ）の音声をそれぞれ６６発話ずつ用いた。テストデータの作成方法は学習用データと同様であるが、短時間フーリエ変換の適用後に乗じる重みはどちらの話者に関しても１にした。 To create the test data, 66 utterances of each of the voices of speaker 0 (bdl) and speaker 1 (clb) were used. The method of creating the test data is the same as that of the training data, but the weight to be multiplied after applying the short-time Fourier transform is set to 1 for both speakers.

　図９は、実施形態における分離実験の第１の結果を示す図である。具体的には図９は、５００エポックの学習によって得られた音源分離モデルを用いたテストデータのスペクトログラムの一例である。図９の結果Ｒ１が話者０のスペクトログラムであり、図９の結果Ｒ２が話者１のスペクトログラムである。 FIG. 9 is a diagram showing the first result of the separation experiment in the embodiment. Specifically, FIG. 9 is an example of a spectrogram of test data using a sound source separation model obtained by learning 500 epochs. The result R1 in FIG. 9 is the spectrogram of speaker 0, and the result R2 in FIG. 9 is the spectrogram of speaker 1.

　図１０は、実施形態における分離実験の第２の結果を示す図である。具体的には図１０は、図９のテストデータに対する正解データの支配音源情報を表す。図１０の結果Ｒ３が話者０に対応する正解データであり、図９の結果Ｒ４が話者１に対応する正解データである。 FIG. 10 is a diagram showing the second result of the separation experiment in the embodiment. Specifically, FIG. 10 shows the dominant sound source information of the correct answer data with respect to the test data of FIG. The result R3 in FIG. 10 is the correct answer data corresponding to the speaker 0, and the result R4 in FIG. 9 is the correct answer data corresponding to the speaker 1.

　図１１は、実施形態における分離実験の第３の結果を示す図である。具体的には図１１は、図９のテストデータに対する音源分離装置２の正規化前の推定結果である。図１１の結果Ｒ５が話者０に対応する推定結果であり、図１１の結果Ｒ６が話者１に対応する推定結果である。 FIG. 11 is a diagram showing the third result of the separation experiment in the embodiment. Specifically, FIG. 11 is an estimation result before normalization of the sound source separation device 2 with respect to the test data of FIG. The result R5 in FIG. 11 is the estimation result corresponding to the speaker 0, and the result R6 in FIG. 11 is the estimation result corresponding to the speaker 1.

　図１２は、実施形態における分離実験の第４の結果を示す図である。具体的には図１２は、図９のテストデータに対する音源分離装置２の正規化後の推定結果である。図１２の結果Ｒ７が話者０に対応する推定結果であり、図１２の結果Ｒ８が話者１に対応する推定結果である。 FIG. 12 is a diagram showing the fourth result of the separation experiment in the embodiment. Specifically, FIG. 12 is an estimation result after normalization of the sound source separation device 2 with respect to the test data of FIG. The result R7 in FIG. 12 is the estimation result corresponding to the speaker 0, and the result R8 in FIG. 12 is the estimation result corresponding to the speaker 1.

　図１３は、実施形態における分離実験の第５の結果を示す図である。具体的には図１３は、図９のテストデータに対して音源分離装置２が取得したスペクトログラムテンプレートを示す。図１３の結果Ｒ９が話者０に対応するスペクトログラムテンプレートであり、図１３の結果Ｒ１０が話者１に対応するスペクトログラムテンプレートである。図１３は、ｊの小さい方から順に５つのスペクトログラムテンプレートを表す。なお、各スペクトログラムテンプレートの横軸は、時刻を表し、縦軸は周波数を表す。ｊは複数のスペクトログラムテンプレートを区別するための番号である。 FIG. 13 is a diagram showing the fifth result of the separation experiment in the embodiment. Specifically, FIG. 13 shows a spectrogram template acquired by the sound source separation device 2 with respect to the test data of FIG. The result R9 in FIG. 13 is the spectrogram template corresponding to the speaker 0, and the result R10 in FIG. 13 is the spectrogram template corresponding to the speaker 1. FIG. 13 represents five spectrogram templates in ascending order of j. The horizontal axis of each spectrogram template represents time, and the vertical axis represents frequency. j is a number for distinguishing a plurality of spectrogram templates.

　図１４は、実施形態における分離実験の第６の結果を示す図である。具体的には図１４は、図９のテストデータに対して音源分離装置２が取得した話者０に対応するテンプレート重みを示す。図１４において、Ｒ１１－０は、図１３のｊ＝０における話者０に対応するテンプレート重みを示す。図１４において、Ｒ１１－１は、図１３のｊ＝１における話者０に対応するテンプレート重みを示す。図１４において、Ｒ１１－２は、図１３のｊ＝２における話者０に対応するテンプレート重みを示す。図１４において、Ｒ１１－３は、図１３のｊ＝３における話者０に対応するテンプレート重みを示す。図１４において、Ｒ１１－４は、図１３のｊ＝４における話者０に対応するテンプレート重みを示す。 FIG. 14 is a diagram showing the sixth result of the separation experiment in the embodiment. Specifically, FIG. 14 shows the template weight corresponding to the speaker 0 acquired by the sound source separation device 2 with respect to the test data of FIG. In FIG. 14, R11-0 indicates the template weight corresponding to speaker 0 at j = 0 in FIG. In FIG. 14, R11-1 shows the template weight corresponding to speaker 0 at j = 1 in FIG. In FIG. 14, R11-2 shows the template weight corresponding to speaker 0 at j = 2 in FIG. In FIG. 14, R11-3 shows the template weight corresponding to speaker 0 at j = 3 in FIG. In FIG. 14, R11-4 shows the template weight corresponding to speaker 0 at j = 4 in FIG.

　図１５は、実施形態における分離実験の第７の結果を示す図である。具体的には図１５は、図９のテストデータに対して音源分離装置２が取得した話者１に対応するテンプレート重みを示す。図１５において、Ｒ１２－０は、図１３のｊ＝０における話者１に対応するテンプレート重みを示す。図１５において、Ｒ１２－１は、図１３のｊ＝１における話者１に対応するテンプレート重みを示す。図１５において、Ｒ１２－２は、図１３のｊ＝２における話者１に対応するテンプレート重みを示す。図１５において、Ｒ１２－３は、図１３のｊ＝３における話者１に対応するテンプレート重みを示す。図１５において、Ｒ１２－４は、図１３のｊ＝４における話者１に対応するテンプレート重みを示す。 FIG. 15 is a diagram showing the seventh result of the separation experiment in the embodiment. Specifically, FIG. 15 shows the template weight corresponding to the speaker 1 acquired by the sound source separation device 2 with respect to the test data of FIG. In FIG. 15, R12-0 indicates the template weight corresponding to speaker 1 at j = 0 in FIG. In FIG. 15, R12-1 shows the template weight corresponding to speaker 1 at j = 1 in FIG. In FIG. 15, R12-2 shows the template weight corresponding to speaker 1 at j = 2 in FIG. In FIG. 15, R12-3 shows the template weight corresponding to speaker 1 at j = 3 in FIG. In FIG. 15, R12-4 shows the template weight corresponding to speaker 1 at j = 4 in FIG.

　図１３～図１５の実験結果は、音源分離装置２がどのようにして話者の違いを分離したかを示す。そのため、分離実験の結果は、音源分離システム１００は学習済みモデルの解釈を容易にすることを示す。 The experimental results of FIGS. 13 to 15 show how the sound source separating device 2 separated the difference between the speakers. Therefore, the results of the separation experiment show that the sound source separation system 100 facilitates the interpretation of the trained model.

　このように構成された実施形態の音源分離システム１００は、スペクトログラムテンプレートとテンプレート重みとを推定し、推定結果に基づき損失を小さくするように学習する。具体的には、音源分離システム１００を用いれば、ユーザは、スペクトログラムテンプレートとその重みを見ることで、入力された信号に対する音源分離に用いられる周波数パターンの情報とその時間変化をそれぞれ把握することができる。なお、周波数パターンとは、周波数に応じたエネルギーの分布を表す情報である。そのため、音源分離システム１００を用いればユーザは、どのようにして音源が分離されたのかについて少なくとも周波数パターンの時間変化を知ることができ、周波数パターンの時間変化を学習済みモデルの解釈に役立てることができる。このように、音源分離システム１００は、学習済みモデルの解釈を容易にする。 The sound source separation system 100 of the embodiment configured in this way estimates the spectrogram template and the template weight, and learns to reduce the loss based on the estimation result. Specifically, if the sound source separation system 100 is used, the user can grasp the information of the frequency pattern used for sound source separation for the input signal and its time change by looking at the spectrogram template and its weight. can. The frequency pattern is information representing the distribution of energy according to the frequency. Therefore, if the sound source separation system 100 is used, the user can know at least the time change of the frequency pattern as to how the sound source is separated, and the time change of the frequency pattern can be useful for interpreting the trained model. can. In this way, the sound source separation system 100 facilitates the interpretation of the trained model.

　また、このように構成された実施形態の音源分離システム１００は、スペクトログラムテンプレートとテンプレート重みとの値が非負値であるように学習を行う。このような場合、スペクトログラムテンプレートの値とテンプレート重みの値とが負の値であるということが無くなるので、スペクトログラムテンプレートの表す物理的な意味やテンプレート重みの表す物理的な意味の解釈を容易にする。そのため、このように構成された音源分離システム１００は学習済みモデルの解釈を容易にする。 Further, the sound source separation system 100 of the embodiment configured in this way learns so that the values of the spectrogram template and the template weight are non-negative values. In such cases, the spectrogram template value and the template weight value are no longer negative, making it easier to interpret the physical meaning of the spectrogram template and the physical meaning of the template weight. .. Therefore, the sound source separation system 100 configured in this way facilitates the interpretation of the trained model.

　（変形例）
　なお、式（１０）の右辺の分子の値を音源ｄの振幅スペクトログラムと解釈する場合、式（１０）の右辺を全音源ｄについて和をとった値は、学習用スペクトログラムＸを良く近似しているという条件（以下「正則化条件」という。）が満たされることが望ましい。このような正則化条件は、以下の式（１６）で表される損失を小さくするという条件である。 (Modification example)
When the value of the molecule on the right side of the equation (10) is interpreted as the amplitude spectrogram of the sound source d, the value obtained by summing the right side of the equation (10) for all the sound sources d is a good approximation of the spectrogram X for learning. It is desirable that the condition of being present (hereinafter referred to as "regularization condition") is satisfied. Such a regularization condition is a condition that the loss represented by the following equation (16) is reduced.

　式（１６）の右辺第１項は式（３）の左辺の値である。式（１７）においてＤ（Ａ｜｜Ｂ）は、ＡとＢとが一致するときに０を出力し、ＸとＹとの違いが大きくなるほど大きい値を出力する非負値関数である。そのためＤ（Ａ｜｜Ｂ）は、例えば｜Ａ－Ｂ｜^２である。式（１６）においてλは正則化の強さを表す非負の定数である。 The first term on the right side of the equation (16) is the value on the left side of the equation (3). In equation (17), D (A || B) is a non-negative function that outputs 0 when A and B match, and outputs a larger value as the difference between X and Y increases. Therefore, D (A || B) is, for example, | AB | ² . In equation (16), λ is a non-negative constant that represents the strength of regularization.

　式（１７）は、式（１０）の右辺を全音源ｄについて和をとった値と学習用スペクトログラムＸの間の誤差を表す項（正則化項）である。音源分離モデル学習装置１は、式（１６）で表される損失を小さくするように学習することで、式（１０）の右辺を全音源ｄについて和をとった値と学習用スペクトログラムＸとの違いを小さくすることができる。具体的には、損失取得部１２０が式（３）で表される損失に代えて式（１６）で表される損失を取得すれば、音源分離モデル学習装置１は式（１０）の右辺を全音源ｄについて和をとった値と学習用スペクトログラムＸとの違いを小さくすることができる。 Equation (17) is a term (regularization term) representing an error between the value obtained by summing the right side of equation (10) for all sound sources d and the spectrogram X for learning. The sound source separation model learning device 1 learns so as to reduce the loss represented by the equation (16), so that the right side of the equation (10) is the sum of all the sound sources d and the spectrogram X for learning. The difference can be small. Specifically, if the loss acquisition unit 120 acquires the loss represented by the equation (16) instead of the loss represented by the equation (3), the sound source separation model learning device 1 obtains the right side of the equation (10). The difference between the sum of all sound sources d and the learning spectrogram X can be reduced.

　なお、音源分離装置２は、必ずしもスペクトログラム取得部２０２を備える必要は無い。このような場合、音源分離装置２には分離対象スペクトログラムがそのまま入力される。 The sound source separation device 2 does not necessarily have to include the spectrogram acquisition unit 202. In such a case, the separation target spectrogram is input to the sound source separation device 2 as it is.

　なお、音源分離モデル学習装置１及び音源分離装置２は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。音源分離モデル学習装置１が備える各機能部は、複数の情報処理装置に分散して実装されてもよい。なお、テンプレート更新部１３０は、支配音源情報推定部１１３が備えてもよい。 The sound source separation model learning device 1 and the sound source separation device 2 may be implemented by using a plurality of information processing devices that are communicably connected via a network. Each functional unit included in the sound source separation model learning device 1 may be distributed and mounted in a plurality of information processing devices. The template updating unit 130 may be provided by the dominant sound source information estimation unit 113.

　なお、非混合音信号生成部２０４は分離部の一例である。なお、構成情報推定部１１２は、重み推定部の一例である。なお、スペクトログラムテンプレートは非負の値である方が非負の値でない場合よりも学習済みモデルの解釈を容易にするが、必ずしも非負の値である必要は無い。また、テンプレート重みについても、非負の値である方が非負の値でない場合よりも学習済みモデルの解釈を容易にするが、必ずしも非負の値である必要は無い。　　 The non-mixed sound signal generation unit 204 is an example of a separation unit. The configuration information estimation unit 112 is an example of the weight estimation unit. It should be noted that the spectrogram template makes it easier to interpret the trained model when it is a non-negative value than when it is not a non-negative value, but it does not necessarily have to be a non-negative value. Also, regarding the template weight, a non-negative value makes it easier to interpret the trained model than a case where it is not a non-negative value, but it does not necessarily have to be a non-negative value. It was

　なお、音源分離モデル学習装置１と音源分離装置２の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されてもよい。プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。プログラムは、電気通信回線を介して送信されてもよい。 All or part of each function of the sound source separation model learning device 1 and the sound source separation device 2 is equipped with hardware such as ASIC (Application Specific Integrated Circuit), PLD (Programmable Logic Device), and FPGA (Field Programmable Gate Array). It may be realized by using. The program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a flexible disk, a magneto-optical disk, a portable medium such as a ROM or a CD-ROM, or a storage device such as a hard disk built in a computer system. The program may be transmitted over a telecommunication line.

　以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 As described above, the embodiment of the present invention has been described in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and the design and the like within a range not deviating from the gist of the present invention are also included.

　１００…音源分離システム、　１…音源分離モデル学習装置、　２…音源分離装置、　１０…制御部、　１１…入力部、　１２…インタフェース部、　１３…記憶部、　１４…出力部、　１０１…被管理部、　１０２…管理部、　１１０…音源分離ニューラルネットワーク、　１１１…入力情報取得部、　１１２…構成情報推定部、　１１３…支配音源情報推定部、　１２０…損失取得部、　１３０…テンプレート更新部、　１４０…学習用データ取得部、　２０…制御部、　２１…入力部、　２２…インタフェース部、　２３…記憶部、　２４…出力部、　２０１…分離対象取得部、　２０２…スペクトログラム取得部、　２０３…分離情報取得部、　２０４…非混合音信号生成部、　２０５…音信号出力制御部、　２０６…インタフェース制御部、　９１…プロセッサ、　９２…メモリ、　９３…プロセッサ、　９４…メモリ 100 ... Sound source separation system, 1 ... Sound source separation model learning device, 2 ... Sound source separation device, 10 ... Control unit, 11 ... Input unit, 12 ... Interface unit, 13 ... Storage unit, 14 ... Output unit, 101 ... Managed unit , 102 ... Management unit, 110 ... Sound source separation neural network, 111 ... Input information acquisition unit, 112 ... Configuration information estimation unit, 113 ... Dominant sound source information estimation unit, 120 ... Loss acquisition unit, 130 ... Template update unit, 140 ... Learning Data acquisition unit, 20 ... Control unit, 21 ... Input unit, 22 ... Interface unit, 23 ... Storage unit, 24 ... Output unit, 201 ... Separation target acquisition unit, 202 ... Spectrogram acquisition unit, 203 ... Separation information acquisition unit, 204 ... non-mixed sound signal generation unit, 205 ... sound signal output control unit, 206 ... interface control unit, 91 ... processor, 92 ... memory, 93 ... processor, 94 ... memory

Claims

A learning data acquisition unit that acquires a spectrogram of a mixed signal in which a plurality of sounds are mixed and dominant sound source information indicating whether or not the target sound source is dominant for each time frequency point of the spectrogram.
Used for estimating the composite product using a template, which is information representing one or more values related to the spectrogram, which is one or more values at a time frequency point belonging to one section divided in the time axis direction of the spectrogram. A weight estimation unit that estimates weights, and a weight estimation unit
The dominant sound source information estimation unit that acquires the estimation result of the dominant sound source information based on the composite product,
A loss acquisition unit that acquires the difference between the estimation result and the dominant sound source information,
Equipped with
The template and the weight used for estimating the synthetic product indicate the estimation result regarding the spectrogram of the target sound source.
The weight estimation unit learns a machine learning model that estimates the weight so as to reduce the difference.
Sound source separation model learning device.

Template update section that updates the template based on the difference,
The sound source separation model learning apparatus according to claim 1.

The value of the template is a non-negative value,
The sound source separation model learning apparatus according to claim 1 or 2.

The weight estimation unit acquires a non-negative value as the value of the weight.
The sound source separation model learning device according to any one of claims 1 to 3.

An output control unit that outputs the template and the weight to the output unit that outputs the template and the weight.
The sound source separation model learning apparatus according to any one of claims 1 to 4.

An spectrogram acquisition unit that acquires a spectrogram of a mixed signal in which multiple sounds are mixed,
A learning data acquisition unit that acquires a spectrogram of a mixed signal in which a plurality of sounds are mixed and dominant sound source information indicating whether or not the target sound source is dominant for each time frequency point of the spectrogram, and a time axis of the spectrogram. A weight for estimating the weight used for estimating the composite product using a template, which is information representing one or more values related to the spectrogram, which is one or more values at a time frequency point belonging to one interval divided in a direction. The estimation unit, the dominant sound source information estimation unit that acquires the estimation result of the dominant sound source information based on the synthetic product, the loss acquisition unit that acquires the difference between the estimation result and the dominant sound source information, the template, and the weight. The template and the weight used for estimating the combined product are provided with an output unit for outputting the above, and the weight estimation unit indicates the estimation result regarding the spectrogram of the target sound source, and the weight estimation unit is said to reduce the difference. A sound source separation model for learning a machine learning model for estimating weights A separation unit that separates sound signals of a plurality of sound sources including a mixed signal to be separated using the template learned by the learning device and the model.
A sound source separator equipped with.

A learning data acquisition step for acquiring a spectrogram of a mixed signal in which a plurality of sounds are mixed and dominant sound source information indicating whether or not the target sound source is dominant for each time frequency point of the spectrogram.
Used for estimating the composite product using a template, which is information representing one or more values related to the spectrogram, which is one or more values at a time frequency point belonging to one section divided in the time axis direction of the spectrogram. The weight estimation step for estimating the weight and the weight estimation step
The dominant sound source information estimation step for acquiring the estimation result of the dominant sound source information based on the synthetic product,
A loss acquisition step for acquiring the difference between the estimation result and the dominant sound source information,
Have,
The template and the weight used for estimating the synthetic product indicate the estimation result regarding the spectrogram of the target sound source.
In the weight estimation step, a machine learning model for estimating the weight is learned so as to reduce the difference.
Sound source separation model learning method.

A program for operating a computer as a sound source separation model learning device according to any one of claims 1 to 5.