[go: up one dir, main page]

WO2022166219A1 - Voice diarization method and voice recording apparatus thereof - Google Patents

Voice diarization method and voice recording apparatus thereof Download PDF

Info

Publication number
WO2022166219A1
WO2022166219A1 PCT/CN2021/120414 CN2021120414W WO2022166219A1 WO 2022166219 A1 WO2022166219 A1 WO 2022166219A1 CN 2021120414 W CN2021120414 W CN 2021120414W WO 2022166219 A1 WO2022166219 A1 WO 2022166219A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
label
intermediate state
state
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/120414
Other languages
French (fr)
Chinese (zh)
Inventor
陈文明
陈新磊
张洁
张世明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Emeet Technology Co Ltd
Original Assignee
Shenzhen Emeet Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Emeet Technology Co Ltd filed Critical Shenzhen Emeet Technology Co Ltd
Publication of WO2022166219A1 publication Critical patent/WO2022166219A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to the technical field of audio, and in particular, to the technical field of voice discrimination.
  • Speaker Diarization is one of the issues worthy of further study. Different from speech recognition and speech separation, speaker distinction does not pay attention to who the speaker is or what the speaker said, but focuses on the question of "who spoke at what time", focusing on the differences between different speakers . After the user obtains the differentiated voices from different speakers, operations such as voice recognition can be performed to improve the corresponding accuracy.
  • the traditional speaker distinction method is mainly based on clustering.
  • Most of these algorithms are offline mode, that is, to obtain a complete piece of speech, first use a sliding window to segment (or frame) the segment of speech, and then use a sliding window. Fourier transform is performed inside, the Mel cepstral coefficient (MFCC) feature or spectral feature is extracted, and then the feature is mapped to a high-dimensional space. Then move the window, adopt a certain proportion of overlapping window length, and try to ensure that each window contains only the speech of one speaker, and then calculate the embedding of speech features in the next window in the high-dimensional space. By comparing the differences in the speech embedding features of different segments, it is judged whether two speeches belong to the same speaker.
  • MFCC Mel cepstral coefficient
  • the method to measure this difference is to calculate the cosine similarity of the two or the Euclidean distance in the multi-dimensional space.
  • the cosine similarity Or when the Euclidean distance is greater than a certain threshold it is considered that the two are different, that is, the two speeches belong to different speakers; if it is less than the threshold, it is considered that the two speeches belong to the same speaker, and the threshold is often set based on experience or some marked data. Get tested.
  • the speech features used by the clustering algorithm such as spectrum and amplitude spectrum features, cannot well reflect the differences of different speakers when modeling the speaker's features.
  • the improvement of the clustering accuracy is very limited. This affects the accuracy of speech discrimination.
  • the present application provides a voice distinguishing method and a voice recording device thereof that can improve the distinguishing accuracy.
  • a speech analysis method includes: obtaining a single-person acoustic feature from multi-channel audio data; obtaining an intermediate state of the single-person acoustic feature using a preset recurrent recurrent neural network, and using the intermediate state Store in the state sequence buffer; in the state sequence buffer, run a clustering algorithm on all intermediate states in the state sequence buffer and obtain at least one cluster; calculate the intermediate states and sums of the single-person acoustic features The weighted mean square error of the cluster center of each of the clusters; the cluster label of the cluster corresponding to the smallest weighted mean square error is determined as the cluster label of the intermediate state of the single-person acoustic feature.
  • a voice recording device which includes: an acoustic feature acquisition unit for extracting single-person acoustic features from multi-channel audio data; an intermediate state buffer unit for acquiring the single-person acoustic feature by using a preset recurrent recurrent neural network intermediate states of the feature, and store the intermediate states in the state sequence buffer; in the state sequence buffer, run a clustering algorithm on all the intermediate states in the state sequence buffer and obtain at least one cluster; Calculate the intermediate state of the single-person acoustic feature and the weighted mean square error of the cluster center of each of the clusters; determine the cluster label of the cluster corresponding to the minimum weighted mean square error as the single-person acoustic feature. Cluster labels for intermediate states.
  • the beneficial effect of the present application is that, through the trained neural network, the intermediate state of the speech data of a single person is predicted, and then the intermediate state is sent to the state buffer for clustering calculation, and the corresponding clustering label is determined. Therefore, the clustering process and the neural network are separated, which facilitates the optimization of the clustering process and improves the accuracy of discrimination.
  • FIG. 1 is a flowchart of a method for distinguishing speech according to Embodiment 1 of the present application.
  • FIG. 2 is a schematic diagram of a specific flow of S110 in Embodiment 1 of the present application.
  • FIG. 3 is a flowchart of a supervised training process of a recurrent recurrent neural network in Embodiment 1 of the present application.
  • FIG. 4 is a schematic diagram of a supervised training process of a recurrent recurrent neural network in Embodiment 1 of the present application.
  • FIG. 5 is a schematic diagram of a testing process of a recurrent recurrent neural network in Embodiment 1 of the present application.
  • FIG. 6 is a schematic diagram of an update process in the status buffer in Embodiment 1 of the present application.
  • FIG. 7 is a schematic block diagram of a voice recording apparatus according to Embodiment 2 of the present application.
  • FIG. 8 is a schematic structural diagram of a voice recording apparatus according to Embodiment 3 of the present application.
  • the embodiments of the present application can be applied to various voice recording apparatuses with a voice distinguishing function.
  • voice recorder audio conference terminal, intelligent conference recording device or intelligent electronic device with recording function, etc.
  • audio conference terminal audio conference terminal
  • intelligent conference recording device intelligent electronic device with recording function
  • FIG. 1 illustrates a method for distinguishing speech according to Embodiment 1 of the present application.
  • the voice discrimination refers to judging the speaker corresponding to the voice information, that is, distinguishing the voice information from different sound sources.
  • the sound source distinction does not need to obtain the complete speech emitted by the sound source, but only needs to obtain a part of it, such as a sentence, or even a word or fragment in a sentence.
  • the voice distinguishing method 100 includes:
  • S110 obtain single-person acoustic features from multi-channel audio data; optionally, the single-person acoustic features are high-dimensional vector features;
  • Fig. 2 obtain single-person acoustic features from multi-channel audio data, including the following S111 to S117:
  • the built-in microphone array of the voice recording device collects and reads the recorded data in real time according to data blocks, and saves the recorded data; wherein, the recorded data is real-time waveform data; optionally, the size of the data block can be set according to real-time requirements
  • the length is generally set to 100 to 500 milliseconds; optionally, the recorded data is stored in the memory in the voice recording device to assist in verifying the speaker distinction result or for other purposes.
  • the voice detection module detects the recorded data in the data block, and determines the recorded data in the data block.
  • the voice detection module includes a trained neural network.
  • the input of the voice detection module is the waveform time domain sequence of the recorded data, and the output is a probability value. If the probability value is greater than the preset threshold, the neural network judges the data.
  • the data in the block is the voice data, otherwise the recording data of the data block is discarded and the next data block is waited.
  • the temporary data buffer receives the recording data.
  • a temporary data buffer is also set after the speech detection module to receive the output of the speech detection module.
  • the size of the temporary data buffer can also be set by yourself. The larger the buffer, the more voice data streams will be accumulated. After feature extraction, speaker discrimination can improve the discrimination accuracy. However, if the buffer is set too large, you need to wait for a long time.
  • the voice filling buffer will affect the real-time requirements, so in practical applications, the buffer size can be set according to the real-time and accuracy requirements, usually the buffer length does not exceed 3 seconds.
  • the microphone array algorithm can be applied to determine how many speakers are included in the multi-channel voice data.
  • the number of speakers judgment module needs to determine how many speakers are included in the voice data in the temporary data buffer. This is a two-category task.
  • the speech data is either a single-person speech without overlapping parts; or multiple people speaking with overlapping speech. But in S115, the number of speakers judgment module does not care how many speakers there are in the overlapping part, it only needs to judge whether the number of speakers is greater than 1. If it is determined that there is more than one speaker in the voice data, S116 is executed, otherwise, the voice data is monophonic voice data, and S117 can be executed directly.
  • the scanning method is used, that is, the space plane is divided into different angular regions, and each region is detected separately. If voice data is detected in different angular regions, it means that If there are multiple speakers speaking at the same time in the same time period, the beamforming algorithm is used in each direction area to enhance the speech in that direction, and suppress the sounds in other directions. Through this method, the voices of different speakers are extracted and enhanced to form the monophonic voice data.
  • the acoustic feature is a high-dimensional vector feature that can distinguish different speakers; optionally, a feature extraction module is used to extract the acoustic feature.
  • the short-time Fourier transform is performed on the speech data, and the waveform signal in the time domain is converted into the frequency domain, and then the amplitude and phase of the spectrum are extracted to form an input vector, which is sent to the trained neural network, and the output
  • a high-dimensional feature vector with a fixed dimension, the high-dimensional feature vector represents the acoustic features of the speaker in the speech data.
  • the recurrent recurrent neural network is obtained by using a supervised learning training method.
  • the so-called supervised learning is that when a neural network is used to train a model, a reference label will be provided as a control to tell the model that the training target should be as close as possible to the label.
  • FIG. 3 is the supervised training process of the RNN
  • FIG. 4 is the testing process of the RNN, that is, the use process thereof.
  • the recurrent recurrent neural network is obtained by supervised learning and training, including:
  • S121 assign a speaker label to the speech signal, and record the start and end times of the speech signal corresponding to the speaker label; as shown in Figure 4, the speech data of the single person of speaker 1 and the speech data of the single person of speaker 2 , all contain their respective voice data and their corresponding speaker labels; during the training process, the input voice or voice fragment can clarify the speaker identity, therefore, the voice information corresponding to the voice fragment obtained according to this voice, It is also possible to clearly know the speaker;
  • S122 extract the acoustic features of the voice signal; continue to refer to FIG. 4, the feature extraction module extracts the acoustic features of speaker 1 and the acoustic features of speaker 2 respectively, and at this time, the acoustic features of speaker 1 and speaker 2 are obtained.
  • the speaker label corresponding to the acoustic feature remains unchanged, which is the speaker label assigned to it in S121;
  • S121 to S123 are the supervised training stages of the RNN. After the training is completed, please refer to Figure 5.
  • the input voice is not clear who the speaker is, but the trained model needs to be clustered on it, such as As described in S120 to S140, assign a speaker label to it.
  • a preset cyclic recurrent neural network is used to obtain an intermediate state of the single-person acoustic feature, and the intermediate state is stored in a state sequence buffer, specifically including:
  • a recurrent recurrent neural network is used to obtain the intermediate states of the single-person acoustic features without speaker labels. At this time, it is not directly given which speaker the feature vector corresponding to the speech belongs to.
  • S130 in the state sequence buffer, run a clustering algorithm on all intermediate states in the state sequence buffer to obtain at least one cluster, including:
  • the post-class state buffer state may contain at least one class, i.e. at least one cluster, each cluster representing a speaker;
  • S140 calculate the intermediate state of the single-person acoustic feature and the weighted mean square error of the cluster center of each of the clusters;
  • each cluster has a cluster center, which is the mean value of all intermediate states in the cluster;
  • S150 determine that the cluster label of the cluster corresponding to the minimum weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature, including:
  • the cluster label of the cluster corresponding to the determined minimum weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature, including:
  • the weighted mean square error is the weighted Euclidean distance, which can represent the distance between the intermediate state to be measured and the Possibility convention of clustering; among the weighted Euclidean distances between the intermediate state to be measured and all cluster centers, the smallest one indicates that the intermediate state to be measured is closest to the cluster, so it can be determined that the intermediate state to be measured is closest to the cluster.
  • the measured intermediate state belongs to the cluster, and classifies the cluster label of the cluster for the intermediate state to be measured.
  • the cluster label that is, the speaker label
  • the cluster label can be directly assigned to the intermediate state; if The intermediate state is not of the speaker that has appeared before, so there is no cluster label (that is, the speaker label) in the previous sequence, then a new cluster label needs to be set for the current speaker, and then the new cluster label needs to be set for the current speaker.
  • the cluster label of is assigned to this intermediate state.
  • the state sequence buffer is specially used to store the intermediate state of the neural network output.
  • the voice data is time series data, as time goes by, the voice data will increase, and the intermediate state corresponding to the output of the neural network will also increase, so the calculation overhead will also increase when updating and maintaining the buffer, and the recording time is long enough.
  • it will take longer and longer to run the clustering algorithm, calculate the weighted mean square error, and update the cluster center, and the real-time requirements will be greatly reduced.
  • a preset capacity value can be set for the size of the state buffer according to requirements.
  • the intermediate state output by the neural network is stored in the buffer. If the state buffer is full, the buffer is updated according to a certain strategy to keep the buffer size unchanged.
  • the circle or ellipse in Figure 6 represents a cluster, the solid circular figure represents the cluster center of each cluster, the solid triangular figure is the currently predicted intermediate state, and the solid diamond is the intermediate state to be discarded.
  • the voice distinguishing method 100 further updates the strategy of the state buffer, which specifically includes:
  • the method further includes:
  • the update strategy of the above state buffer can be summarized as the "most recent first out” strategy. That is, when a new intermediate state has been assigned a cluster label, the Euclidean distances of all intermediate states and cluster centers within the class are calculated and sorted. Discard the intermediate state with the smallest Euclidean distance, then add the new state to the cluster, and recalculate the cluster center.
  • the size of the buffer can be maintained unchanged, so no matter how long the recording time is, when the buffer is full, the response time of the system to distinguish the speaker can remain unchanged as a whole, reducing the response delay caused by the increase in the amount of calculation. cumulative problems.
  • Adopting the strategy of "recent, first out” can ensure that there will be no major difference in the discrimination accuracy, because the closer the distance to the cluster center, the more likely it is that the intermediate state belongs to this category, and the lower the uncertainty, the more likely it is to remain in this category.
  • the value of making further judgments in the state buffer is lower, so you can choose to discard it; on the contrary, the farther away from the cluster center, the greater the uncertainty, and the greater the value of making further judgments in the state buffer. high, therefore, needs to be reserved.
  • the neural network by training the neural network through supervised learning, a large amount of labeled data can be used to improve the accuracy of speaker discrimination. The more labeled training data, the higher the algorithm discrimination accuracy. Then, use the recurrent recurrent neural network trained by supervised learning to predict the intermediate state of the speech data of a single person, send the intermediate state to the state buffer for clustering calculation, and determine the corresponding clustering label. Thus, the clustering process and the neural network are separated, which facilitates the optimization of the clustering process.
  • the solution provided by the embodiments of the present application also maintains and updates the state buffer according to the "last first out” strategy, so as to ensure that the state transition area will not cause the system to run more and more due to the accumulation of delay. Therefore, the problem of time delay accumulation in the running process of the real-time clustering algorithm is solved, so as to achieve the effect of real-time speaker discrimination, and improve the real-time performance of the device or system running the speech discrimination method.
  • the so-called real-time speaker discrimination does not require the acquisition of a complete voice file, and provides the judgment result of the speaker's identity at the previous moment in the form of low-latency while the speaker is speaking.
  • FIG. 7 shows a voice recording apparatus 200 according to Embodiment 2 of the present application.
  • the voice recording device 200 includes, but is not limited to, any one of a voice recorder, an audio conference terminal, or an intelligent electronic device with a recording function, etc. It may also not include a voice pickup function, but only include a voice discrimination function, which can realize this function.
  • a computer or other intelligent electronic device is not limited in the second embodiment.
  • the voice recording device 200 includes:
  • Acoustic feature acquisition unit 210 extracting single-person acoustic features from multi-channel audio data
  • the intermediate state buffer unit 220 adopts a preset cyclic recursive neural network to acquire the intermediate state of the single-person acoustic feature, and stores the intermediate state in the state sequence buffer; in the state sequence buffer, the All intermediate states in the state sequence buffer run the clustering algorithm and obtain at least one cluster; calculate the intermediate state of the single-person acoustic feature and the weighted mean square error of the cluster center of each of the clusters; determine the smallest all The cluster label of the cluster corresponding to the weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature.
  • the recurrent recurrent neural network is obtained by using a supervised learning training method.
  • the voice recording device further includes a cyclic recurrent neural network obtaining unit 230, configured to assign a speaker tag to the voice signal, and record the start and end times of the voice signal corresponding to the speaker tag; extract the voice signal.
  • the acoustic feature of the signal; the acoustic feature and the speaker label are sent into a recurrent recurrent neural network, and the recurrent recurrent neural network is optimized by using a loss function and an optimizer.
  • the space size of the state sequence buffer is a preset capacity value; then the intermediate state buffer unit 220 is further configured to, if the space size of the state sequence buffer reaches the preset capacity value, In the state sequence buffer storing the intermediate states, calculate the Euclidean distance between all intermediate states in at least one of the clusters and the cluster center of the cluster; remove the minimum corresponding to the Euclidean distance Intermediate state.
  • the intermediate state cache unit 220 further includes: adding a new intermediate state; and recalculating the cluster centers of the clusters in the state sequence buffer.
  • the intermediate state cache unit 220 is configured to determine that the cluster label of the cluster corresponding to the minimum weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature, specifically including:
  • the intermediate state cache unit 220 is specifically configured to determine that the existing label is the cluster label of the intermediate state if the cluster corresponding to the smallest weighted mean square error has an existing label; If the cluster corresponding to the mean square error has no label, a new label is assigned to the cluster corresponding to the smallest weighted mean square error, and the new label is determined to be the cluster label of the intermediate state.
  • FIG. 8 is a schematic structural diagram of a voice recording apparatus 300 according to Embodiment 3 of the present application.
  • the video processing apparatus 300 includes: a processor 310 and a memory 320 .
  • the communication connection between the processor 310 and the memory 320 is realized through a bus system.
  • the processor 310 invokes the program in the memory 320 to execute any one of the speech analysis methods provided in the first embodiment above.
  • the processor 310 may be an independent component, or may be a collective term for multiple processing components. For example, it may be a CPU, an ASIC, or one or more integrated circuits configured to implement the above method, such as at least one microprocessor DSP, or at least one programmable gate FPGA, etc.
  • the memory 320 is a computer-readable storage medium on which programs executable on the processor 310 are stored.
  • the voice processing device 300 further includes: a sound pickup device 330 for acquiring voice information.
  • the processor 310, the memory 320, and the sound pickup device 330 are connected to each other through a bus system for communication.
  • the processor 310 invokes the program in the memory 320, executes any one of the speech analysis methods provided in the first embodiment, and processes the multi-channel speech information acquired by the sound pickup device 330.
  • the functions described in the specific embodiments of the present application may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software When implemented in software, it may be implemented by a processor executing software instructions.
  • the software instructions may consist of corresponding software modules.
  • the software modules may be stored in a computer-readable storage medium, which may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media.
  • the available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, Digital Video Disc (DVD)), or semiconductor media (eg, Solid State Disk (SSD)) )Wait.
  • the computer-readable storage medium includes but is not limited to random access memory (Random Access Memory, RAM), flash memory, read only memory (Read Only Memory, ROM), Erasable Programmable Read Only Memory (Erasable Programmable ROM, EPROM) ), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disks, removable hard disks, compact disks (CD-ROMs), or any other form of storage medium known in the art.
  • An exemplary computer-readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the computer-readable storage medium.
  • the computer-readable storage medium can also be an integral part of the processor.
  • the processor and computer-readable storage medium may reside in an ASIC. Additionally, the ASIC may reside in access network equipment, target network equipment or core network equipment.
  • the processor and the computer-readable storage medium may also exist as discrete components in the access network device, the target network device or the core network device. When implemented in software, it can also be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer program instructions may be stored on or transmitted from one computer readable storage medium to another computer readable storage medium as described above, for example, the computer instructions may be downloaded from a website, computer, server or The data center transmits to another website site, computer, server or data center through wired (such as coaxial cable, optical fiber, Digital Subscriber Line, DSL) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, Digital Subscriber Line, DSL
  • wireless such as infrared, wireless, microwave, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A voice diarization method and a voice recording apparatus thereof. The method comprises: acquiring a single-person acoustic feature from multi-channel audio data; acquiring an intermediate state of the single-person acoustic feature by using a preset recurrent-recursive neural network, and storing the intermediate state in a state sequence buffer area; in the state sequence buffer area, operating a clustering algorithm on all the intermediate states in the state sequence buffer area and obtaining at least one cluster; calculating the weighted mean square error between the intermediate state of the single-person acoustic feature and the cluster center of each cluster; and determining a cluster label of a cluster corresponding to the smallest weighted mean square error to be a cluster label of the intermediate state of the single-person acoustic feature. By means of the solution provided in the present application, a clustering process is conveniently optimized, and the diarization accuracy is improved.

Description

一种语音区分方法及其语音记录装置A kind of voice distinguishing method and voice recording device thereof 技术领域technical field

本发明涉及音频技术领域,尤其涉及一种语音区分的技术领域。The present invention relates to the technical field of audio, and in particular, to the technical field of voice discrimination.

背景技术Background technique

随着深度学习的兴起,越来越多的移动便携式设备加入了智能化的浪潮。在众多嵌入式智能设备中,语音往往是开启智能世界的钥匙,不论是在家庭生活场景中还是在公司的会议场景中,语音是很多智能设备的输入,通过对说话人语音的分析,智能设备可以捕获到说话人的指令以便进行下一步操作。With the rise of deep learning, more and more mobile portable devices have joined the wave of intelligence. In many embedded smart devices, voice is often the key to unlocking the smart world. Whether in home life scenarios or in company meeting scenarios, voice is the input of many smart devices. By analyzing the speaker's voice, smart devices The speaker's instructions can be captured for the next step.

但是在这类场景中,说话人往往不止一个,如何把不同的说话人的语音分离,成为了语音领域需要解决的问题。说话人区分(Speaker Diarization)就是其中一个值得深入研究的问题。不同于语音识别以及语音分离,说话人区分不关注说话人是谁,也不关注说话人说了什么,而是聚焦“谁在什么时间说话了”这个问题,侧重点在于不同的说话人的差异。当用户得到了区分后的来自不同说话人的语音后,可进行语音识别等操作,以提高相应的准确率。However, in such scenarios, there are often more than one speaker, and how to separate the voices of different speakers has become a problem that needs to be solved in the field of speech. Speaker Diarization is one of the issues worthy of further study. Different from speech recognition and speech separation, speaker distinction does not pay attention to who the speaker is or what the speaker said, but focuses on the question of "who spoke at what time", focusing on the differences between different speakers . After the user obtains the differentiated voices from different speakers, operations such as voice recognition can be performed to improve the corresponding accuracy.

传统的说话人区分方法以聚类为主,这类算法大多是离线模式,即需要获取到完整的一段语音,先对该段语音使用滑动的窗口进行分片(或分帧),然后在片内做傅里叶变换,提取梅尔倒谱系数(MFCC)特征或者频谱特征,然后将该特征映射到高维空间。之后移动窗口,采用一定比例的重叠窗长,尽可能保证每一个窗口内只含有一个说话人的语音,然后再计算下一窗口内语音特征在高维空间的嵌入。通过对比不同片段语音嵌入特征的差异,来判断两段语音是否属于同一个说话人,一般衡量这种差异的方法是计算两者的余弦相似度或者多维空间中的欧氏距离,当余弦相似度或欧氏距离大于某阈值时,认为两者不同,即两段语音属于不同的说话人;若小于阈值,则认为两段语音属于同一个说话人,阈值的设置往往根据经验或者使用一些标记数据进行测试得到。The traditional speaker distinction method is mainly based on clustering. Most of these algorithms are offline mode, that is, to obtain a complete piece of speech, first use a sliding window to segment (or frame) the segment of speech, and then use a sliding window. Fourier transform is performed inside, the Mel cepstral coefficient (MFCC) feature or spectral feature is extracted, and then the feature is mapped to a high-dimensional space. Then move the window, adopt a certain proportion of overlapping window length, and try to ensure that each window contains only the speech of one speaker, and then calculate the embedding of speech features in the next window in the high-dimensional space. By comparing the differences in the speech embedding features of different segments, it is judged whether two speeches belong to the same speaker. Generally, the method to measure this difference is to calculate the cosine similarity of the two or the Euclidean distance in the multi-dimensional space. When the cosine similarity Or when the Euclidean distance is greater than a certain threshold, it is considered that the two are different, that is, the two speeches belong to different speakers; if it is less than the threshold, it is considered that the two speeches belong to the same speaker, and the threshold is often set based on experience or some marked data. Get tested.

但是,聚类算法使用的语音特征如频谱、幅度谱特征等,在对说话人的特征进行建模时,并不能很好地体现不同的说话人的差异性。并且,当聚类达到一定程度时,无论模型再怎么增加数据训练,聚类准确率提升都很有限。从而影响语音区分的准确性。However, the speech features used by the clustering algorithm, such as spectrum and amplitude spectrum features, cannot well reflect the differences of different speakers when modeling the speaker's features. Moreover, when the clustering reaches a certain level, no matter how much data training is added to the model, the improvement of the clustering accuracy is very limited. This affects the accuracy of speech discrimination.

发明内容SUMMARY OF THE INVENTION

本申请提供一种可提升区分准确性的语音区分方法及其语音记录装置。The present application provides a voice distinguishing method and a voice recording device thereof that can improve the distinguishing accuracy.

本申请提供以下技术方案:This application provides the following technical solutions:

一方面,提供一种语音分析方法,其包括:从多通道音频数据中获取单人声学特征;采用预设的循环递归神经网络获取所述单人声学特征的中间状态,并将所述中间状态存入状态序列缓冲区;在所述状态序列缓冲区中,对所述状态序列缓冲区中所有的中间状态运行聚类算法并获得至少一个聚类;计算所述单人声学特征的中间状态和每一个所述聚类的聚类中心的加权均方差;确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签。In one aspect, a speech analysis method is provided, which includes: obtaining a single-person acoustic feature from multi-channel audio data; obtaining an intermediate state of the single-person acoustic feature using a preset recurrent recurrent neural network, and using the intermediate state Store in the state sequence buffer; in the state sequence buffer, run a clustering algorithm on all intermediate states in the state sequence buffer and obtain at least one cluster; calculate the intermediate states and sums of the single-person acoustic features The weighted mean square error of the cluster center of each of the clusters; the cluster label of the cluster corresponding to the smallest weighted mean square error is determined as the cluster label of the intermediate state of the single-person acoustic feature.

另一方面,提供一种语音记录设备,其包括:声学特征获取单元,从多通道音频数据中提取单人声学特征;中间状态缓存单元,采用预设的循环递归神经网络获取所述单人声学特征的中间状态,并将所述中间状态存入状态序列缓冲区;在所述状态序列缓冲区中,对所述状态序列缓冲区中所有的中间状态运行聚类算法并获得至少一个聚类;计算所述单人声学特征的中间状态和每一个所述聚类的聚类中心的加权均方差;确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签。In another aspect, a voice recording device is provided, which includes: an acoustic feature acquisition unit for extracting single-person acoustic features from multi-channel audio data; an intermediate state buffer unit for acquiring the single-person acoustic feature by using a preset recurrent recurrent neural network intermediate states of the feature, and store the intermediate states in the state sequence buffer; in the state sequence buffer, run a clustering algorithm on all the intermediate states in the state sequence buffer and obtain at least one cluster; Calculate the intermediate state of the single-person acoustic feature and the weighted mean square error of the cluster center of each of the clusters; determine the cluster label of the cluster corresponding to the minimum weighted mean square error as the single-person acoustic feature. Cluster labels for intermediate states.

本申请的有益效果在于,通过已经训练好的训练神经网络,预测单人的语音数据的中间状态,再将该中间状态送入状态缓冲区进行聚类计算,确定相应的聚类标签。由此,将聚类过程和神经网络分离,方便对聚类过程进行优化,提升区分准确性。The beneficial effect of the present application is that, through the trained neural network, the intermediate state of the speech data of a single person is predicted, and then the intermediate state is sent to the state buffer for clustering calculation, and the corresponding clustering label is determined. Therefore, the clustering process and the neural network are separated, which facilitates the optimization of the clustering process and improves the accuracy of discrimination.

附图说明Description of drawings

图1为本申请实施方式一提供的一种语音区分方法的流程图。FIG. 1 is a flowchart of a method for distinguishing speech according to Embodiment 1 of the present application.

图2为本申请实施方式一中S110的具体流程示意图。FIG. 2 is a schematic diagram of a specific flow of S110 in Embodiment 1 of the present application.

图3为本申请实施方式一中循环递归神经网络的监督训练过程流程图。FIG. 3 is a flowchart of a supervised training process of a recurrent recurrent neural network in Embodiment 1 of the present application.

图4为本申请实施方式一中循环递归神经网络的监督训练过程的示意图。FIG. 4 is a schematic diagram of a supervised training process of a recurrent recurrent neural network in Embodiment 1 of the present application.

图5为本申请实施方式一中循环递归神经网络的测试过程的示意图。FIG. 5 is a schematic diagram of a testing process of a recurrent recurrent neural network in Embodiment 1 of the present application.

图6为本申请实施方式一中状态缓冲区中更新过程的示意图。FIG. 6 is a schematic diagram of an update process in the status buffer in Embodiment 1 of the present application.

图7为本申请实施方式二提供的一种语音记录装置的方框示意图。FIG. 7 is a schematic block diagram of a voice recording apparatus according to Embodiment 2 of the present application.

图8本申请实施方式三提供的一种语音记录装置的结构示意图。FIG. 8 is a schematic structural diagram of a voice recording apparatus according to Embodiment 3 of the present application.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施方式,对本申请进行进一步详细说明。应当理解,此处所描述的实施方式仅用以解释本申请,并不用于限定本申请。但是,本申请可以以多种不同的形式来实现,并不限于本文所描述的实施方式。相反地,提供这些实施方式的目的是使对本实用新型的公开内容的理解更加透彻全面。In order to make the objectives, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the embodiments described herein are only used to explain the present application, but not to limit the present application. However, the present application may be implemented in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that a thorough and complete understanding of the present disclosure is provided.

除非另有定义,本文所实用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施方式的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein in the specification of the application are for the purpose of describing particular embodiments only, and are not intended to limit the application.

应理解,本文中术语“系统”或“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。It should be understood that the terms "system" or "network" are often used interchangeably herein. The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases. In addition, the character "/" in this document generally indicates that the related objects are an "or" relationship.

本申请实施例可以应用于各种带有语音区分功能的各种语音记录装置。例如:录音笔、音频会议终端、智能会议记录装置或者有录音功能的智能电子设备等。以下将通过具体的实施方式对本申请的技术方案进行阐述。The embodiments of the present application can be applied to various voice recording apparatuses with a voice distinguishing function. For example: voice recorder, audio conference terminal, intelligent conference recording device or intelligent electronic device with recording function, etc. The technical solutions of the present application will be described below through specific embodiments.

实施方式一Embodiment 1

请参看图1,为本申请实施方式一提供的一种语音区分方法。其中,语音区分是指对语音信息对应的说话人进行判断,即,区分不同的声源发出的语音信息。声源区分可以不需要获取到声源发出的完整语音,只需获取其中一部分即可,如一句话,甚至是一句话中的一个单词或片段。Please refer to FIG. 1 , which illustrates a method for distinguishing speech according to Embodiment 1 of the present application. The voice discrimination refers to judging the speaker corresponding to the voice information, that is, distinguishing the voice information from different sound sources. The sound source distinction does not need to obtain the complete speech emitted by the sound source, but only needs to obtain a part of it, such as a sentence, or even a word or fragment in a sentence.

该语音区分方法100包括:The voice distinguishing method 100 includes:

S110,从多通道音频数据中获取单人声学特征;可选的,所述单人声学特征为高维向量特征;S110, obtain single-person acoustic features from multi-channel audio data; optionally, the single-person acoustic features are high-dimensional vector features;

S120,采用预设的循环递归神经网络获取所述单人声学特征的中间状态,并将所述中间状态存入状态序列缓冲区;S120, using a preset cyclic recurrent neural network to obtain the intermediate state of the single-person acoustic feature, and storing the intermediate state in the state sequence buffer;

S130,在所述状态序列缓冲区中,对所述状态序列缓冲区中所有的中间状态运行聚类算法并获得至少一个聚类;S130, in the state sequence buffer, run a clustering algorithm on all intermediate states in the state sequence buffer to obtain at least one cluster;

S140,计算所述单人声学特征的中间状态和每一个所述聚类的聚类中心的加权均方差;S140, calculating the intermediate state of the single-person acoustic feature and the weighted mean square error of the cluster center of each of the clusters;

S150,确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签。S150. Determine the cluster label of the cluster corresponding to the smallest weighted mean square error as the cluster label of the intermediate state of the single-person acoustic feature.

可选的,请参见图2,S110,从多通道音频数据中获取单人声学特征,包括如下S111至S117:Optionally, please refer to Fig. 2, S110, obtain single-person acoustic features from multi-channel audio data, including the following S111 to S117:

S111,语音记录设备内置的麦克风阵列采集按数据块实时读取录音数据,并保存该录音数据;其中,该录音数据为实时波形数据;可选的,可根据实时性要求设置该数据块的大小长度,一般设置为100到500毫秒;可选的,该录音数据保存在该语音记录设备内的存储器中,用于辅助校验说话人区分结果或者留作他用。S111, the built-in microphone array of the voice recording device collects and reads the recorded data in real time according to data blocks, and saves the recorded data; wherein, the recorded data is real-time waveform data; optionally, the size of the data block can be set according to real-time requirements The length is generally set to 100 to 500 milliseconds; optionally, the recorded data is stored in the memory in the voice recording device to assist in verifying the speaker distinction result or for other purposes.

S112,语音检测模块对该数据块中的录音数据进行检测,确定数据块中的录音数据。S112, the voice detection module detects the recorded data in the data block, and determines the recorded data in the data block.

该语音检测模块包括一个已训练好的神经网络,该语音检测模块输入的是录音数据的波形时域序列,输出的是一个概率值,若此概率值大于预先设置的阈值,则神经网络判断数据块内的数据是语音数据,否则丢弃该数据块的录音数据,等待下一个数据块。The voice detection module includes a trained neural network. The input of the voice detection module is the waveform time domain sequence of the recorded data, and the output is a probability value. If the probability value is greater than the preset threshold, the neural network judges the data. The data in the block is the voice data, otherwise the recording data of the data block is discarded and the next data block is waited.

S113,临时数据缓冲区接收该录音数据。S113, the temporary data buffer receives the recording data.

为了提高说话人区分的准确度,在语音检测模块后还设置了一个临时数据缓冲区,用以接收语音检测模块的输出。临时数据缓冲区的大小也可以自行设置,缓冲区越大,积累的语音数据流越多,提取特征后进行说话人区分可以使区分准确率提升,但缓冲区设置过大,需要等待足够长时间的语音填充缓冲区,会影响实时性的要求,因此实际应用中可以根据实时性以及准确率的需求设置缓冲区大小,通常缓冲区长度不超过3秒。In order to improve the accuracy of speaker distinction, a temporary data buffer is also set after the speech detection module to receive the output of the speech detection module. The size of the temporary data buffer can also be set by yourself. The larger the buffer, the more voice data streams will be accumulated. After feature extraction, speaker discrimination can improve the discrimination accuracy. However, if the buffer is set too large, you need to wait for a long time. The voice filling buffer will affect the real-time requirements, so in practical applications, the buffer size can be set according to the real-time and accuracy requirements, usually the buffer length does not exceed 3 seconds.

S114,确定临时数据缓冲区中的数据块数量是否达到指定容量;若达到指定容量,则执行S115;否则,执行S111。S114, determine whether the number of data blocks in the temporary data buffer reaches the specified capacity; if the specified capacity is reached, execute S115; otherwise, execute S111.

S115,将临时数据缓冲区内的数据块送入说话人数量判断模块,判断该数据块中的说话人数量是否大于1;是则执行S116,否则执行S117;S115, send the data block in the temporary data buffer to the speaker quantity judgment module, and judge whether the number of speakers in the data block is greater than 1; if yes, execute S116, otherwise, execute S117;

由于实时的语音数据是麦克风阵列采集的多通道数据,可以应用麦克风阵列算法,判断多通道语音数据中含有几个说话人。但是,在实际家用生活场景以及会议场景中,不同的说话人有可能会同时说话,产生重叠的语音,因此说话人数量判断模块需要判断临时数据缓冲区内的语音数据包含几个说话人。此为一个二分类任务,语音数据要么就是单人语音,不存在重叠部分;要么就是多人说话,有重叠语音。但是S115中,说话人数量判断模块并不关心重叠部分到底有几个说话人,它只需判断说话人数量是否大于 1即可。如果判断该语音数据中说话人大于一个,则执行S116,,否则,该语音数据为单声道语音数据,可直接执行S117。Since the real-time voice data is the multi-channel data collected by the microphone array, the microphone array algorithm can be applied to determine how many speakers are included in the multi-channel voice data. However, in actual home life scenarios and conference scenarios, different speakers may speak at the same time, resulting in overlapping voices. Therefore, the number of speakers judgment module needs to determine how many speakers are included in the voice data in the temporary data buffer. This is a two-category task. The speech data is either a single-person speech without overlapping parts; or multiple people speaking with overlapping speech. But in S115, the number of speakers judgment module does not care how many speakers there are in the overlapping part, it only needs to judge whether the number of speakers is greater than 1. If it is determined that there is more than one speaker in the voice data, S116 is executed, otherwise, the voice data is monophonic voice data, and S117 can be executed directly.

S116,将说话人数量大于一的语音数据送入阵列算法模块处理重叠语音,获得单声道的语音数据。S116, sending the voice data with the number of speakers greater than one into the array algorithm module to process the overlapping voices to obtain monophonic voice data.

可选的,该阵列算法判断说话人数量时,使用扫描法,即将空间平面划分为不同的角度区域,分别对每个区域进行探测,如果探测到不同的角度区域内都存在语音数据,那么说明同一时间段内有多个说话人同时说话,则对每个方向区域内分别使用波束形成(Beamforming)算法对该方向的语音进行增强,对其他方向的声音进行抑制。通过这种方法,将不同的说话人的语音提取出来并做增强,形成该单声道的语音数据。Optionally, when the array algorithm determines the number of speakers, the scanning method is used, that is, the space plane is divided into different angular regions, and each region is detected separately. If voice data is detected in different angular regions, it means that If there are multiple speakers speaking at the same time in the same time period, the beamforming algorithm is used in each direction area to enhance the speech in that direction, and suppress the sounds in other directions. Through this method, the voices of different speakers are extracted and enhanced to form the monophonic voice data.

S117,根据单声道的语音数据提取单个说话人的声学特征;可选的,该声学特征是高维向量特征,能够区分不同说话人;可选的,采用特征提取模块提取该声学特征。S117, extract the acoustic feature of a single speaker according to the monophonic speech data; optionally, the acoustic feature is a high-dimensional vector feature that can distinguish different speakers; optionally, a feature extraction module is used to extract the acoustic feature.

具体而言,先对语音数据做短时傅里叶变换,将时域的波形信号转换到频域,然后提取出频谱的幅值和相位组成一个输入向量,送入训练好的神经网络,输出一个固定维度的高维特征向量,该高维特征向量表征的是该语音数据内说话人的声学特征。Specifically, the short-time Fourier transform is performed on the speech data, and the waveform signal in the time domain is converted into the frequency domain, and then the amplitude and phase of the spectrum are extracted to form an input vector, which is sent to the trained neural network, and the output A high-dimensional feature vector with a fixed dimension, the high-dimensional feature vector represents the acoustic features of the speaker in the speech data.

传统的声学算法常采用梅尔倒谱系数(MFCC)或者i-vector模型作为说话人声学特征的表示,但这些特征都是基于数学模型计算获得,而为了满足数学模型的运算要求,通常还需设定一系列的前提假设条件,但是在实际场景应用中并不总是能符合这些前提假设条件。因此,传统的声学特征在表示说话人独特性方面存在一定瓶颈。使用神经网络就不存在这些问题,没有预先假设条件的限制,在对语音数据做完短时傅里叶变换,提取出相关的信息后直接送入神经网络,输出表示说话人特征的高维向量,过程中避免了人为因素的影响,可以提高说话人区分的准确度。Traditional acoustic algorithms often use the Mel cepstral coefficient (MFCC) or i-vector model as the representation of the speaker's acoustic features, but these features are calculated based on mathematical models. Set a series of premise conditions, but these premise conditions are not always met in practical application scenarios. Therefore, traditional acoustic features have a certain bottleneck in representing speaker uniqueness. There are no such problems when using a neural network, and there is no limitation of pre-assumed conditions. After short-time Fourier transform is performed on the speech data, the relevant information is extracted and sent directly to the neural network, and a high-dimensional vector representing the speaker's characteristics is output. , the influence of human factors is avoided in the process, and the accuracy of speaker distinction can be improved.

可选的,所述循环递归神经网络采用监督式学习训练方式获得。所谓监督式学习是使用神经网络对模型进行训练的时候,会提供参考标签作为对照,告诉模型训练目标要和标签尽可能接近。Optionally, the recurrent recurrent neural network is obtained by using a supervised learning training method. The so-called supervised learning is that when a neural network is used to train a model, a reference label will be provided as a control to tell the model that the training target should be as close as possible to the label.

可选的,请参见图3及图4,图3为循环递归神经网络的监督训练过程,图4为循环递归神经网络的测试过程,即其使用过程。所述循环递归神经网络采用监督式学习训练方式获得,包括:Optionally, please refer to FIG. 3 and FIG. 4 , FIG. 3 is the supervised training process of the RNN, and FIG. 4 is the testing process of the RNN, that is, the use process thereof. The recurrent recurrent neural network is obtained by supervised learning and training, including:

S121,为语音信号分配一个说话人标签,并记录所述说话人标签对应的语音信号的起止时间;如图4所述,说话人1单人的语音数据和说话人2的单人的语音数据,中均包含其各自的语音数据及其对应说话人标签;在训练过程中,输入的语音或语音片段 是可以明确其说话人身份的,因此,根据此语音获得的语音片段对应的语音信息,也是可以明确的知道其说话人的;S121, assign a speaker label to the speech signal, and record the start and end times of the speech signal corresponding to the speaker label; as shown in Figure 4, the speech data of the single person of speaker 1 and the speech data of the single person of speaker 2 , all contain their respective voice data and their corresponding speaker labels; during the training process, the input voice or voice fragment can clarify the speaker identity, therefore, the voice information corresponding to the voice fragment obtained according to this voice, It is also possible to clearly know the speaker;

S122,提取所述语音信号的声学特征;继续参看图4,特征提取模块分别提取说话人1的声学特征,以及说话人2的声学特征,此时,说话人1的声学特征和说话人2得声学特征对应的说话人标签依然不变,为S121中为其分配的说话人标签;S122, extract the acoustic features of the voice signal; continue to refer to FIG. 4, the feature extraction module extracts the acoustic features of speaker 1 and the acoustic features of speaker 2 respectively, and at this time, the acoustic features of speaker 1 and speaker 2 are obtained. The speaker label corresponding to the acoustic feature remains unchanged, which is the speaker label assigned to it in S121;

S123,将所述声学特征及其所述说话人标签,送入循环递归神经网络中,使用损失函数以及优化器对所述循环递归神经网络进行优化。S123, send the acoustic feature and the speaker label into a recurrent recurrent neural network, and use a loss function and an optimizer to optimize the recurrent recurrent neural network.

上述S121至S123为循环递归神经网络的监督训练阶段。训练完成后,请参见图5,在使用的过程中,即测试阶段中,则输入的语音是不清楚其说话人是谁的,而是需要训练好的模型对其进行聚类运算后,如S120至S140所述,为其分配说话人标签。The above S121 to S123 are the supervised training stages of the RNN. After the training is completed, please refer to Figure 5. In the process of use, that is, in the testing phase, the input voice is not clear who the speaker is, but the trained model needs to be clustered on it, such as As described in S120 to S140, assign a speaker label to it.

可选的,S120,采用预设的循环递归神经网络获取所述单人声学特征的中间状态,并将所述中间状态存入状态序列缓冲区,具体包括:Optionally, in S120, a preset cyclic recurrent neural network is used to obtain an intermediate state of the single-person acoustic feature, and the intermediate state is stored in a state sequence buffer, specifically including:

采用循环递归神经网络,获取无说话人标记的所述单人声学特征的中间状态。此时,并不会直接给出该段语音对应的特征向量归属于哪个说话人。A recurrent recurrent neural network is used to obtain the intermediate states of the single-person acoustic features without speaker labels. At this time, it is not directly given which speaker the feature vector corresponding to the speech belongs to.

可选的,S130,在所述状态序列缓冲区中,对所述状态序列缓冲区中所有的中间状态运行聚类算法并获得至少一个聚类,包括:Optionally, S130, in the state sequence buffer, run a clustering algorithm on all intermediate states in the state sequence buffer to obtain at least one cluster, including:

对状态缓冲区中维护的所有中间状态运行聚类算法;虽然每一段语音数据中仅包含一个说话人,但是整个音频时序序列中,却可能存在多个说话人交叉轮流说话的情形,因此,聚类后的状态缓冲区状态可能包含至少一个类别,即至少一个聚类,每一个聚类表示一个说话人;Run the clustering algorithm on all the intermediate states maintained in the state buffer; although each piece of speech data contains only one speaker, in the entire audio time series, there may be multiple speakers speaking in turn. The post-class state buffer state may contain at least one class, i.e. at least one cluster, each cluster representing a speaker;

可选的,S140,计算所述单人声学特征的中间状态和每一个所述聚类的聚类中心的加权均方差;Optionally, S140, calculate the intermediate state of the single-person acoustic feature and the weighted mean square error of the cluster center of each of the clusters;

S141,计算每个聚类的聚类中心;每一个聚类都有一个聚类中心,它是该聚类中所有中间状态的均值;S141, calculate the cluster center of each cluster; each cluster has a cluster center, which is the mean value of all intermediate states in the cluster;

S142,计算该中间状态和每一聚类中心的加权均方误差;其中,加权均方误差也可以称之为加权欧式距离。S142, calculate the weighted mean square error of the intermediate state and each cluster center; wherein, the weighted mean square error may also be called weighted Euclidean distance.

可选的,S150,确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签,包括:Optionally, S150, determine that the cluster label of the cluster corresponding to the minimum weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature, including:

选取加权均方误差最小的聚类中心的聚类标签作为该中间状态的聚类标签;其中,聚类标签即说话人编号;可选的,确认完该中间状态的聚类标签之后,更新聚类中心。Select the cluster label of the cluster center with the smallest weighted mean square error as the cluster label of the intermediate state; wherein, the cluster label is the speaker number; optionally, after confirming the cluster label of the intermediate state, update the cluster label. Class Center.

可选的,S150,所述确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签,包括:Optionally, S150, the cluster label of the cluster corresponding to the determined minimum weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature, including:

S151,若最小的所述加权均方差对应的聚类已有标签,则确定所述已有标签为所述单中间状态的聚类标签;S151, if the cluster corresponding to the smallest weighted mean square error has an existing label, determine that the existing label is the cluster label of the single intermediate state;

S152,若最小的所述加权均方差对应的聚类没有标签,则给所述最小的所述加权均方差对应的聚类分配新的标签,并确定所述新的标签为所述中间状态的聚类标签。S152: If the cluster corresponding to the smallest weighted mean square error has no label, assign a new label to the cluster corresponding to the smallest weighted mean square error, and determine that the new label is the one of the intermediate state Cluster labels.

具体而言,所述加权均方差即为加权欧式距离,其可以表征待测的中间状态与该之间的距离,距离越近,属于该聚类的可能性越高,距离越远,数据该聚类的可能性约定;在待测的中间状态与所有聚类中心之间的加权欧式距离中,最小的那个,说明该该待测的中间状态与该聚类最近,因此,可以认定该待测的中间状态是属于该聚类,并给该待测的中间状态分类该聚类的聚类标签。但是,如果该中间状态是之前已经出现过的说话人的,那么在先前的序列中已有该聚类标签(即说话人标签),直接将该聚类标签分配给该中间状态即可;如果该中间状态并不是之前已经出现过的说话人的,那么在先前的序列中并没有该聚类标签(即说话人标签),那么需要给当前说话人设置一个新的聚类标签,然后将新的聚类标签分配给该中间状态。Specifically, the weighted mean square error is the weighted Euclidean distance, which can represent the distance between the intermediate state to be measured and the Possibility convention of clustering; among the weighted Euclidean distances between the intermediate state to be measured and all cluster centers, the smallest one indicates that the intermediate state to be measured is closest to the cluster, so it can be determined that the intermediate state to be measured is closest to the cluster The measured intermediate state belongs to the cluster, and classifies the cluster label of the cluster for the intermediate state to be measured. However, if the intermediate state belongs to a speaker that has appeared before, then the cluster label (that is, the speaker label) already exists in the previous sequence, and the cluster label can be directly assigned to the intermediate state; if The intermediate state is not of the speaker that has appeared before, so there is no cluster label (that is, the speaker label) in the previous sequence, then a new cluster label needs to be set for the current speaker, and then the new cluster label needs to be set for the current speaker. The cluster label of is assigned to this intermediate state.

其中,状态序列缓冲区是专门用来存放神经网络输出的中间状态的。但由于语音数据是时序数据,随着时间的推移,语音数据会越来越多,对应神经网络输出的中间状态也会增加,那么更新维护缓冲区时计算开销也会增加,在录音时间足够长的情况下,运行聚类算法、计算加权均方误差、更新聚类中心这些步骤所花费的时间会越来越长,实时性要求就会大打折扣。此即为时延累积问题,即在使用聚类算法对说话人的语音帧进行分类时,由于聚类方法多数是通过遍历运算得出结果,随着录音时间的延长,说话人的语音帧不断增加,遍历运算时产生的时延会越来越高。使用过程中,可能出现的情况是,在录音最开始的时候,系统可以很快地给出说话人区分结果,但是随着录音时间的推移,系统给出区分结果的响应越来越慢,从而影响实时效果。Among them, the state sequence buffer is specially used to store the intermediate state of the neural network output. However, since the voice data is time series data, as time goes by, the voice data will increase, and the intermediate state corresponding to the output of the neural network will also increase, so the calculation overhead will also increase when updating and maintaining the buffer, and the recording time is long enough. In the case of , it will take longer and longer to run the clustering algorithm, calculate the weighted mean square error, and update the cluster center, and the real-time requirements will be greatly reduced. This is the time delay accumulation problem, that is, when using the clustering algorithm to classify the speaker's speech frames, since most of the clustering methods are obtained through traversal operations, with the extension of the recording time, the speaker's speech frames continue to Increase, the delay generated during the traversal operation will be higher and higher. In the process of use, it may happen that at the beginning of the recording, the system can quickly give the speaker distinction result, but as the recording time goes on, the response of the system to give the distinction result becomes slower and slower, so Affects real-time effects.

因此,请参见图6,可选的,可根据需求对状态缓冲区的大小设置一个预设容量值。当状态缓冲区还有空间时,把神经网络输出的中间状态存入缓冲区中,若状态缓冲区满,则按照一定策略更新缓冲区使得缓冲区大小保持不变。图6中圆环或椭圆环代表一个聚类,实心圆形图形表示每一聚类的聚类中心,实心三角形图形为当前预测的中间状态,实心菱形是待丢弃的中间状态。Therefore, referring to FIG. 6 , optionally, a preset capacity value can be set for the size of the state buffer according to requirements. When there is still space in the state buffer, the intermediate state output by the neural network is stored in the buffer. If the state buffer is full, the buffer is updated according to a certain strategy to keep the buffer size unchanged. The circle or ellipse in Figure 6 represents a cluster, the solid circular figure represents the cluster center of each cluster, the solid triangular figure is the currently predicted intermediate state, and the solid diamond is the intermediate state to be discarded.

此时,则所述语音区分方法100还更新状态缓冲区的策略,具体包括:At this time, the voice distinguishing method 100 further updates the strategy of the state buffer, which specifically includes:

S161,若所述状态序列缓冲区的空间大小达到所述预设容量值,在用于存放中间状态的状态序列缓冲区中,计算至少一个所述聚类中所有的中间状态和所述聚类的聚类中心之间的欧氏距离;S161, if the space size of the state sequence buffer reaches the preset capacity value, in the state sequence buffer for storing intermediate states, calculate all intermediate states in at least one of the clusters and the cluster The Euclidean distance between the cluster centers;

S162,移除最小的所述欧氏距离对应的中间状态。S162, remove the intermediate state corresponding to the smallest Euclidean distance.

可选的,S161和S162之后,所述方法还包括:Optionally, after S161 and S162, the method further includes:

S163,加入新的中间状态;S163, adding a new intermediate state;

S164,重新计算所述状态序列缓冲区中的聚类的聚类中心。S164, recalculate the cluster centers of the clusters in the state sequence buffer.

上述状态缓冲区的更新策略可以概括为“最近越最先出”策略。即当新的中间状态已分配聚类标签时,计算该类内所有中间状态和聚类中心的欧氏距离并排序。丢弃欧式距离最小的中间状态,然后把新状态加入该聚类中,重新计算聚类中心。通过这种策略,可以维持缓冲区大小不变,因此不管录音时间怎么延长,当缓冲区满以后系统对说话人区分的响应时间整体上可以保持不变,减少因计算量增加而导致的响应延迟累积的问题。采用“最近越最先出”的策略,可以保证区分准确率不会有较大差异,因为距离聚类中心越近,代表中间状态越有可能归属于该类别,不确定性越低,其留在状态缓冲区中做进一步判断的价值就越低,因此可以选择丢弃;反之,距离聚类中心越远,则说明其不确定性越大,留在状态缓冲区中做进一步判断的价值就越高,因此,需要保留。The update strategy of the above state buffer can be summarized as the "most recent first out" strategy. That is, when a new intermediate state has been assigned a cluster label, the Euclidean distances of all intermediate states and cluster centers within the class are calculated and sorted. Discard the intermediate state with the smallest Euclidean distance, then add the new state to the cluster, and recalculate the cluster center. Through this strategy, the size of the buffer can be maintained unchanged, so no matter how long the recording time is, when the buffer is full, the response time of the system to distinguish the speaker can remain unchanged as a whole, reducing the response delay caused by the increase in the amount of calculation. cumulative problems. Adopting the strategy of "recent, first out" can ensure that there will be no major difference in the discrimination accuracy, because the closer the distance to the cluster center, the more likely it is that the intermediate state belongs to this category, and the lower the uncertainty, the more likely it is to remain in this category. The value of making further judgments in the state buffer is lower, so you can choose to discard it; on the contrary, the farther away from the cluster center, the greater the uncertainty, and the greater the value of making further judgments in the state buffer. high, therefore, needs to be reserved.

本申请的实施方式,通过监督学习训练神经网络,可以使用大量标记数据提升说话人区分准确度,标记训练数据越多,算法区分准确率越高。然后,使用通过监督学习训练后的循环递归神经网络,预测单人的语音数据的中间状态,在将该中间状态送入状态缓冲区进行聚类计算,确定相应的聚类标签。由此,将聚类过程和神经网络分离,方便对聚类过程进行优化。另一方面,本申请实施例提供的方案还按照“最近最先出”策略来维护和更新该状态缓冲区,以保证状态换冲过区不会因为时延积累,而使系统运行越来越慢,从而解决了实时聚类算法在运行过程中的时延累积问题,以达到实时说话人区分的效果,提升了运行该语音区分方法的设备或系统的实时性。所谓实时说话人区分即不需要获取到完整的语音文件,以低延迟的形式在说话人说话的同时给出上一时刻对说话人身份的判断结果。In the embodiment of the present application, by training the neural network through supervised learning, a large amount of labeled data can be used to improve the accuracy of speaker discrimination. The more labeled training data, the higher the algorithm discrimination accuracy. Then, use the recurrent recurrent neural network trained by supervised learning to predict the intermediate state of the speech data of a single person, send the intermediate state to the state buffer for clustering calculation, and determine the corresponding clustering label. Thus, the clustering process and the neural network are separated, which facilitates the optimization of the clustering process. On the other hand, the solution provided by the embodiments of the present application also maintains and updates the state buffer according to the "last first out" strategy, so as to ensure that the state transition area will not cause the system to run more and more due to the accumulation of delay. Therefore, the problem of time delay accumulation in the running process of the real-time clustering algorithm is solved, so as to achieve the effect of real-time speaker discrimination, and improve the real-time performance of the device or system running the speech discrimination method. The so-called real-time speaker discrimination does not require the acquisition of a complete voice file, and provides the judgment result of the speaker's identity at the previous moment in the form of low-latency while the speaker is speaking.

实施方式二Embodiment 2

请参看图7,为本申请实施方式二提供的一种语音记录装置200。该语音记录装置200包括但不限于录音笔、音频会议终端、或者有录音功能的智能电子设备等中任意一 种,也可以是不包含语音拾取功能,仅包含语音区分功能,可实现该功能的电脑或其他智能电子设备,对此在本实施方式二中不做限定。Please refer to FIG. 7 , which shows a voice recording apparatus 200 according to Embodiment 2 of the present application. The voice recording device 200 includes, but is not limited to, any one of a voice recorder, an audio conference terminal, or an intelligent electronic device with a recording function, etc. It may also not include a voice pickup function, but only include a voice discrimination function, which can realize this function. A computer or other intelligent electronic device is not limited in the second embodiment.

该语音记录装置200包括:The voice recording device 200 includes:

声学特征获取单元210,从多通道音频数据中提取单人声学特征;Acoustic feature acquisition unit 210, extracting single-person acoustic features from multi-channel audio data;

中间状态缓存单元220,采用预设的循环递归神经网络获取所述单人声学特征的中间状态,并将所述中间状态存入状态序列缓冲区;在所述状态序列缓冲区中,对所述状态序列缓冲区中所有的中间状态运行聚类算法并获得至少一个聚类;计算所述单人声学特征的中间状态和每一个所述聚类的聚类中心的加权均方差;确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签。The intermediate state buffer unit 220 adopts a preset cyclic recursive neural network to acquire the intermediate state of the single-person acoustic feature, and stores the intermediate state in the state sequence buffer; in the state sequence buffer, the All intermediate states in the state sequence buffer run the clustering algorithm and obtain at least one cluster; calculate the intermediate state of the single-person acoustic feature and the weighted mean square error of the cluster center of each of the clusters; determine the smallest all The cluster label of the cluster corresponding to the weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature.

可选的,所述循环递归神经网络采用监督式学习训练方式获得。可选的,所述语音记录装置还包括,循环递归神经网络获得单元230,用于为语音信号分配一个说话人标签,并记录所述说话人标签对应的语音信号的起止时间;提取所述语音信号的声学特征;将所述声学特征及其所述说话人标签,送入循环递归神经网络中,使用损失函数以及优化器对所述循环递归神经网络进行优化。Optionally, the recurrent recurrent neural network is obtained by using a supervised learning training method. Optionally, the voice recording device further includes a cyclic recurrent neural network obtaining unit 230, configured to assign a speaker tag to the voice signal, and record the start and end times of the voice signal corresponding to the speaker tag; extract the voice signal. The acoustic feature of the signal; the acoustic feature and the speaker label are sent into a recurrent recurrent neural network, and the recurrent recurrent neural network is optimized by using a loss function and an optimizer.

可选的,所述状态序列缓冲区的空间大小为一预设容量值;则所述中间状态缓存单元220,还用于若所述状态序列缓冲区的空间大小达到所述预设容量值,在存放中间状态的状态序列缓冲区中,计算至少一个所述聚类中所有的中间状态和所述聚类的聚类中心之间的欧氏距离;移除最小的所述欧氏距离对应的中间状态。Optionally, the space size of the state sequence buffer is a preset capacity value; then the intermediate state buffer unit 220 is further configured to, if the space size of the state sequence buffer reaches the preset capacity value, In the state sequence buffer storing the intermediate states, calculate the Euclidean distance between all intermediate states in at least one of the clusters and the cluster center of the cluster; remove the minimum corresponding to the Euclidean distance Intermediate state.

可选的,所述中间状态缓存单元220,还包括:加入新的中间状态;重新计算所述状态序列缓冲区中的聚类的聚类中心。Optionally, the intermediate state cache unit 220 further includes: adding a new intermediate state; and recalculating the cluster centers of the clusters in the state sequence buffer.

可选的,所述中间状态缓存单元220,用于确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签,具体包括:Optionally, the intermediate state cache unit 220 is configured to determine that the cluster label of the cluster corresponding to the minimum weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature, specifically including:

所述中间状态缓存单元220,具体用于若最小的所述加权均方差对应的聚类已有标签,则确定所述已有标签为所述中间状态的聚类标签;若最小的所述加权均方差对应的聚类没有标签,则给所述最小的所述加权均方差对应的聚类分配新的标签,并确定所述新的标签为所述中间状态的聚类标签。The intermediate state cache unit 220 is specifically configured to determine that the existing label is the cluster label of the intermediate state if the cluster corresponding to the smallest weighted mean square error has an existing label; If the cluster corresponding to the mean square error has no label, a new label is assigned to the cluster corresponding to the smallest weighted mean square error, and the new label is determined to be the cluster label of the intermediate state.

本实施方式二中有不详尽之处、或优化方案、或者具体的实例,请参见上述实施方式一中相同或对应的部分,在此不做重复赘述。For details, optimization solutions, or specific examples in Embodiment 2, please refer to the same or corresponding parts in Embodiment 1 above, which will not be repeated here.

实施方式三Embodiment 3

请参看图8,本申请实施方式三提供的一种语音记录装置300的结构示意图。该视频处理装置300包括:处理器310以及存储器320。处理器310、存储器320之间通过总线系统实现相互的通信连接。处理器310调用存储器320中的程序,执行上述实施方式一提供的任意一种语音分析方法。Please refer to FIG. 8 , which is a schematic structural diagram of a voice recording apparatus 300 according to Embodiment 3 of the present application. The video processing apparatus 300 includes: a processor 310 and a memory 320 . The communication connection between the processor 310 and the memory 320 is realized through a bus system. The processor 310 invokes the program in the memory 320 to execute any one of the speech analysis methods provided in the first embodiment above.

该处理器310可以是一个独立的元器件,也可以是多个处理元件的统称。例如,可以是CPU,也可以是ASIC,或者被配置成实施以上方法的一个或多个集成电路,如至少一个微处理器DSP,或至少一个可编程门这列FPGA等。存储器320为一计算机可读存储介质,其上存储可在处理器310上运行的程序。The processor 310 may be an independent component, or may be a collective term for multiple processing components. For example, it may be a CPU, an ASIC, or one or more integrated circuits configured to implement the above method, such as at least one microprocessor DSP, or at least one programmable gate FPGA, etc. The memory 320 is a computer-readable storage medium on which programs executable on the processor 310 are stored.

可选的,该语音处理装置300还包括:声音拾取装置330用于获取语音信息。处理器310、存储器320、声音拾取装置330之间通过总线系统实现相互的通信连接。处理器310调用存储器320中的程序,执行上述实施方式一提供的任意一种语音分析方法,处理该声音拾取装置330获取的多通道语音信息。Optionally, the voice processing device 300 further includes: a sound pickup device 330 for acquiring voice information. The processor 310, the memory 320, and the sound pickup device 330 are connected to each other through a bus system for communication. The processor 310 invokes the program in the memory 320, executes any one of the speech analysis methods provided in the first embodiment, and processes the multi-channel speech information acquired by the sound pickup device 330.

本实施方式三中有不详尽之处,请参见上述实施方式一中相同或对应的部分,在此不做重复赘述。For details that are not detailed in the third embodiment, please refer to the same or corresponding parts in the above-mentioned first embodiment, which will not be repeated here.

本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请具体实施方式所描述的功能可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成。软件模块可以被存放于计算机可读存储介质中,所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(Digital Video Disc,DVD))、或者半导体介质(例如,固态硬盘(Solid State Disk,SSD))等。所述计算机可读存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质。一种示例性的计算机可读存储介质耦合至处理器,从而使处理器能够从该计算机可读存储介质读取信息,且可向该计算机可读存储介质写入信息。当然,计算机可读存储介质也可以是处理器的组成部分。处理器和计算机可读存储介质可以位于ASIC中。另外,该ASIC可以位于接入网设备、目标网络设备或核心网设备中。当然,处理器和计算机可读存储介质也可以 作为分立组件存在于接入网设备、目标网络设备或核心网设备中。当使用软件实现时,也可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机或芯片上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请具体实施方式所述的流程或功能,该芯片可包含有处理器。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机程序指令可以存储在上述计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。Those skilled in the art should realize that, in one or more of the above examples, the functions described in the specific embodiments of the present application may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it may be implemented by a processor executing software instructions. The software instructions may consist of corresponding software modules. The software modules may be stored in a computer-readable storage medium, which may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media. The available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, Digital Video Disc (DVD)), or semiconductor media (eg, Solid State Disk (SSD)) )Wait. The computer-readable storage medium includes but is not limited to random access memory (Random Access Memory, RAM), flash memory, read only memory (Read Only Memory, ROM), Erasable Programmable Read Only Memory (Erasable Programmable ROM, EPROM) ), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disks, removable hard disks, compact disks (CD-ROMs), or any other form of storage medium known in the art. An exemplary computer-readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the computer-readable storage medium. Of course, the computer-readable storage medium can also be an integral part of the processor. The processor and computer-readable storage medium may reside in an ASIC. Additionally, the ASIC may reside in access network equipment, target network equipment or core network equipment. Of course, the processor and the computer-readable storage medium may also exist as discrete components in the access network device, the target network device or the core network device. When implemented in software, it can also be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer or a chip, which may include a processor, all or part of the processes or functions described in the specific embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer program instructions may be stored on or transmitted from one computer readable storage medium to another computer readable storage medium as described above, for example, the computer instructions may be downloaded from a website, computer, server or The data center transmits to another website site, computer, server or data center through wired (such as coaxial cable, optical fiber, Digital Subscriber Line, DSL) or wireless (such as infrared, wireless, microwave, etc.).

上述实施方式说明但并不限制本发明,本领域的技术人员能在权利要求的范围内设计出多个可代替实例。所属领域的技术人员应该意识到,本申请并不局限于上面已经描述并在附图中示出的精确结构,对在没有违反如所附权利要求书所定义的本发明的范围之内,可对具体实现方案做出适当的调整、修改、、等同替换、改进等。因此,凡依据本发明的构思和原则,所做的任意修改和变化,均在所附权利要求书所定义的本发明的范围之内。The above-mentioned embodiments illustrate but do not limit the present invention, and those skilled in the art can devise various alternative examples within the scope of the claims. It should be appreciated by those skilled in the art that the application is not limited to the precise structures described above and illustrated in the accompanying drawings, but may be applied without departing from the scope of the invention as defined by the appended claims. Appropriate adjustments, modifications, equivalent substitutions, improvements, etc. are made to the specific implementation scheme. Therefore, any modifications and changes made in accordance with the concept and principles of the present invention are within the scope of the present invention as defined by the appended claims.

Claims (14)

一种语音区分方法,其特征在于,所述方法包括:A method for distinguishing speech, characterized in that the method comprises: 从多通道音频数据中获取单人声学特征;Obtain single-person acoustic features from multi-channel audio data; 采用预设的循环递归神经网络获取所述单人声学特征的中间状态,并将所述中间状态存入状态序列缓冲区;Use a preset RNN to obtain the intermediate state of the single-person acoustic feature, and store the intermediate state in the state sequence buffer; 在所述状态序列缓冲区中,对所述状态序列缓冲区中所有的中间状态运行聚类算法并获得至少一个聚类;In the state sequence buffer, run a clustering algorithm on all intermediate states in the state sequence buffer and obtain at least one cluster; 计算所述单人声学特征的中间状态和每一个所述聚类的聚类中心的加权均方差;Calculate the intermediate state of the single-person acoustic feature and the weighted mean square error of the cluster centers of each of the clusters; 确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签。The cluster label of the cluster corresponding to the smallest weighted mean square error is determined as the cluster label of the intermediate state of the single-person acoustic feature. 如权利要求1所述的方法,其特征在于,所述循环递归神经网络采用监督式学习训练方式获得。The method of claim 1, wherein the recurrent recurrent neural network is obtained by a supervised learning training method. 如权利要求2所述的方法,其特征在于,所述循环递归神经网络采用监督式学习训练方式获得,包括:The method of claim 2, wherein the recurrent recurrent neural network is obtained by a supervised learning training method, comprising: 为语音信号分配一个说话人标签,并记录所述说话人标签对应的语音信号的起止时间;Allocate a speaker tag to the voice signal, and record the start and end time of the voice signal corresponding to the speaker tag; 提取所述语音信号的声学特征;extracting acoustic features of the speech signal; 将所述声学特征及其所述说话人标签,送入循环递归神经网络中,使用损失函数以及优化器对所述循环递归神经网络进行优化。The acoustic features and the speaker labels are sent into a recurrent recurrent neural network, and a loss function and an optimizer are used to optimize the recurrent recurrent neural network. 如权利要求1至3中任意一项所述的方法,其特征在于,所述状态序列缓冲区的空间大小为一预设容量值,所述方法还包括:The method according to any one of claims 1 to 3, wherein the space size of the state sequence buffer is a preset capacity value, and the method further comprises: 若所述状态序列缓冲区的空间大小达到所述预设容量值,在用于存放中间状态的状态序列缓冲区中,计算至少一个所述聚类中所有的中间状态和所述聚类的聚类中心之间的欧氏距离;If the space size of the state sequence buffer reaches the preset capacity value, in the state sequence buffer for storing intermediate states, calculate all intermediate states in at least one of the clusters and the clustering of the cluster. Euclidean distance between class centers; 移除最小的所述欧氏距离对应的中间状态。The intermediate state corresponding to the smallest said Euclidean distance is removed. 如权利要求4所述的方法,其特征在于,所述方法还包括:The method of claim 4, wherein the method further comprises: 加入新的中间状态;Add a new intermediate state; 重新计算所述状态序列缓冲区中的聚类的聚类中心。Recompute the cluster centers of the clusters in the state sequence buffer. 如权利要求1所述的方法,其特征在于,所述确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签,包括:The method according to claim 1, characterized in that, determining that the cluster label of the cluster corresponding to the minimum weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature, comprising: 若最小的所述加权均方差对应的聚类已有标签,则确定所述已有标签为所述中间状态的聚类标签;If the cluster corresponding to the smallest weighted mean square error has an existing label, then determine that the existing label is the cluster label of the intermediate state; 若最小的所述加权均方差对应的聚类没有标签,则给所述最小的所述加权均方差对应的聚类分配新的标签,并确定所述新的标签为所述中间状态的聚类标签。If the cluster corresponding to the smallest weighted mean square error has no label, assign a new label to the cluster corresponding to the smallest weighted mean square error, and determine that the new label is the cluster of the intermediate state Label. 一种语音记录装置,其特征在于,所述语音记录包括:A voice recording device, wherein the voice recording comprises: 声学特征获取单元,从多通道音频数据中提取单人声学特征;Acoustic feature acquisition unit, extracting single-person acoustic features from multi-channel audio data; 中间状态缓存单元,采用预设的循环递归神经网络获取所述单人声学特征的中间状态,并将所述中间状态存入状态序列缓冲区;在所述状态序列缓冲区中,对所述状态序列缓冲区中所有的中间状态运行聚类算法并获得至少一个聚类;计算所述单人声学特征的中间状态和每一个所述聚类的聚类中心的加权均方差;确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签。The intermediate state cache unit adopts a preset cyclic recursive neural network to obtain the intermediate state of the single-person acoustic feature, and stores the intermediate state in the state sequence buffer; in the state sequence buffer, the state All intermediate states in the sequence buffer run the clustering algorithm and obtain at least one cluster; calculate the intermediate state of the single-person acoustic feature and the weighted mean square error of the cluster center of each of the clusters; determine the smallest The cluster label of the cluster corresponding to the weighted mean square error is the cluster label of the intermediate state of the single-person acoustic feature. 如权利要求7所述的语音记录装置,其特征在于,所述循环递归神经网络采用监督式学习训练方式获得。The voice recording device according to claim 7, wherein the recurrent recurrent neural network is obtained by a supervised learning training method. 如权利要求8所述的语音记录装置,其特征在于,所述语音记录装置还包括,循环递归神经网络获得单元,用于为语音信号分配一个说话人标签,并记录所述说话人标签对应的语音信号的起止时间;提取所述语音信号的声学特征;将所述声学特征及其所述说话人标签,送入循环递归神经网络中,使用损失函数以及优化器对所述循环递归神经网络进行优化。The voice recording device according to claim 8, wherein the voice recording device further comprises a recurrent recurrent neural network obtaining unit, configured to assign a speaker label to the speech signal, and record the corresponding speaker label. The start and end time of the speech signal; extract the acoustic features of the speech signal; send the acoustic features and the speaker label into the recurrent recurrent neural network, and use the loss function and the optimizer to carry out the recurrent neural network. optimization. 如权利要求7至9中任意一项所述的语音记录装置,其特征在于,所述状态序列缓冲区的空间大小为一预设容量值;The voice recording device according to any one of claims 7 to 9, wherein the space size of the state sequence buffer is a preset capacity value; 则所述中间状态缓存单元,还用于若所述状态序列缓冲区的空间大小达到所述预设容量值,在存放中间状态的状态序列缓冲区中,计算至少一个所述聚类中所有的中间状态和所述聚类的聚类中心之间的欧氏距离;移除最小的所述欧氏距离对应的中间状态。Then, the intermediate state buffer unit is further configured to, if the space size of the state sequence buffer reaches the preset capacity value, in the state sequence buffer storing the intermediate state, calculate all the data in at least one of the clusters. Euclidean distance between the intermediate state and the cluster center of the cluster; remove the intermediate state corresponding to the smallest Euclidean distance. 如权利要求10所述的语音记录装置,其特征在于,所述中间状态缓存单元,还包括:加入新的中间状态;重新计算所述状态序列缓冲区中的聚类的聚类中心。The voice recording device according to claim 10, wherein the intermediate state buffer unit further comprises: adding a new intermediate state; and recalculating the cluster centers of the clusters in the state sequence buffer. 如权利要求7所述的语音记录装置,其特征在于,所述中间状态缓存单元,用于确定最小的所述加权均方差对应的聚类的聚类标签为所述单人声学特征的中间状态的聚类标签,具体包括:The voice recording device according to claim 7, wherein the intermediate state cache unit is configured to determine that the cluster label of the cluster corresponding to the smallest weighted mean square error is the intermediate state of the single-person acoustic feature The clustering labels of , specifically include: 所述中间状态缓存单元,具体用于若最小的所述加权均方差对应的聚类已有标签,则确定所述已有标签为所述中间状态的聚类标签;若最小的所述加权均方差对应的聚类 没有标签,则给所述最小的所述加权均方差对应的聚类分配新的标签,并确定所述新的标签为所述中间状态的聚类标签。The intermediate state cache unit is specifically configured to determine that the existing label is the cluster label of the intermediate state if the cluster corresponding to the smallest weighted mean square error has an existing label; If the cluster corresponding to the variance has no label, a new label is assigned to the cluster corresponding to the smallest weighted mean square error, and the new label is determined as the cluster label of the intermediate state. 一种语音记录设备,其特征在于,所述语音记录设备包括:处理器以及存储器;所述处理器调用所述存储器中的程序,执行上述权利要求1至6中任意一项所述的语音区分方法。A voice recording device, characterized in that, the voice recording device comprises: a processor and a memory; the processor invokes a program in the memory to perform the voice distinction described in any one of the above claims 1 to 6 method. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有语音分析方法的程序,所述语音分析方法的程序被处理器执行时实现上述权利要求1至6中任意一项所述的语音区分方法。A computer-readable storage medium, characterized in that a program of a speech analysis method is stored on the computer-readable storage medium, and when the program of the speech analysis method is executed by a processor, any one of the above claims 1 to 6 is implemented The voice discrimination method described in item.
PCT/CN2021/120414 2021-02-04 2021-09-24 Voice diarization method and voice recording apparatus thereof Ceased WO2022166219A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110154978.3 2021-02-04
CN202110154978.3A CN112992175B (en) 2021-02-04 2021-02-04 Voice distinguishing method and voice recording device thereof

Publications (1)

Publication Number Publication Date
WO2022166219A1 true WO2022166219A1 (en) 2022-08-11

Family

ID=76346965

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120414 Ceased WO2022166219A1 (en) 2021-02-04 2021-09-24 Voice diarization method and voice recording apparatus thereof

Country Status (2)

Country Link
CN (1) CN112992175B (en)
WO (1) WO2022166219A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992175B (en) * 2021-02-04 2023-08-11 深圳壹秘科技有限公司 Voice distinguishing method and voice recording device thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180166066A1 (en) * 2016-12-14 2018-06-14 International Business Machines Corporation Using long short-term memory recurrent neural network for speaker diarization segmentation
CN110211595A (en) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 A kind of speaker clustering system based on deep learning
CN110299150A (en) * 2019-06-24 2019-10-01 中国科学院计算技术研究所 A kind of real-time voice speaker separation method and system
CN110444223A (en) * 2019-06-26 2019-11-12 平安科技(深圳)有限公司 Speaker's separation method and device based on Recognition with Recurrent Neural Network and acoustic feature
CN110853666A (en) * 2019-12-17 2020-02-28 科大讯飞股份有限公司 A speaker separation method, device, equipment and storage medium
CN112992175A (en) * 2021-02-04 2021-06-18 深圳壹秘科技有限公司 Voice distinguishing method and voice recording device thereof

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5779032B2 (en) * 2011-07-28 2015-09-16 株式会社東芝 Speaker classification apparatus, speaker classification method, and speaker classification program
CN109658948B (en) * 2018-12-21 2021-04-16 南京理工大学 Migratory bird migration activity-oriented acoustic monitoring method
CN110136723A (en) * 2019-04-15 2019-08-16 深圳壹账通智能科技有限公司 Data processing method and device based on voice messaging
CN110289002B (en) * 2019-06-28 2021-04-27 四川长虹电器股份有限公司 End-to-end speaker clustering method and system
CN111063341B (en) * 2019-12-31 2022-05-06 思必驰科技股份有限公司 Segmentation and clustering method and system for multi-person speech in complex environment
CN112233680B (en) * 2020-09-27 2024-02-13 科大讯飞股份有限公司 Speaker character recognition method, speaker character recognition device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180166066A1 (en) * 2016-12-14 2018-06-14 International Business Machines Corporation Using long short-term memory recurrent neural network for speaker diarization segmentation
CN110299150A (en) * 2019-06-24 2019-10-01 中国科学院计算技术研究所 A kind of real-time voice speaker separation method and system
CN110444223A (en) * 2019-06-26 2019-11-12 平安科技(深圳)有限公司 Speaker's separation method and device based on Recognition with Recurrent Neural Network and acoustic feature
CN110211595A (en) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 A kind of speaker clustering system based on deep learning
CN110853666A (en) * 2019-12-17 2020-02-28 科大讯飞股份有限公司 A speaker separation method, device, equipment and storage medium
CN112992175A (en) * 2021-02-04 2021-06-18 深圳壹秘科技有限公司 Voice distinguishing method and voice recording device thereof

Also Published As

Publication number Publication date
CN112992175A (en) 2021-06-18
CN112992175B (en) 2023-08-11

Similar Documents

Publication Publication Date Title
US11900947B2 (en) Method and system for automatically diarising a sound recording
US11727939B2 (en) Voice-controlled management of user profiles
WO2022134833A1 (en) Speech signal processing method, apparatus and device, and storage medium
WO2021139425A1 (en) Voice activity detection method, apparatus and device, and storage medium
JP2019211749A (en) Method and apparatus for detecting starting point and finishing point of speech, computer facility, and program
WO2019080639A1 (en) Object identifying method, computer device and computer readable storage medium
CN107993663A (en) A kind of method for recognizing sound-groove based on Android
CN109036393A (en) Wake-up word training method, device and the household appliance of household appliance
TWI753576B (en) Model constructing method for audio recognition
CN111462758A (en) Method, device and equipment for intelligent conference role classification and storage medium
KR101888058B1 (en) The method and apparatus for identifying speaker based on spoken word
US20200135211A1 (en) Information processing method, information processing device, and recording medium
CN118173094B (en) Wake-up word recognition method, device, equipment and medium combined with dynamic time regularization
CN108231063A (en) A kind of recognition methods of phonetic control command and device
CN108538312B (en) A method for automatic location of digital audio tampering points based on Bayesian information criterion
CN119673173A (en) A streaming speaker log method and system
CN113593597B (en) Voice noise filtering method, device, electronic equipment and medium
WO2018095167A1 (en) Voiceprint identification method and voiceprint identification system
WO2022166219A1 (en) Voice diarization method and voice recording apparatus thereof
CN114822557A (en) Distinguishing method, device, equipment and storage medium of different sounds in classroom
JP6716513B2 (en) VOICE SEGMENT DETECTING DEVICE, METHOD THEREOF, AND PROGRAM
CN113689858B (en) Control method and device of cooking equipment, electronic equipment and storage medium
CN114613370A (en) Training method, recognition method and device of voice object recognition model
CN114333784A (en) Information processing method, information processing device, computer equipment and storage medium
CN114171059B (en) Noise identification method, device, electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21924216

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21924216

Country of ref document: EP

Kind code of ref document: A1