[go: up one dir, main page]

WO2019237517A1 - 说话人聚类方法、装置、计算机设备及存储介质 - Google Patents

说话人聚类方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2019237517A1
WO2019237517A1 PCT/CN2018/103824 CN2018103824W WO2019237517A1 WO 2019237517 A1 WO2019237517 A1 WO 2019237517A1 CN 2018103824 W CN2018103824 W CN 2018103824W WO 2019237517 A1 WO2019237517 A1 WO 2019237517A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
vector
clustered
universal
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/103824
Other languages
English (en)
French (fr)
Inventor
涂宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Publication of WO2019237517A1 publication Critical patent/WO2019237517A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • G10L2015/0636Threshold criteria for the updating

Definitions

  • the present application relates to the field of voiceprint recognition, and in particular, to a speaker clustering method, device, computer equipment, and storage medium.
  • the speaker clustering method is based on certain characteristics of the speaker, such as the gender, age, and accent of the speaker.
  • the speakers in the training set are divided into several subsets based on their voice characteristics.
  • the speakers in each subset are It has a certain kind of speech characteristics with high similarity, and then specifically trains the acoustic model for each subset, and finally forms an acoustic model library that stores several types of clusters.
  • all the stored acoustic models in the acoustic model library are sequentially judged for similarity with the speech to be clustered to confirm which class cluster the speech to be clustered belongs to.
  • a speaker clustering method includes:
  • the speech feature similarity of the speech to be clustered in the target universal speech vector is not greater than a preset threshold, then use the speech to be clustered for model training, and the current universal speech vector corresponding to the speech to be clustered;
  • the current general speech vector is stored in a preset acoustic model library, and the speech to be clustered is classified into a clustering cluster corresponding to the current general speech vector.
  • a speaker clustering device includes:
  • a speech descending ordering module configured to sort at least two speeches to be clustered in descending order of speech duration
  • a universal vector acquisition module configured to sequentially perform speech recognition on each speech to be clustered and each original universal speech vector in a preset acoustic model library to obtain a target universal speech vector corresponding to the speech to be clustered;
  • the training current vector module is used to train the model using the speech to be clustered if the similarity of the speech features of the speech to be clustered in the target universal speech vector is not greater than a preset threshold, and the current universal speech vector corresponding to the speech to be clustered ;
  • the storage current vector module is configured to store the current general speech vector in a preset acoustic model library, and classify the speech to be clustered into a clustering cluster corresponding to the current general speech vector.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that when the processor executes the computer-readable instructions, the following steps are implemented:
  • the speech feature similarity of the speech to be clustered in the target universal speech vector is not greater than a preset threshold, then use the speech to be clustered for model training, and the current universal speech vector corresponding to the speech to be clustered;
  • the current general speech vector is stored in a preset acoustic model library, and the speech to be clustered is classified into a clustering cluster corresponding to the current general speech vector.
  • One or more non-volatile readable storage media storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the speech feature similarity of the speech to be clustered in the target universal speech vector is not greater than a preset threshold, then use the speech to be clustered for model training, and the current universal speech vector corresponding to the speech to be clustered;
  • the current general speech vector is stored in a preset acoustic model library, and the speech to be clustered is classified into a clustering cluster corresponding to the current general speech vector.
  • FIG. 1 is a schematic diagram of an application environment of a speaker clustering method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a speaker clustering method according to an embodiment of the present application.
  • FIG. 3 is another flowchart of a speaker clustering method according to an embodiment of the present application.
  • FIG. 4 is another flowchart of a speaker clustering method according to an embodiment of the present application.
  • FIG. 5 is another flowchart of a speaker clustering method according to an embodiment of the present application.
  • FIG. 6 is another flowchart of a speaker clustering method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a speaker clustering device according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the speaker clustering method and method provided in the embodiment of the present application may be applied in the application environment shown in FIG. 1, where a computer device for collecting speech to be clustered communicates with a recognition server through a network.
  • computer equipment includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the identification server can be implemented by an independent server or a server cluster composed of multiple servers.
  • a speaker clustering method is provided.
  • the method is applied to the recognition server in FIG. 1 as an example, and includes the following steps:
  • the speech to be clustered is used to determine the characteristics of the clusters of speakers, and the speaker's speech to be divided into the corresponding clusters.
  • each speech to be clustered may not have the same length of speech due to factors such as speed of speech, recording content, etc., ranging from minutes to seconds. Understandably, the longer the duration of the speech to be clustered, the more obvious and accurate the speech features that can be extracted. Therefore, in step S10, the recognition server arranges the speeches to be clustered in descending order of speech duration to form a queue, and the recognition server sequentially determines the class clusters to which the speeches to be identified belong according to the queue order, which can improve classification accuracy.
  • the preset acoustic model library stores original universal speech vectors that are respectively established according to the cluster-like features of all existing clusters.
  • the preset acoustic model library can store the original universal speech vector that is divided and saved according to the age characteristics of the speaker, with birth to 10 years old as the first clustering cluster and 11 to 20 years old as the second clustering. Class clusters, with 21 to 30 years old as the third cluster class cluster, and so on.
  • the original universal speech vector is a feature vector representing speakers of the same cluster.
  • the target universal speech vector is an original universal speech vector that has the highest similarity with its own speech features in the preset acoustic model library.
  • step S20 the recognition server sequentially identifies and compares each voice to be clustered with each original general voice vector in the preset acoustic model library, and can match a target general voice vector with the highest similarity, which is helpful for further determining Whether the clustered speech belongs to the same cluster as the target universal speech vector helps to improve the accuracy of clustering the speech to be clustered.
  • the similarity of speech features is the similarity ratio obtained after comparing the speech to be clustered with the target universal speech vector.
  • the preset threshold is a threshold set according to actual experience, and the threshold can be used to limit the minimum similarity of speech features when the speech to be clustered and the target universal speech vector belong to the same cluster.
  • the preset threshold can be set to 0.75, that is, when the similarity of the speech features of the speech to be clustered in the target universal speech vector is not greater than 0.75, the model training using the speech to be clustered is performed to obtain the Steps of the current universal speech vector corresponding to the speech to be clustered.
  • the current universal speech vector is the target universal speech vector that has the highest similarity with its own speech features in the preset acoustic model library, but the similarity of the speech characteristics of the target speech to be clustered is not greater than the target universal speech vector.
  • a preset threshold is set, and a new current vector is established according to the cluster-like attributes of the speech to be clustered. For example, in the preset acoustic model library divided by age, there are only the first cluster clusters from birth to 10 years old, the second cluster clusters from 11 to 20 years old, and the first clusters from 21 to 30 years old. Three-cluster clusters. The speaker of the speech to be clustered is 35 years old.
  • the speech to be clustered does not match the cluster-like vector with a similarity with its own voice feature greater than a preset threshold in the preset acoustic model library, it can be based on the age of the speaker.
  • a fourth cluster class cluster with 31 to 40 years old is established as the corresponding current universal speech vector.
  • step S30 when the speech to be clustered does not match the target universal speech vector similar to its own voice characteristics in the preset acoustic model library, the recognition server may classify the speech to be clustered according to the cluster-like attributes of the speech to be clustered.
  • the new current general speech vector created by speech adds the flexibility of a preset acoustic model library and the classification accuracy of the speech to be clustered.
  • the current general speech vector is stored in a preset acoustic model library, and the speech to be clustered is classified into a clustering cluster corresponding to the current general speech vector.
  • the current general speech vector is the speech vector obtained in step S30
  • the preset acoustic model library is the database including multiple clusters obtained in step S20
  • the speech to be clustered is the speech data input to the recognition server in step S10.
  • the recognition server may store the current universal speech vector newly generated by the speech to be clustered into the preset acoustic model library, expand the range of the recognizable clustering clusters of the preset acoustic model library, and improve the preset acoustic model library. Flexibility and scalability, while improving the accuracy of classifying the speech to be clustered.
  • the at least two speeches to be clustered are arranged in descending order of speech duration.
  • the speeches to be clustered are classified in the target universal speech vector for classification-like speech feature similarity,
  • a current universal speech vector corresponding to the speech to be clustered is generated, thereby improving the accuracy of classifying the speech to be clustered.
  • the current general speech vector is stored in the preset acoustic model library, the range of the identifiable clusters of the preset acoustic model library is expanded, and the flexibility and scalability of the preset acoustic model library are improved.
  • the speaker clustering method further includes:
  • the speech to be clustered is used to determine the characteristics of the clusters of speakers, and the speaker's speech to be divided into the corresponding clusters.
  • the target universal speech vector is a target universal speech vector that has the highest similarity with its own speech features in the preset acoustic model library.
  • the similarity of speech features is the similarity of speech features obtained after comparing the clustered speech with the target universal speech vector.
  • the preset threshold is a threshold set according to actual experience, and the threshold can be used to limit the minimum similarity of speech features when the speech to be clustered and the target universal speech vector belong to the same cluster.
  • the preset threshold may be set to 0.75, that is, when the similarity of the speech features of the speech to be clustered in the target universal speech vector is greater than 0.75, the classification of the speech to be clustered into the target universal speech vector Corresponding cluster class.
  • step S50 when the speech to be clustered matches a target universal speech vector similar to its own voice characteristics in a preset acoustic model library, and the speech feature similarity of the speech to be clustered relative to the target universal speech vector is greater than a preset threshold, recognition is performed.
  • the server can automatically classify the speech to be clustered into the clustering cluster corresponding to the target universal speech vector, thereby improving the clustering speed of speech recognition.
  • each voice to be clustered and each original universal voice vector in a preset acoustic model library are sequentially recognized for speech, and the voice to be clustered is obtained.
  • the corresponding target universal speech vector includes the following steps:
  • the preset rule is a rule for setting a duration for dividing the speech to be clustered into a first speech segment and a second speech segment.
  • the first speech segment is a speech segment that is used for speech adaptation with each original universal speech vector in the preset acoustic model library
  • the second speech segment is an adaptive speech feature that is generated after adaptively matching the first speech segment. Compare the speech segments.
  • the preset rule follows the principle that the duration percentage of the first speech segment is greater than the duration percentage of the second speech segment.
  • the duration percentage of the first speech segment used for adaptation can be set to 75%; the duration percentage of the second speech segment is set to 25%, and the speech features used for clustering clusters are similar Degree scoring.
  • Step S21 divides the speech segment to be clustered into a first speech segment for speech adaptation and a second speech segment for scoring, which facilitates subsequent clustering based on the two speech segments of the speech to be clustered. The accuracy of the decision.
  • the voice features mentioned in this embodiment represent the voice features of the clusters of this class that are different from other clusters.
  • Mel-Frequency Cepstral Coefficients (hereinafter referred to as MFCC features) are generally used as speech features.
  • MFCC features Mel-Frequency Cepstral Coefficients
  • the test found that the human ear is like a filter bank and only pays attention to certain specific frequency components (human hearing is non-linear to frequency), which means that the human ear receives a limited number of signals at a sound frequency.
  • these filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and they are densely distributed. However, in the high frequency region, the number of filters becomes relatively small and the distribution is sparse.
  • the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear. Therefore, the Mel feature frequency cepstrum coefficient is used as the speech feature, which can well reflect the speech characteristics of the clusters.
  • the first speech feature is the MFCC feature corresponding to the first speech segment of the speech to be clustered for the adaptive part
  • the second speech feature is the MFCC feature corresponding to the second speech segment for scoring.
  • the implementation process of obtaining the first voice feature includes: preprocessing the first voice segment to obtain preprocessed voice data; the preprocessed voice data is pre-emphasized voice data, and pre-emphasis is a type of input at the transmitting end.
  • a signal processing method that compensates high-frequency components of a signal. With the increase of the signal rate, the speech signal is greatly damaged during transmission. In order to obtain a better signal waveform at the receiving end, it is necessary to compensate the damaged speech signal.
  • the idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the transmitting end of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission, so that the receiving end can obtain a better voice signal waveform.
  • Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
  • MFCC feature feature vector used for training or recognition The MFCC feature can be used as a coefficient for distinguishing different voices from the first voice feature.
  • the first voice feature can reflect the difference between voices and can be used to identify and distinguish training voice data.
  • step S22 feature extraction is performed on the first speech segment and the second speech segment to obtain the first speech feature and the second speech feature, which can accurately reflect the features of the speech to be clustered, and use the two for adaptation respectively. And scoring can improve the accuracy of clustering clusters of speech to be clustered.
  • the first speech feature is input to each original universal speech vector in the preset acoustic model library for speech adaptation, and an adaptive speech feature corresponding to each original universal speech vector is obtained.
  • the preset acoustic model library stores original universal speech vectors that are respectively established according to the cluster-like features of all existing clusters.
  • Speech adaptation is based on the original universal speech vector that has been trained, and uses the first speech feature to adjust the original universal speech vector to improve the modeling accuracy of the original universal speech model, so that the speech recognition rate is close to that of the first general speech vector.
  • a fully trained level of speech features The currently widely used speech adaptive algorithm is based on the MAP (Maximum Posteriorori) method for parameter re-evaluation. This method uses the prior probabilities of the original general speech vector parameters, and uses the maximum posterior probability of the original general speech vector parameters as a criterion to re-estimate the parameters of the original general speech vector, thereby improving the adaptive effect. Understandably, the adaptive speech feature is the speech vector corresponding to the new first speech feature formed after re-estimating the parameters of the original universal speech vector.
  • the implementation of the MAP revaluation method is as follows:
  • O ⁇ O 1, O 2 ..., O r ⁇ be a series of observations where the probability density function of the first speech feature is p (O), and ⁇ estimate is the parameter set of the original universal speech vector that defines the distribution, p ( ⁇
  • the re-estimation problem is a process of re-estimating ⁇ estimate given a training data sequence O. This process is implemented using the following formula (1):
  • p ( ⁇ ) is the prior distribution of the original universal speech vector parameter, where ⁇ is a random variable that conforms to the prior distribution p ( ⁇ ).
  • step S23 an adaptive speech feature corresponding to each original universal speech vector may be acquired, which is helpful for further determining a technical basis of clustering clusters based on the feature.
  • the recognition similarity is the degree of similarity between two vectors.
  • the cosine value can be obtained by calculating the cosine space distance of the two vectors, so the value is from -1 to 1. Where -1 indicates that the two vectors are in opposite directions, 1 indicates that the two vectors point in the same direction, and 0 indicates that the two vectors are independent. Between -1 and 1 represents the similarity or dissimilarity between the two vectors. Understandably, the closer the similarity is to 1, the closer the two vectors are.
  • the recognition server may obtain and record the recognition similarity corresponding to each original universal speech vector, and may determine the clustering cluster where the closest voice to be clustered is located based on the recognition similarity.
  • the target universal speech vector is an original universal speech vector that has the highest similarity with its own speech features in the preset acoustic model library.
  • step S26 by selecting the original universal speech vector with the highest recognition similarity as the target universal speech vector corresponding to the speech to be clustered, it can be temporarily determined that the existing speech to be clustered most likely belongs to the preset acoustic model library. Cluster-like clusters.
  • the voice to be clustered is divided into a first voice segment and a second voice segment for feature extraction, and the first voice feature and the second voice feature are obtained, which can accurately reflect the features of the voice to be clustered, and Both are used for adaptive and scoring respectively, which can improve the accuracy of clustering clusters of speech to be clustered.
  • step S24 the similarity calculation is performed on the adaptive speech feature and the second speech feature to obtain the recognition similarity corresponding to each original universal speech vector, which specifically includes the following steps :
  • the adaptive speech feature is a new first speech feature formed by re-estimating the parameters of the original universal speech vector.
  • the second speech feature is a speech feature of a second speech segment corresponding to the speech to be clustered for scoring.
  • the identification of the i-vector vector and the second i-vector vector are two fixed-length vector representations obtained by reducing the dimensions of the identified i-vector vector and the second i-vector vector to a low-dimensional total variable space, respectively.
  • an I-Vector vector is also called an identity factor method. It does not attempt to forcibly separate the speaker space and the channel space, but directly sets a global change space, which contains all possible information in the voice data. Then through the method of factor analysis, the load factor of the global change space is obtained. This is called the I-Vector vector.
  • Step S241 by obtaining the recognition i-vector vector and the second i-vector vector corresponding to the adaptive speech feature and the second speech feature, the recognition i-vector vector and the second i-vector can be further obtained based on these two vector representations.
  • Vector space distance by obtaining the recognition i-vector vector and the second i-vector vector corresponding to the adaptive speech feature and the second speech feature, the recognition i-vector vector and the second i-vector can be further obtained based on these two vector representations.
  • obtaining the recognition similarity between the recognition i-vector vector and the second i-vector vector may be determined by a cosine value obtained by the following formula:
  • a i and Bi represent the components of the vector A and the vector B, respectively.
  • the similarity ranges from -1 to 1, where -1 indicates that the two vectors are in opposite directions, 1 indicates that the two vectors point in the same direction, and 0 indicates that the two vectors are independent. Between -1 and 1 represents the similarity or dissimilarity between the two vectors. Understandably, the closer the similarity is to 1, the closer the two vectors are. Show similarity or dissimilarity between two vectors
  • the recognition server may use the cosine similarity algorithm to obtain the recognition similarity between the recognition i-vector vector and the second i-vector vector, which is simple and fast.
  • step S30 the model training is performed using the speech to be clustered to obtain the current universal speech vector corresponding to the speech to be clustered, which specifically includes the following steps:
  • the speech to be clustered is used to determine the characteristics of the clusters of speakers, and the speaker's speech to be divided into the corresponding clusters.
  • the test voice feature is the voice feature of the clustering cluster represented by the voice to be clustered, which is different from other clusters, and specifically refers to the voice feature obtained after the feature extraction of the voice to be clustered, which is applied in this embodiment and may use the Mel frequency Cepstral coefficients (Mel-Frequency, Cepstral, Coefficients, hereinafter referred to as MFCC features) are used as test speech features.
  • MFCC features Mel frequency Cepstral coefficients
  • step S31 the recognition server prepares technical support for establishing the current universal voice vector by extracting the test voice features of the voice to be clustered.
  • the simplified model algorithm is used to simplify the processing of test voice features to obtain simplified voice features.
  • the simplified model algorithm refers to a Gaussian Blur (Gaussian Blur) processing algorithm, which is used to reduce the sound noise and level of detail of a voice file.
  • Simplified speech features are relatively pure speech features obtained by simplification of the simplified model algorithm to remove sound noise.
  • step S32 a simplified model algorithm is used to simplify the processing of the test voice feature.
  • the two-dimensional normal distribution of the test voice feature can be obtained first, and then all phonemes of the two-dimensional normal distribution are blurred to obtain a purer simplified voice feature.
  • the simplified voice feature It can largely reflect the characteristics of the test speech features, which can help improve the efficiency of subsequent training of current general speech vectors.
  • the Maximum Expectation Algorithm (Maximum Expectation Algorithm, hereinafter referred to as the EM algorithm) is an iterative algorithm that is used in statistics to find the maximum likelihood of parameters in a probability model that depends on unobservable hidden variables estimate.
  • T space The total variation subspace (Total Space) (hereinafter referred to as T space) is a global change mapping matrix that is directly set to contain all possible information of the speaker in the voice data.
  • the speaker space and channel space are not separated in the T space.
  • T-space can map high-dimensional full statistics (supervectors) to i-vectors (identity-vectors) that can be used as low-dimensional speaker representations, and play a role in reducing dimensions.
  • the training process of the T space includes: according to a preset UBM model, using a vector analysis and an EM (Expectation Maximum Algorithm) algorithm to calculate the T space from the convergence.
  • the EM algorithm is used to iteratively simplify the speech features.
  • the realization process of obtaining T space is as follows:
  • Preset sample set x (x (1), x (2), ... x (m)) comprises m independent samples, each sample x i z i corresponding to category is unknown, it is necessary to take into account the joint probability distribution
  • x, ⁇ ) need to find the appropriate ⁇ and z to maximize L ( ⁇ ), where the maximum number of iterations J:
  • Step E Calculate the conditional probability expectation of the joint distribution, and calculate the posterior probability of the recessive variable (that is, the expectation of the recessive variable) according to the initial value of the parameter ⁇ or the parameter value obtained from the last iteration.
  • Q i (z (i) ) As the current estimate of the recessive variable:
  • Step M Maximize L ( ⁇ , ⁇ j ) to get ⁇ j + 1 (maximize the likelihood function to obtain new parameter values):
  • step c) If ⁇ j + 1 has converged, the algorithm ends. Otherwise continue to step a) and perform step E iteration.
  • the overall change subspace obtained in step 33 does not distinguish between the speaker space and the channel space, and converges the information of the channel space and the channel space into one space to reduce the computational complexity and facilitate further based on the overall change subspace to obtain simplification.
  • the simplified speech feature is the speech feature obtained after processing by the simplified model algorithm obtained in step S32.
  • the current universal speech vector is a fixed-length vector representation obtained by projecting simplified speech features onto a low-dimensional overall change subspace, which is used to represent a speech vector formed by multiple speakers belonging to the same cluster.
  • the recognition server uses a simplified model algorithm to simplify the processing of test voice features. After obtaining the simplified voice features, and then projecting the simplified voice features into the overall change subspace, a more pure and simple current universal voice vector can be obtained. In order to perform subsequent voice clustering on the speaker's voice data based on the current universal voice vector, the complexity of performing voice clustering is reduced, and the efficiency of voice clustering is accelerated.
  • step S32 a simplified model algorithm is used to simplify the processing of test voice features to obtain simplified voice features, which specifically include the following steps:
  • the Gaussian filter can perform linear smooth filtering on the input test voice features, is suitable for eliminating Gaussian noise, and is widely used in the noise reduction process.
  • the process of Gaussian filter processing test speech features is specifically a process of weighted average of test speech features. Taking the phonemes in test speech features as an example, the value of each phoneme is weighted by itself and other phoneme values in the neighborhood. Obtained after averaging.
  • the two-dimensional normal distribution (also known as the two-dimensional Gaussian distribution) meets the following characteristics of the density function: With respect to ⁇ symmetry, it reaches the maximum at ⁇ , and the value is 0 at positive (negative) infinity, and at ⁇ ⁇ ⁇ There are inflection points; the shape of the two-dimensional normal distribution is high in the middle and low in both sides, and the image is a bell curve above the x-axis.
  • the Gaussian filter processes the test voice characteristics by using a 3 * 3 mask to scan each phoneme in the training voice data, and using the weighted average of the phonemes in the neighborhood determined by the mask to replace the template center.
  • the phoneme values form a two-dimensional normal distribution of training speech data.
  • the calculation process of the weighted average of each phoneme includes:
  • step S321 the noise in the test voice feature can be removed, and the output is a linear smooth sound filter to obtain a pure sound filter for further processing.
  • a simplified model algorithm is used to simplify the two-dimensional normal distribution to obtain simplified speech features.
  • the simplified model algorithm may use a Gaussian fuzzy algorithm to simplify the two-dimensional normal distribution.
  • each phoneme takes the average value of the surrounding phonemes
  • the "middle point” takes the average value of the "peripheral points”.
  • this is a “smoothing”.
  • the "middle point” loses detail.
  • the recognition server can obtain the simplified voice feature of the two-dimensional normal distribution corresponding to the test voice feature through the simplified model algorithm, which can further reduce the voice details of the test voice feature and simplify the voice feature.
  • the recognition server may sequentially denoise and reduce details of the test voice features to obtain pure and simplified simplified voice features, which is beneficial to improving the recognition efficiency of voice clustering.
  • the at least two speeches to be clustered are arranged in descending order of speech duration.
  • the speeches to be clustered are classified in the target universal speech vector for classification-like speech feature similarity,
  • the current general speech vector corresponding to the speech to be clustered is generated to improve the accuracy of classifying the speech to be clustered;
  • the current general speech vector is stored in the preset acoustic model library to expand the preset acoustic model
  • the range of the library's identifiable clusters improves the flexibility and scalability of the preset acoustic model library.
  • the recognition server divides the speech segment to be clustered into a first speech segment for speech adaptation and a second speech segment for scoring, and performs feature extraction on the first speech segment and the second speech segment, respectively.
  • Obtaining the first speech feature and the second speech feature can accurately reflect the features of the speech to be clustered, and use the two for adaptive and scoring respectively, which can improve the accuracy of clustering clusters of the speech to be clustered.
  • the recognition server can temporarily determine the existing clusters to which the speech to be clustered most likely belongs in the preset acoustic model library by selecting the original universal speech vector with the highest recognition similarity as the target universal speech vector corresponding to the speech to be clustered. Class cluster.
  • the recognition server uses the cosine similarity algorithm to obtain the recognition similarity between the recognition i-vector vector and the second i-vector vector, which is simple and fast.
  • the recognition server uses a simplified model algorithm to simplify the processing of test voice features. After obtaining the simplified voice features, and then projecting the simplified voice features into the overall change subspace, a more pure and simple current universal voice vector can be obtained, so that subsequent subsequent based on the current universal voice Vector performs speech clustering on speaker's speech data to reduce the complexity of speech clustering and speed up the efficiency of speech clustering.
  • a speaker clustering device is provided, and the speaker clustering device corresponds to the speaker clustering method in the embodiment described above.
  • the speaker clustering device includes a speech descending ordering module 10, a universal vector acquisition module 20, a training current vector module 30, and a current vector storage module 40.
  • the functional modules are described in detail as follows:
  • the speech descending ordering module 10 is configured to arrange at least two speeches to be clustered in descending order of speech duration.
  • the universal vector obtaining module 20 is configured to sequentially perform speech recognition for each speech to be clustered and each original universal speech vector in a preset acoustic model library to obtain a target universal speech vector corresponding to the speech to be clustered.
  • the training current vector module 30 is configured to use the speech to be clustered for model training if the speech feature similarity of the speech to be clustered in the target universal speech vector is not greater than a preset threshold, and the current general speech corresponding to the speech to be clustered vector.
  • the storage current vector module 40 is configured to store the current universal speech vector in a preset acoustic model library, and classify the speech to be clustered into a clustering cluster corresponding to the current universal speech vector.
  • the speaker clustering device further includes a clustering clustering unit 21.
  • a clustering clustering unit 50 is configured to classify the speech to be clustered to the target if the speech feature similarity of the speech to be clustered in the target universal speech vector is greater than a preset threshold. In the cluster class corresponding to the universal speech vector.
  • the universal vector acquisition module 20 includes a speech segment division unit 21, an acquisition speech feature unit 22, an acquisition recognition feature unit 23, an acquisition recognition similarity unit 24, and a selected speech model unit 25.
  • the speech segment unit 21 is configured to sequentially divide each of the speeches to be clustered into a first speech segment and a second speech segment according to a preset rule.
  • the voice feature obtaining unit 22 is configured to perform feature extraction on the first voice segment and the second voice segment, respectively, to obtain a first voice feature and a second voice feature.
  • An identification feature obtaining unit 23 is configured to input the first speech feature into each original universal speech vector in a preset acoustic model library for speech adaptation, and obtain an adaptive speech feature corresponding to each original universal speech vector.
  • the recognition similarity unit 24 is configured to perform similarity calculation on the adaptive speech feature and the second speech feature to obtain a recognition similarity corresponding to each original universal speech vector.
  • the speech model unit 25 is selected to select an original universal speech vector with the highest recognition similarity as a target universal speech vector corresponding to the speech to be clustered.
  • the recognition recognition similarity unit 24 includes a recognition recognition vector subunit 241 and a recognition recognition similarity subunit 242.
  • a recognition vector subunit 241 is configured to obtain a recognition i-vector vector and a second i-vector vector corresponding to the adaptive speech feature and the second speech feature, respectively.
  • the recognition similarity subunit 242 is configured to obtain a recognition similarity between the recognition i-vector vector and the second i-vector vector by using a cosine similarity algorithm.
  • the training current vector module 30 includes an extraction test feature unit 31, a simplified feature unit 32, a changed subspace unit 33, and a universal vector unit 34.
  • the extraction test feature unit 31 is configured to extract test speech features of speech to be clustered.
  • the simplified feature unit 32 is used to simplify processing test voice features by using a simplified model algorithm, and to obtain simplified voice features.
  • a variation subspace obtaining unit 33 is configured to iteratively simplify speech features by using a maximum expectation algorithm to obtain an overall variation subspace.
  • the universal vector obtaining unit 34 is configured to project the simplified speech feature onto the overall change subspace to obtain the current universal speech vector corresponding to the cluster identifier.
  • the obtaining current voice model unit 33 includes obtaining a normal distribution subunit 321 and obtaining a simplified feature subunit 322.
  • a normal distribution subunit 321 is used to process a test voice feature using a Gaussian filter to obtain a corresponding two-dimensional normal distribution.
  • a simplified feature sub-unit 322 is used to use a simplified model algorithm to simplify the two-dimensional normal distribution and obtain simplified voice features.
  • Each module in the speaker clustering device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and the internal structure diagram may be as shown in FIG. 8.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operating system and computer-readable instructions in a non-volatile storage medium to execute.
  • the computer equipment database is used to store speech data related to the speaker clustering method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement a speaker clustering method.
  • a computer device which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented: at least two The speeches to be clustered are arranged in descending order of speech duration; each speech to be clustered is sequentially identified with each original universal speech vector in the preset acoustic model library to obtain a target universal speech vector corresponding to the speech to be clustered; If the speech feature similarity of the speech to be clustered in the target universal speech vector is not greater than a preset threshold, the model is trained using the speech to be clustered, and the current universal speech vector corresponding to the speech to be clustered is stored; the current universal speech vector is stored In the preset acoustic model library, the speech to be clustered is classified into a clustering cluster corresponding to the current general speech vector.
  • the processor further implements the following steps when the processor executes computer-readable instructions: if the speech feature of the speech to be clustered in the target universal speech vector is If the similarity is greater than a preset threshold, the speech to be clustered is classified into a clustering cluster corresponding to the target universal speech vector.
  • each voice to be clustered is sequentially divided into a first voice segment and a second voice segment according to a preset rule; Perform feature extraction with the second speech segment to obtain the first speech feature and the second speech feature; input the first speech feature to each original universal speech vector in the preset acoustic model library for speech adaptation, and obtain each original universal speech Adaptive speech features corresponding to vectors; similarity calculation of adaptive speech features and second speech features to obtain the recognition similarity corresponding to each original universal speech vector; the original universal speech vector with the highest recognition similarity is selected as to be aggregated The target universal speech vector corresponding to the class of speech.
  • the processor executes the computer-readable instructions, the following steps are further implemented: obtaining the recognition i-vector vector and the second i-vector vector corresponding to the adaptive speech feature and the second speech feature, respectively; using a cosine similarity algorithm Obtain the recognition similarity between the recognition i-vector vector and the second i-vector vector.
  • the processor when the processor executes the computer-readable instructions, the following steps are further implemented: extracting test voice features of the speech to be clustered; using a simplified model algorithm to simplify processing the test voice features to obtain simplified voice features; using a maximum An algorithm is expected to iterate the simplified speech feature to obtain an overall change subspace; project the simplified speech feature to the overall change subspace to obtain the current universal speech vector corresponding to the class cluster identifier.
  • the processor when the processor executes the computer-readable instructions, the following steps are implemented: processing the test voice feature using a Gaussian filter to obtain a corresponding two-dimensional normal distribution; and simplifying the two-dimensional normal distribution using a simplified model algorithm To get simplified speech features.
  • one or more non-volatile readable storage media storing computer-readable instructions, and when the computer-readable instructions are executed by one or more processors, cause the one or more processors to perform the following steps : Arrange at least two speeches to be clustered in descending order of speech duration; sequentially perform speech recognition for each speech to be clustered and each original universal speech vector in a preset acoustic model library to obtain a target corresponding to the speech to be clustered Universal speech vector; if the similarity of the speech features of the speech to be clustered in the target universal speech vector is not greater than a preset threshold, use the speech to be clustered for model training, and the current universal speech vector corresponding to the speech to be clustered; The universal speech vector is stored in a preset acoustic model library, and the speech to be clustered is classified into a clustering cluster corresponding to the current universal speech vector.
  • the one or more processors further perform the following steps: The speech feature similarity of the speech to be clustered in the target universal speech vector is greater than a preset threshold, and the speech to be clustered is classified into a clustering cluster corresponding to the target universal speech vector.
  • the one or more processors when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: sequentially dividing each speech to be clustered into a first speech according to a preset rule Segment and second speech segment; perform feature extraction on the first speech segment and the second speech segment, respectively, to obtain the first speech feature and the second speech feature; input the first speech feature into each of the original commons in the preset acoustic model library
  • the speech vector is subjected to speech adaptation to obtain the adaptive speech features corresponding to each original universal speech vector; similarity calculation is performed on the adaptive speech feature and the second speech feature to obtain the recognition similarity corresponding to each original universal speech vector;
  • the original universal speech vector with the highest similarity is identified as the target universal speech vector corresponding to the speech to be clustered.
  • the one or more processors when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: obtaining the recognition i-vector vectors corresponding to the adaptive speech feature and the second speech feature, respectively. And the second i-vector vector; the cosine similarity algorithm is used to obtain the recognition similarity between the recognition i-vector vector and the second i-vector vector.
  • the one or more processors when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: extracting test voice features of the speech to be clustered; using a simplified model algorithm to simplify processing The test voice feature is used to obtain simplified voice features; the maximum expectation algorithm is used to iterate the simplified voice feature to obtain an overall change subspace; and the simplified voice feature is projected to the overall change subspace to obtain the cluster-like identifier The corresponding current universal speech vector.
  • the one or more processors when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: processing the test voice feature using a Gaussian filter to obtain a corresponding two-dimensional normal Distribution; a simplified model algorithm is used to simplify the two-dimensional normal distribution to obtain simplified speech features.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种说话人聚类方法、装置、计算机设备及存储介质,其中,该说话人聚类方法包括:将至少两个待聚类语音按语音时长降序排列;依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量;若待聚类语音在目标通用语音向量中的语音特征相似度不大于预设阈值,则采用待聚类语音进行模型训练,与待聚类语音对应的当前通用语音向量;将当前通用语音向量存储在预设声学模型库中,并将待聚类语音归类到对应的聚类类簇中。本申请通过判定待聚类语音的语音特征相似度不大于预设阈值时,自动生成与待聚类语音对应的当前通用语音向量,提高分类的准确性。

Description

说话人聚类方法、装置、计算机设备及存储介质
本申请以2018年06月11日提交的申请号为2018110592867.9,名称为“说话人聚类方法、装置、计算机设备及存储介质”的中国发明申请为基础,并要求其优先权。
技术领域
本申请涉及声纹识别领域,尤其涉及一种说话人聚类方法、装置、计算机设备及存储介质。
背景技术
说话人聚类方法是直接根据说话人的某种特性,比如说话人的性别、年龄、口音等,将训练集中的说话人根据其语音特性分成若干个子集,每一个子集内的说话人都具有相似度高的某种语音特性,然后专门为每个子集训练声学模型,最终形成存储若干类簇的声学模型库。在测试说话人的待聚类语音时,将声学模型库中的所有已存声学模型依次与待聚类语音进行相似度判断,以确认该待聚类语音属于哪个类簇。
现有说话人聚类方法只能基于已知声学模型库对待聚类语音进行分类,对待聚类语音的聚类范围有所限制,可能产生分类不准确的现象。
发明内容
基于此,有必要针对上述技术问题,提供一种可以提高说话人聚类准确性的说话人聚类方法、装置、计算机设备及存储介质。
一种说话人聚类方法,包括:
将至少两个待聚类语音按语音时长降序排列;
依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量;
若待聚类语音在目标通用语音向量中的语音特征相似度不大于预设阈值,则采用待聚类语音进行模型训练,与待聚类语音对应的当前通用语音向量;
将当前通用语音向量存储在预设声学模型库中,并将待聚类语音归类到当前通用语音向量对应的聚类类簇中。
一种说话人聚类装置,包括:
语音降序排列模块,用于将至少两个待聚类语音按语音时长降序排列;
获取通用向量模块,用于依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量;
训练当前向量模块,用于若待聚类语音在目标通用语音向量中的语音特征相似度不大于预设阈值,则采用待聚类语音进行模型训练,与待聚类语音对应的当前通用语音向量;
存储当前向量模块,用于将当前通用语音向量存储在预设声学模型库中,并将待聚类语音归类到当前通用语音向量对应的聚类类簇中。
一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机可读指令,其特征在于,处理器执行计算机可读指令时实现如下步骤:
将至少两个待聚类语音按语音时长降序排列;
依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量;
若待聚类语音在目标通用语音向量中的语音特征相似度不大于预设阈值,则采用待聚类语音进行模型训练,与待聚类语音对应的当前通用语音向量;
将当前通用语音向量存储在预设声学模型库中,并将待聚类语音归类到当前通用语音向量对应的聚类类簇中。
一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
将至少两个待聚类语音按语音时长降序排列;
依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量;
若待聚类语音在目标通用语音向量中的语音特征相似度不大于预设阈值,则采用待聚类语音进行模型训练,与待聚类语音对应的当前通用语音向量;
将当前通用语音向量存储在预设声学模型库中,并将待聚类语音归类到当前通用语音向量对应的聚类类簇中。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中说话人聚类方法的一应用环境示意图;
图2是本申请一实施例中说话人聚类方法的一流程图;
图3是本申请一实施例中说话人聚类方法的另一流程图;
图4是本申请一实施例中说话人聚类方法的另一流程图;
图5是本申请一实施例中说话人聚类方法的另一流程图;
图6是本申请一实施例中说话人聚类方法的另一流程图;
图7是本申请一实施例中说话人聚类装置的一示意图;
图8是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的说话人聚类方法方法,可应用在如图1的应用环境中,其中,用于采集待聚类语音的计算机设备通过网络与识别服务器进行通信。其中,计算机设备包括但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。识别服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种说话人聚类方法,以该方法应用在图1中的识别服务器为例进行说明,包括如下步骤:
S10.将至少两个待聚类语音按语音时长降序排列。
其中,待聚类语音是用于按类簇特征进行判定,待划分到对应类簇的说话人语音。
每个待聚类语音因语速、录制内容等因素造成语音时长不一定相同,从几分钟到几秒不等。可以理解地,待聚类语音的时长越长,可提取的语音特征越明显越准确。因此,步骤S10中,识别服务器将待聚类语音按语音时长降序排列以形成队列,识别服务器按队列顺序来依次判定待识别语音所属的类簇,可提高分类准确性。
S20.依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量。
其中,预设声学模型库中存储有根据现有所有类簇的类簇特征分别建立的原始通用语音向量。比如,预设声学模型库中可以保存按说话人年龄特征来进行划分并保存的原始通用语音向量,以出生到10岁为第一聚类类簇,以11岁至20岁为第二聚类类簇,以21岁至30岁为第三聚类类簇,此类类推。
原始通用语音向量是表示同一类簇说话人的特征向量。
目标通用语音向量是待聚类语音在预设声学模型库中匹配到与自身语音特征相似度最高的一原始通用语音向量。
步骤S20中,识别服务器依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行识别对比,可匹配到与相似度最高的一目标通用语音向量,利于进一步判定待聚类语音是否与该 目标通用语音向量属于同一类簇,有助于提高对待聚类语音进行聚类的准确性。
S30.若待聚类语音在目标通用语音向量中的语音特征相似度不大于预设阈值,则采用待聚类语音进行模型训练,获取与待聚类语音对应的当前通用语音向量。
其中,语音特征相似度是待聚类语音和目标通用语音向量进行对比后得到的相似度比值。
预设阈值是根据实际经验设定的阈值,该阈值可以用于限定待聚类语音和目标通用语音向量属于同一类簇时,其语音特征相似度的最小值。应用于本实施例,可将预设阈值设定为0.75,即当待聚类语音在目标通用语音向量中的语音特征相似度不大于0.75时,执行采用待聚类语音进行模型训练,获取与待聚类语音对应的当前通用语音向量的步骤。
当前通用语音向量是待聚类语音在预设声学模型库中匹配到与自身语音特征相似度最高的目标通用语音向量,但该待聚类语音在目标通用语音向量是的语音特征相似度不大于预设阈值,而根据待聚类语音自身具有的类簇属性而建立的新的当前向量。比如,以年龄进行划分的预设声学模型库中仅存有出生到10岁的第一聚类类簇、从11岁到20岁的第二聚类类簇,从21岁到30岁的第三聚类类簇。而待聚类语音的说话人为35岁,当待聚类语音在该预设声学模型库中未匹配到与自身语音特征相似度大于预设阈值的类簇向量,可根据说话人的年龄所处的划分段,建立以31岁到40岁的第四聚类类簇,作为对应的当前通用语音向量。
步骤S30中,当待聚类语音在预设声学模型库中未匹配到与自身语音特征相似的目标通用语音向量时,识别服务器可根据待聚类语音自身具有的类簇属性而为待聚类语音建立的新的当前通用语音向量,增添了预设声学模型库的灵活性,和对待聚类语音进行划分的分类准确性。
S40.将当前通用语音向量存储在预设声学模型库中,并将待聚类语音归类到当前通用语音向量对应的聚类类簇中。
其中,当前通用语音向量即步骤S30得到的语音向量,预设声学模型库即步骤S20得到的包括多个聚类类簇的数据库,待聚类语音就是步骤S10输入识别服务器的语音数据。
步骤S40中,识别服务器可将待聚类语音新生成的当前通用语音向量存储到预设声学模型库中,扩大预设声学模型库的可识别聚类类簇的范围,提高预设声学模型库的灵活性和可扩展性,同时提高对待聚类语音进行分类的准确性。
本申请实施例提供的说话人聚类方法,通过将至少两个待聚类语音按语音时长降序排列,当待聚类语音在目标通用语音向量中针对分类进行的类簇识别的语音特征相似度不大于预设阈值时,生成与待聚类语音对应的当前通用语音向量,提高对待聚类语音进行分类的准确性。将当前通用语音向量存储在预设声学模型库中,扩大预设声学模型库的可识别聚类类簇的范围,提高预设声学模型库的灵活性和可扩展性。
在一实施例中,在步骤S20之后,即在获取与待聚类语音对应的目标通用语音向量的步骤之后,说话人聚类方法还包括:
S50.若待聚类语音在目标通用语音向量中的语音特征相似度大于预设阈值,则将待聚类语音归 类到目标通用语音向量对应的聚类类簇中。
其中,待聚类语音是用于按类簇特征进行判定,待划分到对应类簇的说话人语音。目标通用语音向量是待聚类语音在预设声学模型库中匹配到与自身语音特征相似度最高的一目标通用语音向量。语音特征相似度是待聚类语音和目标通用语音向量进行对比后得到的语音特征相似度。
预设阈值是根据实际经验设定的阈值,该阈值可以用于限定待聚类语音和目标通用语音向量属于同一类簇时,其语音特征相似度的最小值。应用于本实施例,可将预设阈值设定为0.75,即当待聚类语音在目标通用语音向量中的语音特征相似度大于0.75时,执行将待聚类语音归类到目标通用语音向量对应的聚类类簇中。
步骤S50中,当待聚类语音在预设声学模型库中匹配到与自身语音特征相似的目标通用语音向量,且待聚类语音相对目标通用语音向量的语音特征相似度大于预设阈值,识别服务器可自动将待聚类语音归类到目标通用语音向量对应的聚类类簇中,提高语音识别的聚类速度。
在一实施例中,如图3所示,在步骤S20中,即依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量,具体包括如下步骤:
S21.依序将每一待聚类语音按预设规则划分成第一语音段和第二语音段。
其中,预设规则是用以设定将待聚类语音划分为第一语音段和第二语音段的时长的规则。
第一语音段是用以和预设声学模型库中每一原始通用语音向量进行语音自适应的语音段,第二语音段是用来与第一语音段进行自适应后生成的自适应语音特征进行对比的语音段。
可以理解地,用于进行自适应的第一语音段的时长越长,则自适应后生成的自适应语音特征准确性越高。因此,该预设规则遵循的原则是第一语音段的时长百分比大于第二语音段的时长百分比。应用于本实施例,可将用于自适应的第一语音段的时长百分比设定为75%;第二语音段的时长百分比设定为25%,用来进行聚类类簇的语音特征相似度打分。
步骤S21将待聚类语音段划分为用以进行语音自适应的第一语音段和用以进行打分的第二语音段,利于后续基于上述待聚类语音的两个语音段进行聚类类簇判定的准确性。
S22.分别对第一语音段和第二语音段进行特征提取,获取第一语音特征和第二语音特征。
其中,本实施例中提到的语音特征是代表本类簇区别于其它类簇的语音特征。一般采用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,以下简称MFCC特征)作为语音特征。检测发现人耳像一个滤波器组,只关注某些特定的频率分量(人的听觉对频率是非线性的),也就是说人耳接收声音频率的信号是有限的。然而这些滤波器在频率坐标轴上却不是统一分布的,在低频区域有很多的滤波器,他们分布比较密集,但在高频区域,滤波器的数目就变得比较少,分布很稀疏。梅尔刻度滤波器组在低频部分的分辨率高,跟人耳的听觉特性是相符的,因此将采用梅尔频率倒谱系数作为语音特征,可以很好地体现聚类类簇的语音特征。
由上述对于语音特征的定义可知,第一语音特征是待聚类语音用于自适应部分的第一语音段对 应的MFCC特征,第二语音特征是用于打分的第二语音段对应的MFCC特征。
本实施例中,获取第一语音特征的实现过程包括:对第一语音段进行预处理,获取预处理语音数据;预处理语音数据就是预加重语音数据,预加重是一种在发送端对输入信号高频分量进行补偿的信号处理方式。随着信号速率的增加,语音信号在传输过程中受损很大,为了使接收端能得到比较好的信号波形,就需要对受损的语音信号进行补偿。预加重技术的思想就是在传输线的发送端增强信号的高频成分,以补偿高频分量在传输过程中的过大衰减,使得接收端能够得到较好的语音信号波形。预加重对噪声并没有影响,因此能够有效提高输出信噪比。
对预处理语音数据作快速傅里叶变换,获取第一语音段的频谱,并根据频谱获取第一语音段的功率谱;采用梅尔刻度滤波器组处理第一语音段的功率谱,获取第一语音段的梅尔功率谱;在梅尔功率谱上进行倒谱分析,获取第一语音段的梅尔频率倒谱系数,也即获得第一语音段的MFCC特征。
对梅尔功率谱进行倒谱分析,根据倒谱的结果,分析并获取第一语音段的MFCC特征。通过该倒谱分析,可以将原本特征维度过高,难以直接使用的训练语音数据的梅尔功率谱中包含的特征,通过在梅尔功率谱上进行倒谱分析,转换成易于使用的特征(用来进行训练或识别的MFCC特征特征向量)。该MFCC特征能够作为第一语音特征对不同语音进行区分的系数,该第一语音特征可以反映语音之间的区别,可以用来识别和区分训练语音数据。
由于获取第二语音特征的实现过程与获取第一语音特征的过程相同,不再赘述。
步骤S22中,分别对第一语音段和第二语音段进行特征提取,获取第一语音特征和第二语音特征,能够准确地体现待聚类语音的特征,并将两者分别用于自适应和打分,可提高对待聚类语音进行聚类类簇的准确性。
S23.将第一语音特征输入到预设声学模型库中每一原始通用语音向量进行语音自适应,获取每一原始通用语音向量对应的自适应语音特征。
其中,预设声学模型库中存储有根据现有所有类簇的类簇特征分别建立的原始通用语音向量。
语音自适应是在已经训练好的原始通用语音向量的基础上,用第一语音特征对原始通用语音向量进行调整,以提高原始通用语音模型的建模精度,从而使语音识别率接近于对第一语音特征经过充分训练的水平。目前广泛使用的语音自适应算法是基于MAP(Maximum a Posteriori,最大后验概率方法)方法进行参数重估。该方法利用原始通用语音向量参数的先验概率,以原始通用语音向量参数的后验概率最大为准则,重新估计原始通用语音向量的参数,从而提高自适应效果。可以理解地,自适应语音特征就是重新估计原始通用语音向量的参数后形成的新的第一语音特征对应的语音向量。MAP重估方法的实现过程如下:
设O={O 1,O 2...,O r}是第一语音特征的概率密度函数为p(O)的一系列观察值,λ estimate是定义分布的原始通用语音向量的参数集合,p(λ|O)是原始通用语音向量参数的后验分布。重估问题也即是给定训练数据序列O,重新估计λ estimate的过程。这个过程采用下述公式(1)实现:
Figure PCTCN2018103824-appb-000001
应用贝叶斯准则可得:
Figure PCTCN2018103824-appb-000002
式中p(λ)是原始通用语音向量参数的先验分布,其中,λ是符合先验分布p(λ)的随机变量。
将(2)式代入(1)式可得到:
Figure PCTCN2018103824-appb-000003
步骤S23可获取获取每一原始通用语音向量对应的自适应语音特征,利于进一步基于该特征进行聚类类簇的判定技术基础。
S24.对自适应语音特征和第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度。
其中,识别相似度是两个向量之间的相似程度,可通过计算两个向量的余弦空间距离从而得到余弦值,因此是数值是从-1到1之间的。其中-1表示两个向量方向相反,1表示两个向量指向相同;0表示两个向量是独立的。在-1和1之间表示两个向量之间的相似性或相异性,可以理解地,相似度越接近1表示两个向量越接近。
步骤S24中,识别服务器可获取并记录每一原始通用语音向量对应的识别相似度,可基于该识别相似度判定出最接近的待聚类语音所在的聚类类簇。
S25.选取识别相似度最高的原始通用语音向量作为与待聚类语音对应的目标通用语音向量。
其中,目标通用语音向量是待聚类语音在预设声学模型库中匹配到与自身语音特征相似度最高的一原始通用语音向量。
可以理解地,两个向量的识别相似度最高说明两个向量最接近。步骤S26中通过选取识别相似度最高的原始通用语音向量作为与待聚类语音对应的目标通用语音向量,可暂时判定出待聚类语音在预设声学模型库中最有可能属于的已有的聚类类簇。
步骤S21至S25中,将待聚类语音划分为第一语音段和第二语音段进行特征提取,获取第一语音特征和第二语音特征,能够准确地体现待聚类语音的特征,并将两者分别用于自适应和打分,可提高对待聚类语音进行聚类类簇的准确性;通过选取识别相似度最高的原始通用语音向量作为与待聚类语音对应的目标通用语音向量,可暂时判定出待聚类语音在预设声学模型库中最有可能属于的已有的聚类类簇。
在一实施例中,如图4所示,在步骤S24中,即对自适应语音特征和第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度,具体包括如下步骤:
S241.分别获取自适应语音特征和第二语音特征对应的识别i-vector向量和第二i-vector向量。
其中,自适应语音特征就是重新估计原始通用语音向量的参数后形成的新的第一语音特征。第 二语音特征是用于打分的待聚类语音对应的第二语音段的语音特征。
识别i-vector向量和第二i-vector向量是通过将识别i-vector向量和第二i-vector向量分别降维映射到一个低维的总变量空间后得到的两个固定长度的矢量表征。
具体地,获取I-Vector向量的过程,也称身份因子方法,它不尝试去强制分开说话人空间和信道空间,而是直接设置一个全局变化空间,它包含了语音数据中所有可能的信息。然后通过因子分析的方法,得到全局变化空间的载荷因子,这个就叫做I-Vector向量。
步骤S241,通过分别获取自适应语音特征和第二语音特征对应的识别i-vector向量和第二i-vector向量,可基于这两个矢量表征来进一步获取识别i-vector向量和第二i-vector向量的空间距离。
S242.采用余弦相似度算法获取识别i-vector向量和第二i-vector向量的识别相似度。
具体地,获取识别i-vector向量和第二i-vector向量的识别相似度可由以下公式获得的余弦值进行判定:
Figure PCTCN2018103824-appb-000004
其中,A i和B i分别代表向量A和向量B的各个分量。由上式可知,相似度范围从-1到1,其中-1表示两个向量方向相反,1表示两个向量指向相同;0表示两个向量是独立的。在-1和1之间表示两个向量之间的相似性或相异性,可以理解地,相似度越接近1表示两个向量越接近。示两个向量之间的相似性或相异
步骤S241至S242,识别服务器可采用余弦相似度算法获取识别i-vector向量和第二i-vector向量的识别相似度,简单快捷。
在一实施例中,如图5所示,在步骤S30中,即采用待聚类语音进行模型训练,获取与待聚类语音对应的当前通用语音向量,具体包括如下步骤:
S31.提取待聚类语音的测试语音特征。
其中,待聚类语音是用于按类簇特征进行判定,待划分到对应类簇的说话人语音。
测试语音特征是待聚类语音代表的聚类类簇区别于其它类簇的语音特征,具体是指对待聚类语音进行特征提取后获取的语音特征,应用于本实施例,可采用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,以下简称MFCC特征)作为测试语音特征。
步骤S31中,识别服务器通过提取待聚类语音的测试语音特征,为建立当前通用语音向量准备技术支持。
S32.采用简化模型算法简化处理测试语音特征,获取简化语音特征。
其中,简化模型算法是指高斯模糊(Gaussian Blur,高斯平滑)处理算法,用于降低语音文件的声音噪声和细节层次。简化语音特征是经简化模型算法简化后去除声音噪声,得到的较为纯净的语音特征。
步骤S32中采用简化模型算法简化处理测试语音特征具体可先获取测试语音特征的二维正态分布,再模糊二维正态分布的所有音素,以获取更纯净的简化语音特征,该简化语音特征可以在很大程度上体现测试语音特征的特性,有助于提高后续训练当前通用语音向量的效率。
S33.采用最大期望算法迭代简化语音特征,获取总体变化子空间。
其中,最大期望算法(Expectation Maximization Algorithm,最大期望算法,以下简称EM算法)是一种迭代算法,在统计学中被用于寻找依赖于不可观察的隐性变量的概率模型中参数的最大似然估计。
总体变化子空间(Total Variability Space,以下简称T空间),是直接设置一个全局变化的映射矩阵,用以包含语音数据中说话人所有可能的信息,在T空间内不分开说话人空间和信道空间。T空间能把高维充分统计量(超矢量)映射到可以作为低维说话人表征的i-vector(identity-vector,身份认证向量),起到降维作用。T空间的训练过程包括:根据预设UBM模型,利用向量分析和EM(Expectation Maximization Algorithm,最大期望)算法,从其中收敛计算出T空间。
采用EM算法迭代简化语音特征,获取T空间的实现过程如下:
预先设置样本集x=(x (1),x (2),...x (m))包含m个独立样本,每个样本x i对应的类别z i是未知的,需要顾及联合分布概率模型p(x,z|θ)和条件分布概率模型p(z|x,θ)的参数θ,即需要找到合适的θ和z让L(θ)最大,其中,最大迭代次数J:
1)随机初始化简化语音特征的模型参数θ,初值为θ 0
2)for j from 1to J开始EM算法迭代:
a)E步:计算联合分布的条件概率期望,根据参数θ初始值或上一次迭代所得参数值来计算出隐性变量的后验概率(即隐性变量的期望)Q i(z (i)),作为隐性变量的现估计值:
Q i(z (i))=P(z (i)|x (i),θ j))
Figure PCTCN2018103824-appb-000005
b)M步:极大化L(θ,θ j),得到θ j+1(将似然函数最大化以获得新的参数值):
Figure PCTCN2018103824-appb-000006
c)如果θ j+1已收敛,则算法结束。否则继续回到步骤a)进行E步迭代。
3)输出:T空间的模型参数θ。
步骤33获取的总体变化子空间不区分说话人空间和信道空间,将声道空间的信息和信道空间的信息收敛于一个空间,以降低计算复杂度,便于进一步基于总体变化子空间,以获取简化的当前通用语音向量。
S34.将简化语音特征投影到总体变化子空间,以获取类簇标识对应的当前通用语音向量。
其中,简化语音特征就是由步骤S32获取的经简化模型算法处理后获取的语音特征。
当前通用语音向量是将简化语音特征投影到低维的总体变化子空间,获取的一个固定长度的矢量表征,用以表示属于同一类簇的多个说话人形成的语音向量。
步骤S31至S34中,识别服务器采用简化模型算法简化处理测试语音特征,获取简化语音特征后,再将简化语音特征投影到总体变化子空间后,可得更为纯净和简单的当前通用语音向量,以便后续基于当前通用语音向量对说话人的语音数据进行语音聚类,以降低进行语音聚类的复杂性,同时加快语音聚类的效率。
在一实施例中,如图6所示,在步骤S32中,即采用简化模型算法简化处理测试语音特征,获取简化语音特征,具体包括如下步骤:
S321.采用高斯滤波器处理测试语音特征,获取对应的二维正态分布。
其中,高斯滤波器可对输入的测试语音特征进行线性平滑滤波,适用于消除高斯噪声,广泛应用于减噪过程。高斯滤波器处理测试语音特征的过程具体为对测试语音特征进行加权平均的过程,以测试语音特征中的音素为例,每一个音素的值,都由其本身和邻域内的其他音素值经过加权平均后得到。
二维正态分布(又名二维高斯分布),是满足如下密度函数特点:关于μ对称,在μ处达到最大值,在正(负)无穷远处取值为0,在μ±σ处有拐点;二维正态分布的形状是中间高两边低,图像是一条位于x轴上方的钟形曲线。
具体地,高斯滤波器对测试语音特征进行处理的具体操作是:用一个3*3掩模扫描训练语音数据中的每一个音素,用掩模确定的邻域内音素的加权平均值去替代模板中心音素的值后形成有关训练语音数据的二维正态分布,其中,每一个音素的加权平均值的计算过程包括:
(1)求各音素的权值总和。(2)逐个扫描测试语音特征中的音素,根据音素中各位置的权值求其邻域的加权平均值,并将求得的加权平均值赋给当前位置对应的音素。(3)循环步骤(2),直到处理完测试语音特征的全部音素。
经步骤S321,可去除测试语音特征中的噪音,输出为线性平滑的声音滤波,以获取纯净的声音滤波进行进一步处理。
S322.采用简化模型算法简化二维正态分布,获取简化语音特征。
应用于本实施例,简化模型算法可采用高斯模糊算法来简化二维正态分布。
具体地,高斯模糊算法简化二维正态分布的实现过程包括:每一个音素都取周边音素的平均值,"中间点"取"周围点"的平均值。在数值上,这是一种"平滑化"。在图形上,就相当于产生"模糊"效果,"中间点"失去细节。显然,计算平均值时,取值范围越大,"模糊效果"越强烈。
步骤S322中,识别服务器通过简化模型算法可获取测试语音特征对应的二维正态分布的简化语音特征,可进一步降低测试语音特征的语音细节,简化语音特征。
步骤S321至S322,识别服务器可依次将测试语音特征进行除噪和降低细节,以得到纯净简单的简化语音特征,利于提高语音聚类的识别效率。
本申请实施例提供的说话人聚类方法,通过将至少两个待聚类语音按语音时长降序排列,当待聚类语音在目标通用语音向量中针对分类进行的类簇识别的语音特征相似度不大于预设阈值时,生成与待聚类语音对应的当前通用语音向量,提高对待聚类语音进行分类的准确性;将当前通用语音向量存储在预设声学模型库中,扩大预设声学模型库的可识别聚类类簇的范围,提高预设声学模型库的灵活性和可扩展性。
优选地,识别服务器将待聚类语音段划分为用以进行语音自适应的第一语音段和用以进行打分的第二语音段,分别对第一语音段和第二语音段进行特征提取,获取第一语音特征和第二语音特征,能够准确地体现待聚类语音的特征,并将两者分别用于自适应和打分,可提高对待聚类语音进行聚类类簇的准确性。识别服务器通过选取识别相似度最高的原始通用语音向量作为与待聚类语音对应的目标通用语音向量,可暂时判定出待聚类语音在预设声学模型库中最有可能属于的已有的聚类类簇。识别服务器采用余弦相似度算法获取识别i-vector向量和第二i-vector向量的识别相似度,简单快捷。识别服务器采用简化模型算法简化处理测试语音特征,获取简化语音特征后,再将简化语音特征投影到总体变化子空间后,可得更为纯净和简单的当前通用语音向量,以便后续基于当前通用语音向量对说话人的语音数据进行语音聚类,以降低进行语音聚类的复杂性,同时加快语音聚类的效率。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种说话人聚类装置,该说话人聚类装置与上述实施例中说话人聚类方法一一对应。如图7所示,该说话人聚类装置包括语音降序排列模块10、获取通用向量模块20、训练当前向量模块30和存储当前向量模块40,各功能模块详细说明如下:
语音降序排列模块10,用于将至少两个待聚类语音按语音时长降序排列。
获取通用向量模块20,用于依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量。
训练当前向量模块30,用于若待聚类语音在目标通用语音向量中的语音特征相似度不大于预设阈值,则采用待聚类语音进行模型训练,与待聚类语音对应的当前通用语音向量。
存储当前向量模块40,用于将当前通用语音向量存储在预设声学模型库中,并将待聚类语音归类到当前通用语音向量对应的聚类类簇中。
优选地,该说话人聚类装置还包括归类聚类类簇单元21。
归类聚类类簇单元50,用于若所述待聚类语音在所述目标通用语音向量中的语音特征相似度大于预设阈值,则将所述待聚类语音归类到所述目标通用语音向量对应的聚类类簇中。
优选地,获取通用向量模块20包括划分语音段单元21、获取语音特征单元22、获取识别特征单元23、获取识别相似度单元24和选取语音模型单元25
划分语音段单元21,用于依序将每一所述待聚类语音按预设规则划分成第一语音段和第二语 音段。
获取语音特征单元22,用于分别对所述第一语音段和所述第二语音段进行特征提取,获取第一语音特征和第二语音特征。
获取识别特征单元23,用于将所述第一语音特征输入到预设声学模型库中每一原始通用语音向量进行语音自适应,获取每一原始通用语音向量对应的自适应语音特征。
获取识别相似度单元24,用于对所述自适应语音特征和所述第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度。
选取语音模型单元25,用于选取识别相似度最高的原始通用语音向量作为与所述待聚类语音对应的目标通用语音向量。
优选地,获取识别相似度单元24包括获取识别向量子单元241和获取识别相似度子单元242。
获取识别向量子单元241,用于分别获取所述自适应语音特征和所述第二语音特征对应的识别i-vector向量和第二i-vector向量。
获取识别相似度子单元242,用于采用余弦相似度算法获取所述识别i-vector向量和所述第二i-vector向量的识别相似度。
优选地,训练当前向量模块30包括提取测试特征单元31、获取简化特征单元32、获取变化子空间单元33和获取通用向量单元34。
提取测试特征单元31,用于提取待聚类语音的测试语音特征。
获取简化特征单元32,用于采用简化模型算法简化处理测试语音特征,获取简化语音特征。
获取变化子空间单元33,用于采用最大期望算法迭代简化语音特征,获取总体变化子空间。
获取通用向量单元34,用于将简化语音特征投影到总体变化子空间,以获取类簇标识对应的当前通用语音向量。
优选地,该获取当前语音模型单元33包括获取正态分布子单元321和获取简化特征子单元322。
获取正态分布子单元321,用于采用高斯滤波器处理测试语音特征,获取对应的二维正态分布。
获取简化特征子单元322,用于采用简化模型算法简化二维正态分布,获取简化语音特征。
关于说话人聚类装置的具体限定可以参见上文中对于说话人聚类方法的限定,在此不再赘述。上述说话人聚类装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一实施例中,提供一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储 介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储与说话人聚类方法相关的语音数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种说话人聚类方法。
在一实施例中,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:将至少两个待聚类语音按语音时长降序排列;依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量;若待聚类语音在目标通用语音向量中的语音特征相似度不大于预设阈值,则采用待聚类语音进行模型训练,与待聚类语音对应的当前通用语音向量;将当前通用语音向量存储在预设声学模型库中,并将待聚类语音归类到当前通用语音向量对应的聚类类簇中。
在一实施例中,在获取与待聚类语音对应的目标通用语音向量的步骤之后,处理器执行计算机可读指令时还实现以下步骤:若待聚类语音在目标通用语音向量中的语音特征相似度大于预设阈值,则将待聚类语音归类到目标通用语音向量对应的聚类类簇中。
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:依序将每一待聚类语音按预设规则划分成第一语音段和第二语音段;分别对第一语音段和第二语音段进行特征提取,获取第一语音特征和第二语音特征;将第一语音特征输入到预设声学模型库中每一原始通用语音向量进行语音自适应,获取每一原始通用语音向量对应的自适应语音特征;对自适应语音特征和第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度;选取识别相似度最高的原始通用语音向量作为与待聚类语音对应的目标通用语音向量。
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:分别获取自适应语音特征和第二语音特征对应的识别i-vector向量和第二i-vector向量;采用余弦相似度算法获取识别i-vector向量和第二i-vector向量的识别相似度。
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:提取所述待聚类语音的测试语音特征;采用简化模型算法简化处理所述测试语音特征,获取简化语音特征;采用最大期望算法迭代所述简化语音特征,获取总体变化子空间;将所述简化语音特征投影到所述总体变化子空间,以获取所述类簇标识对应的所述当前通用语音向量。
在一实施例中,处理器执行计算机可读指令时实现以下步骤:采用高斯滤波器处理所述测试语音特征,获取对应的二维正态分布;采用简化模型算法简化所述二维正态分布,获取简化语音特征。
在一实施例中,一个或多个存储有计算机可读指令的非易失性可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:将至少两个待聚类语音按语音时长降序排列;依序将每一待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与待聚类语音对应的目标通用语音向量;若待聚类语音在目标通用语音向量中的语音特征相似度不大于预设阈值,则采用待聚类语音进行模型训练,与待聚类语音对应的当前通用语音向量;将 当前通用语音向量存储在预设声学模型库中,并将待聚类语音归类到当前通用语音向量对应的聚类类簇中。
在一实施例中,在获取与待聚类语音对应的目标通用语音向量的步骤之后,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器还执行如下步骤:若待聚类语音在目标通用语音向量中的语音特征相似度大于预设阈值,则将待聚类语音归类到目标通用语音向量对应的聚类类簇中。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器还执行如下步骤:依序将每一待聚类语音按预设规则划分成第一语音段和第二语音段;分别对第一语音段和第二语音段进行特征提取,获取第一语音特征和第二语音特征;将第一语音特征输入到预设声学模型库中每一原始通用语音向量进行语音自适应,获取每一原始通用语音向量对应的自适应语音特征;对自适应语音特征和第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度;选取识别相似度最高的原始通用语音向量作为与待聚类语音对应的目标通用语音向量。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器还执行如下步骤:分别获取自适应语音特征和第二语音特征对应的识别i-vector向量和第二i-vector向量;采用余弦相似度算法获取识别i-vector向量和第二i-vector向量的识别相似度。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器还执行如下步骤:提取所述待聚类语音的测试语音特征;采用简化模型算法简化处理所述测试语音特征,获取简化语音特征;采用最大期望算法迭代所述简化语音特征,获取总体变化子空间;将所述简化语音特征投影到所述总体变化子空间,以获取所述类簇标识对应的所述当前通用语音向量。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器还执行如下步骤:采用高斯滤波器处理所述测试语音特征,获取对应的二维正态分布;采用简化模型算法简化所述二维正态分布,获取简化语音特征。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块 的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种说话人聚类方法,其特征在于,包括:
    将至少两个待聚类语音按语音时长降序排列;
    依序将每一所述待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与所述待聚类语音对应的目标通用语音向量;
    若所述待聚类语音在所述目标通用语音向量中的语音特征相似度不大于预设阈值,则采用所述待聚类语音进行模型训练,获取与所述待聚类语音对应的当前通用语音向量;
    将所述当前通用语音向量存储在所述预设声学模型库中,并将所述待聚类语音归类到所述当前通用语音向量对应的聚类类簇中。
  2. 如权利要求1所述的说话人聚类方法,其特征在于,在获取与所述待聚类语音对应的目标通用语音向量的步骤之后,所述说话人聚类方法还包括:
    若所述待聚类语音在所述目标通用语音向量中的语音特征相似度大于预设阈值,则将所述待聚类语音归类到所述目标通用语音向量对应的聚类类簇中。
  3. 如权利要求1所述的说话人聚类方法,其特征在于,所述依序将每一所述待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与所述待聚类语音对应的目标通用语音向量,包括:
    依序将每一所述待聚类语音按预设规则划分成第一语音段和第二语音段;
    分别对所述第一语音段和所述第二语音段进行特征提取,获取第一语音特征和第二语音特征;
    将所述第一语音特征输入到预设声学模型库中每一原始通用语音向量进行语音自适应,获取每一原始通用语音向量对应的自适应语音特征;
    对所述自适应语音特征和所述第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度;
    选取识别相似度最高的原始通用语音向量作为与所述待聚类语音对应的目标通用语音向量。
  4. 如权利要求3所述的说话人聚类方法,其特征在于,所述对所述自适应语音特征和所述第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度,包括:
    分别获取所述自适应语音特征和所述第二语音特征对应的识别i-vector向量和第二i-vector向量;
    采用余弦相似度算法获取所述识别i-vector向量和所述第二i-vector向量的识别相似度。
  5. 如权利要求1所述的说话人聚类方法,其特征在于,所述采用所述待聚类语音进行模型训练,获取与所述待聚类语音对应的当前通用语音向量,包括:
    提取所述待聚类语音的测试语音特征;
    采用简化模型算法简化处理所述测试语音特征,获取简化语音特征;
    采用最大期望算法迭代所述简化语音特征,获取总体变化子空间;
    将所述简化语音特征投影到所述总体变化子空间,以获取所述类簇标识对应的所述当前通用语音向量。
  6. 如权利要求5所述的说话人聚类方法,其特征在于,所述采用简化模型算法简化处理所述测试语音特征,获取简化语音特征,包括:
    采用高斯滤波器处理所述测试语音特征,获取对应的二维正态分布;
    采用简化模型算法简化所述二维正态分布,获取简化语音特征。
  7. 一种说话人聚类装置,其特征在于,包括:
    语音降序排列模块,用于将至少两个待聚类语音按语音时长降序排列;
    获取通用向量模块,用于依序将每一所述待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与所述待聚类语音对应的目标通用语音向量;
    训练当前向量模块,用于若所述待聚类语音在所述目标通用语音向量中的语音特征相似度不大于预设阈值,则采用所述待聚类语音进行模型训练,与所述待聚类语音对应的当前通用语音向量;
    存储当前向量模块,用于将所述当前通用语音向量存储在所述预设声学模型库中,并将所述待聚类语音归类到所述当前通用语音向量对应的聚类类簇中。
  8. 如权利要求7所述的说话人聚类装置,其特征在于,所述说话人聚类装置还包括:
    归类聚类类簇模块,用于若所述待聚类语音在所述目标通用语音向量中的语音特征相似度大于预设阈值,则将所述待聚类语音归类到所述目标通用语音向量对应的聚类类簇中。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    将至少两个待聚类语音按语音时长降序排列;
    依序将每一所述待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与所述待聚类语音对应的目标通用语音向量;
    若所述待聚类语音在所述目标通用语音向量中的语音特征相似度不大于预设阈值,则采用所述待聚类语音进行模型训练,获取与所述待聚类语音对应的当前通用语音向量;
    将所述当前通用语音向量存储在所述预设声学模型库中,并将所述待聚类语音归类到所述当前通用语音向量对应的聚类类簇中。
  10. 如权利要求9所述的计算机设备,其特征在于,在获取与所述待聚类语音对应的目标通用语音向量的步骤之后,所述处理器执行所述计算机可读指令时还实现如下步骤:
    若所述待聚类语音在所述目标通用语音向量中的语音特征相似度大于预设阈值,则将所述待聚类语音归类到所述目标通用语音向量对应的聚类类簇中。
  11. 如权利要求9所述的计算机设备,其特征在于,所述依序将每一所述待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与所述待聚类语音对应的目标通用语音向 量,包括:
    依序将每一所述待聚类语音按预设规则划分成第一语音段和第二语音段;
    分别对所述第一语音段和所述第二语音段进行特征提取,获取第一语音特征和第二语音特征;
    将所述第一语音特征输入到预设声学模型库中每一原始通用语音向量进行语音自适应,获取每一原始通用语音向量对应的自适应语音特征;
    对所述自适应语音特征和所述第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度;
    选取识别相似度最高的原始通用语音向量作为与所述待聚类语音对应的目标通用语音向量。
  12. 如权利要求11所述的计算机设备,其特征在于,所述对所述自适应语音特征和所述第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度,包括:
    分别获取所述自适应语音特征和所述第二语音特征对应的识别i-vector向量和第二i-vector向量;
    采用余弦相似度算法获取所述识别i-vector向量和所述第二i-vector向量的识别相似度。
  13. 如权利要求9所述的计算机设备,其特征在于,所述采用所述待聚类语音进行模型训练,获取与所述待聚类语音对应的当前通用语音向量,包括:
    提取所述待聚类语音的测试语音特征;
    采用简化模型算法简化处理所述测试语音特征,获取简化语音特征;
    采用最大期望算法迭代所述简化语音特征,获取总体变化子空间;
    将所述简化语音特征投影到所述总体变化子空间,以获取所述类簇标识对应的所述当前通用语音向量。
  14. 如权利要求13所述的计算机设备,其特征在于,所述采用简化模型算法简化处理所述测试语音特征,获取简化语音特征,包括:
    采用高斯滤波器处理所述测试语音特征,获取对应的二维正态分布;
    采用简化模型算法简化所述二维正态分布,获取简化语音特征。
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    将至少两个待聚类语音按语音时长降序排列;
    依序将每一所述待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与所述待聚类语音对应的目标通用语音向量;
    若所述待聚类语音在所述目标通用语音向量中的语音特征相似度不大于预设阈值,则采用所述待聚类语音进行模型训练,获取与所述待聚类语音对应的当前通用语音向量;
    将所述当前通用语音向量存储在所述预设声学模型库中,并将所述待聚类语音归类到所述当前通用语音向量对应的聚类类簇中。
  16. 如权利要求15所述的非易失性可读存储介质,其特征在于,在获取与所述待聚类语音对应的目标通用语音向量的步骤之后,所述一个或多个处理器还执行如下步骤:
    若所述待聚类语音在所述目标通用语音向量中的语音特征相似度大于预设阈值,则将所述待聚类语音归类到所述目标通用语音向量对应的聚类类簇中。
  17. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述依序将每一所述待聚类语音与预设声学模型库中每一原始通用语音向量进行语音识别,获取与所述待聚类语音对应的目标通用语音向量,包括:
    依序将每一所述待聚类语音按预设规则划分成第一语音段和第二语音段;
    分别对所述第一语音段和所述第二语音段进行特征提取,获取第一语音特征和第二语音特征;
    将所述第一语音特征输入到预设声学模型库中每一原始通用语音向量进行语音自适应,获取每一原始通用语音向量对应的自适应语音特征;
    对所述自适应语音特征和所述第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度;
    选取识别相似度最高的原始通用语音向量作为与所述待聚类语音对应的目标通用语音向量。
  18. 如权利要求17所述的非易失性可读存储介质,其特征在于,所述对所述自适应语音特征和所述第二语音特征进行相似度计算,获取每一原始通用语音向量对应的识别相似度,包括:
    分别获取所述自适应语音特征和所述第二语音特征对应的识别i-vector向量和第二i-vector向量;
    采用余弦相似度算法获取所述识别i-vector向量和所述第二i-vector向量的识别相似度。
  19. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述采用所述待聚类语音进行模型训练,获取与所述待聚类语音对应的当前通用语音向量,包括:
    提取所述待聚类语音的测试语音特征;
    采用简化模型算法简化处理所述测试语音特征,获取简化语音特征;
    采用最大期望算法迭代所述简化语音特征,获取总体变化子空间;
    将所述简化语音特征投影到所述总体变化子空间,以获取所述类簇标识对应的所述当前通用语音向量。
  20. 如权利要求19所述的非易失性可读存储介质,其特征在于,所述采用简化模型算法简化处理所述测试语音特征,获取简化语音特征,包括:
    采用高斯滤波器处理所述测试语音特征,获取对应的二维正态分布;
    采用简化模型算法简化所述二维正态分布,获取简化语音特征。
PCT/CN2018/103824 2018-06-11 2018-09-03 说话人聚类方法、装置、计算机设备及存储介质 Ceased WO2019237517A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810592867.9A CN109065028B (zh) 2018-06-11 2018-06-11 说话人聚类方法、装置、计算机设备及存储介质
CN201810592867.9 2018-06-11

Publications (1)

Publication Number Publication Date
WO2019237517A1 true WO2019237517A1 (zh) 2019-12-19

Family

ID=64820020

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/103824 Ceased WO2019237517A1 (zh) 2018-06-11 2018-09-03 说话人聚类方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN109065028B (zh)
WO (1) WO2019237517A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113052270A (zh) * 2021-05-10 2021-06-29 清华大学 分类精度评价方法、装置、计算机设备和存储介质
CN113470695A (zh) * 2021-06-30 2021-10-01 平安科技(深圳)有限公司 声音异常检测方法、装置、计算机设备及存储介质
CN114220419A (zh) * 2021-12-31 2022-03-22 科大讯飞股份有限公司 一种语音评价方法、装置、介质及设备
CN114596863A (zh) * 2022-02-11 2022-06-07 厦门快商通科技股份有限公司 一种交互式的声纹聚类方法、系统、电子设备及存储介质
CN116631432A (zh) * 2023-06-21 2023-08-22 中信银行股份有限公司 一种音频分离和话术违规提醒方法、装置及计算机设备
CN117725273A (zh) * 2023-09-21 2024-03-19 书行科技(北京)有限公司 样本标签生成方法、装置、计算机设备和存储介质
CN118197324A (zh) * 2024-05-16 2024-06-14 江西广播电视网络传媒有限公司 对话语料提取方法、系统、计算机及存储介质
CN118824276A (zh) * 2024-09-14 2024-10-22 北京云行在线软件开发有限责任公司 一种基于声纹聚类的网约车音频角色识别方法及设备

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545229B (zh) * 2019-01-11 2023-04-21 华南理工大学 一种基于语音样本特征空间轨迹的说话人识别方法
CN109961794B (zh) * 2019-01-14 2021-07-06 湘潭大学 一种基于模型聚类的提高说话人识别效率的方法
CN109800299B (zh) * 2019-02-01 2021-03-09 浙江核新同花顺网络信息股份有限公司 一种说话人聚类方法及相关装置
CN112204657B (zh) * 2019-03-29 2023-12-22 微软技术许可有限责任公司 利用提前停止聚类的讲话者分离
CN110119762B (zh) * 2019-04-15 2023-09-26 华东师范大学 基于聚类的人类行为依赖分析方法
CN110782879B (zh) * 2019-09-18 2023-07-07 平安科技(深圳)有限公司 基于样本量的声纹聚类方法、装置、设备及存储介质
CN110942765B (zh) * 2019-11-11 2022-05-27 珠海格力电器股份有限公司 一种构建语料库的方法、设备、服务器和存储介质
CN111414511B (zh) * 2020-03-25 2023-08-22 合肥讯飞数码科技有限公司 自动声纹建模入库方法、装置以及设备
CN111599346B (zh) * 2020-05-19 2024-02-20 科大讯飞股份有限公司 一种说话人聚类方法、装置、设备及存储介质
CN111754982B (zh) * 2020-06-19 2024-11-05 平安科技(深圳)有限公司 语音通话的噪声消除方法、装置、电子设备及存储介质
CN111933152B (zh) * 2020-10-12 2021-01-08 北京捷通华声科技股份有限公司 注册音频的有效性的检测方法、检测装置和电子设备
CN112530409B (zh) * 2020-12-01 2024-01-23 平安科技(深圳)有限公司 基于几何学的语音样本筛选方法、装置及计算机设备
CN114023349B (zh) * 2021-10-29 2025-07-29 北京百度网讯科技有限公司 语音处理方法、装置、电子设备及存储介质
CN114141253B (zh) * 2021-12-14 2025-02-11 青岛海尔科技有限公司 一种语音识别的方法及装置、电子设备、存储介质
CN114464194A (zh) * 2022-03-12 2022-05-10 云知声智能科技股份有限公司 声纹聚类方法、装置、存储介质及电子装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103370920A (zh) * 2011-03-04 2013-10-23 高通股份有限公司 用于基于背景相似度对客户端装置进行分组的方法和设备
CN105989849A (zh) * 2015-06-03 2016-10-05 乐视致新电子科技(天津)有限公司 一种语音增强方法、语音识别方法、聚类方法及装置
CN108091326A (zh) * 2018-02-11 2018-05-29 张晓雷 一种基于线性回归的声纹识别方法及系统

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6073096A (en) * 1998-02-04 2000-06-06 International Business Machines Corporation Speaker adaptation system and method based on class-specific pre-clustering training speakers
KR100612840B1 (ko) * 2004-02-18 2006-08-18 삼성전자주식회사 모델 변이 기반의 화자 클러스터링 방법, 화자 적응 방법및 이들을 이용한 음성 인식 장치
ES2535858T3 (es) * 2007-08-24 2015-05-18 Deutsche Telekom Ag Procedimiento y dispositivo para la clasificación de interlocutores
CN102479511A (zh) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 一种大规模声纹认证方法及其系统
CN103871413A (zh) * 2012-12-13 2014-06-18 上海八方视界网络科技有限公司 基于svm和hmm混合模型的男女说话声音分类方法
CN103258535A (zh) * 2013-05-30 2013-08-21 中国人民财产保险股份有限公司 基于声纹识别的身份识别方法及系统
CN105469784B (zh) * 2014-09-10 2019-01-08 中国科学院声学研究所 一种基于概率线性鉴别分析模型的说话人聚类方法及系统
CN106971713B (zh) * 2017-01-18 2020-01-07 北京华控智加科技有限公司 基于密度峰值聚类和变分贝叶斯的说话人标记方法与系统
CN107342077A (zh) * 2017-05-27 2017-11-10 国家计算机网络与信息安全管理中心 一种基于因子分析的说话人分段聚类方法及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103370920A (zh) * 2011-03-04 2013-10-23 高通股份有限公司 用于基于背景相似度对客户端装置进行分组的方法和设备
CN105989849A (zh) * 2015-06-03 2016-10-05 乐视致新电子科技(天津)有限公司 一种语音增强方法、语音识别方法、聚类方法及装置
CN108091326A (zh) * 2018-02-11 2018-05-29 张晓雷 一种基于线性回归的声纹识别方法及系统

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113052270A (zh) * 2021-05-10 2021-06-29 清华大学 分类精度评价方法、装置、计算机设备和存储介质
CN113470695A (zh) * 2021-06-30 2021-10-01 平安科技(深圳)有限公司 声音异常检测方法、装置、计算机设备及存储介质
CN113470695B (zh) * 2021-06-30 2024-02-09 平安科技(深圳)有限公司 声音异常检测方法、装置、计算机设备及存储介质
CN114220419A (zh) * 2021-12-31 2022-03-22 科大讯飞股份有限公司 一种语音评价方法、装置、介质及设备
CN114596863A (zh) * 2022-02-11 2022-06-07 厦门快商通科技股份有限公司 一种交互式的声纹聚类方法、系统、电子设备及存储介质
CN116631432A (zh) * 2023-06-21 2023-08-22 中信银行股份有限公司 一种音频分离和话术违规提醒方法、装置及计算机设备
CN117725273A (zh) * 2023-09-21 2024-03-19 书行科技(北京)有限公司 样本标签生成方法、装置、计算机设备和存储介质
CN118197324A (zh) * 2024-05-16 2024-06-14 江西广播电视网络传媒有限公司 对话语料提取方法、系统、计算机及存储介质
CN118824276A (zh) * 2024-09-14 2024-10-22 北京云行在线软件开发有限责任公司 一种基于声纹聚类的网约车音频角色识别方法及设备

Also Published As

Publication number Publication date
CN109065028B (zh) 2022-12-30
CN109065028A (zh) 2018-12-21

Similar Documents

Publication Publication Date Title
CN109065028B (zh) 说话人聚类方法、装置、计算机设备及存储介质
CN111712874B (zh) 用于确定声音特性的方法、系统、装置和存储介质
CN108922544B (zh) 通用向量训练方法、语音聚类方法、装置、设备及介质
Chou et al. Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations
CN108922543B (zh) 模型库建立方法、语音识别方法、装置、设备及介质
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
US10468032B2 (en) Method and system of speaker recognition using context aware confidence modeling
Ittichaichareon et al. Speech recognition using MFCC
TW201935464A (zh) 基於記憶性瓶頸特徵的聲紋識別的方法及裝置
WO2019227586A1 (zh) 语音模型训练方法、说话人识别方法、装置、设备及介质
JP6845489B2 (ja) 音声処理装置、音声処理方法、および音声処理プログラム
CN109065022B (zh) i-vector向量提取方法、说话人识别方法、装置、设备及介质
WO2019227574A1 (zh) 语音模型训练方法、语音识别方法、装置、设备及介质
KR102026226B1 (ko) 딥러닝 기반 Variational Inference 모델을 이용한 신호 단위 특징 추출 방법 및 시스템
WO2022143723A1 (zh) 语音识别模型训练方法、语音识别方法及相应装置
CN111933187B (zh) 情感识别模型的训练方法、装置、计算机设备和存储介质
CN113345464B (zh) 语音提取方法、系统、设备及存储介质
WO2020045313A1 (ja) マスク推定装置、マスク推定方法及びマスク推定プログラム
CN114765028A (zh) 声纹识别方法、装置、终端设备及计算机可读存储介质
CN111462762A (zh) 一种说话人向量正则化方法、装置、电子设备和存储介质
CN119339714B (zh) 多语言语音识别方法、装置、设备及介质
KR20200114705A (ko) 음성 신호 기반의 사용자 적응형 스트레스 인식 방법
Mccree et al. Language Recognition for Telephone and Video Speech: The JHU HLTCOE Submission for NIST LRE17.
JP2016162437A (ja) パターン分類装置、パターン分類方法およびパターン分類プログラム
CN115101055A (zh) 语音情绪识别模型训练方法、装置、计算机设备及介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 26/03/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18922895

Country of ref document: EP

Kind code of ref document: A1