CN111951791B - Voiceprint recognition model training method, electronic equipment and storage medium - Google Patents
Voiceprint recognition model training method, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN111951791B CN111951791B CN202010869727.9A CN202010869727A CN111951791B CN 111951791 B CN111951791 B CN 111951791B CN 202010869727 A CN202010869727 A CN 202010869727A CN 111951791 B CN111951791 B CN 111951791B
- Authority
- CN
- China
- Prior art keywords
- audio data
- pieces
- voiceprint
- recognition model
- persons
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000012549 training Methods 0.000 title claims abstract description 52
- 238000012545 processing Methods 0.000 claims description 33
- 238000007635 classification algorithm Methods 0.000 claims description 25
- 239000000284 extract Substances 0.000 claims description 21
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 4
- 230000008054 signal transmission Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000015654 memory Effects 0.000 description 30
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000000379 polymerizing effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Collating Specific Patterns (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a voiceprint recognition model training method, a voiceprint recognition method and electronic equipment, and belongs to the technical field of voiceprint recognition, wherein M pieces of audio data are added on the basis of original N pieces of audio data, the M pieces of audio data are different from the N pieces of audio data, and the judgment of what voiceprint features are not the same person can be effectively improved by adding a model trained by negative example data of the M pieces of audio data, so that the probability of misjudging that two persons are judged to be the same person is greatly reduced, and the accuracy of model judgment can be effectively improved.
Description
Technical Field
The application relates to the technical field of voiceprint recognition, in particular to a training method, a recognition method, electronic equipment and a storage medium for a voiceprint recognition model of an unknown speaker.
Background
Voiceprint recognition, one type of biometric technology, also known as speaker recognition. Voiceprint recognition is to convert an acoustic signal into an electrical signal and then to recognize the electrical signal by a computer.
In the existing voiceprint recognition, a voiceprint recognition algorithm based on deep learning needs to acquire a large amount of data for training, and a single person needs to cross channels and time and has longer audio at the same time so as to improve the accuracy of model training. Then, in the use of the actual model, erroneous judgment often occurs, for example, two persons are regarded as one person, or one person is regarded as two persons, so that the recognition accuracy of the model is affected.
Disclosure of Invention
In view of the above, the present application provides a voiceprint recognition model training method, a voiceprint recognition method, an electronic device, and a storage medium, which can improve the accuracy judgment of a voiceprint recognition model from negative examples by adding audio data different from the original model training set.
Some embodiments of the application provide a voiceprint recognition model training method. The application is described in terms of several aspects, embodiments and advantages of which can be referenced to one another.
In a first aspect, the present application provides a voiceprint recognition model training method, including: the electronic equipment trains based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, wherein each of the N persons has a plurality of pieces of audio, and part or all of the N persons have cross-time and cross-channel audio data, and N is a natural number greater than or equal to 1; the electronic equipment acquires M pieces of audio data, wherein the M pieces of audio data are derived from speakers with undefined identities; the electronic equipment extracts voiceprint features in M pieces of audio data based on a first voiceprint recognition model to obtain M pieces of voiceprint features in the M pieces of audio data, wherein the M pieces of audio data are different from the N pieces of audio data of the N persons, and the M pieces of audio data are natural numbers greater than or equal to 1; the electronic equipment takes M voiceprint features as fixed weights and adds the M voiceprint features into a speaker classification algorithm of a first voiceprint recognition model so as to obtain a second voiceprint recognition model; the electronic equipment trains the second voice print recognition model based on the plurality of pieces of audio data of the N persons so as to obtain an unknown speaker voice print recognition model.
According to the voiceprint recognition model training method, M pieces of audio data are added on the basis of the original N pieces of audio data, the M pieces of audio data are different from the N pieces of audio data, and the judgment of what voiceprint features are not the same person can be effectively improved by adding the model trained by negative example data of the M pieces of audio data, so that the probability of misjudging that two persons are judged to be the same person is greatly reduced, and the accuracy of model judgment can be effectively improved.
In an embodiment of the first aspect of the present application, the electronic device adds M voiceprint features as fixed weights to a speaker classification algorithm of the first voiceprint recognition model to obtain a new speaker classification algorithm formula L i as follows:
wherein,
The academic weight of pieces of audio data representing N persons, j belonging to [1, i-1] or [ i+1, N ], i representing the ith person,/>Representing the fixed weights of M pieces of audio data, k belongs to [1, M ].
In an embodiment of the first aspect of the present application, the electronic device performs training based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, including:
The electronic equipment acquires a plurality of pieces of audio data of N persons, intercepts audio of set duration from each piece of audio data, and extracts the audio of each set duration into a multidimensional spectrogram;
the electronic equipment sends the multidimensional spectrogram into a convolutional neural network to obtain voiceprint characteristics of a plurality of pieces of audio data of N persons;
and the electronic equipment performs speaker classification calculation based on voiceprint characteristics of the plurality of pieces of audio data of the N persons, and obtains a first voiceprint recognition model.
In an embodiment of the first aspect of the application, the method further comprises: the electronic equipment extracts voiceprint features in M pieces of audio data again based on the unknown speaker voiceprint recognition model to obtain M pieces of new voiceprint features in the M pieces of audio data, takes the M pieces of new voiceprint features as fixed weights, and trains the unknown speaker voiceprint recognition model again based on N pieces of audio data of people.
In a second aspect, the present application provides a voiceprint recognition method of audio data, applied to an electronic device, the method comprising: the electronic equipment acquires audio data to be identified; the electronic equipment extracts voiceprint features of the audio data to be identified based on an unknown speaker voiceprint identification model; the electronic device performs 1:1 comparison calculation on the voiceprint feature F1 of the audio data to be identified and the standard voiceprint feature F2, calculates cosine similarity of the voiceprint feature F1 and the voiceprint feature F2, and when the cosine similarity is greater than or equal to a set threshold, people corresponding to the voiceprint feature F1 and the voiceprint feature F2 are identical, and when the cosine similarity is less than the set threshold, people corresponding to the voiceprint feature F1 and the voiceprint feature F2 are not identical.
In an embodiment of the second aspect of the present application, the training method of the unknown speaker voiceprint recognition model includes: the electronic equipment trains based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, wherein each of the N persons has a plurality of pieces of audio, and part or all of the N persons have cross-time and cross-channel audio data, and N is a natural number greater than or equal to 1; the electronic equipment acquires M pieces of audio data, wherein the M pieces of audio data are derived from speakers with undefined identities; the electronic equipment extracts voiceprint features in M pieces of audio data based on a first voiceprint recognition model to obtain M pieces of voiceprint features in the M pieces of audio data, wherein the M pieces of audio data are different from the N pieces of audio data of the N persons, and the M pieces of audio data are natural numbers greater than or equal to 1; the electronic equipment takes M voiceprint features as fixed weights and adds the M voiceprint features into a speaker classification algorithm of a first voiceprint recognition model so as to obtain a second voiceprint recognition model; the electronic equipment trains the second voice print recognition model based on the plurality of pieces of audio data of the N persons so as to obtain an unknown speaker voice print recognition model.
According to the embodiment of the application, the unknown speaker voiceprint recognition model obtained by the voiceprint recognition model training method according to the first aspect can effectively improve the judgment of what voiceprint features are not the same person, so that the probability of misjudging that two persons are judged to be the same person is greatly reduced, and the accuracy of model judgment can be effectively improved.
In an embodiment of the second aspect of the present application, the electronic device adds M voiceprint features as fixed weights to a speaker classification algorithm of the first voiceprint recognition model to obtain a new speaker classification algorithm formula L i as:
wherein,
The academic weight of pieces of audio data representing N persons, j belonging to [1, i-1] or [ i+1, N ], i representing the ith person,/>Representing the fixed weights of M pieces of audio data, k belongs to [1, M ].
In an embodiment of the second aspect of the present application, the electronic device performs training based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, including: the electronic equipment acquires a plurality of pieces of audio data of N persons, intercepts audio of set duration from each piece of audio data, and extracts the audio of each set duration into a multidimensional spectrogram; the electronic equipment sends the multidimensional spectrogram into a convolutional neural network to obtain voiceprint characteristics of a plurality of pieces of audio data of N persons; and the electronic equipment performs speaker classification calculation based on voiceprint characteristics of the plurality of pieces of audio data of the N persons, and obtains a first voiceprint recognition model.
In an embodiment of the second aspect of the application, the method further comprises: the electronic equipment extracts voiceprint features in M pieces of audio data again based on the unknown speaker voiceprint recognition model to obtain M pieces of new voiceprint features in the M pieces of audio data, takes the M pieces of new voiceprint features as fixed weights, and trains the unknown speaker voiceprint recognition model again based on N pieces of audio data of people.
In a second aspect, the present application also provides an electronic device, including: the processing module is used for training based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, wherein each of the N persons has a plurality of pieces of audio, part or all of the N persons have cross-time and cross-channel audio data, and N is a natural number greater than or equal to 1; the acquisition module acquires M pieces of audio data, wherein the M pieces of audio data are derived from speakers with undefined identities; the processing module extracts voiceprint features in M pieces of audio data based on a first voiceprint recognition model to obtain M pieces of voiceprint features in the M pieces of audio data, wherein the M pieces of audio data are different from the N pieces of audio data of the N persons, and the M pieces of audio data are natural numbers greater than or equal to 1; the processing module takes M voiceprint features as fixed weights and adds the M voiceprint features into a speaker classification algorithm of the first voiceprint recognition model to obtain a second voiceprint recognition model; the processing module trains the second voice print recognition model based on the plurality of pieces of audio data of the N persons so as to obtain an unknown speaker voice print recognition model.
According to the electronic equipment provided by the embodiment of the application, M pieces of audio data are added on the basis of the original N pieces of audio data, and are different from the N pieces of audio data, and the judgment of what voiceprint features are not the same person can be effectively improved by adding the model trained by negative example data of the M pieces of audio data, so that the probability of misjudging that two persons are judged to be the same person is greatly reduced, and the accuracy of model judgment can be effectively improved.
In an embodiment of the third aspect of the present application, the processing module adds M voiceprint features as fixed weights to a speaker classification algorithm of the first voiceprint recognition model to obtain a new speaker classification algorithm formula L i as:
wherein,
The academic weight of pieces of audio data representing N persons, j belonging to [1, i-1] or [ i+1, N ], i representing the ith person,/>Representing the fixed weights of M pieces of audio data, k belongs to [1, M ].
In an embodiment of the third aspect of the present application, the processing module performs training based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, including: the processing module intercepts audio of set duration from each piece of audio data in a plurality of pieces of audio data of N persons, and pumps the audio of each set duration into a multidimensional spectrogram; the processing module sends the multidimensional spectrogram into a convolutional neural network to obtain voiceprint characteristics of a plurality of pieces of audio data of N persons; and the processing module performs speaker classification calculation based on voiceprint characteristics of the plurality of pieces of audio data of the N persons, and obtains a first voiceprint recognition model.
In an embodiment of the third aspect of the present application, the electronic device further comprises: the processing module extracts voiceprint features in the M pieces of audio data again based on the unknown speaker voiceprint recognition model to obtain M pieces of new voiceprint features in the M pieces of audio data, takes the M pieces of new voiceprint features as fixed weights, and trains the unknown speaker voiceprint recognition model again based on the N pieces of audio data of the person.
In a fourth aspect, the present application also provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method of the embodiments of the first and second aspects.
Drawings
FIG. 1 is a scene diagram of a voiceprint recognition system in accordance with one embodiment of the present application;
FIG. 2 is a flow chart of a voiceprint recognition model training method in accordance with one embodiment of the present application;
FIG. 3 is a graph of a frequency spectrum of audio data according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for identifying voiceprints of audio data according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 6 is a block diagram of an apparatus according to some embodiments of the application;
Fig. 7 is a block diagram of a system on a chip (SoC) in accordance with some embodiments of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.
It is to be appreciated that as used herein, the term "module" may refer to or include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and/or memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality, or may be part of such hardware components.
It is to be appreciated that in various embodiments of the application, the processor may be a microprocessor, a digital signal processor, a microcontroller, or the like, and/or any combination thereof. According to another aspect, the processor may be a single core processor, a multi-core processor, or the like, and/or any combination thereof.
Embodiments of the present application will be described below in conjunction with specific scenarios.
Referring to fig. 1, fig. 1 shows a scene diagram of a voiceprint recognition system, in which a scenario includes an electronic device 110 and a cloud server 120, wherein the electronic device 110 may obtain a plurality of pieces of audio data of N persons and M pieces of audio data from the cloud server 120, wherein each of the N persons has a plurality of pieces of audio, and a part or all of the N persons has cross-time and cross-channel audio data, the M pieces of audio data are different from the plurality of pieces of audio data of the N persons, the M pieces of audio data may consist of short audio data of a single person that does not cross-channel, does not cross-time for about 10s, and the M pieces of audio data are derived from speakers of unknown identity, that is, the identity of a specific speaker of the M pieces of audio data is unknown. The electronic device 110 may be based on pieces of audio data of N persons, M pieces of audio data, and a convolutional neural network (Convolutional Neural Network, CNN) or a recurrent neural network (Recurrent Neural Network, RNN). Training to obtain the voice print recognition model of the unknown speaker.
When the electronic device 110 performs voiceprint recognition, the audio data to be recognized can be input into the unknown speaker model, and more accurate voiceprint features of the audio data to be recognized can be obtained.
Because the model is obtained by training M pieces of audio data based on the original N pieces of audio data, and the M pieces of audio data are different from the N pieces of audio data, and the model trained by negative example data of the M pieces of audio data is introduced, the judgment of what voiceprint features are not the same person can be effectively improved, the probability of misjudging that two persons are judged to be the same person is greatly reduced, and the accuracy of model judgment can be effectively improved.
In another embodiment of the present application, the model training process may also be performed at the cloud server 120, and the electronic device 110 may communicate with the cloud server to obtain a trained unknown speaker voiceprint recognition model. And are not intended to be limiting.
The electronic device in the application can be a device with voiceprint recognition function such as a mobile phone, a notebook computer, a tablet computer, a desktop computer, a laptop computer, an Ultra-mobile Personal Computer (UMPC), a handheld computer, a netbook, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a wearable electronic device and the like.
The initial voiceprint recognition model referred to in the present application is referred to as a first voiceprint recognition model for ease of understanding and distinction hereinafter. The electronic device 110 extracts voiceprint features in the M pieces of audio data based on the initial voiceprint recognition model to obtain M pieces of voiceprint features in the M pieces of audio data, and adds the M pieces of voiceprint features as fixed weights to a speaker classification algorithm to obtain a second voiceprint recognition model, and trains the second voiceprint recognition model based on the N pieces of audio data of the person to obtain an unknown speaker voiceprint recognition model.
Embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
FIG. 2 shows a flow chart of a voiceprint recognition model training method. The method is performed by an electronic device. As shown in fig. 2, the flowchart includes:
Step 210, acquiring a plurality of pieces of audio data of N persons. Wherein each of N persons has a plurality of pieces of audio, for example, 2000 persons has 50000 pieces of audio data, and some or all of 2000 persons have cross-time and cross-channel audio data, N being a natural number of 1 or more. The N pieces of audio data of the N persons may be obtained by the electronic device from a cloud server, where the cloud server may be obtained based on a plurality of devices communicatively connected to the electronic device.
The cross-time audio data in the application refer to the audio data of the same speaker in different time periods, and the cross-channel audio data refer to the audio data input by the same speaker through different signal transmission media.
Step 220, training to obtain a first voiceprint recognition model based on the plurality of pieces of audio data of the N persons and the neural network model. The neural network model may be a CNN model or an RNN model, among others. The push formula of the speaker classification of the first voiceprint recognition model is as follows:
wherein,
In the above formula, for moleculesThe larger and better the portion of (a), i.e., x i resembles w i, where x i represents the voiceprint features of the ith person obtained in a certain training example and w i represents the voiceprint features of the ith person's standard. The smaller the denominator portion, the better, i.e., x i is less like the other w j, where j belongs to [1, i-1] or [ i+1, N ]. That is, when two voiceprint features are more alike at a 1:1 ratio, the cosine similarity cos (θ i) of the two voiceprint features is greater. The less similar the two voiceprint features are, the less the cosine similarity cos (θ i) of x i and the other w j is.
In step 230, M pieces of audio data are acquired. Wherein, the M pieces of audio data originate from a speaker with an undefined identity, the M pieces of audio data are different from the N pieces of audio data of the N persons, and the M pieces of audio data are natural numbers greater than or equal to 1, for example, 10000 pieces of audio data, 20000 pieces of audio data, and the like.
Step 240, inputting the M pieces of audio data into the first voiceprint recognition model obtained in step 220, and extracting voiceprint features Wk and K of the M pieces of audio data, where K is [1, M ].
Step 250, taking M pieces of corresponding voiceprint features Wk as fixed weights and adding the fixed weights into a speaker classification algorithm to form a new speaker classification algorithm formula, wherein the new speaker classification algorithm formula comprises the following steps:
wherein, The academic weight of pieces of audio data representing N persons, j belonging to [1, i-1] or [ i+1, N ], i representing the ith person,/>Representing the fixed weights of M pieces of audio data, k belongs to [1, M ].
Step 260, obtaining a second voice print recognition model. And adding fixed weight parts of M pieces of audio data based on the speaker classification formula to obtain a second voice recognition model.
Step 270, inputting the pieces of audio data of the N persons into the second voice print recognition model. After adding M voiceprint features, training the second voiceprint recognition model based on N person's multiple audio numbers may obtain a new voiceprint model with higher accuracy, i.e., the unknown speaker voiceprint recognition model in step 280. The model adds M pieces of audio data, and the M pieces of audio data are different from the N pieces of audio data of the same person, so that negative examples are formed, and the accuracy of what voiceprint features are not the same person can be enhanced. Further, in the voiceprint recognition process, when the voiceprint feature of a certain audio data is not determined to be a (refer to a specific person), but it is possible to exclude that it is not necessarily B (a person other than a). Thereby improving the accuracy of the voiceprint recognition model by adding negative examples.
In the embodiment of the present application, after training for a period of time, the method may also return to step 240-step 270, just change the first voiceprint recognition model to the latest voiceprint recognition model, update the voiceprint features of M pieces of audio data using the latest voiceprint recognition model, use the updated M pieces of voiceprint features as fixed weights, and train the audio data of N persons. The voice print recognition model of the unknown speaker is continuously perfected, and the accuracy of the voice print recognition model of the unknown speaker on voice print recognition is improved.
In one embodiment of the present application, step 220 further comprises the steps of:
Step 221, intercepting each piece of audio data of N persons respectively for a set duration, and obtaining a multidimensional spectrogram. For example, 3 seconds of audio is truncated for each of the pieces of audio data, resulting in a spectrogram of 3 seconds of f×d dimensions for the pair, referring to the multi-dimensional spectrogram of 3 seconds of audio shown in fig. 3.
Step 222, extracting voiceprint features in the pieces of audio data of the N persons. And sending the F multiplied by D spectrogram into CNN or RNN, and polymerizing to obtain voiceprint features.
In step 223, speaker classification is performed on the voiceprint features to obtain a first voiceprint recognition model. The speaker classification refers to that the extracted voiceprint features are corresponding to N persons. For example, there are 2000 persons, 50000 pieces of audio data, and these 50000 pieces of audio data are respectively associated with 2000 persons.
A voiceprint recognition method using audio data of an unknown speaker according to an embodiment of the present application will be described with reference to the accompanying drawings.
Referring to fig. 4, fig. 4 shows a flowchart of a voiceprint recognition method of audio data, and the method is applied to an electronic device, and specifically includes the following steps:
In step 410, audio data to be identified is obtained. For example, a piece of audio data is downloaded randomly in the network.
Step 420, extracting voiceprint features of the audio data to be identified based on the unknown speaker voiceprint recognition model. That is, the audio data to be identified is input into a trained unknown speaker voiceprint identification model to obtain voiceprint features of the audio data to be identified. The training method of the voiceprint recognition model of the unknown speaker can refer to the training process shown in fig. 2, and will not be described herein.
Step 430, performing 1:1 comparison calculation on the voiceprint feature F1 of the audio data to be identified and the standard voiceprint feature F2, and calculating the cosine similarity between the voiceprint feature F1 and the voiceprint feature F2, for example, using the formula in the training process, where the formula is as follows:
Step 440, comparing the calculation result of step 430 with the set threshold value to determine whether the two voiceprint features are the same person or not belonging to other specified person. For example, assuming that the cosine similarity is 1, the voiceprint feature F1 and the voiceprint feature F2 necessarily belong to the same person, and the first threshold is set to 0.8, and the electronic device determines that the voiceprint feature F1 and the voiceprint feature F2 are the same person as long as the cosine similarity is 0.8 or more. If the second threshold is set to 0.5, the electronic device determines that the voiceprint feature F1 and the voiceprint feature F2 are not necessarily the same person. That is, when the cosine similarity is 0.7, there may be a case where the same person or not, and then, in the case of uncertainty, it may be excluded that the voiceprint feature to be identified is not necessarily a voiceprint feature of someone other than the currently aligned voiceprint feature first. Therefore, some interference voiceprint features can be eliminated from the information on the back surface, so that the accuracy of identification is improved.
According to the voiceprint recognition method of the audio data, M pieces of audio data are added on the basis of the original N pieces of audio data, the M pieces of audio data are different from the N pieces of audio data, and the judgment of what voiceprint features are not the same person can be effectively improved by adding the model trained by negative example data of the M pieces of audio data, so that the probability of misjudging that two persons are judged to be the same person is greatly reduced, and the accuracy of model judgment can be effectively improved.
Based on the above description, an electronic device of the present application for performing the above-described method embodiments is specifically described below. Fig. 5 shows a schematic structural diagram of the electronic device. As shown in fig. 5, the electronic device includes:
the processing module 510 is used for training based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, wherein each of the N persons has a plurality of pieces of audio, and part or all of the N persons have cross-time and cross-channel audio data, and N is a natural number greater than or equal to 1;
the acquiring module 520 acquires M pieces of audio data;
The processing module 510 extracts voiceprint features in the M pieces of audio data based on the first voiceprint recognition model to obtain M pieces of voiceprint features in the M pieces of audio data, where the M pieces of audio data are different from the N pieces of audio data of the N persons, and the M pieces of audio data are natural numbers greater than or equal to 1;
the processing module 510 takes the M voiceprint features as fixed weights and adds the M voiceprint features to a speaker classification algorithm of the first voiceprint recognition model to obtain a second voiceprint recognition model;
The processing module 510 trains the second voice print recognition model based on the pieces of audio data of the N persons to obtain an unknown speaker voice print recognition model.
According to one embodiment of the present application, the processing module 510 adds the M voiceprint features as fixed weights to the speaker classification algorithm of the first voiceprint recognition model to obtain a new speaker classification algorithm formula Li as:
wherein,
The academic weight of pieces of audio data representing N persons, j belonging to [1, i-1] or [ i+1, N ], i representing the ith person,/>Representing the fixed weights of M pieces of audio data, k belongs to [1, M ].
According to one embodiment of the present application, the processing module 510 performs training based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, including:
The processing module 510 intercepts audio of set duration from each piece of audio data in the plurality of pieces of audio data of N persons, and pumps the audio of each set duration into a multidimensional spectrogram;
the processing module 510 sends the multidimensional spectrogram into a convolutional neural network to obtain voiceprint characteristics of a plurality of pieces of audio data of N persons;
The processing module 510 performs speaker classification calculation based on voiceprint features of pieces of audio data of N persons, and obtains a first voiceprint recognition model.
According to one embodiment of the application, the electronic device further comprises: the processing module extracts voiceprint features in the M pieces of audio data again based on the unknown speaker voiceprint recognition model to obtain M pieces of new voiceprint features in the M pieces of audio data, takes the M pieces of new voiceprint features as fixed weights, and trains the unknown speaker voiceprint recognition model again based on the N pieces of audio data.
The specific roles of the modules of the electronic device in the embodiment of the present application are described in detail in the above embodiment, and the method shown in fig. 2 and fig. 4 of the above embodiment may be specifically referred to, which is not described herein.
According to the electronic equipment provided by the embodiment of the application, the method is executed, M pieces of audio data are added on the basis of the original N pieces of audio data, the M pieces of audio data are different from the N pieces of audio data, and the judgment of what voiceprint features are not the same person can be effectively improved by adding the model trained by negative example data of the M pieces of audio data, so that the probability of judging two persons as the same person by misjudgment is greatly reduced, and the accuracy of model judgment can be effectively improved.
The application also provides an electronic device, comprising:
a memory for storing instructions for execution by one or more processors of the device, and
A processor for performing the methods described in fig. 2 and 4 of the above embodiments.
The present application also provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method described in fig. 2 and 4 of the above embodiments.
Referring now to FIG. 6, shown is a block diagram of an apparatus 1200 in accordance with one embodiment of the present application. The device 1200 may include one or more processors 1201 coupled to a controller hub 1203. For at least one embodiment, the controller hub 1203 communicates with the processor 1201 via a multi-drop Bus, such as a Front Side Bus (FSB), a point-to-point interface, such as a fast channel interconnect (Quick Path Interconnect, QPI), or similar connection 1206. The processor 1201 executes instructions that control general types of data processing operations. In one embodiment, controller Hub 1203 includes, but is not limited to, a graphics memory controller Hub (Graphics Memory Controller Hub, GMCH) (not shown) and an Input Output Hub (IOH) (which may be on separate chips) (not shown), where the GMCH includes memory and graphics controllers and is coupled to the IOH.
The device 1200 may also include a coprocessor 1202 and memory 1204 coupled to the controller hub 1203. Or one or both of the memory and GMCH may be integrated within the processor (as described in the present application), with the memory 1204 and co-processor 1202 being directly coupled to the processor 1201 and to the controller hub 1203, the controller hub 1203 being in a single chip with the IOH. The Memory 1204 may be, for example, a dynamic random access Memory (Dynamic Random Access Memory, DRAM), a phase change Memory (PHASE CHANGE Memory, PCM), or a combination of both. In one embodiment, the coprocessor 1202 is a special-purpose processor, such as, for example, a high-throughput MIC processor (MANY INTEGERATED Core, MIC), a network or communication processor, compression engine, graphics processor, general-purpose graphics processor (General Purpose Computing on GPU, GPGPU), embedded processor, or the like. Optional properties of the co-processor 1202 are shown in fig. 6 with dashed lines.
Memory 1204, as a computer-readable storage medium, may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. For example, memory 1204 may include any suitable non-volatile memory, such as flash memory, and/or any suitable non-volatile storage device, such as one or more Hard disk drives (Hard-DISK DRIVE, HDD (s)), one or more Compact Disc (CD) drives, and/or one or more digital versatile Disc (DIGITAL VERSATILE DISC, DVD) drives.
In one embodiment, the device 1200 may further include a network interface (Network Interface Controller, NIC) 1206. The network interface 1206 may include a transceiver to provide a radio interface for the device 1200 to communicate with any other suitable device (e.g., front end module, antenna, etc.). In various embodiments, the network interface 1206 may be integrated with other components of the device 1200. The network interface 1206 may implement the functions of the communication units in the above-described embodiments.
Device 1200 may further include an Input/Output (I/O) device 1205. The I/O1205 may include: a user interface, the design enabling a user to interact with the device 1200; the design of the peripheral component interface enables the peripheral component to also interact with the device 1200; and/or sensors are designed to determine environmental conditions and/or location information associated with the device 1200.
It is noted that fig. 6 is merely exemplary. That is, although the apparatus 1200 is shown in fig. 6 as including a plurality of devices such as the processor 1201, the controller hub 1203, the memory 1204, etc., in practical applications, the apparatus using the methods of the present application may include only a part of the devices of the apparatus 1200, for example, may include only the processor 1201 and the NIC1206. The nature of the alternative device is shown in dashed lines in fig. 6.
According to some embodiments of the present application, the memory 1204, which is a computer readable storage medium, stores instructions that when executed on a computer cause the system 1200 to perform the method according to the above embodiment, and the detailed description of the method according to the above embodiment will be omitted.
Referring now to fig. 7, shown is a block diagram of a SoC (System on Chip) 1300 in accordance with an embodiment of the present application. In fig. 7, similar parts have the same reference numerals. In addition, the dashed box is an optional feature of a more advanced SoC. In fig. 7, soC1300 includes: an interconnect unit 1350 coupled to the application processor 1310; a system agent unit 1380; a bus controller unit 1390; an integrated memory controller unit 1340; a set or one or more coprocessors 1320 which may include integrated graphics logic, an image processor, an audio processor, and a video processor; a static random access memory (Static Random Access Memory, SRAM) unit 1330; a Direct Memory Access (DMA) unit 1360. In one embodiment, coprocessor 1320 includes a special-purpose processor, such as, for example, a network or communication processor, compression engine, GPGPU, a high-throughput MIC processor, embedded processor, or the like.
One or more computer-readable media for storing data and/or instructions may be included in Static Random Access Memory (SRAM) unit 1330. The computer-readable storage medium may have stored therein instructions, and in particular, temporary and permanent copies of the instructions. The instructions may include: the execution by at least one unit in the processor causes the Soc1300 to perform the method according to the above embodiment, and the specific reference may be made to the method of the above embodiment, which is not described herein.
Embodiments of the disclosed mechanisms may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as a computer program or program code that is executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of the present application, a processing system includes any system having a Processor such as, for example, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), microcontroller, application SPECIFIC INTEGRATED Circuit (ASIC), or microprocessor.
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in the present application are not limited in scope by any particular programming language. In either case, the language may be a compiled or interpreted language.
In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, compact disk Read-Only memories (Compact Disc Read Only Memory, CD-ROMs), magneto-optical disks, read-Only memories (ROMs), random Access Memories (RAMs), erasable programmable Read-Only memories (Erasable Programmable Read Only Memory, EPROMs), electrically erasable programmable Read-Only memories (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY, EEPROMs), magnetic or optical cards, flash Memory, or tangible machine-readable Memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) in an electrical, optical, acoustical or other form of propagated signal using the internet. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).
In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such a particular arrangement and/or ordering may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the drawings of the specification. Additionally, the inclusion of structural or methodological features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments, may not be included or may be combined with other features.
It should be noted that, in the embodiments of the present application, each unit/module mentioned in each device is a logic unit/module, and in physical terms, one logic unit/module may be one physical unit/module, or may be a part of one physical unit/module, or may be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logic unit/module itself is not the most important, and the combination of functions implemented by the logic unit/module is only a key for solving the technical problem posed by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units/modules that are less closely related to solving the technical problems posed by the present application, which does not indicate that the above-described device embodiments do not have other units/modules.
It should be noted that in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
While the application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the application.
Claims (10)
1. A voiceprint recognition model training method applied to electronic equipment, the method comprising:
the electronic equipment trains based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, wherein each of the N persons has a plurality of pieces of audio, and part or all of the N persons have cross-time and cross-channel audio data, and N is a natural number greater than or equal to 1; the cross-channel audio data refer to audio data input by the same speaker through different media of signal transmission;
the electronic equipment acquires M pieces of audio data, wherein the M pieces of audio data are derived from speakers with undefined identities;
The electronic equipment extracts voiceprint features in M pieces of audio data based on a first voiceprint recognition model to obtain M pieces of voiceprint features in the M pieces of audio data, wherein the M pieces of audio data are different from the N pieces of audio data of the N persons, and the M pieces of audio data are natural numbers greater than or equal to 1;
the electronic equipment takes M voiceprint features as fixed weights and adds the M voiceprint features into a speaker classification algorithm of a first voiceprint recognition model so as to obtain a second voiceprint recognition model;
The electronic equipment trains the second voice print recognition model based on the plurality of pieces of audio data of the N persons so as to obtain an unknown speaker voice print recognition model;
the electronic equipment takes M voiceprint features as fixed weights, and adds the M voiceprint features into a speaker classification algorithm of a first voiceprint recognition model to obtain a new speaker classification algorithm formula L i as follows:
wherein,
The academic weight of pieces of audio data representing N persons, j belonging to [1, i-1] or [ i+1, N ], i representing the ith person,/>Representing the fixed weight of M pieces of audio data, k belongs to [1, M ], x i represents the voiceprint feature obtained by the ith person in a certain training sample, and w i represents the voiceprint feature of the ith person standard.
2. The method of claim 1, wherein the electronic device is trained based on the plurality of pieces of audio data of the N persons as a training sample set to obtain a first voiceprint recognition model, comprising:
The electronic equipment acquires a plurality of pieces of audio data of N persons, intercepts audio of set duration from each piece of audio data, and extracts the audio of each set duration into a multidimensional spectrogram;
the electronic equipment sends the multidimensional spectrogram into a neural network to obtain voiceprint characteristics of a plurality of pieces of audio data of N persons;
and the electronic equipment performs speaker classification calculation based on voiceprint characteristics of the plurality of pieces of audio data of the N persons, and obtains a first voiceprint recognition model.
3. The method as recited in claim 2, further comprising:
The electronic equipment extracts voiceprint features in M pieces of audio data again based on the unknown speaker voiceprint recognition model to obtain M pieces of new voiceprint features in the M pieces of audio data, takes the M pieces of new voiceprint features as fixed weights, and trains the unknown speaker voiceprint recognition model again based on N pieces of audio data of people.
4. A voiceprint recognition method of audio data, applied to an electronic device, the method comprising:
the electronic equipment acquires audio data to be identified;
The electronic equipment extracts voiceprint features of the audio data to be identified based on an unknown speaker voiceprint identification model;
the electronic equipment performs 1:1 comparison calculation on the voiceprint characteristic F1 of the audio data to be identified and the standard voiceprint characteristic F2, calculates cosine similarity of the voiceprint characteristic F1 and the voiceprint characteristic F2,
When the cosine similarity is larger than or equal to a set first threshold, the person corresponding to the voiceprint feature F1 and the voiceprint feature F2 is the same person,
When the cosine similarity is smaller than a set second threshold, the person corresponding to the voiceprint feature F1 and the person corresponding to the voiceprint feature F2 are not the same person, and the first threshold is larger than the second threshold;
The training method of the unknown speaker voiceprint recognition model comprises the following steps:
the electronic equipment trains based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, wherein each of the N persons has a plurality of pieces of audio, and part or all of the N persons have cross-time and cross-channel audio data, and N is a natural number greater than or equal to 1; the cross-channel audio data refer to audio data input by the same speaker through different media of signal transmission;
the electronic equipment acquires M pieces of audio data, wherein the M pieces of audio data are derived from speakers with undefined identities;
The electronic equipment extracts voiceprint features in M pieces of audio data based on a first voiceprint recognition model to obtain M pieces of voiceprint features in the M pieces of audio data, wherein the M pieces of audio data are different from the N pieces of audio data of the N persons, and the M pieces of audio data are natural numbers greater than or equal to 1;
the electronic equipment takes M voiceprint features as fixed weights and adds the M voiceprint features into a speaker classification algorithm of a first voiceprint recognition model so as to obtain a second voiceprint recognition model;
The electronic equipment trains the second voice print recognition model based on the plurality of pieces of audio data of the N persons so as to obtain an unknown speaker voice print recognition model;
the electronic equipment takes M voiceprint features as fixed weights, and adds the M voiceprint features into a speaker classification algorithm of a first voiceprint recognition model to obtain a new speaker classification algorithm formula L i as follows:
wherein,
The academic weight of pieces of audio data representing N persons, j belonging to [1, i-1] or [ i+1, N ], i representing the ith person,/>Representing the fixed weight of M pieces of audio data, k belongs to [1, M ], x i represents the voiceprint feature obtained by the ith person in a certain training sample, and w i represents the voiceprint feature of the ith person standard.
5. The method of claim 4, wherein the electronic device training to obtain the first voiceprint recognition model based on the plurality of pieces of audio data of the N persons as a training sample set, comprising:
The electronic equipment acquires a plurality of pieces of audio data of N persons, intercepts audio of set duration from each piece of audio data, and extracts the audio of each set duration into a multidimensional spectrogram;
the electronic equipment sends the multidimensional spectrogram into a convolutional neural network to obtain voiceprint characteristics of a plurality of pieces of audio data of N persons;
and the electronic equipment performs speaker classification calculation based on voiceprint characteristics of the plurality of pieces of audio data of the N persons, and obtains a first voiceprint recognition model.
6. The method as recited in claim 5, further comprising:
The electronic equipment extracts voiceprint features in M pieces of audio data again based on the unknown speaker voiceprint recognition model to obtain M pieces of new voiceprint features in the M pieces of audio data, takes the M pieces of new voiceprint features as fixed weights, and trains the unknown speaker voiceprint recognition model again based on N pieces of audio data of people.
7. An electronic device, comprising:
the processing module is used for training based on a plurality of pieces of audio data of N persons as a training sample set to obtain a first voiceprint recognition model, wherein each of the N persons has a plurality of pieces of audio, part or all of the N persons have cross-time and cross-channel audio data, and N is a natural number greater than or equal to 1; the cross-channel audio data refer to audio data input by the same speaker through different media of signal transmission;
the acquisition module acquires M pieces of audio data;
the processing module extracts voiceprint features in M pieces of audio data based on a first voiceprint recognition model to obtain M pieces of voiceprint features in the M pieces of audio data, wherein the M pieces of audio data are different from the N pieces of audio data of the N persons, and the M pieces of audio data are natural numbers greater than or equal to 1;
The processing module takes M voiceprint features as fixed weights and adds the M voiceprint features into a speaker classification algorithm of the first voiceprint recognition model to obtain a second voiceprint recognition model;
The processing module trains the second voice print recognition model based on the plurality of pieces of audio data of the N persons so as to obtain an unknown speaker voice print recognition model;
The processing module takes M voiceprint features as fixed weights, and adds the M voiceprint features into a speaker classification algorithm of the first voiceprint recognition model to obtain a new speaker classification algorithm formula L i as follows:
wherein,
The academic weight of pieces of audio data representing N persons, j belonging to [1, i-1] or [ i+1, N ], i representing the ith person,/>Representing the fixed weight of M pieces of audio data, k belongs to [1, M ], x i represents the voiceprint feature obtained by the ith person in a certain training sample, and w i represents the voiceprint feature of the ith person standard.
8. The electronic device of claim 7, wherein the processing module trains to obtain the first voiceprint recognition model based on the plurality of pieces of audio data of the N persons as a training sample set, comprising:
the processing module intercepts audio of set duration from each piece of audio data in a plurality of pieces of audio data of N persons, and pumps the audio of each set duration into a multidimensional spectrogram;
The processing module sends the multidimensional spectrogram into a convolutional neural network to obtain voiceprint characteristics of a plurality of pieces of audio data of N persons;
And the processing module performs speaker classification calculation based on voiceprint characteristics of the plurality of pieces of audio data of the N persons, and obtains a first voiceprint recognition model.
9. The electronic device of claim 8, further comprising:
The processing module extracts voiceprint features in the M pieces of audio data again based on the unknown speaker voiceprint recognition model to obtain M pieces of new voiceprint features in the M pieces of audio data, takes the M pieces of new voiceprint features as fixed weights, and trains the unknown speaker voiceprint recognition model again based on the N pieces of audio data of the person.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-6.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010869727.9A CN111951791B (en) | 2020-08-26 | 2020-08-26 | Voiceprint recognition model training method, electronic equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010869727.9A CN111951791B (en) | 2020-08-26 | 2020-08-26 | Voiceprint recognition model training method, electronic equipment and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111951791A CN111951791A (en) | 2020-11-17 |
| CN111951791B true CN111951791B (en) | 2024-05-17 |
Family
ID=73366732
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010869727.9A Active CN111951791B (en) | 2020-08-26 | 2020-08-26 | Voiceprint recognition model training method, electronic equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111951791B (en) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113035203A (en) * | 2021-03-26 | 2021-06-25 | 合肥美菱物联科技有限公司 | Control method for dynamically changing voice response style |
| CN114299920B (en) * | 2021-09-01 | 2025-06-10 | 腾讯科技(深圳)有限公司 | Training of language model for speech recognition, speech recognition method and device |
| CN114333849B (en) * | 2022-02-21 | 2025-09-26 | 百果园技术(新加坡)有限公司 | Voiceprint model training, voiceprint extraction method, device, equipment and storage medium |
Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107610707A (en) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | A kind of method for recognizing sound-groove and device |
| CN107731233A (en) * | 2017-11-03 | 2018-02-23 | 王华锋 | A kind of method for recognizing sound-groove based on RNN |
| CN108170650A (en) * | 2016-12-07 | 2018-06-15 | 北京京东尚科信息技术有限公司 | Text comparative approach and text comparison means |
| CN109243466A (en) * | 2018-11-12 | 2019-01-18 | 成都傅立叶电子科技有限公司 | A kind of vocal print authentication training method and system |
| CN109903774A (en) * | 2019-04-12 | 2019-06-18 | 南京大学 | A Voiceprint Recognition Method Based on Angular Separation Loss Function |
| CN110289003A (en) * | 2018-10-10 | 2019-09-27 | 腾讯科技(深圳)有限公司 | A voiceprint recognition method, model training method and server |
| CN110648669A (en) * | 2019-09-30 | 2020-01-03 | 上海依图信息技术有限公司 | Multi-frequency shunt voiceprint recognition method, device and system and computer readable storage medium |
| CN111243603A (en) * | 2020-01-09 | 2020-06-05 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
| CN111462760A (en) * | 2019-01-21 | 2020-07-28 | 阿里巴巴集团控股有限公司 | Voiceprint recognition system, method and device and electronic equipment |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102760434A (en) * | 2012-07-09 | 2012-10-31 | 华为终端有限公司 | Method for updating voiceprint feature model and terminal |
| CN106847292B (en) * | 2017-02-16 | 2018-06-19 | 平安科技(深圳)有限公司 | Method for recognizing sound-groove and device |
-
2020
- 2020-08-26 CN CN202010869727.9A patent/CN111951791B/en active Active
Patent Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108170650A (en) * | 2016-12-07 | 2018-06-15 | 北京京东尚科信息技术有限公司 | Text comparative approach and text comparison means |
| CN107610707A (en) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | A kind of method for recognizing sound-groove and device |
| CN107731233A (en) * | 2017-11-03 | 2018-02-23 | 王华锋 | A kind of method for recognizing sound-groove based on RNN |
| CN110289003A (en) * | 2018-10-10 | 2019-09-27 | 腾讯科技(深圳)有限公司 | A voiceprint recognition method, model training method and server |
| CN109243466A (en) * | 2018-11-12 | 2019-01-18 | 成都傅立叶电子科技有限公司 | A kind of vocal print authentication training method and system |
| CN111462760A (en) * | 2019-01-21 | 2020-07-28 | 阿里巴巴集团控股有限公司 | Voiceprint recognition system, method and device and electronic equipment |
| CN109903774A (en) * | 2019-04-12 | 2019-06-18 | 南京大学 | A Voiceprint Recognition Method Based on Angular Separation Loss Function |
| CN110648669A (en) * | 2019-09-30 | 2020-01-03 | 上海依图信息技术有限公司 | Multi-frequency shunt voiceprint recognition method, device and system and computer readable storage medium |
| CN111243603A (en) * | 2020-01-09 | 2020-06-05 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
Non-Patent Citations (1)
| Title |
|---|
| Research on Voiceprint Recognition Based on Weighted Clustering Recognition SVM Algorithm;Yang Wu 等;《2017 Chinese Automation Congress》;第1144-1148页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111951791A (en) | 2020-11-17 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20200372905A1 (en) | Mixed speech recognition method and apparatus, and computer-readable storage medium | |
| US10540988B2 (en) | Method and apparatus for sound event detection robust to frequency change | |
| US9589560B1 (en) | Estimating false rejection rate in a detection system | |
| CN111951791B (en) | Voiceprint recognition model training method, electronic equipment and storage medium | |
| Kobayashi et al. | Acoustic feature extraction by statistics based local binary pattern for environmental sound classification | |
| US20200243067A1 (en) | Environment classifier for detection of laser-based audio injection attacks | |
| CN110111812B (en) | Adaptive recognition method and system for keyboard keystroke content | |
| CN111968625A (en) | Sensitive audio recognition model training method and recognition method fusing text information | |
| CN111667843B (en) | Voice wake-up method and system for terminal equipment, electronic equipment and storage medium | |
| CN115497481B (en) | False voice recognition method and device, electronic equipment and storage medium | |
| TWI659410B (en) | Audio recognition method and device | |
| CN111860130A (en) | Audio-based gesture recognition method, device, terminal device and storage medium | |
| US20170294185A1 (en) | Segmentation using prior distributions | |
| CN112037772A (en) | Multi-mode-based response obligation detection method, system and device | |
| CN110808030A (en) | Voice awakening method, system, storage medium and electronic equipment | |
| CN109545226B (en) | Voice recognition method, device and computer readable storage medium | |
| CN118212927B (en) | Identity recognition method and system based on sound characteristics, storage medium and electronic equipment | |
| CN111554288A (en) | Awakening method and device of intelligent device, electronic device and medium | |
| CN110544468B (en) | Application awakening method and device, storage medium and electronic equipment | |
| CN117037772A (en) | Voice audio segmentation method, device, computer equipment and storage medium | |
| CN113436633B (en) | Speaker recognition method, speaker recognition device, computer equipment and storage medium | |
| WO2025108022A1 (en) | Anti-theft detection for model | |
| WO2025031170A1 (en) | Voiceprint recognition system evaluation method and apparatus, storage medium, and electronic device | |
| CN113407425B (en) | Internal user behavior detection method based on BiGAN and OTSU | |
| CN110634492A (en) | Login verification method and device, electronic equipment and computer readable storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |