[go: up one dir, main page]

WO2019156427A1 - Procédé d'identification d'un locuteur sur la base d'un mot prononcé et appareil associé, et appareil de gestion de modèle vocal sur la base d'un contexte et procédé associé - Google Patents

Procédé d'identification d'un locuteur sur la base d'un mot prononcé et appareil associé, et appareil de gestion de modèle vocal sur la base d'un contexte et procédé associé Download PDF

Info

Publication number
WO2019156427A1
WO2019156427A1 PCT/KR2019/001355 KR2019001355W WO2019156427A1 WO 2019156427 A1 WO2019156427 A1 WO 2019156427A1 KR 2019001355 W KR2019001355 W KR 2019001355W WO 2019156427 A1 WO2019156427 A1 WO 2019156427A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
model
speaker
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/KR2019/001355
Other languages
English (en)
Korean (ko)
Inventor
이태훈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gonghoon Co Ltd
Original Assignee
Gonghoon Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020180016444A external-priority patent/KR101888058B1/ko
Priority claimed from KR1020180016663A external-priority patent/KR101888059B1/ko
Application filed by Gonghoon Co Ltd filed Critical Gonghoon Co Ltd
Publication of WO2019156427A1 publication Critical patent/WO2019156427A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/12Score normalisation

Definitions

  • the present invention relates to a method and apparatus for identifying a speaker based on a spoken word, and more particularly, to grasp a voice characteristic of a speaker (for example, a user of the device) based on the spoken word,
  • the present invention relates to a method and apparatus for determining that a speech pattern of a word corresponding to a speech characteristic having a high similarity compared to the speech characteristic stored in a database (DB) generated according to the characteristic is a speaker's updated speech pattern.
  • DB database
  • the present invention also relates to a context-based speech model management apparatus and a method of operating the apparatus, and more particularly, to a speech model that can be used in a speech authentication system at a context-based speaker's speech characteristics and at predetermined predetermined intervals.
  • An apparatus for managing a voice model by updating and a method of operating the apparatus is also referred to a context-based speech model management apparatus and a method of operating the apparatus.
  • the voice is vulnerable to imitation and recording / playback of others, and may change from time to time depending on the pronunciation state and time of the user, and thus may be restricted in use as a means of recognition and authentication.
  • voice is equipped with the optimum conditions of the interface between the machine and human beings, the use range is gradually increasing.
  • the voice of the speaker and other authentication means such as iris, fingerprint, and password are used in parallel. It is hampering the effectiveness of authentication through a bay.
  • the existing speaker identification has a limitation in raising the recognition rate standard for the speaker by taking a method of recognizing the user by data-forming common feature elements based on all voices spoken by the user.
  • this conventional speaker identification method has caused a lot of inconvenience for the user who needs the instantaneous use of identification (authentication) information in that it takes quite a long time to accurately identify the speaker.
  • the speaker's voice is not permanent, and the aging of the vocal muscles over time, changes in the living environment (e.g., area, work place, etc.), changes in the state of health (e.g., the development of a cold, etc.) Depending on various factors, it may change temporarily or continuously and over time.
  • the present invention has been made as a countermeasure to the above-described problem, and is intended to enhance the effectiveness of speech recognition and authentication by increasing the accuracy of speech recognition and speaker identification (eg, authentication, etc.) for the speaker.
  • the present invention can be performed temporarily or for a period of time depending on the speaker's voice tone, depending on the speaker's emotion, the surrounding environment (e.g., noise, etc.), the speaker's state of health (e.g., the development of a throat, etc.).
  • the present invention is to provide a method and apparatus for improving identification accuracy by reflecting the possibility of such a voice change in the speaker's identification process.
  • the context in the matrix DB including the user's context (word) speech model that can be utilized in the context (word) presentation system that is an implementation aspect of the voice authentication system
  • the present invention provides a method and apparatus for updating a user's context (word) speech model in consideration of the presence or absence of a change in the voice input from a speaker and the degree of change.
  • a method and apparatus for identifying a speaker based on a spoken word can be provided.
  • a method for identifying a speaker based on a spoken word may include receiving a spoken voice from a speaker, extracting a word included in the received voice, and voice information of the word, in advance. Searching for a word in the database (DB), if the word does not exist in the DB, adds the word and voice information of the word to the DB, and if the word exists in the DB, the voice information of the spoken word and Comparing the respective reference voice information stored in the DB, estimating the similarity according to the comparison with the respective reference voice information, and the words of the speaker based on the number of times the voice information corresponding to the estimated similarity is received. Determining an utterance pattern for and identifying the speaker based on the determined utterance pattern.
  • the voice information of the word may include at least one of a frequency, pitch, formant, speech time, and speech speed of the speech.
  • the similarity is determined according to the determination result.
  • the estimated similarity is less than the first reference value
  • new reference voice information is generated and stored in the DB.
  • the estimated similarity is greater than or equal to the first reference value, the number of matching of the reference voice information having the corresponding similarity may be increased and counted. .
  • a new voice spoken by the speaker is received and the similarity is repeatedly estimated. If it is equal to or greater than the second reference value, it may be determined as a speech pattern for the speaker's word.
  • the speech pattern is determined by establishing a speech model of the speaker based on the speech information corresponding to the similarity having the number of matching counts greater than or equal to the second reference value.
  • the identifying step it may be identified who the speaker of the spoken voice is based on the speech pattern determined through the above-described steps with respect to the spoken voice.
  • An apparatus for identifying a speaker based on a spoken word includes a voice receiver for receiving a spoken voice from a speaker, information contained in the received voice, and information extracted to extract voice information of the word.
  • the information retrieval unit which searches for words in a pre-built database (DB). If a word does not exist in the DB, the word and voice information of the word are added to the DB.
  • DB pre-built database
  • a comparison unit for comparing the voice information of the word with each reference voice information stored in the DB, a similarity estimation unit for estimating the similarity according to comparison with each reference voice information, and receiving voice information corresponding to the estimated similarity
  • a speech pattern determining unit that determines a speech pattern for the speaker's word based on the number of times of speech and a speaker identification unit that identifies the speaker based on the determined speech pattern May be included.
  • the voice information about the word may include at least one of the frequency, pitch, formant, speech time, and speech speed of the speech.
  • the comparison unit determines whether the voice information about the word spoken by the speaker is similar to at least one reference voice information stored in the DB, and the similarity estimation unit estimates the similarity according to the result of the determination. If is less than the first reference value is a new reference voice information is generated and stored in the DB, if more than the first reference value can be counted by increasing the number of matching of the reference voice information having a corresponding similarity.
  • the speech pattern determination unit receives a new speech spoken from the speaker and repeatedly performs the process of estimating the similarity. You can decide by pattern.
  • a speech pattern is determined by a speech pattern determination unit by establishing a speech model of a speaker based on speech information corresponding to a similarity having a counted matching count equal to or greater than a second reference value, and the speaker identification unit
  • the person who is the speaker may be identified based on the speech pattern determined for the speech spoken.
  • a computer-readable recording medium having recorded thereon a program for executing the above method on a computer may be provided.
  • a context-based speech model management apparatus and a method of operating the apparatus may be provided.
  • An apparatus for managing a context-based speech model may be linked to a context-based speaker identification system, and the apparatus may include a storage unit for storing individual voice data generated each time a voice from the speaker is received.
  • a similarity estimator extracting each individual voice data from the storage unit and estimating the similarity between the individual voice data and at least one individual selected based on the similarity estimated by the similarity estimator
  • a voice model generator for generating a first voice model of the speaker according to the voice data, determines whether a comparison voice model corresponding to the first voice model exists in a storage unit of the contextual speaker identification system.
  • a speech model is provided to the storage of the contextual speaker identification system and stored.
  • the comparison speech model is defined as the first.
  • a voice model editing unit for replacing the voice model and generating a second voice model by combining the first voice model and the comparison voice model when less than a predetermined reference value, the second voice model being provided to the determination unit and the voice model editing unit. Can be.
  • the context presenting speaker identification system includes a voice receiver for receiving a voice from the speaker, a voice feature extractor for extracting voice characteristics from the received voice, and a context voice model generation for generating a voice model based on the extracted voice characteristics.
  • a speech model extraction unit for extracting a speech model, a speech speech requesting unit for requesting a speaker for a predetermined speech based on the extracted speech model, and a speaker identification for identifying the speaker by comparing the speech spoken from the speaker with the extracted speech model And a predetermined speech utterance is set in advance at a position on a DB in a matrix form of a storage unit corresponding to the generated random number.
  • SOLO can be a word or sentence.
  • the individual voice data includes at least one of a speaker's speech per speech, pitch, formant, speech time, and speech rate, and the context-based speech model management apparatus.
  • the similarity estimating unit may evaluate the similarity between individual voice data for each speaker's speech per speech.
  • the apparatus further includes a period setting unit for setting a management period of the voice model, and when all the voice models are updated within the set management period, the voice model editing unit provides a context presentation type. If the existing matrix voice model DB on the storage of the speaker identification system is maintained and at least one voice model is not updated within the set management period, the voice model editing unit is based on the new first voice model associated with the speaker. Thus, a part of the existing matrix speech model DB may be deleted or maintained.
  • the voice model editing unit deletes at least one unupdated voice model from the matrix-type voice model DB if there is no new first voice model associated with the speaker, and at least one unupdated voice model if the new first voice model exists.
  • the speech model is compared with the new first speech model, and if the difference is within the predetermined range, the speech model editing unit maintains the existing matrix speech model DB on the storage of the contextual speaker identification system. If it is out of range, at least one unupdated voice model may be deleted from the matrix voice model DB.
  • a method of managing a speech model using a context-based speech model management apparatus includes the steps of: (a) generating and storing individual voice data each time a voice from a speaker is received; Extracting each individual voice data and estimating the similarity between the individual voice data when a plurality of voice data are stored; and (c) generating the speaker's first voice model according to the at least one individual voice data selected based on the estimated similarity. (D) determining whether a comparison speech model corresponding to the first speech model exists in the storage of the context-presenting speaker identification system, and if not, the first speech model of the context-presenting speaker identification system.
  • the comparison similarity between the first speech model and the comparison speech model is estimated through the similarity estimator. And (e) replacing the comparison speech model with the first speech model when the comparison similarity is greater than or equal to a predetermined reference value, and generating the second speech model by combining the first speech model and the comparison speech model if less than the predetermined reference value. It may include a step. In addition, steps (d) and (e) may be repeatedly performed for the second voice model.
  • the method according to an embodiment of the present invention further comprises the step of setting the management period of the voice model by the period setting unit of the above-described device, if all the voice model is updated within the set management period, the device
  • the voice model editing unit of the voice model editing unit maintains an existing matrix voice model DB on the storage unit of the context presenting speaker identification system, and if at least one voice model is not updated within the set management period, the voice model editing unit is associated with the speaker. Based on the new first speech model, a part of the existing matrix speech model DB may be deleted or maintained.
  • the voice model editing unit deletes at least one unupdated voice model from the matrix-type voice model DB if there is no new first voice model associated with the speaker, and at least one unupdated voice model if the new first voice model exists.
  • the speech model is compared with the new first speech model, and if the difference is within the predetermined range, the speech model editing unit maintains the existing matrix-type speech model DB on the storage of the contextual speaker identification system and maintains the range. If out of the at least one voice model can be deleted from the speech model DB of the unupdated voice model.
  • a computer-readable recording medium having recorded thereon a program for executing the above method on a computer may be provided.
  • accuracy and reliability of speaker recognition and authentication by extracting and matching a user's speech pattern eg, speech characteristics according to speech
  • a user's speech pattern eg, speech characteristics according to speech
  • the speaker's voice may change continuously or for a period of time by temporal factors (e.g., aging, etc.), environmental factors (e.g., concert halls, etc.)
  • temporal factors e.g., aging, etc.
  • environmental factors e.g., concert halls, etc.
  • the speech model can be updated by updating the speech model that can be used in the speaker identification (or speech authentication) system based on the speaker's speech characteristics and a predetermined period of time. (up to date) to manage.
  • FIG. 1 is a view showing a conventional speaker identification system.
  • FIG. 2 is a diagram illustrating a conventional context (word) presentation speaker identification system.
  • 3 shows a conventional leveling system for speech.
  • FIG. 4 is a flowchart illustrating a method for identifying a speaker based on a spoken word according to an embodiment of the present invention.
  • FIG. 5 is a flowchart illustrating a specific speaker identification method according to an embodiment of the present invention.
  • FIG. 6 is a block diagram illustrating an apparatus for identifying a speaker based on a spoken word according to an embodiment of the present invention.
  • FIG. 7 is a diagram illustrating a leveling system for speech according to an embodiment of the present invention.
  • FIG. 8 is a view showing a leveling process based on the speaker's utterance similarity according to an embodiment of the present invention.
  • FIG. 9 is a block diagram of an apparatus for context-based speech model management according to an embodiment of the present invention.
  • FIG. 10 is a block diagram of a context-based speech model management apparatus and a context-presenting speaker identification system interoperable with the context-based speech model management apparatus according to an embodiment of the present invention.
  • FIG 11 shows an example of the operation of the contextual speaker identification system.
  • FIG. 12 is a flowchart illustrating an operation example of a context-based speech model management apparatus according to an embodiment of the present invention.
  • FIG. 13 illustrates an operation example of a context-based speech model management apparatus according to another embodiment of the present invention.
  • FIG. 14 is a flowchart illustrating a voice model management method using a context-based voice model management apparatus according to an embodiment of the present invention.
  • a method and apparatus for identifying a speaker based on a spoken word can be provided.
  • a method for identifying a speaker based on a spoken word may include receiving a spoken voice from a speaker, extracting a word included in the received voice, and voice information of the word, in advance. Searching for a word in the database (DB), if the word does not exist in the DB, adds the word and voice information of the word to the DB, and if the word exists in the DB, the voice information of the spoken word and Comparing the respective reference voice information stored in the DB, estimating the similarity according to the comparison with the respective reference voice information, and the words of the speaker based on the number of times the voice information corresponding to the estimated similarity is received. Determining an utterance pattern for and identifying the speaker based on the determined utterance pattern.
  • the voice information of the word may include at least one of a frequency, pitch, formant, speech time, and speech speed of the speech.
  • the similarity is determined according to the determination result.
  • the estimated similarity is less than the first reference value
  • new reference voice information is generated and stored in the DB.
  • the estimated similarity is greater than or equal to the first reference value, the number of matching of the reference voice information having the corresponding similarity may be increased and counted. .
  • a new voice spoken by the speaker is received and the similarity is repeatedly estimated. If it is equal to or greater than the second reference value, it may be determined as a speech pattern for the speaker's word.
  • the speech pattern is determined by establishing a speech model of the speaker based on the speech information corresponding to the similarity having the number of matching counts greater than or equal to the second reference value.
  • the identifying step it may be identified who the speaker of the spoken voice is based on the speech pattern determined through the above-described steps with respect to the spoken voice.
  • An apparatus for identifying a speaker based on a spoken word includes a voice receiver for receiving a spoken voice from a speaker, information contained in the received voice, and information extracted to extract voice information of the word.
  • the information retrieval unit which searches for words in a pre-built database (DB). If a word does not exist in the DB, the word and voice information of the word are added to the DB.
  • DB pre-built database
  • a comparison unit for comparing the voice information of the word with each reference voice information stored in the DB, a similarity estimation unit for estimating the similarity according to comparison with each reference voice information, and receiving voice information corresponding to the estimated similarity
  • a speech pattern determining unit that determines a speech pattern for the speaker's word based on the number of times of speech and a speaker identification unit that identifies the speaker based on the determined speech pattern May be included.
  • the voice information about the word may include at least one of the frequency, pitch, formant, speech time, and speech speed of the speech.
  • the comparison unit determines whether the voice information about the word spoken by the speaker is similar to at least one reference voice information stored in the DB, and the similarity estimation unit estimates the similarity according to the result of the determination. If is less than the first reference value is a new reference voice information is generated and stored in the DB, if more than the first reference value can be counted by increasing the number of matching of the reference voice information having a corresponding similarity.
  • the speech pattern determination unit receives a new speech spoken from the speaker and repeatedly performs the process of estimating the similarity. You can decide by pattern.
  • a speech pattern is determined by a speech pattern determination unit by establishing a speech model of a speaker based on speech information corresponding to a similarity having a counted matching count equal to or greater than a second reference value, and the speaker identification unit
  • the person who is the speaker may be identified based on the speech pattern determined for the speech spoken.
  • a computer-readable recording medium having recorded thereon a program for executing the above method on a computer may be provided.
  • a context-based speech model management apparatus and a method of operating the apparatus may be provided.
  • An apparatus for managing a context-based speech model may be linked to a context-based speaker identification system, and the apparatus may include a storage unit for storing individual voice data generated each time a voice from the speaker is received.
  • a similarity estimator extracting each individual voice data from the storage unit and estimating the similarity between the individual voice data and at least one individual selected based on the similarity estimated by the similarity estimator
  • a voice model generator for generating a first voice model of the speaker according to the voice data, determines whether a comparison voice model corresponding to the first voice model exists in a storage unit of the contextual speaker identification system.
  • a speech model is provided to the storage of the contextual speaker identification system and stored.
  • the comparison speech model is defined as the first.
  • a voice model editing unit for replacing the voice model and generating a second voice model by combining the first voice model and the comparison voice model when less than a predetermined reference value, the second voice model being provided to the determination unit and the voice model editing unit. Can be.
  • the context presenting speaker identification system includes a voice receiver for receiving a voice from the speaker, a voice feature extractor for extracting voice characteristics from the received voice, and a context voice model generation for generating a voice model based on the extracted voice characteristics.
  • a speech model extraction unit for extracting a speech model, a speech speech requesting unit for requesting a speaker for a predetermined speech based on the extracted speech model, and a speaker identification for identifying the speaker by comparing the speech spoken from the speaker with the extracted speech model And a predetermined speech utterance is set in advance at a position on a DB in a matrix form of a storage unit corresponding to the generated random number.
  • SOLO can be a word or sentence.
  • the individual voice data includes at least one of a speaker's speech per speech, pitch, formant, speech time, and speech rate, and the context-based speech model management apparatus.
  • the similarity estimating unit may evaluate the similarity between individual voice data for each speaker's speech per speech.
  • the apparatus further includes a period setting unit for setting a management period of the voice model, and when all the voice models are updated within the set management period, the voice model editing unit provides a context presentation type. If the existing matrix voice model DB on the storage of the speaker identification system is maintained and at least one voice model is not updated within the set management period, the voice model editing unit is based on the new first voice model associated with the speaker. Thus, a part of the existing matrix speech model DB may be deleted or maintained.
  • the voice model editing unit deletes at least one unupdated voice model from the matrix-type voice model DB if there is no new first voice model associated with the speaker, and at least one unupdated voice model if the new first voice model exists.
  • the speech model is compared with the new first speech model, and if the difference is within the predetermined range, the speech model editing unit maintains the existing matrix speech model DB on the storage of the contextual speaker identification system. If it is out of range, at least one unupdated voice model may be deleted from the matrix voice model DB.
  • a method of managing a speech model using a context-based speech model management apparatus includes the steps of: (a) generating and storing individual voice data each time a voice from a speaker is received; Extracting each individual voice data and estimating the similarity between the individual voice data when a plurality of voice data are stored; and (c) generating the speaker's first voice model according to the at least one individual voice data selected based on the estimated similarity. (D) determining whether a comparison speech model corresponding to the first speech model exists in the storage of the context-presenting speaker identification system, and if not, the first speech model of the context-presenting speaker identification system.
  • the comparison similarity between the first speech model and the comparison speech model is estimated through the similarity estimator. And (e) replacing the comparison speech model with the first speech model when the comparison similarity is greater than or equal to a predetermined reference value, and generating the second speech model by combining the first speech model and the comparison speech model if less than the predetermined reference value. It may include a step. In addition, steps (d) and (e) may be repeatedly performed for the second voice model.
  • the method according to an embodiment of the present invention further comprises the step of setting the management period of the voice model by the period setting unit of the above-described device, if all the voice model is updated within the set management period, the device
  • the voice model editing unit of the voice model editing unit maintains an existing matrix voice model DB on the storage unit of the context presenting speaker identification system, and if at least one voice model is not updated within the set management period, the voice model editing unit is associated with the speaker. Based on the new first speech model, a part of the existing matrix speech model DB may be deleted or maintained.
  • the voice model editing unit deletes at least one unupdated voice model from the matrix-type voice model DB if there is no new first voice model associated with the speaker, and at least one unupdated voice model if the new first voice model exists.
  • the speech model is compared with the new first speech model, and if the difference is within the predetermined range, the speech model editing unit maintains the existing matrix-type speech model DB on the storage of the contextual speaker identification system and maintains the range. If out of the at least one voice model can be deleted from the speech model DB of the unupdated voice model.
  • a computer-readable recording medium having recorded thereon a program for executing the above method on a computer may be provided.
  • any part of the specification is to “include” any component, this means that it may further include other components, except to exclude other components unless otherwise stated.
  • the terms “... unit”, “module”, etc. described in the specification mean a unit for processing at least one function or operation, which may be implemented in hardware or software or a combination of hardware and software. .
  • a part of the specification is “connected” to another part, this includes not only “directly connected”, but also “connected with other elements in the middle”.
  • FIG. 1 is a view showing a conventional speaker identification system.
  • a conventional speaker identification system first obtains a plurality of voice samples from a speaker (eg, A of FIG. 1) to be identified, extracts characteristic values such as frequency and pitch for each voice, and then overlaps them. The speech is leveled based on the overlapped portion. After leveling, a speech model is established for the speaker. After collecting an acoustic signal such as a human voice, noise can be removed from the collected signal, and the characteristics of the voice signal can be extracted and made into a database. May be referred to. In other words, through the speech model establishment process for the specific speaker (A of FIG. 1), information about the specific speaker's voice may be collected in advance and a DB may be constructed (eg, a blue dashed line box of FIG. 1).
  • a DB may be constructed (eg, a blue dashed line box of FIG. 1).
  • a speech characteristic parameter and the like are extracted and formed in the same manner as the verification target speaker (A of FIG. 1) with respect to a newly input voice of an unspecified speaker (for example, B of FIG.
  • the data is compared with the voice model of the speaker to be confirmed and the predetermined threshold value is exceeded, it is determined that the input voice of the unspecified speaker is the same person as the speaker to be confirmed.
  • the conventional voice comparison method takes a long time, and does not reflect a case where the voice of the speaker to be confirmed is changed by temporal and environmental factors.
  • FIG. 2 is a diagram illustrating a conventional context (word) presentation speaker identification system.
  • Conventional speaker identification systems may be classified into a context (word) fixed type system using a sentence or word designated by a user and a context free form system having no limitation on the pronunciation content of the user.
  • word fixed type system
  • the system efficiency is good, but the security is weak due to the risk of exposure of a given context (word) and the use of illegal methods such as recording impersonating the user.
  • a large amount of training data is required to identify the user, making the system less efficient in terms of time and resource utilization.
  • a context (word) presentation system such as in Figure 2 has emerged.
  • the system asks the user to pronounce a different word or sentence each time, and performs a speech recognition process for the requested word or sentence and After checking whether the text is matched, the speaker's unique feature value is extracted from the pronunciation information of the word or sentence required by the user and compared with the predefined speaker's voice feature value.
  • This process of the context-based presentation system reduces the risk of remembering the user-specified sentences or words or recordings impersonating the user, and in terms of performance, it is possible to achieve the same efficiency as the context-fixed form. This is the advantage.
  • 3 shows a conventional leveling system for speech.
  • the user's voice can be digitized through a sampling process into continuous waveforms.
  • the system samples a plurality of voice data instead of one user voice to generate reference data for speaker identification (identification or authentication), and then common data (eg, normalized data) for the digitized voice data is collected. (Red region in Fig. 3).
  • LPC linear predictive coding
  • MFCC Mel-Frequency Cepstral Coefficients
  • the voice tone generally spoken that is, Frequency and pitch can vary.
  • the voice model configuration based on simply leveled data is a common method according to the user's living environment. Distortion of the characteristic values can rather act as a barrier to accurate speaker identification (identification).
  • FIG. 4 is a flowchart illustrating a method for identifying a speaker based on a spoken word according to an embodiment of the present invention
  • FIG. 5 is a flowchart illustrating a specific speaker identification method according to an embodiment of the present invention.
  • a method for identifying a speaker based on a spoken word includes receiving a spoken voice from a speaker (S110), extracting a word included in the received voice, and voice information of the word. Step S120, searching for a word in a pre-built database DB (S130), if a word does not exist in the DB, adds the word and voice information of the word to the DB, and the word exists in the DB.
  • the estimated similarity may include determining an utterance pattern for the speaker's word based on the number of times voice information corresponding to the signal is received (S160) and identifying a speaker based on the determined utterance pattern (S170).
  • the voice information of the word according to an embodiment of the present invention may include at least one of the frequency, pitch, formant, speech time, and speech speed of the speech.
  • Pitch refers to the pitch of the note.
  • Voice voiced sound
  • All of the oscillation sources have unique vibration characteristics (eg, resonance characteristics).
  • Human articulation organs eg, vocal cords, etc.
  • a resonance characteristic at the moment that changes with the articulation, and the vocal cords can be filtered and expressed according to the resonance characteristics.
  • a particular sound eg, a vowel
  • a plurality of resonance bands exist when the resonance characteristic is expressed.
  • Such a plurality of resonant frequency bands is referred to as a formant.
  • the word and the voice information of the word may be added to the DB.
  • the added voice information may be used as reference data for comparison of voice information when a voice by a speaker is received later as reference voice information.
  • voice information of the spoken word may be compared with each reference voice information stored in the DB. In the comparison step (S140), it may be determined whether the voice information of the word spoken by the speaker is similar to at least one reference voice information stored in the DB.
  • the similarity is estimated according to the result of the above determination, and when the estimated similarity is less than the first reference value, Reference voice information of may be generated and stored in the DB.
  • the estimated similarity information may be included in the voice information and stored together on the DB.
  • the first reference value may be 70% (or 0.7), and the first reference value may be variably set according to a user's setting. Even if the same word is spoken by the same speaker, the voice information may be changed according to the speaker's state and environmental conditions (elements). You need to keep track of your patterns and manage them.
  • the number of matching of the reference voice information having the corresponding similarity may be increased and counted.
  • the speaker is highly likely to speak again in this current speech pattern. That is, as in an embodiment of the present invention, by grasping (collecting) the frequency of the speaker's speech pattern and using the same for speaker recognition (identification), not only can a high level of accuracy and reliability be obtained, but also voice information of the speaker. Can be kept up to date.
  • the reference voice information may be determined as a speech pattern for the speaker's word.
  • This second reference value may, for example, have a value comprised in the range of 5-10.
  • the similarity having the number of matching counts equal to or greater than the second reference value is determined.
  • a speech pattern may be determined by establishing a speaker's speech model based on the corresponding speech information.
  • reference voice information having a counted matching count greater than or equal to the second reference value may be established as the speaker's voice model, and thus a speech pattern may be determined.
  • the speaker may be identified based on the speech pattern determined through the above-described steps with respect to the spoken speech.
  • the reference voice information exceeding the first reference value and the second reference value may be determined by the speech pattern of the speaker to be confirmed, and if the voice is input (received), the speaker who uttered the voice according to the determined speech pattern is confirmed. Whether it is the same person or another person as the target speaker can be identified quickly and accurately.
  • FIG. 6 is a block diagram illustrating an apparatus for identifying a speaker based on a spoken word according to an embodiment of the present invention.
  • the apparatus 1000 for identifying a speaker based on a spoken word includes a voice receiver 1100 for receiving a spoken voice from a speaker, a word included in the received voice, and a voice for a word.
  • Information extraction unit 1200 for extracting information information search unit 1300 for searching for words in a pre-built database (DB), and if words do not exist in the DB, adds words and voice information of the words to the DB. If there is a word in the DB, a comparison unit 1400 for comparing the voice information of the spoken word with each reference voice information stored in the DB, and estimates the similarity according to the comparison with each reference voice information.
  • DB pre-built database
  • the similarity estimation unit 1500 for determining the speech pattern corresponding to the speaker's word based on the number of times voice information corresponding to the estimated similarity is received, and based on the determined speech pattern.
  • a speaker identification unit 1700 for identifying a speaker may be included.
  • the voice information about the word may include at least one of the frequency, pitch, formant, speech time, and speech speed of the speech.
  • tag information eg, U000
  • U000 which is an identifier for the first user
  • Voice information for example, vector property information, etc.
  • V_Inof000 for the data may be stored and managed in the DB in association with the tag information U000.
  • the speech matching count information as described above may be stored and managed together with the tag information U000 and the voice information V_Inof000. (E.g., "2" in FIG. 6).
  • the tag information for example, U000
  • the voice information V_Inof003 for the voice of the spoken "bank” are spoken matching times. It can be stored and managed with the information (eg, "7" in FIG. 6).
  • Tag information of the second user (second speaker) may be assigned to U011, for example.
  • the comparator 1400 determines whether the voice information of the word spoken by the speaker is similar to at least one reference voice information stored in the DB, and the similarity estimation unit 1500. The similarity is estimated according to the result of the determination. If the estimated similarity is less than the first reference value, new reference voice information is generated and stored in the DB. If the estimated similarity is greater than or equal to the first reference value, the number of matching of the reference voice information having the corresponding similarity is determined. May be increased and counted.
  • the speech pattern determination unit 1600 receives a new voice spoken from the speaker and repeatedly performs the process of estimating the similarity. It can be determined by the speech pattern for the word of.
  • a speech pattern is determined by the speech pattern determining unit 1600 by establishing a speech model of a speaker based on speech information corresponding to a similarity having a number of matching counts equal to or greater than a second reference value.
  • the speaker identification unit 1700 may identify who is the speaker based on the speech pattern determined for the spoken voice.
  • FIG. 7 is a diagram illustrating a leveling system for speech according to an embodiment of the present invention.
  • the system may not know about the user's everyday speech pattern, and may not know about the state of speech. Accordingly, for each voice spoken by the user, a separate reference voice information DB for each voice property is constructed. Thereafter, the newly input voice is distinguished from the reference voice information DB constructed after the characteristic classification, and the characteristic similarity is determined. If the reference voice value is equal to or greater than a predetermined reference value (for example, the third reference value), the newly input voice other than the compared reference voice information DB The number of matching counts of the reference voice information DB is increased by 1 so as to form a similar reference voice information DB for the user and to analyze the user voice similarity pattern. In addition, when the feature similarity of speech is less than or equal to the third reference value, a new DB may be generated as a new reference speech information value.
  • a predetermined reference value for example, the third reference value
  • a DB with a high similarity over a predetermined reference value for example, the fourth reference value
  • the corresponding reference voice information is used. Recognizes as a speech pattern for a specific context (word), and uses the DB of the reference speech information as basic speech data for establishing a speaker speech model. This effectively eliminates distortion errors for the speaker's various voice state transitions and can normalize the voice pattern for the context (word) of a particular speaker.
  • FIG. 8 is a view showing a leveling process based on the speaker's utterance similarity according to an embodiment of the present invention.
  • the voice graph of FIG. 8 has a similarity, and thus, it can be seen that not much difference occurs in each voice data.
  • the speech model may be established based on the common content (eg, the hatched region of FIG. 8), and the speaker identification may be performed by comparing and matching a newly input unspecified speaker speech.
  • a difference between the maximum value and the minimum value of the corresponding voice data other than the common area may be applied as an error range, and the input comparison value converges within the error range.
  • the speaker who uttered the voice may be recognized as a legitimate speaker (ie, the same person) corresponding to the reference voice information DB.
  • the above-described method may be applied. Therefore, with respect to the apparatus, the description of the same contents as those of the above-described method is omitted.
  • the method for identifying a speaker based on the spoken words described above can be written in a program executable in a computer, and can be implemented in a general-purpose digital computer operating the program using a computer readable medium.
  • the structure of the data used in the above-described method can be recorded on the computer-readable medium through various means.
  • a recording medium for recording an executable computer program or code for performing various methods of the present invention should not be understood to include temporary objects, such as carrier waves or signals.
  • the computer readable medium may include a storage medium such as a magnetic storage medium (eg, a ROM, a floppy disk, a hard disk, etc.), an optical reading medium (eg, a CD-ROM, a DVD, etc.).
  • a general tone of speech that is, Frequency and pitch can vary.
  • time factors eg aging
  • environmental factors eg concert halls, etc.
  • the voice spoken by the user although the voice may be changed in a specific environment and state as described above, the identification of the user's voice using a fixed voice model as in the conventional method is dependent on the user's living environment and the like. Since the possibility of speech fluctuations is not considered at all, reliability in speech recognition may be seriously degraded.
  • FIG. 9 is a block diagram of a context-based speech model management apparatus according to an embodiment of the present invention
  • FIG. 10 is a context-based speech model management apparatus and a context-presenting speaker identification system interoperable with each other according to an embodiment of the present invention.
  • a block diagram of FIG. 11 shows an example of an operation of a contextual speaker identification system.
  • 12 is a flowchart illustrating an operation example of the context-based speech model management apparatus according to an embodiment of the present invention
  • FIG. 13 illustrates an operation example of the context-based speech model management apparatus according to another embodiment of the present invention.
  • the context-based speech model management apparatus 3000 may be interworked with the context-presenting speaker identification system 4000, and the apparatus 3000 is generated whenever a voice from the speaker is received.
  • a similarity estimator extracts each individual voice data from the storage unit 3100 and estimates the similarity between the individual voice data.
  • a speech model generator 3300 for generating a speaker's first speech model based on the at least one individual speech data selected based on the similarity estimated by the similarity estimator 3200, and the contextual presentation speaker identification It is determined whether a comparison speech model corresponding to the first speech model exists in the storage unit 4400 of the system 4000, and if not, the first speech model is identified when the context-presenting speaker is identified.
  • the determination unit 3400 and the determination to provide the storage unit 4400 of the system 4000 and store the same, and if there is a comparison similarity between the first voice model and the comparison voice model, through the similarity estimation unit 3200.
  • the comparison speech model is replaced with the first speech model when the comparison similarity degree, which is a result of estimation by the similarity estimating unit 3200, is greater than or equal to the predetermined reference value.
  • the comparison similarity model is less than the predetermined reference value, the first speech model and the comparison speech model are compared.
  • the voice model editing unit 3500 for generating a second voice model may be included, and the second voice model may be provided again to the determination unit 3400 and the voice model editing unit 3500.
  • the contextual presentation speaker identification system 4000 includes a voice receiver 4100 for receiving a voice from the speaker, a voice feature extractor 4200 for extracting voice characteristics from the received voice, and a voice attribute based on the extracted voice characteristic.
  • a contextual speech model generator 4300 for generating a speech model, a storage unit 4400 in which the generated speech model is stored in a matrix form, a random number generator 4500 for generating a random number to be used for identification of a speaker,
  • a voice model extractor 4600 for extracting a voice model at a position corresponding to the random number generated on the matrix-shaped voice model DB of the storage unit, and a voice speech request for requesting a speaker to make a predetermined speech based on the extracted voice model.
  • a unit 4700 and a speaker identification unit 4800 identifying a speaker by comparing the speech uttered from the speaker with the extracted speech model, and the predetermined speech utterance is a matrix of the storage unit corresponding to the generated random number. It may be a sound of a word or sentence that is preset at a position on a DB of the form.
  • the word 'bank' and a spoken speech model of the word are stored in a matrix DB of the storage unit 4400 in advance, and the user's word 'bank' is spoken for user identification (confirmation) through voice.
  • the voice request unit 4700 may request the user to pronounce the word "bank". Such a request may be presented to the user by voice, picture, message, or the like.
  • the speech model according to an embodiment of the present invention refers to a data set including speech pattern information such as a context and a speaker's pronunciation method for the context.
  • context refers to a particular word (eg, "bank") as well as containing a series of sentences containing the word.
  • the word 'bank' and the spoken speech model of the word may be stored on a matrix position of a predetermined matrix DB.
  • the random number generator 4500 When user voice identification is required, the random number generator 4500 generates a random number, and a word on the matrix position of the matrix DB corresponding to the random number may be presented to the user as a voice speech request target word.
  • the context-presented speech model matrix DB may be configured in the form of NxM (where N and M are the same or different positive integers).
  • NxM where N and M are the same or different positive integers.
  • a context-presented speech model may be constructed as a DB in a 20 ⁇ 5 matrix.
  • the context-based voice model managing apparatus 3000 may communicate with another electronic device included in a network through which the communication unit 3700 may communicate.
  • the apparatus 3000 may communicate with each other while transmitting and receiving data with the communication unit 4900 of the context presenting speaker identification system 4000.
  • the context-based speech model management apparatus 3000 is designed separately from the context-presenting speaker identification system 4000 for convenience of description.
  • the context-based speech model management apparatus 3000 is the context-based speech model identification system ( It may be implemented to constitute a portion of 4000).
  • the communication unit 3700 and 4900 may include a Bluetooth communication module, a BLE (Bluetooth Low Energy) communication module, a near field communication unit, a Wi-Fi communication module, and a Zigbee communication module. , An infrared data association (IrDA) communication module, a Wi-Fi Direct (WFD) communication module, an ultra wideband (UWB) communication module, an Ant + communication module, and the like, but is not limited thereto.
  • IrDA infrare
  • Individual voice data includes at least one of the frequency, pitch, formant, speech time, speech rate of each speaker's speech, and the context-based speech model management apparatus (The similarity estimator 3200 of 3000 may evaluate the similarity between individual voice data for each speaker's speech.
  • Pitch refers to the pitch of the note.
  • Voice voiced sound
  • All of the oscillation sources have unique vibration characteristics (eg, resonance characteristics).
  • Human articulation organs eg, vocal cords, etc.
  • Human articulation organs also have a resonance characteristic at the moment that changes with the articulation, and the vocal cords can be filtered and expressed according to the resonance characteristics. Looking at the frequency spectrum of a particular sound (eg, a vowel), it can be seen that a plurality of resonance bands exist when the resonance characteristic is expressed. Such a plurality of resonant frequency bands is referred to as a formant.
  • a predetermined word eg, “bank”
  • a specific speaker eg, user B of FIG. 11
  • the spoken voice is received by the voice receiver 4100.
  • Speech characteristics can be extracted.
  • the extracted voice characteristic may be composed of individual voice data.
  • a voice for each speaker's speech eg, a voice spoken two weeks ago for a "bank", a voice spoken one week ago, Similarity between individual voice data for each of the voices uttered yesterday
  • a voice for each speaker's speech eg, a voice spoken two weeks ago for a "bank”, a voice spoken one week ago, Similarity between individual voice data for each of the voices uttered yesterday
  • At least one piece of individual voice data selected based on the similarity estimated by the similarity estimator 3200 may generate a first voice model of the speaker (for example, user B of FIG. 11).
  • the determination unit 3400 determines whether a comparison speech model corresponding to the first speech model exists in the storage unit 4400 of the contextual presentation speaker identification system 4000. If not present, the first speech model is provided to the storage unit 4400 of the context presenting speaker identification system 4000 and stored therein, and if present, the comparison similarity between the first speech model and the comparative speech model is similarity estimating unit ( 3200 may be estimated.
  • the voice model editing unit 3500 replaces the comparison voice model with the first voice model, and the value is less than the predetermined reference value.
  • the second voice model may be generated by combining the first voice model and the comparison voice model.
  • This predetermined reference value may be at least 51% (or 0.51). Preferably at least 75% (or 0.75). It is possible to edit (replace) a reliable voice model or the like above the predetermined reference value.
  • the second voice model may be provided to the determination unit 3400 and the voice model editing unit 3500 again, and the determination unit 3400 may include the second voice in the storage unit 4400 of the context presenting speaker identification system 4000. It is determined whether there is a comparison speech model corresponding to the model (newly reproduced speech model), and if not, the second speech model is provided to the storage unit 4400 of the contextual speaker identification system 4000 for storage. And, if present, the comparison similarity between the second speech model and the comparison speech model may be estimated by the similarity estimator 3200. This process can be performed repeatedly. Through such an iterative process, a speech model optimized for the speaker's current speech state may be stored and managed in the matrix DB.
  • the apparatus further includes a period setting unit 3600 for setting a management period of the voice model, and when all the voice models are updated within the set management period, the voice model editing unit ( In 3500, the speech model DB of the existing matrix form on the storage unit 4400 of the context presenting speaker identification system 4000 is maintained, and when at least one speech model is not updated within a set management period, the speech model editing unit is performed. At 3500, a part of the existing matrix speech model DB based on the new first speech model associated with the speaker may be deleted or maintained.
  • the management cycle according to an embodiment of the present invention may be a period of one day, one week, or one month, and may be individually set according to a user's intention.
  • management cycle For example, for certain words ("banks"), you can set up a management cycle to manage the voice model at weekly intervals, a particular user has a management cycle at daily intervals, and another user has a month.
  • the management cycle for each user may be individually set to have a management cycle as a period.
  • the voice model editing unit 3500 deletes at least one unupdated voice model from the matrix-type voice model DB if a new first voice model related to the speaker does not exist, and if a new first voice model exists, the voice model editor 3500 is not updated. Compare the at least one speech model with the new first speech model, and if the comparison results in a difference within the predetermined range, the speech model editor 3500 in the existing matrix form on the storage of the contextual presentation speaker identification system.
  • the voice model DB is maintained, and if it is outside the above-mentioned range, the at least one unupdated voice model can be deleted from the matrix-type voice model DB.
  • the allowable range of the difference value representing the aforementioned difference may be greater than 0 and 15% (or 0.15), depending on whether or not there is a difference within the range, the specific speech model (eg, The voice model 8) of FIG. 13 may be kept or deleted.
  • the at least one updated speech model eg, speech model 8 of FIG. 13.
  • FIG. 14 is a flowchart illustrating a voice model management method using a context-based voice model management apparatus according to an embodiment of the present invention.
  • a method of managing a speech model using a context-based speech model management apparatus includes (a) generating and storing individual speech data each time a speech from a speaker is received (S210), ( b) when a plurality of individual voice data are stored, extracting each individual voice data to estimate similarity between the individual voice data (S220), and (c) the speaker according to at least one individual voice data selected based on the estimated similarity.
  • the model is provided to the storage unit of the contextual speaker identification system to be stored, and if there is a comparison similarity between the first speech model and the comparison speech model, (S) and (e) if the comparison similarity is greater than or equal to a predetermined reference value, replaces the comparison speech model with a first speech model, and if the comparison similarity is less than or equal to the predetermined reference value, combines the first speech model and the comparison speech model to form a second comparison model.
  • Generating a voice model may include a step (S250).
  • steps (d) S240 and (e) S250 described above with respect to the second voice model may be repeatedly performed.
  • the method for managing a voice model may further include setting a management cycle of the voice model by the period setting unit of the aforementioned context-based voice model management apparatus (S10).
  • the setting of the management period may be performed before S210 or may be performed such that the management period is set at any time by the user.
  • the voice model editing unit 3500 of the apparatus 3000 may use the existing matrix model voice model DB on the storage unit of the contextual speaker identification system 4000. In this case, if at least one voice model is not updated within the set management period, the voice model editing unit 3500 performs a part of the existing matrix voice model DB based on the new first voice model associated with the speaker. Can be deleted or maintained.
  • the voice model editing unit 3500 deletes at least one unupdated voice model from the matrix-type voice model DB if a new first voice model related to the speaker does not exist, and if a new first voice model exists, the voice model editor 3500 is not updated. Comparing the at least one speech model with the new first speech model, and if the difference is within a predetermined range, the speech model editing unit 3500 forms an existing matrix on the storage unit of the contextual presentation speaker identification system 4000. If the speech model DB of is maintained and out of range, at least one un-updated speech model can be deleted from the matrix-type speech model DB.
  • the above-described content of the context-based speech model management apparatus may be applied. Therefore, with regard to the operation method, descriptions of the same contents as those of the above-described context-based voice model management apparatus are omitted.
  • the above-described method of operating the context-based speech model management apparatus may be written as a program executable on a computer, and may be implemented in a general-purpose digital computer operating the program using a computer readable medium.
  • the structure of the data used in the above-described method can be recorded on the computer-readable medium through various means.
  • a recording medium for recording an executable computer program or code for performing various methods of the present invention should not be understood to include temporary objects, such as carrier waves or signals.
  • the computer readable medium may include a storage medium such as a magnetic storage medium (eg, a ROM, a floppy disk, a hard disk, etc.), an optical reading medium (eg, a CD-ROM, a DVD, etc.).

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Un mode de réalisation de la présente invention concerne un procédé d'identification d'un locuteur sur la base d'un mot prononcé et un appareil associé. De plus, un appareil de gestion d'un modèle vocal sur la base d'un contexte selon un mode de réalisation de la présente invention peut interagir avec un système d'identification de locuteur déclenché par un texte. L'appareil et un procédé de celui-ci peuvent être configurés de telle sorte qu'une donnée vocale individuelle générée à chaque fois qu'une voix est reçue en provenance d'un locuteur est stockée dans une unité de stockage, et lorsque de multiples données vocales individuelles sont stockées dans l'unité de stockage, les données vocales individuelles respectives sont extraites de l'unité de stockage, la similarité entre les données vocales individuelles est estimée de façon à générer un modèle vocal, puis le modèle vocal est géré sur la base du contexte de l'énoncé d'un utilisateur.
PCT/KR2019/001355 2018-02-09 2019-01-31 Procédé d'identification d'un locuteur sur la base d'un mot prononcé et appareil associé, et appareil de gestion de modèle vocal sur la base d'un contexte et procédé associé Ceased WO2019156427A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2018-0016444 2018-02-09
KR1020180016444A KR101888058B1 (ko) 2018-02-09 2018-02-09 발화된 단어에 기초하여 화자를 식별하기 위한 방법 및 그 장치
KR1020180016663A KR101888059B1 (ko) 2018-02-12 2018-02-12 문맥 기반 음성 모델 관리 장치 및 그 방법
KR10-2018-0016663 2018-02-12

Publications (1)

Publication Number Publication Date
WO2019156427A1 true WO2019156427A1 (fr) 2019-08-15

Family

ID=67548542

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2019/001355 Ceased WO2019156427A1 (fr) 2018-02-09 2019-01-31 Procédé d'identification d'un locuteur sur la base d'un mot prononcé et appareil associé, et appareil de gestion de modèle vocal sur la base d'un contexte et procédé associé

Country Status (1)

Country Link
WO (1) WO2019156427A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220301554A1 (en) * 2019-01-28 2022-09-22 Pindrop Security, Inc. Unsupervised keyword spotting and word discovery for fraud analytics

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09198084A (ja) * 1996-01-16 1997-07-31 Nippon Telegr & Teleph Corp <Ntt> モデル更新を伴う話者認識方法及びその装置
KR20000037106A (ko) * 2000-04-07 2000-07-05 이상건 네트워크 기반의 화자 학습 및 화자 확인 방법 및 장치
KR20030013855A (ko) * 2001-08-09 2003-02-15 삼성전자주식회사 음성등록방법 및 음성등록시스템과 이에 기초한음성인식방법 및 음성인식시스템
KR20070060581A (ko) * 2005-12-09 2007-06-13 한국전자통신연구원 화자적응 방법 및 장치
JP2017223848A (ja) * 2016-06-16 2017-12-21 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America 話者認識装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09198084A (ja) * 1996-01-16 1997-07-31 Nippon Telegr & Teleph Corp <Ntt> モデル更新を伴う話者認識方法及びその装置
KR20000037106A (ko) * 2000-04-07 2000-07-05 이상건 네트워크 기반의 화자 학습 및 화자 확인 방법 및 장치
KR20030013855A (ko) * 2001-08-09 2003-02-15 삼성전자주식회사 음성등록방법 및 음성등록시스템과 이에 기초한음성인식방법 및 음성인식시스템
KR20070060581A (ko) * 2005-12-09 2007-06-13 한국전자통신연구원 화자적응 방법 및 장치
JP2017223848A (ja) * 2016-06-16 2017-12-21 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America 話者認識装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KIM, KYUNG WHA ET AL.: "Forensic Automatic Speaker Identification System for Korean Speakers", PHONETICS AND SPEECH SCIENCES, vol. 4, no. 3, September 2012 (2012-09-01), pages 95 - 101, XP055631005 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220301554A1 (en) * 2019-01-28 2022-09-22 Pindrop Security, Inc. Unsupervised keyword spotting and word discovery for fraud analytics
US11810559B2 (en) * 2019-01-28 2023-11-07 Pindrop Security, Inc. Unsupervised keyword spotting and word discovery for fraud analytics

Similar Documents

Publication Publication Date Title
WO2020139058A1 (fr) Reconnaissance d&#39;empreinte vocale parmi des dispositifs
WO2020207035A1 (fr) Procédé, appareil et dispositif d&#39;interception de canular téléphonique, et support d&#39;informations
WO2016129930A1 (fr) Procédé de fonctionnement pour une fonction vocale et dispositif électronique le prenant en charge
WO2020034526A1 (fr) Procédé d&#39;inspection de qualité, appareil, dispositif et support de stockage informatique pour l&#39;enregistrement d&#39;une assurance
CN110047481B (zh) 用于语音识别的方法和装置
WO2015068947A1 (fr) Système d&#39;analyse de contenu vocal reposant sur l&#39;extraction de mots-clés à partir de données vocales enregistrées, procédé d&#39;indexation à l&#39;aide du système et procédé d&#39;analyse de contenu vocal
WO2015005679A1 (fr) Procédé, appareil et système de reconnaissance vocale
WO2018070780A1 (fr) Dispositif électronique et son procédé de commande
KR101888058B1 (ko) 발화된 단어에 기초하여 화자를 식별하기 위한 방법 및 그 장치
WO2019208860A1 (fr) Procédé d&#39;enregistrement et de sortie de conversation entre de multiples parties au moyen d&#39;une technologie de reconnaissance vocale, et dispositif associé
WO2020151317A1 (fr) Procédé et appareil de vérification vocale, dispositif informatique et support d&#39;enregistrement
KR102389995B1 (ko) 자연발화 음성 생성 방법 및 이를 실행하기 위하여 기록매체에 기록된 컴퓨터 프로그램
CN113129895B (zh) 一种语音检测处理系统
WO2020246641A1 (fr) Procédé de synthèse de la parole et dispositif de synthèse de la parole capables de déterminer une pluralité de locuteurs
CN109887508A (zh) 一种基于声纹的会议自动记录方法、电子设备及存储介质
WO2023063718A1 (fr) Procédé et système d&#39;analyse de caractéristiques de dispositif pour améliorer l&#39;expérience utilisateur
WO2019172734A2 (fr) Dispositif d&#39;exploration de données, et procédé et système de reconnaissance vocale utilisant ce dispositif
WO2022203152A1 (fr) Procédé et dispositif de synthèse de parole sur la base d&#39;ensembles de données d&#39;apprentissage de locuteurs multiples
CN108364655B (zh) 语音处理方法、介质、装置和计算设备
CN111462754A (zh) 一种电力系统调度控制语音识别模型建立方法
WO2020159140A1 (fr) Dispositif électronique et son procédé de commande
WO2020091123A1 (fr) Procédé et dispositif de fourniture de service de reconnaissance vocale fondé sur le contexte
WO2019088635A1 (fr) Dispositif et procédé de synthèse vocale
WO2020096078A1 (fr) Procédé et dispositif pour fournir un service de reconnaissance vocale
WO2019156427A1 (fr) Procédé d&#39;identification d&#39;un locuteur sur la base d&#39;un mot prononcé et appareil associé, et appareil de gestion de modèle vocal sur la base d&#39;un contexte et procédé associé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19750986

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19750986

Country of ref document: EP

Kind code of ref document: A1