WO2024226024A1 - Systèmes et procédés de localisation sonore personnalisée dans une conversation - Google Patents
Systèmes et procédés de localisation sonore personnalisée dans une conversation Download PDFInfo
- Publication number
- WO2024226024A1 WO2024226024A1 PCT/US2023/019559 US2023019559W WO2024226024A1 WO 2024226024 A1 WO2024226024 A1 WO 2024226024A1 US 2023019559 W US2023019559 W US 2023019559W WO 2024226024 A1 WO2024226024 A1 WO 2024226024A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- participants
- computing device
- audio
- conversation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
- H04N7/155—Conference systems involving storage of or access to video conference sessions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/40—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
- H04R1/406—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/005—Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2430/00—Signal processing covered by H04R, not provided for in its groups
- H04R2430/20—Processing of the output signals of the acoustic transducers of an array for obtaining a desired directivity characteristic
- H04R2430/23—Direction finding using a sum-delay beam-former
Definitions
- a useful feature of such a transcription is to be able to identify the individual participants in the group conversation, and label the transcription so as to reflect the participants corresponding to different portions of the transcription.
- Another useful feature is to provide a visual display of the participants and their respective physical locations as they participate in the group conversation.
- Some individual participants may utter speech in different ways. For example, some participants may have accents, and others may have speech characteristics that may cause variations in their uttered speech.
- a general speech recognition transcription may reflect such utterances as errors in transcription.
- Accuracy of on-device recognition of user input can be challenging when the input includes terms that are not recognized as common terms. For example, when the input is in the form of speech, one or more phrases may be pronounced in a way that significantly depends on the participant.
- a computer-implemented method includes receiving, by a computing device, first and second audio signals from respective first and second audio input devices, wherein the first and second audio signals correspond to speech input of a conversation between two participants. The method further includes estimating, by the computing device and based on the first and second audio signals, a time delay in respective Atty.
- the method additionally includes estimating, by the computing device, respective directions for two audio sources based on the estimated time delay in the respective arrival times.
- the method also includes associating, based on the estimated directions of the two audio sources, respective portions of a speech-to-text transcript of the conversation with the respective participants.
- the method further includes displaying, by the computing device and based on the associating of the respective portions, a modified speech-to-text transcript of the conversation that labels the respective portions of the speech-to-text transcript associated with the respective participants.
- a computing device includes one or more processors and data storage.
- the data storage has stored thereon computer- executable instructions that, when executed by one or more processors, cause the computing device to carry out functions.
- the functions include receiving, by the computing device, first and second audio signals from respective first and second audio input devices, wherein the first and second audio signals correspond to speech input of a conversation between two participants.
- the functions further include estimating, by the computing device and based on the first and second audio signals, a time delay in respective arrival times for the speech input at the first and second audio input devices.
- the functions additionally include estimating, by the computing device, respective directions for two audio sources based on the estimated time delay in the respective arrival times.
- the functions also include associating, based on the estimated directions of the two audio sources, respective portions of a speech-to-text transcript of the conversation with the respective participants.
- the functions further include displaying, by the computing device and based on the associating of the respective portions, a modified speech-to-text transcript of the conversation that labels the respective portions of the speech- to-text transcript associated with the respective participants.
- an article of manufacture includes one or more computer readable media having computer-readable instructions stored thereon that, when executed by one or more processors of a computing device, cause the computing device to carry out functions.
- the functions include receiving, by the computing device, first and second audio signals from respective first and second audio input devices, wherein the first and second audio signals correspond to speech input of a conversation between two participants.
- the functions further include estimating, by the computing device Atty.
- the functions additionally include estimating, by the computing device, respective directions for two audio sources based on the estimated time delay in the respective arrival times.
- the functions also include associating, based on the estimated directions of the two audio sources, respective portions of a speech-to- text transcript of the conversation with the respective participants.
- the functions further include displaying, by the computing device and based on the associating of the respective portions, a modified speech-to-text transcript of the conversation that labels the respective portions of the speech-to-text transcript associated with the respective participants.
- the system includes means for receiving, by a computing device, first and second audio signals from respective first and second audio input devices, wherein the first and second audio signals correspond to speech input of a conversation between two participants; means for estimating, by the computing device and based on the first and second audio signals, a time delay in respective arrival times for the speech input at the first and second audio input devices; means for estimating, by the computing device, respective directions for two audio sources based on the estimated time delay in the respective arrival times; means for associating, based on the estimated directions of the two audio sources, respective portions of a speech-to-text transcript of the conversation with the respective participants; and means for displaying, by the computing device and based on the associating of the respective portions, a modified speech-to-text transcript of the conversation that labels the respective portions of the speech-to-text transcript associated with the respective participants.
- FIG. 1 illustrates a computing device for speaker identification based on sound localization, in accordance with example embodiments.
- FIG.2A illustrates localization with two microphones, in accordance with example embodiments.
- FIG. 2B illustrates example audio input components of a computing device for sound localization, in accordance with example embodiments.
- FIG.2C illustrates localization with three microphones, in accordance with example embodiments.
- FIG. 3 illustrates example sample audio signals captured by three audio input devices, in accordance with example embodiments.
- FIG.4A illustrates an example for sound localization, in accordance with example embodiments.
- FIG.4B illustrates an example flowchart for speaker identification, in accordance with example embodiments.
- FIG. 5 illustrates an example split-screen display for speaker identification based on sound localization, in accordance with example embodiments.
- FIG.6 is a diagram illustrating training and inference phases of a machine learning model, in accordance with example embodiments.
- FIG. 7 depicts a distributed computing architecture, in accordance with example embodiments. [19] FIG.
- FIG. 8 is a block diagram of a computing device, in accordance with example embodiments.
- FIG.9 is a flowchart of a method, in accordance with example embodiments.
- FIG. 10 is a flowchart of another method, in accordance with example embodiments.
- DETAILED DESCRIPTION Overview [22] Speech-to-text capabilities on mobile devices are helpful for hearing and speech accessibility, language translation, note taking, transcribing meetings, and so forth. However, the usefulness of such capabilities is constrained by an inability to distinguish between multiple speakers, tracking which direction speech is coming from, and providing acceptable performance in noisy environments. Atty. Docket: 23-0049-WO [23] In the context of speech-to-text transcription, diarization methods separate the speech according to the speakers.
- Some real-time diarization techniques are based on dedicated microphone clips that are attached to each speaker. Such approaches are not suitable for applications with off-the- shelf mobile phones. Some transcription software provide diarization that cannot be performed in real-time, and/or may depend on network connectivity. Real-time diarization may be performed using deep learning models based on d-vectors, and these can provide a unique acoustic signature. Such techniques can be useful in a single microphone setting.
- sound localization techniques and beamforming on embedded multi-microphone hardware may be used at a mobile device to identify a sound source corresponding to speakers (or participants) in a conversation, and then apply a speech characteristics diarization (e.g., a personalized speech recognition model) to the participants located in directions corresponding to their identified sound sources.
- the device combines the speech characteristics diarization based on acoustic characteristics (e.g. d-vector/speaker embeddings followed by clustering or principal components analysis) and sound localization to improve the detection of a speaker turn or identification of a speaker identity (ID).
- acoustic characteristics e.g. d-vector/speaker embeddings followed by clustering or principal components analysis
- a person may have hearing challenges, and/or may not be proficient in a particular language. Such a person may use an application on their mobile device for real- time speech-to-text. However, such an application would be unable to distinguish between speakers, and the transcription may be a concatenation of the speech into a single paragraph, making it difficult for the individual to follow. Additionally, as the person is more likely to be viewing the display on their mobile device, they may not be able to track speaker changes effectively. For example, the person will likely struggle to follow the turn taking among the participants in the conversation. Finally, the text generated by automatic speech recognition (ASR) may be difficult to understand since background noise may introduce errors. Atty.
- ASR automatic speech recognition
- a speech-to-text transcript may visually separate different speakers, based on the direction of the speech.
- the screen may display the direction of the source of the sound.
- Beamforming may be used to enhance audio coming from certain directions and suppress audio from non-selected directions.
- Microphone array processing has been generally limited to high-end audio applications, because it is computationally intensive to meet the latency and power consumption requirements for mobile and/or wearable devices.
- recent advances in computational power allow such techniques to be used in low-power devices with low latency as well.
- low-latency localization and beamforming algorithms are described herein. For example, 180-degree localization is possible using an existing two-microphone placement on an off-the-shelf mobile phone with no additional hardware.
- TDOA time-difference-of-arrival
- the prototype may be configured for multi-channel audio over USB-C, localization data over USB-C, and provide analog audio over the headphone jack.
- a trained machine learning model can work on a variety of computing devices, including but not limited to, mobile computing devices (e.g., smart phones, Atty. Docket: 23-0049-WO tablet computers, cell phones, laptop computers), stationary computing devices (e.g., desktop computers), and server computing devices.
- a machine learning model such as a convolutional neural network, can be trained using training data (e.g., speech data) to perform one or more aspects as described herein.
- the neural network can be arranged as an encoder/decoder neural network.
- a trained machine learning model can process the input data to predict an output data comprising one or more words and/or phrases that are associated with the input data.
- (a copy of) the trained neural network can reside on a mobile computing device.
- the mobile computing device can include a microphone that can capture input voice data.
- the trained neural network can generate a predicted output word that is associated with the input voice data.
- a first trained machine learning model can perform input recognition to generate a transcription of an input
- a second trained machine learning model can perform personalization of the transcription based on speaker-based acoustic models. Although the personalization and labeling of the transcription is performed by an on-device system, the speech recognition can be performed by a remote server.
- the on-device system can receive the speech and send the speech to the remote server.
- the remote server can apply the first trained machine learning model to generate the transcription of the speech.
- the on-device system can receive the transcription from the remote server, and perform the personalization and labeling of the transcription.
- the herein-described techniques can improve speech recognition by accurately identifying speakers based on estimation of audio sources, comparison of speech characteristics, or both. Enhancing the actual and/or perceived quality of speech recognition can provide benefits to downstream applications that depend on an output of a speech recognition system (e.g., voice enabled applications). These techniques are flexible, and so can apply a wide variety of user input such as human voices, including various languages, dialects, and accents.
- FIG. 1 illustrates a computing device 100 for participant identification based on sound localization, in accordance with example embodiments.
- Device 105 may be configured with an on-device system 110 that can include the one or more components that perform the functions described herein. Two or more individuals may be participating in a group conversation and device 105 may capture the conversation and transcribe the speech.
- Device 105 may include one or more audio input devices, such as microphone(s) 115. Although microphone(s) 115 are shown as part of hardware 120 of device 105, this is for purposes of illustration only. One or more of microphone(s) 115 may be located on device 105. However, microphone(s) 115 may be external audio input devices that capture the conversation. For example, a conference room may include microphone(s) 115 installed at various locations to capture the conversation, and the captured speech can be converted to an audio signal (for each microphone) and be transmitted to device 105.
- Device 105 may include computing devices such as a laptop, a desktop computer, a smart television, an electronic reading device, a streaming content device, a gaming console, a tablet device, or other related computing devices that are configured to execute software instructions and application programs.
- on-device system 110 can be configured to run on an operating system of device 105.
- On-device system 110 can include an interface (e.g., by an application programming interface (API)) to communicate with one or more application programs on device 105.
- API application programming interface
- on-device system 110 can communicate with an automatic speech recognition system, and/or a transcription system, one or both of which may be located on a cloud server.
- API application programming interface
- On-device system 110 can interface with one or more hardware components 120 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), memory, input and/or output devices, and/or camera(s) of device 105).
- hardware components 120 e.g., a central processing unit (CPU), a graphics processing unit (GPU), memory, input and/or output devices, and/or camera(s) of device 105.
- on-device system 110 can Atty. Docket: 23-0049-WO interface with one or more cameras, or with microphone(s) 115.
- the speech input by the user can be captured by microphone(s) 115.
- Microphone(s) 115 may be a part of device 105, or may be an audio input device (e.g., a wired or wireless microphone) separate from device 105, and communicatively linked to device 105.
- Device 105 may be configured to receive first and second audio signals from respective first and second audio input devices (e.g., microphone(s) 115).
- the first and second audio signals correspond to speech input of a conversation between two participants (e.g., participants A and B).
- microphone(s) 115 are in close proximity to one another, they will receive substantially the same speech, and therefore generate substantially similar audio signals.
- microphone(s) 115 are located at a distance apart from one another, each may identify a different audio signal for the same conversation.
- microphone(s) 115 are two microphones, say M1 and M2, located at or near the top and bottom of a smartphone.
- microphone M1 will likely receive participant A’s speech signal earlier than participant B’s speech signal
- microphone M2 will likely receive participant B’s speech signal earlier than participant A’s speech signal.
- microphone M1 that is farther from a sound source e.g., participant B
- microphone M1 is likely to generate a relatively lower amplitude and/or slightly phase delayed signal relative to microphone M2 that is closer to the sound source (e.g., participant B). Similar logic applies to microphone M2 and participant A.
- the delay estimation may be performed using cross-correlation. For example, generalized cross-correlation with phase transform (GCC-PHAT) may be applied to the synchronized first and second audio signals. In some examples, the following relationship may be utilized: Atty.
- GCC-PHAT generalized cross-correlation with phase transform
- the time delay between two microphones may be determined from the cross-correlation based on the following relationship: (Eqn.2) [46] where ⁇ ⁇ ⁇ ⁇ is the audio sampling frequency (44.1 kHz), and ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ is the index of the maximum peak in the cross-correlation, corresponding to delay in samples.
- FIG.2A illustrates localization with two microphones, in accordance with example embodiments.
- the sound will arrive at microphone two, mic 2210, before microphone one, mic 1205. This time difference may be used to estimate an angle of arrival, However, with Atty. Docket: 23-0049-WO two microphones, there may be some potential ambiguity, as an actual source 215 may appear to be at an inverse angle, ⁇ ⁇ 1 , and shown as a potential source 220.
- the graph 225 illustrates a kernel density estimation (KDE) with the actual source 215 and the potential source 220.
- the horizontal axis of the graph represents the azimuth angle and the vertical axis represents the kernel density.
- Such angle ambiguity may be fixed with three microphones.
- the KDE from multiple microphone pairs will have the highest peak at the correct source, as illustrated in Figure 2C. [51] FIG.
- Device 230 illustrates example audio input components of a computing device for sound localization, in accordance with example embodiments.
- Device 230 is shown with three microphones.
- device 230 can be a smartphone, and a view of the rear side of the smartphone is illustrated.
- Device 230 is shown to be configured with three microphones 210, 215, and 220.
- Microphones 215 and 220 are located proximate to each other near the top of device 230, whereas microphone 235 is located near the bottom of device 230, at some distance away from microphones 215 and 220.
- Participant A may be located proximate to microphone 235
- participant B may be located proximate to microphones 215 and 220.
- Pairwise distances between the three microphones 210, 215, and 220 are known, as indicated by the three edges edge labeled “1” (distance between microphones 210 and 220), “2” (distance between microphones 210 and 215), and “3” (distance between microphones 215 and 220).
- distances between each participant and device 230 may be determined. For example, arrow “4” captures the first distance between participant A and the bottom of device 230, whereas arrow “5” captures a second distance between participant B and the top of device 230.
- Device 230 can record audio of the conversation between the two participants A and B.
- a time-difference-of-arrival (TDOA) may be determined for each pair of microphones.
- FIG.2C illustrates localization with three microphones, in accordance with example embodiments. Three microphones, mic 1 250, mic 2 255, and mic 3 260, are shown. The distance between mic 1250, and mic 2255, may be denoted as ⁇ ⁇ 1 and the angle between them may be denoted as ⁇ 1 ⁇ 2 . Similarly, the distance between mic 2255, and mic 3260, may be denoted as and the angle between them may be denoted as ⁇ 2 ⁇ 3 , and the distance Atty.
- ⁇ ⁇ 3 the angle between them may be denoted as ⁇ 1 ⁇ 3 .
- a statistical approach may be used to determine an actual source location (e.g., source 265).
- the TDOA for each microphone pair (six unique pairs for four microphones) may be determined, and a second potential TDOA may be added.
- a coordinate transform may be applied so that each angle of arrival is aligned with global azimuth angles.
- KDE kernel density estimate
- a highest peak will likely correspond to the angle of arrival. As shown in graph 270, the highest peak indicates source 265.
- Graph 270 comprises a horizontal axis that represents azimuth angle, and a vertical axis that represents density.
- the KDE may be wrapped around 0 and 360 degrees to address a discontinuity around 0/360 degrees.
- more resource intensive techniques including advanced machine learning models, can be applied to identify participants during a conversation, such techniques involve higher compute resources, may require dedicated microphone clips attached to each participant, may depend on network connectivity, and may fail to identify participants in real-time as a conversation unfolds.
- a more efficient technique may be employed based on a time-difference- of-arrival (TDOA) for pairs of microphones.
- TDOA time-difference- of-arrival
- on-device system 110 can interface with location determination system 125.
- Location determination system 125 can be configured to estimate a time delay in respective arrival times for the speech input at the first and second audio input devices. For example, localization based on cross-correlation of the first and second audio signals (or pairs of audio signals in the case of more than two audio input devices) can result in faster compute time, and can be performed on device 105, as it involves signal processing. As such, there is no need for dedicated microphones attached to individual participants, or availability of network connectivity. Many mobile devices are equipped with two microphones. These two microphones can be leveraged to generate the first and second audio signals and determine the time delay in respective arrival times for the speech input at the first and second audio input devices.
- Location determination system 125 can be configured to estimate respective directions for two audio sources based on the estimated time delay in the respective arrival times.
- cross- correlation for localization can be used to estimate two different audio sources, such as sources corresponding to participants A and B. Accordingly, device 105 is able to identify that the conversation is between two individuals.
- the time delay can be determined for each microphone pair. For example, three microphones result in three unique pairs, and four microphones result in six unique pairs, and so forth. Similarly, where there are more than two participants in the conversation, the time delay in respective arrival times for the audio signal can be used to identify the three audio sources.
- participant A may say, “Hi, how are you doing?” and participant B may respond, “Very well, sank yo. Are you heeded to ze meeting now?”
- Microphone(s) 115 may receive this speech input, and each microphone may convert this speech input to an audio signal.
- Location determination system 125 may estimate the audio sources by determining a time delay between these audio signals.
- speaker-based acoustic model 135 may be configured to predict speech characteristics of participant B. Accordingly, based on one or both of the estimated audio sources and the predicted speech characteristics, speaker determination system 130 may identify that there are two speakers, and also identify one of the two speakers as participant B (based on predicted speech characteristics). Transcription system 140 may then transcribe the input speech. In particular, in conjunction with speaker-based acoustic model 135, transcription system 140 may be able to accurately transcribe the words or phrases accurately.
- transcription system 140 transcribes the speech input and outputs a labeled transcript as “A: Hi, how are you doing?, B: Very well, thank you. Are you headed to the meeting now?”. The remainder of the conversation may be transcribed in like manner, and on-device system 110 displays the transcript via an interactive graphical display of device 105. Atty. Docket: 23-0049-WO [60] FIG.
- FIG. 3 illustrates example sample audio signals captured by three audio input devices, in accordance with example embodiments.
- Each of the three graphs 305, 310, and 315 represents a raw audio signal from three microphones.
- the vertical axis represents an amplitude of the audio signal
- the horizontal axis represents time.
- beamforming based on two microphones in a phone may be used for localization.
- the bottom microphone endfire represented by graph 315 and the top microphone endfire represented by graph 310 have different maximum and minimum pics.
- Techniques for beamforming include a filter and sum (FAS) beamformer, and/or a minimum variance distortionless response (MVDR) beamformer.
- FAS filter and sum
- MVDR minimum variance distortionless response
- the term “broadside” means that the audio is arriving at the same time at each mic, so there is no delay. This is generally the case when the audio source is right in front of the microphone array.
- the term “endfire” means that the audio wave is arriving at one microphone much earlier than the other. This means the source is generally on the left or right side of the microphone array.
- Beamforming is a spatial filter where sound delays can be used to boost audio arriving from certain directions.
- a Filter and Sum (FAS) beamforming approach may be used, which is generally more computationally intensive than Delay and Sum (DAS), but it is more suitable for broadband speech signals.
- the time-domain audio channels may be converted to the frequency domain.
- Each frequency bin can be filtered individually by multiplying by a steering weight (Eqn. 5), calculated from the microphone geometry (Eqn. 6).
- Multiplication in the frequency domain may be filtered in the time domain, so the steering weights may act as filters.
- FFT Fast Fourier Transform
- the spectral domain signals can be summed in their respective frequency bins and transformed to single-output audio in the time domain.
- a Hamming window with approximately 50% overlap may be used to reconstruct the audio.
- the FAS filter can be described in vector form as: Atty. Docket: 23-0049-WO (Eqn.5) [64] where ⁇ ( ⁇ ) is the single-channel audio output, ⁇ is the complex steering weight, as described in Equation 7, ⁇ ( ⁇ ) is the input audio transformed to the frequency domain, ⁇ is the microphone, with ⁇ total microphones, and ⁇ is the conjugate transpose. [65] Generally, the beam may be steered to any azimuth angle ( ⁇ ) by adjusting the steering weights.
- the individual complex steering weight of a microphone at a specific frequency may be computed as phase delays in the following way: (Eqn.6) [66] where ⁇ is the speed of sound, ⁇ is the steering weight, ⁇ is the frequency, and ⁇ is the desired steering azimuth angle. ⁇ ⁇ is the azimuth angle of the microphone in the circular array, and ⁇ is the radius of the circular array. [67] For a 4-microphone array, there can be a complex collection of steering weights as vectors, abbreviated as ⁇ .
- ⁇ ⁇ ⁇ ( ⁇ 0 ⁇ , ⁇ 0 ), ⁇ ( ⁇ 0 ⁇ , ⁇ 1 ), ⁇ ( ⁇ 0 ⁇ , ⁇ 2 ), ⁇ ( ⁇ 0 ⁇ , ⁇ 3 ) ⁇ ... ⁇ ⁇ ( ⁇ ⁇ ⁇ ⁇ , ⁇ 0 ), ⁇ ( ⁇ ⁇ ⁇ ⁇ , ⁇ 1 ), ⁇ ( ⁇ ⁇ ⁇ , ⁇ 2 ), ⁇ ( ⁇ ⁇ ⁇ ⁇ , ⁇ 3 ) ⁇ (Eqn.7) [68]
- the Filter and Sum (FAS) beamformer may be further enhanced with Minimum Variance Distortionless Response Filter (MVDR).
- the curve represented using circles may correspond to the audio signal captured by microphone 240 (of Figure 2)
- the curve represented using squares may correspond to the audio signal captured by microphone 245 (of Figure 2)
- the curve represented using triangles may correspond to the audio signal captured by microphone 235 (of Figure 2).
- microphones 215 and 220 are in close proximity near the top of device 230, and therefore the corresponding curves (represented with circles and squares) are substantially similar.
- microphone 235 is located toward the bottom of device 230, and therefore the corresponding curve (represented by triangles) has a different amplitude characteristic than the curves represented with circles and squares. Atty.
- FIG. 4 illustrates an example for sound localization, in accordance with example embodiments.
- estimating time differences for sound arrival at top and bottom microphones may be used to estimate a direction to a sound source.
- the horizontal axis represents time and the vertical axis represents angle for an audio source.
- the piecewise linear curve is ground truth data indicating two separate sources.
- the observed data is represented by the curve with triangles. As illustrated, the observed data also indicates two separate audio sources.
- the device may be placed on a motorized rotating platform capable of collecting the ground truth data, as the device is rotated 180 degrees. The angles may then be determined from the motor position.
- on-device system 110 can interface with speaker determination system 130.
- Speaker determination system 130 can be configured to associate, based on the estimated directions of the two audio sources, respective portions of a speech-to- text transcript of the conversation with the respective participants.
- FIG. 4B illustrates an example flowchart for speaker identification, in accordance with example embodiments. An example flow for detecting a change in participant is described. Block 410 involves identifying if participants are located in different spatial directions. This can be done by determining whether the estimated audio sources are spatially separated.
- a spatial threshold can be used to determine whether the estimated audio sources represent different participants.
- Block 425 involves inferring that the estimated audio sources represent separate participants, thereby implying a change in participant.
- Block 415 involves determining whether the audio signals from the estimated audio sources correspond to different acoustic characteristics. For example, this may involve determining whether acoustic characteristics for the estimated audio sources differ by more than an acoustic threshold.
- Block 425 involves inferring that the estimated audio sources represent separate participants, thereby implying a change in participant.
- the process proceeds to block 420. For example, this may involve determining whether acoustic characteristics for the estimated audio sources fail to differ by more than an acoustic threshold.
- Block 420 involves determining whether a sum of the estimated spatial separation and an acoustic separation (e.g., difference of the acoustic characteristics from the audio sources) exceeds a cumulative threshold. Upon a determination that the sum of the estimated spatial separation and the acoustic separation exceeds the cumulative threshold, the process proceeds to block 425. Block 425 involves inferring that the estimated audio sources represent separate participants, thereby implying a change in participant. [77] Finally, upon a determination that the sum of the estimated spatial separation and the acoustic separation fails to exceed the cumulative threshold, the process proceeds to block 430. Block 430 involves inferring that the estimated audio sources do not represent separate participants, thereby implying no change in participants.
- an acoustic separation e.g., difference of the acoustic characteristics from the audio sources
- inferring that a change in participant has occurred may trigger on- device system 110 to identify the participant.
- prior information about user location may be used to identify the participant. For example, a display of the user locations (via an interactive graphical user interface) may be used to identify the participant.
- prior information about speech characteristics of a participant may be used to identify the participant. In case such individual methods fail to identify the participant, the two methods may be combined to identify the participant.
- estimating of the respective directions for two audio sources can involve synchronizing the speech-to-text transcript of the conversation with the synchronized first and second audio.
- a likely delay for respective transcript segments may be determined based on one or more of a histogram analysis or a kernel density estimate (KDE). Audio source estimation, and/or participant identification may be based on applying a threshold to the likely delay.
- KDE kernel density estimate
- the transcript may be synchronized with TDOA data. Accordingly, diarization can be based on the angle of arrival of speech, assuming that the speakers are located at different places.
- the aforementioned techniques allow flexibility in choosing a level of clarity for an acoustic model (for personalized speech characteristics) versus clarity for spatial direction for participants. For example when the participants in the group conversation are seated in different directions, less accurate acoustic methods may be used to distinguish them as different participants. Also, for example, when the participants sound very different (have different speech characteristics), then the sound localization level may be less accurate.
- quantifiable metrics may be determined to quantify accuracy of sound localization vis-a-vis identification of acoustic and/or participant models.
- sound localization can be used to identify an identity of a participant, and to select an acoustic model for participants that are located in directions corresponding to their sound sources.
- a user can identify the location and/or the identity of one or more of the participants in the conversation via a digital image that captures the relative positions of the different participants. For example, one or more cameras may be used to capture digital images of the participants in a meeting. Also, for example, the digital images may be analyzed to detect movements, and participant identification may be updated accordingly.
- a digital image of four persons seated around a circular table may be generated and displayed on a user’s device.
- the user may interact with the digital image and identify the other participants.
- the user may add nicknames to each of the participants.
- on-device system 110 may associate the estimated audio sources with the nicknames.
- the user may also orient a displayed image of the device (with the microphones) that may be placed on the circular table. This can reinforce the Atty. Docket: 23-0049-WO relative positions of the participants with reference to the microphones, resulting in better sound localization and tracking.
- the digital image may be an actual image or live video captured by one or more cameras.
- speaker determination system 130 can receive, from transcription system 140, a transcription of the audio signal.
- Transcription system 140 can be configured to recognize speech and convert it to text to generate a speech-to-text transcript.
- transcription system 140 can be a speech recognition system that is configured to recognize speech.
- transcription system 140 can reside on device 105.
- transcription system 140 can be an interface (e.g., an application programming interface) at device 105 for an input recognition system residing at a remote server.
- a trained machine learning model can perform speech recognition to generate a transcription of the speech.
- Some embodiments involve receiving, from an input recognition model, a transcription of the input.
- transcription system 140 such as, for example, a speech recognition system, can transcribe the speech input received by microphone(s) 115 and provide the transcription.
- speech recognition can generally refer to any process that recognizes audio input and converts the audio input to a textual format.
- microphone(s) 115 can receive the speech input in the form of human speech, and transcription system 140 can transcribe the human speech into text.
- transcription system 140 can include one or more machine learning models that are trained to recognize speech. [86] Recognizing speech can be a challenging task.
- speech recognition can be especially challenging when a particular term is not recognized as a common term in a dictionary associated with a language.
- speech recognition systems can be trained to recognize different languages, accents, dialects, and so forth, such speech recognition systems may still be unable to recognize a word, or a phrase uttered by a user.
- speech recognition systems may be located Atty. Docket: 23-0049-WO at a server remote from the device, and various data processing restrictions may limit an appropriate personalization of such systems, and/or network limitations may limit access to such remote servers.
- speech recognition can be performed by a remote server that interfaces with device 105 by an API (e.g., transcription system 140).
- on-device system 110 can receive the audio signal and send the audio signal to the remote server.
- the remote server can apply a trained machine learning model to generate the transcription of the audio signal.
- on-device system 110 can receive the transcription from the remote server, and perform modifications of the transcription as needed.
- on-device system 110 can perform one or more operations, including scanning the transcription, filtering out common terms, performing contextual analysis, and so forth. On-device system 110 may then determine that the transcription is accurate (within appropriate thresholds of allowable accuracy), and no corrections are to be made. In some embodiments, on-device system 110 may identify one or more candidate terms that are likely to have been mistranscribed. [89] In some embodiments, accuracy of transcription of speech can be enhanced based on acoustic models that are personalized to a specific user. Accordingly, on-device system 110 can include speaker-based acoustic model 135 that can correct the transcription of the input by transcription system 140.
- speaker-based acoustic model 135 can be trained to learn speech characteristics of a particular user.
- speech input can be received in the form of an audio signal.
- a feature extraction system can be configured to extract one or more features of the audio signal.
- Speaker-based acoustic model 135 can be trained to associate relationships between audio signal and phonemes or other linguistic properties that form speech audio.
- speaker-based acoustic model 135 can be trained to identify and associate certain received utterances that exhibit acoustical characteristics that align with the acoustics associated with a spoken word or phrase.
- speaker-based acoustic model 135 can be configured to use a language model.
- the language model can specify or identify certain word combinations or Atty. Docket: 23-0049-WO sequences.
- the language model can be configured to generate a word sequence probability factor which can be used to indicate a likely occurrence or existence of particular word sequences or word combinations.
- the identified word sequences may correspond primarily to sequences that are specific to a speech corpus.
- Transcription system 140 can be configured to receive input from speaker-based acoustic model 135 and the speaker determination system 130 to generate a transcript of the conversation.
- transcription system 140 can be configured to include speech recognition logic, programmed instructions, and/or algorithms that are executed by one or more processors to transcribe the conversation.
- transcription system 140 can execute program code to manage identification, extraction, and analysis of characteristics of the received audio signal. Further, transcription system 140 can execute comparator logic to compare characteristics of the received audio signal to various model parameters stored in speaker-based acoustic model 135 and the language model. Results of the comparison can yield text transcription outputs that correspond substantially to the conversation. [93] In some embodiments, transcription system 140 can be configured to generate a modified speech-to-text transcript of the conversation that labels the respective portions of the speech-to-text transcript with each of the distinct participants, and identifies a change in participant for accurate labeling. In some embodiments, on-device system 110 is configured to display the modified speech-to-text transcript, for example, in substantial real-time, contemporaneously with the conversation.
- speaker-based acoustic model 135 resides on device 105, one or more of the other components may also reside on device 105, at a remote server, or both.
- transcription system 140 may reside on device 105.
- transcription system 140 may reside at a remote server (e.g., a cloud server).
- a remote server e.g., a cloud server
- certain portions of transcription system 140 may reside on device 105, whereas other portions can reside at the remote server.
- the speaker-based acoustic model 135 may be stored in a local repository of device 105.
- on-device system 110 can receive the audio signal, and send the audio signal to a remote server for processing. Feature extraction can be performed by the remote server. Also, for example, at the remote server, transcription system 140 can utilize a language model to generate a transcription based on the audio signal, and send the transcription to device 105. On-device system 110 may then identify the participant based on estimated audio sources, and may generate a modified speech-to-text transcript of the conversation that labels the respective portions of the speech-to-text transcript associated with the respective participants.
- a speaker-based acoustic model 135 may be utilized to further modify the transcript based on speech characteristics of a user.
- Variations in user accent can be a source of errors in automatic speech recognition.
- a geographical or regional accent of a user can sometimes be determined by a user’s location information
- a user’s ethnicity can be indicative of ethnic dialects (e.g., ethnolect), variations in spellings, and so forth.
- ethnic dialects can indicate an influence from a first language of the user, among other things.
- Some variational factors may also be inferred from user information.
- preferences provided by the user can be used.
- speech characteristics may be determined by generating vector embeddings of the speech input (e.g. d-vector/speaker embeddings), followed by clustering methods or principal components analysis.
- the acoustic model may be combined with sound localization to improve detection of a change in participant, and/or identification of a participant.
- speech characteristics can be combined with sound localization by identifying audio sources and then applying the speech characteristics.
- One or more deep neural networks may be trained to predict the speech characteristics based on sample training speech data.
- a user may be provided with controls allowing the user to make an election as to both if and when systems, programs, or features described herein may enable collection of user information (e.g., information about a user’s ethnicity, gender, social network, social contacts, or activities, a user’s preferences, or a user’s current location, and so forth), and if the user is sent content or communications from a server.
- user information e.g., information about a user’s ethnicity, gender, social network, social contacts, or activities, a user’s preferences, or a user’s current location, and so forth
- certain data may be treated in one or more ways before it is stored or used, so that Atty. Docket: 23-0049-WO personal data is removed, secured, encrypted, and so forth.
- a user’s identity may be treated so that no user data can be determined for the user, or a user’s geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined.
- location information such as to a city, ZIP code, or state level
- the user may have control over what information is collected about the user, how that information is used, and what information is provided to the user.
- user information is used for speech recognition
- such user information is restricted to the user’s device, and is not shared with a server, and/or with other devices.
- the user may have an ability to delete or modify any user information.
- two categories of participants may be identified during a conversation.
- a first category may include a first user who has non-standard speech, and is proximate to device 105.
- a second category may include other participants located further away from device 105 (relative to the first user).
- device 105 may be equipped with microphones at the top and the bottom (e.g., see Figure 2). Based on beamforming techniques and cross-correlation between the audio inputs from the two microphones, device 105 may identify when the first user is speaking, and when one of the other participants may be speaking. Accordingly, display of the transcript may be split into two screens, one screen displaying the portions of the transcript attributed to the first user, and a second screen displaying portions of the transcript attributed to the other participants. [100] FIG.
- top and bottom microphones of a smart device may be used to distinguish a user of the device from other participants located in front of the user.
- Display transcription may be provided on a split- screen and the text may be displayed with different characteristics (font, color, size, highlighted, etc.) based on sound localization and identification of the participants.
- the split-screen may be configured based on the reader of the relevant portion of the transcript.
- the first user who has non-standard speech may be provided with a view of the transcript corresponding to what the other participants are saying, whereas the other participants may be provided with a view of the transcript corresponding to what the first user is saying.
- device 505 is shown with a single display screen 505A.
- participants A and B are shown located at opposite ends of device 510.
- Participant A may say, “How was the trip to Boulder?”
- single display screen 505A is now displayed in a split-screen format comprising a first portion 510A and a second portion 510B divided by divider 510C.
- Speech uttered by participant A is displayed on the second portion 510B. This provides participant B with a view of the transcript corresponding to what participant A is saying.
- participant B responds by saying, “It was phenomenal. We took the roads through the Rockies and did some hiking.”
- a first portion 515A of the split-screen format displays a transcription of the speech uttered by participant B. This provides participant A with a view of the transcript corresponding to what participant B is saying.
- the second portion 515B of the split-screen format is not updated as participant A has not spoken.
- the first portion 515A and the second portion 515B are divided by divider 515C.
- the second portion 520B of the split-screen format is not updated as participant A has not spoken.
- the first portion 520A and the second portion 520B are divided by divider 520C.
- the first portion 520A continues to be contemporaneously updated.
- participant B may say, “The town was strange too.”
- the first portion 520A is updated with a transcription of what participant B says.
- divider 520C may be adjusted dynamically (e.g., moved vertically down) to allocate more space to the first portion 520A.
- divider 520C may be adjusted dynamically (e.g., moved vertically up) to allocate more space to the second portion 520B.
- speaker- based acoustic model 135 (of Figure 1) may be customized for participant B to suitably modify the transcript to accurately transcribe the speech uttered by participant B.
- participants A and B may be located at a public place with several background speakers. In such circumstances, participant B, who is associated with the mobile device, may be able to identify participant A as a participant in the conversation.
- on-device system 110 may then transcribe the speech uttered by participant A and participant B and ignore other speech.
- participant B may be provided (via an interactive graphical user interface of the device), a digital rendering of the Atty. Docket: 23-0049-WO participants indicating a relative physical location of the participants based on the respective directions for the two audio sources.
- the digital rendering may capture individuals located at a meeting, and participant B may be able to identify participants in the meeting, and/or identify who the relevant participants are.
- on-device system 110 may utilize a facial recognition algorithm to identify faces in a digitally captured image, and participant B may identify the individuals that are participants in the conversation.
- FIG.6 shows diagram 600 illustrating a training phase 602 and an inference phase 604 of trained machine learning model(s) 632, in accordance with example embodiments.
- Some machine learning techniques involve training one or more machine learning algorithms on an input set of training data to recognize patterns in the training data and provide output inferences and/or predictions about (patterns in the) training data.
- the resulting trained machine learning algorithm can be termed as a trained machine learning model.
- FIG.6 shows training phase 602 where machine learning algorithm(s) 620 are being trained on training data 610 to become trained machine learning model(s) 632.
- trained machine learning model(s) 632 can receive input data 630 and one or more inference/prediction requests 640 (perhaps as part of input data 630) and responsively provide as an output one or more inferences and/or prediction(s) 650.
- trained machine learning model(s) 632 can include one or more models of machine learning algorithm(s) 620.
- Machine learning algorithm(s) 620 may include, but are not limited to: an artificial neural network (e.g., a herein-described convolutional neural networks, a recurrent neural network, a Bayesian network, a hidden Markov model, a Markov decision process, a logistic regression function, a support vector machine, a suitable statistical machine learning algorithm, and/or a heuristic machine learning system).
- Machine Atty. Docket: 23-0049-WO learning algorithm(s) 620 may be supervised or unsupervised, and may implement any suitable combination of online and offline learning.
- machine learning algorithm(s) 620 and/or trained machine learning model(s) 632 can be accelerated using on-device coprocessors, such as graphic processing units (GPUs), tensor processing units (TPUs), digital signal processors (DSPs), and/or application specific integrated circuits (ASICs).
- on-device coprocessors can be used to speed up machine learning algorithm(s) 620 and/or trained machine learning model(s) 632.
- trained machine learning model(s) 632 can be trained, reside and execute to provide inferences on a particular computing device, and/or otherwise can make inferences for the particular computing device.
- machine learning algorithm(s) 620 can be trained by providing at least training data 610 as training input using unsupervised, supervised, semi- supervised, and/or reinforcement learning techniques.
- Unsupervised learning involves providing a portion (or all) of training data 610 to machine learning algorithm(s) 620 and machine learning algorithm(s) 620 determining one or more output inferences based on the provided portion (or all) of training data 610.
- Supervised learning involves providing a portion of training data 610 to machine learning algorithm(s) 620, with machine learning algorithm(s) 620 determining one or more output inferences based on the provided portion of training data 610, and the output inference(s) are either accepted or corrected based on correct results associated with training data 610.
- supervised learning of machine learning algorithm(s) 620 can be governed by a set of rules and/or a set of labels for the training input, and the set of rules and/or set of labels may be used to correct inferences of machine learning algorithm(s) 620.
- Semi-supervised learning involves having correct results for part, but not all, of training data 610. During semi-supervised learning, supervised learning is used for a portion of training data 610 having correct results, and unsupervised learning is used for a portion of training data 610 not having correct results.
- Reinforcement learning involves machine learning algorithm(s) 620 receiving a reward signal regarding a prior inference, where the reward signal can be a numerical value.
- machine learning algorithm(s) 620 can output an inference and receive a reward signal in response, where machine learning algorithm(s) 620 are configured to try to maximize the numerical value Atty. Docket: 23-0049-WO of the reward signal.
- reinforcement learning also utilizes a value function that provides a numerical value representing an expected total of the numerical values provided by the reward signal over time.
- machine learning algorithm(s) 620 and/or trained machine learning model(s) 632 can be trained using other machine learning techniques, including but not limited to, incremental learning and curriculum learning. [111]
- machine learning algorithm(s) 620 and/or trained machine learning model(s) 632 can use transfer learning techniques.
- transfer learning techniques can involve trained machine learning model(s) 632 being pre-trained on one set of data and additionally trained using training data 610.
- machine learning algorithm(s) 620 can be pre-trained on data from one or more computing devices and a resulting trained machine learning model provided to a particular computing device, where the particular computing device is intended to execute the trained machine learning model during inference phase 604. Then, during training phase 602, the pre-trained machine learning model can be additionally trained using training data 610, where training data 610 can be derived from kernel and non-kernel data of the particular computing device.
- This further training of the machine learning algorithm(s) 620 and/or the pre-trained machine learning model using training data 610 of the particular computing device’s data can be performed using either supervised or unsupervised learning.
- training phase 602 can be completed.
- the trained resulting machine learning model can be utilized as at least one of trained machine learning model(s) 632.
- trained machine learning model(s) 632 can be provided to a computing device, if not already on the computing device.
- Inference phase 604 can begin after trained machine learning model(s) 632 are provided to the particular computing device.
- trained machine learning model(s) 632 can receive input data 630 and generate and output one or more corresponding inferences and/or prediction(s) 650 about input data 630.
- input data 630 can be used as an input to trained machine learning model(s) 632 for providing corresponding inference(s) and/or prediction(s) 650 to kernel components and non-kernel components.
- trained machine learning model(s) 632 can generate inference(s) and/or prediction(s) 650 in response Atty. Docket: 23-0049-WO to one or more inference/prediction requests 640.
- trained machine learning model(s) 632 can be executed by a portion of other software.
- trained machine learning model(s) 632 can be executed by an inference or prediction daemon to be readily available to provide inferences and/or predictions upon request.
- Input data 630 can include data from the particular computing device executing trained machine learning model(s) 632 and/or input data from one or more computing devices other than the particular computing device. [114] Input data 630 can include a non-common term. Other types of input data are possible as well.
- Inference(s) and/or prediction(s) 650 can include one or more alternative terms that represent phonetically similar alternatives of pronouncing a non-common term.
- Inference(s) and/or prediction(s) 650 can include other output data produced by trained machine learning model(s) 632 operating on input data 630 (and training data 610).
- trained machine learning model(s) 632 can use output inference(s) and/or prediction(s) 650 as input feedback 660.
- Trained machine learning model(s) 632 can also rely on past inferences as inputs for generating new inferences.
- Convolutional neural networks and/or deep neural networks used herein can be an example of machine learning algorithm(s) 620. After training, the trained version of a convolutional neural network can be an example of trained machine learning model(s) 632.
- FIG. 7 depicts a distributed computing architecture 700, in accordance with example embodiments.
- Distributed computing architecture 700 includes server devices 708, 710 that are configured to communicate, via network 706, with programmable devices 704a, 704b, 704c, 704d, 704e.
- Network 706 may correspond to a local area network (LAN), a wide area network (WAN), a WLAN, a WWAN, a corporate intranet, the public Internet, or any other type of network configured to provide a communications path between networked computing devices.
- Network 706 may also correspond to a combination of one or more LANs, WANs, corporate intranets, and/or the public Internet.
- FIG. 7 only shows five programmable devices, distributed application architectures may serve tens, hundreds, or thousands of programmable devices.
- programmable devices 704a, 704b, 704c, 704d, 704e may be any sort of computing device, such as a mobile computing device, desktop computer, wearable computing device, head-mountable device (HMD), network terminal, a mobile computing device, and so on.
- programmable devices 704a, 704b, 704c, 704e programmable devices can be directly connected to network 706.
- programmable devices 704d programmable devices can be indirectly connected to network 706 via an associated computing device, such as programmable device 704c.
- programmable device 704c can act as an associated computing device to pass electronic communications between programmable device 704d and network 706.
- a computing device can be part of and/or inside a vehicle, such as a car, a truck, a bus, a boat or ship, an airplane, etc.
- a programmable device can be both directly and indirectly connected to network 706.
- Server devices 708, 710 can be configured to perform one or more services, as requested by programmable devices 704a-704e.
- server device 708 and/or 710 can provide content to programmable devices 704a-704e.
- the content can include, but is not limited to, web pages, hypertext, scripts, binary data such as compiled software, images, audio, and/or video.
- the content can include compressed and/or uncompressed content.
- the content can be encrypted and/or unencrypted.
- server devices 708 and/or 710 can provide programmable devices 704a-704e with access to software for database, search, computation, graphical, audio, video, World Wide Web/Internet utilization, and/or other functions. Many other examples of server devices are possible as well.
- FIG.8 is a block diagram of an example computing device 800, in accordance with example embodiments.
- computing device 800 shown in FIG.8 can be configured to perform at least one function of and/or related to speaker identification based on sound localization as described herein, including in method 900, and/or method 1000.
- Atty. Docket: 23-0049-WO Computing device 800 may include a user interface module 801, a network communications module 802, one or more processors 803, data storage 804, one or more camera(s) 818, one or more sensors 820, and power system 822, all of which may be linked together via a system bus, network, or other connection mechanism 805.
- User interface module 801 can be operable to send data to and/or receive data from external user input/output devices.
- user interface module 801 can be configured to send and/or receive data to and/or from user input devices such as a touch screen, a computer mouse, a keyboard, a keypad, a touch pad, a trackball, a joystick, a voice recognition module, and/or other similar devices.
- User interface module 801 can also be configured to provide output to user display devices, such as one or more cathode ray tubes (CRT), liquid crystal displays, light emitting diodes (LEDs), displays using digital light processing (DLP) technology, printers, light bulbs, and/or other similar devices, either now known or later developed.
- CTR cathode ray tubes
- LCDs light emitting diodes
- DLP digital light processing
- User interface module 801 can also be configured to generate audible outputs, with devices such as a speaker, speaker jack, audio output port, audio output device, earphones, and/or other similar devices. User interface module 801 can further be configured with one or more haptic devices that can generate haptic outputs, such as vibrations and/or other outputs detectable by touch and/or physical contact with computing device 800. In some examples, user interface module 801 can be used to provide a graphical user interface (GUI) for utilizing computing device 800. The GUI may display a transcript, a digital rendering of participants in a conversation, and receive user inputs.
- GUI graphical user interface
- Network communications module 802 can include one or more devices that provide wireless interface(s) 807 and/or wireline interface(s) 808 that are configurable to communicate via a network.
- Wireless interface(s) 807 can include one or more wireless transmitters, receivers, and/or transceivers, such as a BluetoothTM transceiver, a Zigbee® transceiver, a Wi- FiTM transceiver, a WiMAXTM transceiver, an LTETM transceiver, and/or other type of wireless transceiver configurable to communicate via a wireless network.
- Wireline interface(s) 808 can include one or more wireline transmitters, receivers, and/or transceivers, such as an Ethernet transceiver, a Universal Serial Bus (USB) transceiver, or similar transceiver configurable to communicate via a twisted pair wire, a coaxial cable, a fiber-optic link, or a similar physical connection to a wireline network.
- network communications module 802 can be configured to provide reliable, secured, and/or authenticated communications.
- information for facilitating reliable communications can be provided, perhaps as part of a message header and/or footer (e.g., packet/message sequencing information, encapsulation headers and/or footers, size/time information, and transmission verification information such as cyclic redundancy check (CRC) and/or parity check values).
- a message header and/or footer e.g., packet/message sequencing information, encapsulation headers and/or footers, size/time information, and transmission verification information such as cyclic redundancy check (CRC) and/or parity check values.
- CRC cyclic redundancy check
- Communications can be made secure (e.g., be encoded or encrypted) and/or decrypted/decoded using one or more cryptographic protocols and/or algorithms, such as, but not limited to, Data Encryption Standard (DES), Advanced Encryption Standard (AES), a Rivest-Shamir-Adelman (RSA) algorithm, a Diffie-Hellman algorithm, a secure sockets protocol such as Secure Sockets Layer (SSL) or Transport Layer Security (TLS), and/or Digital Signature Algorithm (DSA).
- DES Data Encryption Standard
- AES Advanced Encryption Standard
- RSA Rivest-Shamir-Adelman
- Diffie-Hellman algorithm a secure sockets protocol
- SSL Secure Sockets Layer
- TLS Transport Layer Security
- DSA Digital Signature Algorithm
- Other cryptographic protocols and/or algorithms can be used as well or in addition to those listed herein to secure (and then decrypt/decode) communications.
- One or more processors 803 can include one or more general purpose processors, and/or one or more special purpose processors (e.g., digital signal processors, tensor processing units (TPUs), graphics processing units (GPUs), application specific integrated circuits, etc.).
- One or more processors 803 can be configured to execute computer-readable instructions 806 that are contained in data storage 804 and/or other instructions as described herein.
- Data storage 804 can include one or more non-transitory computer-readable storage media that can be read and/or accessed by at least one of one or more processors 803.
- the one or more computer-readable storage media can include volatile and/or non-volatile storage components, such as optical, magnetic, organic or other memory or disc storage, which can be integrated in whole or in part with at least one of one or more processors 803.
- data storage 804 can be implemented using a single physical device (e.g., one optical, magnetic, organic or other memory or disc storage unit), while in other examples, data storage 804 can be implemented using two or more physical devices.
- Data storage 804 can include computer-readable instructions 806 and perhaps additional data.
- data storage 804 can include storage required to perform at least part of the herein-described methods, scenarios, and techniques and/or at least part of the functionality of the herein-described devices and networks.
- data storage 804 Atty. Docket: 23-0049-WO can include storage for a trained neural network model 812 (e.g., a model of trained convolutional neural networks).
- computer-readable instructions 806 can include instructions that, when executed by the one or more processors 803, enable computing device 800 to provide for some or all of the functionality of trained neural network model 812.
- computing device 800 can include camera(s) 818.
- Camera(s) 818 can include one or more image capture devices, such as still and/or video cameras, equipped to capture light and record the captured light in one or more images; that is, camera(s) 818 can generate image(s) of captured light.
- the one or more images can be one or more still images and/or one or more images utilized in video imagery.
- Camera(s) 818 can capture light and/or electromagnetic radiation emitted as visible light, infrared radiation, ultraviolet light, and/or as one or more other frequencies of light.
- camera(s) 818 can be used to capture digital views of participants in a conversation for purposes of rendering the participants via a display, and/or for purposes of applying facial recognition algorithms to identify a participant, based on prior authorization from the participant.
- computing device 800 can include one or more sensors 820. Sensors 820 can be configured to measure conditions within computing device 800 and/or conditions in an environment of computing device 800 and provide data about these conditions.
- sensors 820 can include one or more of: (i) sensors for obtaining data about computing device 800, such as, but not limited to, a thermometer for measuring a temperature of computing device 800, a battery sensor for measuring power of one or more batteries of power system 822, and/or other sensors measuring conditions of computing device 800; (ii) an identification sensor to identify other objects and/or devices, such as, but not limited to, a Radio Frequency Identification (RFID) reader, proximity sensor, one-dimensional barcode reader, two-dimensional barcode (e.g., Quick Response (QR) code) reader, and a laser tracker, where the identification sensors can be configured to read identifiers, such as RFID tags, barcodes, QR codes, and/or other devices and/or object configured to be read and provide at least identifying information; (iii) sensors to measure locations and/or movements of computing device 800, such as, but not limited to, a tilt sensor, a gyroscope, an accelerometer, a Doppler sensor, a GPS device
- computing device 800 such as, but not limited to, an infrared sensor, an optical sensor, a light sensor, a biosensor, a capacitive sensor, a touch sensor, a temperature sensor, a wireless sensor, a radio sensor, a movement sensor, a microphone, a sound sensor, an ultrasound sensor and/or a smoke sensor; and/or (v) a force sensor to measure one or more forces (e.g., inertial forces and/or G-forces) acting about computing device 800, such as, but not limited to one or more sensors that measure: forces in one or more dimensions, torque, ground force, friction, and/or a zero moment point (ZMP) sensor that identifies ZMPs and/or locations of the ZMPs.
- forces e.g., inertial forces and/or G-forces
- Power system 822 can include one or more batteries 824 and/or one or more external power interfaces 826 for providing electrical power to computing device 800.
- Each battery of the one or more batteries 824 can, when electrically coupled to the computing device 800, act as a source of stored electrical power for computing device 800.
- One or more batteries 824 of power system 822 can be configured to be portable. Some or all of one or more batteries 824 can be readily removable from computing device 800. In other examples, some or all of one or more batteries 824 can be internal to computing device 800, and so may not be readily removable from computing device 800. Some or all of one or more batteries 824 can be rechargeable.
- a rechargeable battery can be recharged via a wired connection between the battery and another power supply, such as by one or more power supplies that are external to computing device 800 and connected to computing device 800 via the one or more external power interfaces.
- some or all of one or more batteries 824 can be non-rechargeable batteries.
- One or more external power interfaces 826 of power system 822 can include one or more wired-power interfaces, such as a USB cable and/or a power cord, that enable wired electrical power connections to one or more power supplies that are external to computing device 800.
- One or more external power interfaces 826 can include one or more wireless power interfaces, such as a Qi wireless charger, that enable wireless electrical power connections, such as via a Qi wireless charger, to one or more external power supplies. Once an electrical power connection is established to an external power source using one or more external power interfaces 826, computing device 800 can draw electrical power from the external power source the established electrical power connection.
- power system 822 can include Atty. Docket: 23-0049-WO related sensors, such as battery sensors associated with one or more batteries or other types of electrical power sensors.
- FIG.9 is a flowchart of a method 900, in accordance with example embodiments. Method 900 can be executed by a computing device, such as computing device 800.
- Method 900 can begin at block 910, where the method involves receiving, by a computing device, first and second audio signals from respective first and second audio input devices, wherein the first and second audio signals correspond to speech input of a conversation between two participants. [133] At block 920, the method further involves estimating, by the computing device and based on the first and second audio signals, a time delay in respective arrival times for the speech input at the first and second audio input devices. [134] At block 930, the method also involves estimating, by the computing device, respective directions for two audio sources based on the estimated time delay in the respective arrival times.
- the method additionally involves associating, based on the estimated directions of the two audio sources, respective portions of a speech-to-text transcript of the conversation with the respective participants.
- the method further involves displaying, by the computing device and based on the associating of the respective portions, a modified speech-to-text transcript of the conversation that labels the respective portions of the speech-to-text transcript associated with the respective participants.
- the displaying of the modified speech-to-text transcript involves displaying the modified speech-to-text transcript in a split-screen format comprising a respective window for the respective participants, and wherein each respective window displays text corresponding to the respective participant contemporaneously as the respective participant speaks.
- Some embodiments also involve displaying a digital rendering of the two participants indicating a relative physical location of the two participants based on the respective directions for the two audio sources. Such embodiments may involve receiving, at Atty. Docket: 23-0049-WO the computing device, user input indicating an identity of the two participants, where the digital rendering comprises the identity of the two participants. Such embodiments may involve detecting a movement of one or both of the two participants. Such embodiments may also involve adjusting the digital rendering based on the detected movement. In some embodiments, the displaying of the digital rendering may be based on a digital image of the two participants as captured by a camera. [139] In some embodiments, the computing device may be associated with a first participant of the two participants.
- Such embodiments involve applying, by the computing device, an acoustic model associated with the first participant, where the first participant is proximate to the computing device.
- the modified transcript may be based on the acoustic model.
- the acoustic model may be stored in a local repository of the computing device. Such embodiments involve restricting access to contents of the local repository to within the computing device.
- the conversation comprises one or more additional speakers. Such embodiments involve receiving user indication that the one or more additional speakers are not participants of the conversation.
- the speech-to-text transcript does not transcribe speech input from the one or more additional speakers.
- the first and second audio signals are synchronized.
- the estimating of the time delay in respective arrival times involves applying generalized cross-correlation with phase transform (GCC-PHAT) to the synchronized first and second audio signals.
- GCC-PHAT generalized cross-correlation with phase transform
- the estimating of the respective directions for two audio sources involves synchronizing the speech-to-text transcript of the conversation with the synchronized first and second audio signals.
- Such embodiments also involve determining sample delays for respective transcript segments based on the synchronizing of the speech-to- text transcript.
- the determining of the sample delays for respective transcript segments involves determining a likely delay based on one or more of a histogram analysis or a kernel density estimate (KDE).
- the estimating of the respective directions for two audio sources involves applying a threshold to the determined likely delay. Atty. Docket: 23-0049-WO [144] In some embodiments, the estimating of the respective directions for the two audio sources involves correlating the time delay in respective arrival times with relative locations of the first and second audio input devices. [145] Some embodiments involve receiving, from an automatic speech recognition model, the speech-to-text transcript of the conversation. [146] In some embodiments, the conversation may include a third participant. In such embodiments, the estimating of the time delay involves estimating, by the computing device, respective directions for three audio sources based on the estimated time delay in the respective arrival times.
- Some embodiments also involve associating, based on the estimated directions of the three audio sources, the respective portions of the speech-to-text transcript of the conversation with each of the three participants.
- the modified speech- to-text transcript of the conversation may include labels identifying each of the three participants.
- Some embodiments involve receiving speech input at a third audio input device.
- the estimating of the time delay in respective arrival times comprises estimating pairwise time delays for respective audio signals from each pair of the three audio input devices.
- Some embodiments involve selecting, for the respective participants corresponding to the estimated directions for the two audio sources, a respective acoustic model indicative of speech characteristics of a respective participant.
- the modified speech- to-text transcript of the conversation may be based on the selected acoustic models.
- the selecting of the respective acoustic model may involve generating a d-vector embedding for the speech input.
- Such embodiments also involve classifying the two participants based on the embedding, wherein the classifying comprises one of a clustering technique or a principal component analysis.
- FIG.10 is a flowchart of a method 1000, in accordance with example embodiments.
- Method 1000 can be executed by a computing device, such as computing device 800.
- Method 1000 can begin at block 1010, where the method involves collecting synchronized audio signals from a plurality of microphones in a fixed buffer. Atty.
- the method further involves applying generalized cross-correlation with phase transform (GCC-PHAT) to each pair of audio signals (corresponding to a respective pair of microphones) to determine respective sample delays between the audio buffers.
- GCC-PHAT generalized cross-correlation with phase transform
- the method also involves collecting sample delays for transcript segments, by synchronization with the speech-to-text.
- the method additionally involves running statistics on a transcript segment, such as histogram or kernel density estimate (KDE), to find a most likely delay.
- KDE kernel density estimate
- the method additionally involves correlating the speakers with an angle of arrival (e.g., based on TDOA) of the audio.
- each block and/or communication may represent a processing of information and/or a transmission of information in accordance with example embodiments.
- Alternative embodiments are included within the scope of these example embodiments. In these alternative embodiments, for example, functions described as blocks, transmissions, Atty.
- a block that represents a processing of information may correspond to circuitry that can be configured to perform the specific logical functions of a herein-described method or technique. Alternatively or additionally, a block that represents a processing of information may correspond to a module, a segment, or a portion of program code (including related data).
- the program code may include one or more instructions executable by a processor for implementing specific logical functions or actions in the method or technique.
- the program code and/or related data may be stored on any type of computer readable medium such as a storage device including a disk or hard drive or other storage medium.
- the computer readable medium may also include non-transitory computer readable media such as non-transitory computer-readable media that stores data for short periods of time like register memory, processor cache, and random access memory (RAM).
- the computer readable media may also include non-transitory computer readable media that stores program code and/or data for longer periods of time, such as secondary or persistent long term storage, like read only memory (ROM), optical or magnetic disks, compact-disc read only memory (CD-ROM), for example.
- the computer readable media may also be any other volatile or non- volatile storage systems.
- a computer readable medium may be considered a computer readable storage medium, for example, or a tangible storage device.
- a block that represents one or more information transmissions may correspond to information transmissions between software and/or hardware modules in the same physical device. However, other information transmissions may be between software modules and/or hardware modules in different physical devices.
- a user may be provided with controls allowing the user to make an election as to both if and when systems, programs, or features described herein may enable collection of user information (e.g., Atty. Docket: 23-0049-WO information about a user’s social network, social actions, or activities, profession, a user’s preferences, a user’s demographic information, a user’s current location, or other personal information), and if the user is sent content or communications from a server.
- user information e.g., Atty. Docket: 23-0049-WO information about a user’s social network, social actions, or activities, profession, a user’s preferences, a user’s demographic information, a user’s current location, or other personal information
- certain data may be treated in one or more ways before it is stored or used, so that personally identifiable information is removed.
- a user’s identity may be treated so that no personally identifiable information can be determined for the user, or a user’s geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined.
- location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined.
- the user may have control over what information is collected about the user, how that information is used, and what information is provided to the user.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Otolaryngology (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- General Health & Medical Sciences (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
L'invention concerne un procédé donné à titre d'exemple qui consiste à recevoir des premier et second signaux audio en provenance de premier et second dispositifs d'entrée audio respectifs. Les premier et second signaux audio correspondent à une entrée vocale d'une conversation entre deux participants. Le procédé consiste à estimer, sur la base des premier et second signaux audio, un retard temporel dans des temps d'arrivée respectifs pour l'entrée vocale au niveau des premier et second dispositifs d'entrée audio. Le procédé consiste à estimer des directions respectives pour deux sources audio sur la base du retard temporel estimé dans les temps d'arrivée respectifs. Le procédé consiste à associer, sur la base des directions estimées des deux sources audio, de parties respectives d'une transcription parole-texte de la conversation avec les participants respectifs. Le procédé consiste à afficher, sur la base de l'association des parties respectives, une transcription de parole en texte modifiée de la conversation qui marque les parties respectives de la transcription de parole en texte associée aux participants respectifs.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2023/019559 WO2024226024A1 (fr) | 2023-04-24 | 2023-04-24 | Systèmes et procédés de localisation sonore personnalisée dans une conversation |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2023/019559 WO2024226024A1 (fr) | 2023-04-24 | 2023-04-24 | Systèmes et procédés de localisation sonore personnalisée dans une conversation |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024226024A1 true WO2024226024A1 (fr) | 2024-10-31 |
Family
ID=86386916
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/019559 Pending WO2024226024A1 (fr) | 2023-04-24 | 2023-04-24 | Systèmes et procédés de localisation sonore personnalisée dans une conversation |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024226024A1 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12380894B2 (en) * | 2023-05-24 | 2025-08-05 | Ringcentral, Inc. | Systems and methods for contextual modeling of conversational data |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190341050A1 (en) * | 2018-05-04 | 2019-11-07 | Microsoft Technology Licensing, Llc | Computerized intelligent assistant for conferences |
| US20220115020A1 (en) * | 2020-10-12 | 2022-04-14 | Soundhound, Inc. | Method and system for conversation transcription with metadata |
-
2023
- 2023-04-24 WO PCT/US2023/019559 patent/WO2024226024A1/fr active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190341050A1 (en) * | 2018-05-04 | 2019-11-07 | Microsoft Technology Licensing, Llc | Computerized intelligent assistant for conferences |
| US20220115020A1 (en) * | 2020-10-12 | 2022-04-14 | Soundhound, Inc. | Method and system for conversation transcription with metadata |
Non-Patent Citations (3)
| Title |
|---|
| LIU CONGGUI ET AL: "A unified network for multi-speaker speech recognition with multi-channel recordings", 2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), IEEE, 12 December 2017 (2017-12-12), pages 1304 - 1307, XP033315617, DOI: 10.1109/APSIPA.2017.8282233 * |
| RAJ DESH ET AL: "Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis", 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), IEEE, 19 January 2021 (2021-01-19), pages 897 - 904, XP033891225, DOI: 10.1109/SLT48900.2021.9383556 * |
| YOSHIOKA TAKUYA ET AL: "Advances in Online Audio-Visual Meeting Transcription", 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), IEEE, 14 December 2019 (2019-12-14), pages 276 - 283, XP033718873, DOI: 10.1109/ASRU46091.2019.9003827 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12380894B2 (en) * | 2023-05-24 | 2025-08-05 | Ringcentral, Inc. | Systems and methods for contextual modeling of conversational data |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20250266053A1 (en) | Identifying input for speech recognition engine | |
| US11915699B2 (en) | Account association with device | |
| CN112074901B (zh) | 语音识别登入 | |
| US11854550B2 (en) | Determining input for speech processing engine | |
| US10621991B2 (en) | Joint neural network for speaker recognition | |
| US10847162B2 (en) | Multi-modal speech localization | |
| US10468030B2 (en) | Speech recognition method and apparatus | |
| US20210327431A1 (en) | 'liveness' detection system | |
| US10354642B2 (en) | Hyperarticulation detection in repetitive voice queries using pairwise comparison for improved speech recognition | |
| CN116547746A (zh) | 针对多个用户的对话管理 | |
| US20190392858A1 (en) | Intelligent voice outputting method, apparatus, and intelligent computing device | |
| Minotto et al. | Multimodal multi-channel on-line speaker diarization using sensor fusion through SVM | |
| US20150364129A1 (en) | Language Identification | |
| KR20210044475A (ko) | 대명사가 가리키는 객체 판단 방법 및 장치 | |
| US12211509B2 (en) | Fusion of acoustic and text representations in RNN-T | |
| JP2023546703A (ja) | マルチチャネル音声アクティビティ検出 | |
| WO2024226024A1 (fr) | Systèmes et procédés de localisation sonore personnalisée dans une conversation | |
| Aripin et al. | Indonesian Lip‐Reading Detection and Recognition Based on Lip Shape Using Face Mesh and Long‐Term Recurrent Convolutional Network | |
| US20250259003A1 (en) | Machine Learning Based Context Aware Correction for User Input Recognition | |
| WO2024196613A1 (fr) | Système et procédé de génération de profils synthétiques pour l'apprentissage de systèmes de vérification biométrique | |
| Xu et al. | Speaker identification and speech recognition using phased arrays | |
| Huu et al. | A Novel Sentence‐Level Visual Speech Recognition System for Vietnamese Language Using ResNet3D and Zipformer | |
| Tran | Study on speech detection method under heavy noise conditions using spectro-temporal modulation analysis | |
| WO2024144811A1 (fr) | Systèmes et procédés de séparation de pistes audio et de mise en correspondance avec une légende texte |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23724547 Country of ref document: EP Kind code of ref document: A1 |