WO2024176673A1 - Programme, procédé de traitement d'informations, dispositif de traitement d'informations et robot - Google Patents
Programme, procédé de traitement d'informations, dispositif de traitement d'informations et robot Download PDFInfo
- Publication number
- WO2024176673A1 WO2024176673A1 PCT/JP2024/001328 JP2024001328W WO2024176673A1 WO 2024176673 A1 WO2024176673 A1 WO 2024176673A1 JP 2024001328 W JP2024001328 W JP 2024001328W WO 2024176673 A1 WO2024176673 A1 WO 2024176673A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- data
- user
- output
- classification information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
Definitions
- This technology relates to a program, an information processing method, an information processing device, and a robot.
- the operating device described in Patent Document 1 includes a voice recognition unit and a control unit that controls lighting fixtures based on any of the recognition results by the voice recognition unit.
- Patent Document 1 cannot determine whether the acquired voice data (voice data) is voice spoken by the user or voice output by the device.
- This disclosure has been made in light of these circumstances, and aims to provide a program or the like that is capable of determining whether voice data is voice spoken by a user or voice output by a device.
- a program acquires voice data and causes a computer to execute a process of inputting the acquired voice data into a learning model that has been trained to output voice classification information indicating whether the voice data is a voice spoken by a user or a voice output by a device when the voice data is input, thereby outputting the voice classification information.
- a program generates a mel spectrogram based on the audio data, and inputs the generated mel spectrogram into the learning model as the audio data.
- a program inputs the mel spectrogram converted to grayscale into the learning model as the audio data.
- a program extracts text data based on the mel spectrogram if the voice classification information indicates voice spoken by a user.
- a program acquires audio data collected by a microphone, inputs the acquired audio data into a voice section detection module, extracts data of sections containing audio from the audio data as the audio data, inputs the data extracted from the audio data into the learning model as the audio data, and outputs the audio classification information.
- the program according to one embodiment of the present disclosure cancels the voice output by a robot that executes response control in response to a user's voice instructions from the collected sound data.
- a program commands a robot to execute response control corresponding to an instruction by the voice uttered by the user when the voice classification information indicates a voice uttered by the user, and does not command the robot to execute the response control when the voice classification information indicates a voice output by a device.
- the learning model outputs the voice classification information indicating whether the voice data is a voice spoken by a user, a voice output by an equipment, or includes both a voice spoken by a user and a voice output by an equipment, and if the voice classification information indicates that the voice data includes both a voice spoken by a user and a voice output by an equipment, the learning model commands a robot that executes response control corresponding to a voice instruction from the user to execute voice reacquisition control that prompts the user to input voice again.
- a program selects one of the multiple learning models according to the type of environment where the sound contained in the sound data is picked up, inputs the sound data into the selected learning model, and outputs the sound classification information.
- An information processing method acquires voice data, and outputs the voice classification information by inputting the acquired voice data into a learning model that has been trained to output voice classification information indicating whether the voice data is a voice spoken by a user or a voice output by a device when the voice data is input.
- An information processing device includes a control unit that acquires voice data and outputs voice classification information by inputting the acquired voice data into a learning model that has been trained to output voice classification information indicating whether the voice data is a voice spoken by a user or a voice output by a device when the voice data is input.
- a robot includes a microphone that picks up ambient sounds, and a control unit that executes response control in response to a user's voice instruction.
- the control unit When voice data based on the voice data picked up by the microphone is input, the control unit outputs the voice classification information by inputting the acquired voice data into a learning model that has been trained to output voice classification information indicating whether the voice data is a recording of a voice spoken by a user or a recording of a voice output by a device, and executes the response control if the voice classification information indicates a voice spoken by a user.
- a program it is possible to determine whether the voice data is voice spoken by a user or voice output by a device.
- FIG. 1 is an explanatory diagram illustrating an example of a voice determination system.
- 1 is a block diagram showing an example of the configuration of an information processing device
- 1 is a block diagram showing an example of the configuration of a voice response device
- FIG. 2 is an explanatory diagram illustrating an example of a voice activity detection module.
- FIG. 2 is an explanatory diagram illustrating an example of a fast Fourier transform module.
- FIG. 2 is an explanatory diagram illustrating an example of a fast Fourier transform module.
- FIG. 1 is an explanatory diagram illustrating an example of a learning model.
- FIG. 1 is an explanatory diagram illustrating an example of a learning model.
- 10 is a flowchart illustrating an example of processing by an information processing device.
- FIG. 10 is a flowchart illustrating an example of processing by an information processing device.
- FIG. 11 is an explanatory diagram showing an example of a learning model according to the second embodiment.
- 10 is a flowchart illustrating an example of processing by an information processing device according to a second embodiment.
- FIG. 11 is an explanatory diagram illustrating an example of a voice determination system according to a third embodiment.
- FIG. 11 is a block diagram showing an example of the configuration of an information processing device according to a third embodiment.
- FIG. 2 is a block diagram showing an example of the configuration of a user terminal.
- FIG. 11 is an explanatory diagram illustrating an example of an environment type table.
- FIG. 13 is an explanatory diagram illustrating an example of an environment type input screen.
- 13 is a flowchart illustrating an example of processing by an information processing device according to a third embodiment.
- FIG. 13 is a block diagram showing an example of the configuration of a voice response device according to a fourth embodiment.
- 13 is a flowchart illustrating an example of a process of a voice response device according to a
- (Embodiment 1) 1 is an explanatory diagram showing an example of a voice determination system S.
- the voice determination system S includes an information processing device 1 and a voice response device 2.
- the information processing device 1 is capable of communicating with the voice response device 2 via a network N using wide-area wireless communication.
- the voice response device 2 is, for example, a smart speaker equipped with a computer that executes response control corresponding to the user's voice instructions.
- the voice response device 2 may be a robot that operates in response to the user's voice instructions, or an electrical appliance such as a washing machine, dishwasher, vacuum cleaner, refrigerator, or air conditioner.
- the functions of the information processing device 1 and the voice response device 2 may be realized by an AI assistant function or a voice chatbot function built into a smartphone.
- the voice response device 2 is described as being a smart speaker.
- the voice response device 2 is installed, for example, in an indoor space (room) where the user resides, and executes response control such as replying by voice output or switching the power of a communicable electrical appliance in response to the user's voice instructions.
- a device (voice output device) 3 that outputs sound such as a television or a radio may be installed in the indoor space.
- the voice response device 2 collects sounds in the indoor space and transmits the collected sound data to the information processing device 1.
- the information processing device 1 receives the collected sound data and obtains from the collected sound data voice data including a voice uttered by a user in the indoor space or a voice output by the voice output device 3, and determines whether the voice included in the voice data is a voice uttered by the user or a voice output by the voice output device 3. If the voice included in the voice data is a voice uttered by a user, the information processing device 1 causes the voice response device 2 to execute response control.
- the voice response device 2 may be installed in a car.
- the voice output device 3 is a radio, and the voice response device 2 determines whether the voice included in the voice data is a voice uttered by a user or a voice output by the voice output device (radio) 3.
- FIG. 2 is a block diagram showing an example configuration of the information processing device 1.
- the information processing device 1 is, for example, a server computer, and includes a control unit 11, a storage unit 12, and a communication unit 13.
- the control unit 11 is configured with a CPU (Central Processing Unit), an MPU (Micro Processing Unit), a GPU (Graphical Processing Unit), or a quantum processor, and performs various control processes, arithmetic processes, and the like by reading and executing a program P (program product) and a database pre-stored in the storage unit 12.
- the functions of the information processing device 1 may be realized by multiple server devices or computers.
- the information processing device 1 may correspond to a node on a blockchain.
- the storage unit 12 of the information processing device 1 is, for example, a volatile memory and a non-volatile memory.
- the storage unit 12 stores a program P, a voice section detection module M1, a fast Fourier transform module M2, and a learning model M3.
- the program P may be provided to the information processing device 1 using a storage medium 12a in which the program P is stored in a computer-readable manner.
- the storage medium 12a is, for example, a portable memory. Examples of the portable memory include a CD-ROM, a USB (Universal Serial Bus) memory, an SD card, a micro SD card, and a compact flash memory (registered trademark).
- the processing element of the control unit 11 may read the program P from the storage medium 12a using a reading device not shown.
- the read program P is written to the storage unit 12.
- the program P may be provided to the information processing device 1 by the communication unit 13 communicating with an external device. Details of the voice section detection module M1, the fast Fourier transform module M2, and the learning model M3 will be described later.
- the communication unit 13 is a communication module or communication interface for communicating with the voice response unit 2 by wire or wirelessly, and is, for example, a wide-area wireless communication module such as LTE (registered trademark), 4G, or 5G.
- the control unit 11 communicates with the voice response unit 2 via the communication unit 13 through an external network N such as the Internet.
- FIG. 3 is a block diagram showing an example of the configuration of the voice response device 2.
- the voice response device 2 includes a device control unit 21, a storage unit 22, a communication unit 23, a microphone 24, and a voice output unit 25.
- the device control unit is configured with a CPU or an MPU, etc., and performs various control processes, arithmetic processes, etc.
- the voice response device 2 may perform some or all of the processes executed by the information processing device 1 according to this embodiment.
- the storage unit 22 transmits (outputs) the audio data collected by the microphone 24 to the information processing device, and stores an application program Pa that executes response control based on instructions from the information processing device 1.
- the application program Pa is provided to the voice response unit 2 using, for example, a storage medium 22a.
- the device control unit 21 of the voice response unit 2 may obtain the application program Pa using the Internet and store it in the storage unit 22.
- the communication unit 23 is a communication module or communication interface for wirelessly communicating with the information processing device 1.
- the device control unit 21 communicates with the information processing device 1 through the external network N via the communication unit 23.
- the microphone 24 picks up sounds around the voice response unit 2, i.e., sounds within the indoor space in which the voice response unit 2 is installed.
- the device control unit 21 converts the sounds picked up by the microphone 24 into picked-up data and transmits (outputs) it to the information processing device 1.
- the voice response unit 2 may be connected to an external microphone and acquire the picked-up data from the microphone.
- the voice output unit 25 responds (notifies) the user by outputting a voice related to the response control.
- the voice output from the voice output unit 25 is a voice generated by the information processing device 1 or the voice response device 2, or a voice selected from a plurality of voice templates stored in the information processing device 1 or the voice response device 2.
- the device control unit 21 receives (acquires) an instruction to execute response control from the information processing device 1, it executes response control by outputting voice from the voice output unit 25.
- the voice output from the voice output unit 25 is included in the collected voice data collected by the microphone 24, the control unit 11 of the information processing device 1 deletes the data indicating the voice by a canceling process.
- the voice response unit 2 may be connected to an external speaker or the like and output voice from the speaker or the like.
- FIG. 4 is an explanatory diagram showing an example of a voice activity detection module.
- the voice activity detection (VAD) module M1 extracts data of a section containing voice from the collected sound data, and sets it as voice data.
- the voice activity detection module M1 cuts out a section from the collected sound data that contains environmental sounds such as door opening and closing sounds, operating sounds of electrical appliances, and outdoor noise, and does not contain voice, and extracts the voice data.
- the collected sound data according to this embodiment and the voice data extracted from the collected sound data are represented by a voice signal with the vertical axis representing amplitude and the horizontal axis representing time.
- the voice data represented by the voice signal is referred to as signal voice data.
- the control unit 11 of the information processing device 1 constantly acquires collected sound data from the voice response device 2 and inputs it to the voice activity detection module M1.
- the voice activity detection module M1 extracts data of a section from the collected sound data, from the time when the amplitude of the voice signal first exceeds a predetermined threshold to the time when the amplitude last exceeds the threshold, as signal voice data.
- the voice activity detection module M1 may extract, as signal voice data, a section in which the number of times the amplitude of the voice signal switches between positive and negative (number of zero crossings) is equal to or greater than a predetermined number.
- data of a section delimited by a dashed line shown in the collected sound data is extracted as signal voice data.
- the voice activity detection module M1 may be stored in the voice response unit 2, and the device control unit 21 of the voice response unit 2 may extract the signal voice data from the collected sound data, and transmit (output) the extracted signal voice data to the information processing device 1.
- FIG. 5 is an explanatory diagram showing an example of a fast Fourier transform module.
- the fast Fourier transform (FFT) module M2 converts the signal audio data into a mel spectrogram display.
- the mel spectrogram-displayed audio data is referred to as mel spectrogram audio data.
- the control unit 11 of the information processing device 1 inputs the signal audio data output by the audio section detection module M1 to the fast Fourier transform module M2.
- the fast Fourier transform module M2 performs a discrete Fourier transform on the audio signal data for each frame separated by a certain time period to obtain the audio intensity (dB) at each frequency, and generates a spectrogram of the audio data (spectrogram audio data) by mapping based on the audio intensity with the vertical axis being frequency and the horizontal axis being time.
- the fast Fourier transform module M2 generates and outputs mel spectrogram audio data by changing the vertical axis of the spectrogram audio data to the mel scale.
- FIG. 5A shows mel spectrogram audio data when the audio included in the signal audio data is audio uttered by the user.
- the mel spectrogram audio data generated in this embodiment is shown as a grayscale image, but may also be shown as an RGB three-channel color image.
- FIG. 6 is an explanatory diagram showing an example of the learning model M3.
- the learning model M3 is, for example, a model that extracts image features, such as a CNN (Convolutional Neural Network) or a Vision Transformer.
- the input layer included in the learning model M3 has a plurality of neurons that accept input of pixel values of an image of melspectrogram audio data (audio data image), and passes the input pixel values to the intermediate layer.
- the intermediate layer has a plurality of neurons that extract image features of the audio data image, and passes the extracted image features to the output layer.
- the output layer has one or more neurons that output audio classification information indicating whether the input audio data is audio uttered by the user or audio output by the audio output device 3, and outputs the audio classification information based on the image features output from the intermediate layer.
- the learning model M3 may input the image features output from the intermediate layer to an SVM (support vector machine) to output the audio classification information.
- the neural network (learning model M3) trained using the training data is expected to be used as a program module that is part of artificial intelligence software. As described above, the learning model M3 is used in the control unit 11 (CPU, etc.), and is executed by the control unit 11 having the calculation processing capacity in this manner to form a neural network system.
- control unit 11 performs calculations to extract features of the audio data image input to the input layer in accordance with instructions from the learning model M3 stored in the memory unit 12, and outputs audio classification information from the output layer.
- grayscale mel spectrogram audio data is input to the learning model M3, but RGB three-channel color spectrogram audio data may also be input.
- the learning model M3 is trained using training data in which a label of a correct output value (1 or 0) is associated with mel spectrogram audio data.
- the control unit 11 of the information processing device 1 inputs a plurality of training data, including training data created by labeling the mel spectrogram audio data shown in FIG. 5A with the correct output value (1), or training data created by labeling the mel spectrogram audio data shown in FIG. 5B with the correct output value (0), to the learning model M3, and trains the learning model M3.
- the learning model M3 adjusts the bias in each layer and the weight between each layer so as to output the correct output value, and is trained.
- the learning model M3 may be trained by a computer different from the information processing device 1.
- the learning model M3 outputs a binary value of 1 or 0 as the voice classification information. That is, the control unit 11 outputs the voice classification information using the learning model M3.
- the control unit 11 determines that the voice data is a voice produced by a user's voice.
- the control unit 11 determines that the voice data is a voice produced by the output of the voice output device 3.
- the learning model may output the probability that the voice data is a voice produced by a user's voice as the voice classification information.
- control unit 11 determines that the voice data is a voice produced by the output of the voice output device 3 when the probability is 0.5 or more, and determines that the voice data is a voice produced by the output of the voice output device 3 when the probability is less than 0.5. That is, the control unit 11 outputs the voice classification information using the learning model M3.
- FIG. 7 is a flowchart showing an example of processing of the information processing device.
- the control unit 11 of the information processing device 1 acquires the collected voice data collected by the microphone 24 from the voice response unit 2 (S1).
- the control unit 11 performs a canceling process on the collected voice data to delete the voice output from the voice output unit 25 of the voice response unit 2 (S2).
- the control unit 11 inputs the collected voice data after the canceling process to the voice section detection module M1 (S3) and extracts the signal voice data (voice data) (S4).
- the control unit 11 inputs the signal voice data to the fast Fourier transform module M2 (S5) and generates mel spectrogram voice data (S6).
- the control unit 11 inputs the mel spectrogram voice data (voice data) to the learning model M3 (S7) and outputs voice classification information indicating whether the voice data is a voice uttered by the user or a voice output by the voice output device 3 (S8).
- the control unit 11 determines whether the voice data is a voice uttered by the user (whether it is a voice output by the voice output device 3) (S9). In S9, if the sound classification information is 1, the control unit 11 determines that the sound data is sound produced by the user's voice, and if the sound classification information is 0, the control unit 11 determines that the sound data is sound produced by the output of the sound output device 3.
- the control unit 11 extracts text data of the content of the voice generated by the user based on the mel spectrogram voice data (S10).
- the control unit 11 extracts the text data by voice recognition processing using processing modules such as an acoustic model, a pronunciation dictionary, and a language model.
- the control unit 11 recognizes an instruction by the voice generated by the user based on the extracted text data (S11).
- the control unit 11 recognizes the instruction by a trained model that performs natural language processing, such as GPT (Generative Pre-Training), GPT2, GPT3, Transformer, BERT (Bidirectional Encoder Representations from Transformers), BART (Bidirectional Auto-Regressive Transformer), or T5 (Text-to-Text Transfer Transformer).
- the control unit 11 determines the content of the response control corresponding to the recognized instruction (S12).
- the control unit 11 determines the content of the response control, for example, by referring to a database that links instructions with the content of the response control.
- the control unit 11 transmits (outputs) a command to execute the response control whose content has been determined to the voice response unit 2 (S13), and ends the process.
- the control unit 11 ends the process without sending (outputting) a command to execute response control to the voice response unit 2.
- the above configuration and processing make it possible to determine whether the acquired voice data is a voice uttered by the user or a voice output by the voice output device 3. In addition, by deciding whether or not to execute response control depending on the result of the determination, it is possible to cause the voice response unit 2 to execute response control with high accuracy only in response to voice instructions from the user.
- the information processing device 1 When the voice data includes both the voice uttered by the user and the voice output by the voice output device 3, the information processing device 1 according to the second embodiment commands the voice response unit 2 to execute voice reacquisition control to prompt the user to input voice again.
- the other configurations, except for the configuration described below, are common to the first embodiment. Therefore, the same reference symbols as those in the first embodiment are used for the components common to the first embodiment, and the description of those components will be omitted.
- FIG. 8 is an explanatory diagram showing an example of the learning model M3 according to the second embodiment.
- the learning model M3 according to the second embodiment outputs the probability that the mel spectrogram voice data is a voice uttered by the user (first probability), the probability that the voice is a voice output by the voice output device 3 (second probability), and the probability that the mel spectrogram voice data includes both a voice uttered by the user and a voice output by the voice output device 3 (third probability) as voice classification information so that the sum of the probabilities is 1.
- the input layer included in the learning model M3 has a plurality of neurons that accept input of pixel values of an image of the mel spectrogram voice data (voice data image), and passes the input pixel values to the intermediate layer.
- the intermediate layer has a plurality of neurons that extract image features of the voice data image, and passes the extracted image features to the output layer.
- the output layer has a plurality of neurons that output each probability, and outputs each probability based on the image features output from the intermediate layer.
- the control unit 11 of the information processing device 1 performs voice determination based on the highest probability among the various probabilities included in the voice classification information. In the example shown in FIG. 8, since the third probability is the highest, the control unit 11 determines that the mel spectrogram voice data includes both voice uttered by the user and voice output by the voice output device 3.
- the learning model M3 may output the probability that the mel spectrogram voice data is voice uttered by the user as the voice classification information.
- control unit 11 of the information processing device 1 determines that the voice data is voice output by the voice output device 3 when the probability is 0 or more and less than 0.34, determines that the voice data includes both voice uttered by the user and voice output by the voice output device 3 when the probability is 0.34 or more and less than 0.67, and determines that the voice data is voice uttered by the user when the probability is 0.67 or more and less than 1.
- the learning model M3 is trained using training data that links mel spectrogram voice data with labels for each probability. For example, when the mel spectrogram voice data included in the training data is a voice uttered by a user, the control unit 11 of the information processing device 1 inputs training data created by linking labels of first probability: 1, second probability: 0, and third probability: 0 to the mel spectrogram voice data to the learning model M3, and trains the learning model M3.
- the learning model M3 is trained by adjusting the bias in each layer and the weights between each layer so as to output a probability equal to the label. Note that the learning model M3 may be trained by a computer different from the information processing device 1.
- FIG. 9 is a flowchart showing an example of processing of the information processing device 1 according to the second embodiment. Steps S21 to S28 are similar to steps S1 to S8 in FIG. 7.
- the control unit 11 of the information processing device 1 determines whether the voice data includes both voice uttered by the user and voice output by the voice output device 3 (S29). In S29, if the voice classification information output by the learning model M3 is 0.5, the control unit 11 determines that the voice data includes both voice uttered by the user and voice output by the voice output device 3, and if the voice classification information is 0 or 1, determines that the voice data does not include both voice uttered by the user and voice output by the voice output device 3.
- the control unit 11 transmits (outputs) a command to execute voice re-acquisition control to the voice response unit 2 (S30) and ends the process.
- the voice response unit 2 receives (acquires) a command to execute voice re-acquisition control from the information processing device 1, it executes voice re-acquisition control to prompt the user to input voice again.
- the voice response unit 2 executes voice re-acquisition control by, for example, outputting a voice such as "I could't hear you. Please speak again when it is quiet around you" via the voice output unit 25.
- S31 to S35 are the same processes as S9 to S13 in FIG. 7.
- the information processing device 1 selects one learning model M3 from the multiple learning models M3 according to the environmental type of the location where the sound included in the sound data is collected, inputs the sound data to the selected learning model M3, and outputs sound classification information.
- the following describes the differences between the third embodiment and the first embodiment.
- the other configurations, except for the configurations described below, are common to the first embodiment. For this reason, the same reference symbols as those in the first embodiment are used for the components common to the first embodiment, and the description of those components will be omitted.
- FIG. 10 is an explanatory diagram showing an example of a voice assessment system S according to embodiment 3.
- the voice assessment system S according to embodiment 3 includes a user terminal 4.
- the user terminal 4 is a terminal device owned by a user, such as a smartphone, a tablet terminal, or a personal computer.
- the following description will be given assuming that the user terminal 4 is a smartphone.
- the audio output device 3 is a television.
- the user terminal 4 accepts input from the user of the environmental type of the location where the voice included in the voice data is picked up, i.e., the location where the voice response unit 2 is installed, and transmits (outputs) the input environmental type to the information processing device 1.
- the information processing device 1 selects a learning model M3 to be used for outputting voice classification information from a plurality of stored learning models M3 according to the received (acquired) environmental type. Details of the environmental type will be described later. Note that when sound is picked up by a microphone installed outside the voice response unit 2, the location where the voice is picked up is the location where the microphone is installed.
- FIG. 11 is a block diagram showing an example of the configuration of an information processing device 1 according to embodiment 3.
- the storage unit 12 of the information processing device 1 according to embodiment 3 stores an environment type table T and a plurality of learning models M3 (M3a to M3f).
- FIG. 12 is a block diagram showing an example of the configuration of the user terminal 4.
- the user terminal 4 includes a terminal control unit 41, a storage unit 42, a communication unit 43, an input unit 44, and a display unit 45.
- the terminal control unit 41 is configured with a CPU or an MPU, and performs various control processes, arithmetic processes, and the like. Note that the functions of the user terminal 4 may be realized by multiple devices.
- the storage unit 42 stores an application program Pb that receives an environment type input by the user at the input unit 44 and transmits (outputs) the input environment type to the information processing device 1.
- the application program Pb is provided to the user terminal 4 using, for example, a storage medium 42a.
- the terminal control unit 41 of the user terminal 4 may obtain the application program Pb using the Internet and store it in the storage unit 42.
- the communication unit 43 is a communication module or communication interface for wirelessly communicating with the information processing device 1.
- the terminal control unit 41 communicates with the information processing device 1 through the external network N via the communication unit 43.
- the input unit 44 accepts input from the user.
- the user can use the input unit 44 to input the type of environment in which the voice response unit 2 will be installed.
- the display unit 45 displays an input screen for the environment type.
- the user terminal 4 is a smartphone, and the input unit 44 and the display unit 45 are integrated into a touch panel.
- FIG. 13 is an explanatory diagram showing an example of an environment type table T.
- the environment types are classified by a number of items.
- the environment types are classified by two items: the channel that the user frequently watches using the audio output device (television) 3, and the distance between the voice response device 2 and the audio output device (television) 3.
- the environment types may be classified by three or more items, such as the user's family composition, the location of the user when speaking to the voice response device 2, the floor number of the indoor space (room) in which the voice response device 2 is installed, or the traffic volume of roads near the indoor space (room) in which the voice response device 2 is installed.
- the management items (fields) of the environment type table T include, for example, a channel field, a distance field, and a learning model field.
- the channel field stores the channel that the user frequently watches through the audio output device 3.
- the distance field stores the distance between the voice response unit 2 and the audio output device 3.
- the learning model field stores the type of learning model M3 selected for the information on the environment type stored in the channel field and the distance field.
- the control unit 11 selects the learning model M3 to be used for outputting the audio classification information based on the environment type input to the user terminal 4 and the environment type table T.
- the environment types are divided by three channels and two distance divisions, but this is not limited to this.
- the environment types may be divided by two or four or more channels, or by three or more distance divisions.
- the number of learning models M3 stored in the memory unit 12 is not limited to six.
- the memory unit 12 may store two to five learning models M3, or seven or more learning models M3.
- the terminal control unit 41 of the user terminal 4 causes the display unit 45 to display the environment type input screen.
- the input unit 44 accepts input, for example, by selection, for each item of the environment type displayed on the environment type input screen. In the example shown in FIG. 14, the input unit 44 accepts a selection from channel A, channel B, or channel C for the channel that the user frequently watches through the audio output device 3.
- the input unit 44 also accepts a selection of either less than 3 m or 3 m or more for the distance between the voice response unit 2 and the audio output device 3.
- a command (transmission command) for instructing transmission of the input environment type is displayed on the environment type input screen. When the transmission command is selected, the terminal control unit 41 transmits the input (selected) environment type to the information processing device 1.
- FIG. 15 is a flowchart showing an example of processing of the information processing device 1 according to the third embodiment.
- the control unit 11 of the information processing device 1 acquires the input environmental type of the location where the voice response unit 2 is installed from the user terminal 4 (S41).
- the control unit 11 reads the environmental type table T from the storage unit 12 (S42).
- the control unit 11 refers to the environmental type table T and selects a learning model M3 to be used for outputting voice classification information based on the input environmental type of the location where the voice response unit 2 is installed (S43).
- Steps S44 to S56 are the same as steps S1 to S13 shown in FIG. 7.
- the control unit 11 inputs mel spectrogram voice data to the learning model M3 selected in S43.
- the control unit 11 of the information processing device 1 determines the learning model M3 to be used to output the voice classification information by referring to the environment type table T, but this is not limited to this.
- the control unit 11 may determine the learning model M3 to be used to output the voice classification information by a trained model such as a decision tree or neural network that outputs a type of learning model M3 appropriate for use in outputting the voice classification information when each item of the environment type is input.
- the model is trained to output the correct answer type when each item of the environment type is input, for example, using training data that links each item of the environment type with the type of learning model M3 to be output (correct answer type).
- the voice response unit 2 according to the fourth embodiment executes the process executed by the information processing device 1 according to the first embodiment.
- the following describes the differences between the first embodiment and the fourth embodiment. Except for the configuration described below, the other configurations are common to the first embodiment. Therefore, the same reference symbols as those in the first embodiment are used for the components common to the first embodiment, and the description of those components is omitted.
- FIG. 16 is a block diagram showing an example of the configuration of a voice response device 2 according to embodiment 4.
- the voice response device 2 according to embodiment 4 is, for example, a robot that transports an object to a location where a user is located by a drive unit (arm) 26.
- the memory unit 22 of the voice response device 2 according to embodiment 4 stores a voice activity detection module M1, a fast Fourier transform module M2, and a learning model M3.
- the memory unit 22 may store multiple learning models M3 and an environment type table T, and may execute the same processing as the information processing device 1 according to embodiment 3.
- the driving unit (arm) 26 executes the specified response control under the control of the device control unit 21. For example, when text data such as "I want to drink water” is extracted from the user's voice, the device control unit 21 causes the driving unit (arm) 26 to grasp a container containing water and transport the grasped container to the location where the user is located.
- the device control unit 21 of the voice response device 2 acquires the collected voice data collected by the microphone 24 (S61).
- the device control unit 21 performs a canceling process on the collected voice data to delete the voice output from the voice output unit 25 of the voice response device 2 (S62).
- the device control unit 21 inputs the collected voice data after the canceling process to the voice section detection module M1 (S63) and extracts the signal voice data (voice data) (S64).
- the device control unit 21 inputs the signal voice data to the fast Fourier transform module M2 (S65) and generates mel spectrogram voice data (S66).
- the device control unit 21 inputs the mel spectrogram voice data (voice data) to the learning model M3 (S67) and outputs voice classification information indicating whether the voice data is a voice uttered by the user or a voice output device 3 output (S68).
- the device control unit 21 may output, as the voice classification information, the probability that the voice is a voice uttered by the user (first probability), the probability that the voice is a voice output by the voice output device 3 (second probability), and the probability that the voice data includes both the voice uttered by the user and the voice output by the voice output device 3 (third probability) using the learning model M3.
- the device control unit 21 determines whether the voice data is a voice uttered by the user (whether the voice is a voice output by the voice output device 3) (S69). In S69, if the voice classification information is 1, the device control unit 21 determines that the voice data is a voice uttered by the user, and if the voice classification information is 0, determines that the voice data is a voice output by the voice output device 3.
- the device control unit 21 extracts text data of the content of the voice generated by the user based on the mel spectrogram voice data (S70). In S70, the device control unit 21 extracts the text data by voice recognition processing using processing modules such as an acoustic model, a pronunciation dictionary, and a language model. The device control unit 21 recognizes an instruction by the voice generated by the user based on the extracted text data (S71). In S71, the device control unit 21 recognizes the instruction by a trained model that performs natural language processing such as GPT, GPT2, GPT3, Transformer, BERT, BART, or T5.
- the device control unit 21 determines the content of the response control corresponding to the recognized instruction (S72). In S72, the device control unit 21 determines the content of the response control by referring to a database that links instructions and the content of the response control, for example. The device control unit 21 executes the response control whose content has been determined (S73) and ends the process. If the audio data is audio output by the audio output device 3 (S69: NO), the device control unit 21 ends the process without performing response control.
- the device control unit 21 recognizes an instruction to transport a container filled with water in S71.
- the device control unit 21 determines the transport of a container filled with water as the content of the response control in S72, and causes the drive unit (arm) 26 to carry out the transport of a container filled with water as the response control in S73.
- the voice response unit 2 operates the drive unit (arm) to carry out the transport of a container filled with water as the response control. Note that if a similar voice is output from the voice output device 3, the voice response unit 2 does not carry out the response control.
- Control unit 12 Storage unit 12a Storage medium 13 Communication unit 2 Voice response unit 21 Device control unit 22 Storage unit 22a Storage medium 23 Communication unit 24 Microphone 25 Audio output unit 3 Audio output device (device) 4 User terminal M1 Voice activity detection module M2 Fast Fourier transform module M3 Learning model N Network P Program Pa Application program Pb Application program S Voice determination system T Environment type table
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Manipulator (AREA)
Abstract
Un programme selon un mode de réalisation de la présente invention amène un ordinateur à exécuter un traitement pour acquérir des données vocales, entrer les données vocales acquises dans un modèle entraîné qui est entraîné de façon à, lorsque des données vocales sont entrées, délivrer en sortie des informations de classification vocale indiquant que les données vocales sont prononcées par un utilisateur ou une voix émise par un appareil, et délivrer ainsi les informations de classification vocale.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023026150 | 2023-02-22 | ||
| JP2023-026150 | 2023-02-22 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024176673A1 true WO2024176673A1 (fr) | 2024-08-29 |
Family
ID=92500693
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2024/001328 Ceased WO2024176673A1 (fr) | 2023-02-22 | 2024-01-18 | Programme, procédé de traitement d'informations, dispositif de traitement d'informations et robot |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024176673A1 (fr) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2016511475A (ja) * | 2013-03-05 | 2016-04-14 | アリババ・グループ・ホールディング・リミテッドAlibaba Group Holding Limited | 人間を機械から区別するための方法及びシステム |
| JP2018517927A (ja) * | 2015-09-04 | 2018-07-05 | グーグル エルエルシー | 話者検証のためのニューラルネットワーク |
| JP2019514045A (ja) * | 2016-03-21 | 2019-05-30 | アマゾン テクノロジーズ インコーポレイテッド | 話者照合方法及びシステム |
| JP2020527758A (ja) * | 2017-07-25 | 2020-09-10 | グーグル エルエルシー | 発話分類器 |
| JP2020154061A (ja) * | 2019-03-19 | 2020-09-24 | 株式会社フュートレック | 話者識別装置、話者識別方法およびプログラム |
| JP2021524063A (ja) * | 2018-05-17 | 2021-09-09 | グーグル エルエルシーGoogle LLC | ニューラルネットワークを使用したターゲット話者の声でのテキストからの音声合成 |
-
2024
- 2024-01-18 WO PCT/JP2024/001328 patent/WO2024176673A1/fr not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2016511475A (ja) * | 2013-03-05 | 2016-04-14 | アリババ・グループ・ホールディング・リミテッドAlibaba Group Holding Limited | 人間を機械から区別するための方法及びシステム |
| JP2018517927A (ja) * | 2015-09-04 | 2018-07-05 | グーグル エルエルシー | 話者検証のためのニューラルネットワーク |
| JP2019514045A (ja) * | 2016-03-21 | 2019-05-30 | アマゾン テクノロジーズ インコーポレイテッド | 話者照合方法及びシステム |
| JP2020527758A (ja) * | 2017-07-25 | 2020-09-10 | グーグル エルエルシー | 発話分類器 |
| JP2021524063A (ja) * | 2018-05-17 | 2021-09-09 | グーグル エルエルシーGoogle LLC | ニューラルネットワークを使用したターゲット話者の声でのテキストからの音声合成 |
| JP2020154061A (ja) * | 2019-03-19 | 2020-09-24 | 株式会社フュートレック | 話者識別装置、話者識別方法およびプログラム |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12033621B2 (en) | Method for speech recognition based on language adaptivity and related apparatus | |
| JP6816925B2 (ja) | 育児ロボットのデータ処理方法及び装置 | |
| CN110490213B (zh) | 图像识别方法、装置及存储介质 | |
| CN108108340B (zh) | 用于智能机器人的对话交互方法及系统 | |
| JP2021021955A (ja) | 声紋の作成・登録の方法及び装置 | |
| CN113823272B (zh) | 语音处理方法、装置、电子设备以及存储介质 | |
| CN114127849B (zh) | 语音情感识别方法和装置 | |
| CN111933115A (zh) | 语音识别方法、装置、设备以及存储介质 | |
| KR20210047173A (ko) | 오인식된 단어를 바로잡아 음성을 인식하는 인공 지능 장치 및 그 방법 | |
| CN111261162B (zh) | 语音识别方法、语音识别装置及存储介质 | |
| CN112101044B (zh) | 一种意图识别方法、装置及电子设备 | |
| US11211059B2 (en) | Artificial intelligence apparatus and method for recognizing speech with multiple languages | |
| CN113393841B (zh) | 语音识别模型的训练方法、装置、设备及存储介质 | |
| CN118535005B (zh) | 虚拟数字人的交互装置、系统及其方法 | |
| US11468247B2 (en) | Artificial intelligence apparatus for learning natural language understanding models | |
| CN113792537B (zh) | 一种动作生成方法以及装置 | |
| KR102544249B1 (ko) | 발화의 문맥을 공유하여 번역을 수행하는 전자 장치 및 그 동작 방법 | |
| KR20200074690A (ko) | 전자 장치 및 이의 제어 방법 | |
| CN119260752A (zh) | 一种基于多模态数据融合的机器人控制方法及装置 | |
| CN103730032A (zh) | 多媒体数据控制方法和系统 | |
| KR102367778B1 (ko) | 언어 정보를 처리하기 위한 방법 및 그 전자 장치 | |
| CN113763925A (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
| CN110647613A (zh) | 一种课件构建方法、装置、服务器和存储介质 | |
| CN112257432A (zh) | 一种自适应意图识别方法、装置及电子设备 | |
| CN113836932B (zh) | 交互方法、装置和系统,以及智能设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24759980 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |