WO2023273064A1 - 对象说话检测方法及装置、电子设备和存储介质 - Google Patents
对象说话检测方法及装置、电子设备和存储介质 Download PDFInfo
- Publication number
- WO2023273064A1 WO2023273064A1 PCT/CN2021/127097 CN2021127097W WO2023273064A1 WO 2023273064 A1 WO2023273064 A1 WO 2023273064A1 CN 2021127097 W CN2021127097 W CN 2021127097W WO 2023273064 A1 WO2023273064 A1 WO 2023273064A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target object
- face
- area
- sound signal
- driver
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
Definitions
- the present disclosure relates to the technical field of smart car cabins, and in particular to a method and device for detecting object speech, electronic equipment, and a storage medium.
- Cabin intelligence includes multi-mode interaction, personalized service, safety perception, etc., which is an important direction for the current development of the automotive industry.
- the multi-mode interaction in the cabin is intended to provide passengers with a comfortable interactive experience.
- the means of multi-mode interaction include voice recognition, gesture recognition, etc. Among them, speech recognition occupies a significant market share in the field of vehicle interaction.
- the present disclosure proposes a technical solution for object speech detection.
- a method for detecting object speaking comprising: acquiring a video stream in the vehicle cabin and a sound signal collected by a vehicle microphone; Carry out face detection, determine the face area of the target object in the car in each of the video frames; determine the lip of the target object's lips according to the face area of the target object in the N video frames
- the motion recognition result, N is an integer greater than 1; according to the lip motion recognition result and the first sound signal, determine the speech detection result of the target object, wherein the first sound signal includes the N video frames
- the speech detection result includes that the target object is in a speaking state or in a non-speaking state.
- the determining the speech detection result of the target object according to the lip movement recognition result and the first sound signal includes: when the lip movement recognition result is a lip movement, and the If the first sound signal includes voice, it is determined that the target object is in a speaking state.
- the method further includes: when the target object is speaking, performing content recognition on the first sound signal, and determining the speech content corresponding to the first sound signal ;
- the voice content includes a preset voice command, execute a control function corresponding to the voice command.
- the target object includes a driver
- executing the control function corresponding to the voice command includes: In the case where the voice instruction corresponds to a plurality of directional control functions, determine the gaze direction of the target object according to the face area of the target object in the N video frames; looking at a direction, determining a target control function from the plurality of control functions; and executing the target control function.
- the video stream includes a first video stream of the driver area, and/or a second video stream of the occupant area in the cabin; the multiple video frames of the video stream Face detection is performed on each video frame of the first video stream, including: detecting the driver's face based on each first video frame in a plurality of first video frames of the first video stream; and/or based on the second video
- Each of the plurality of second video frames of the stream detects a human face in the cabin, and determines the driver in each of the second video frames according to the position of the detected human face in the cabin human face.
- the obtaining the video stream in the cabin includes: obtaining the first video stream of the driver's area collected by the camera of the driver detection system DMS; and/or obtaining the video stream collected by the camera of the occupant detection system OMS A second video stream of the occupant area in the cabin.
- the method further includes: determining the first seat area of the target object according to each video frame in the plurality of video frames; As a result and the first sound signal, determining the speech detection result of the target object includes: in the case that the first sound signal includes speech, performing sound zone positioning on the first sound signal, and determining the same as the first sound signal.
- the second seat area corresponding to the sound signal; if the lip movement recognition result is lip movement and the first seat area is the same as the second seat area, it is determined that the target object is in a speaking state.
- the video stream includes a first video stream of a driver area
- the target object includes a driver
- the determined The first seat area of the target object includes: in response to detecting a human face in the driver area according to each video frame, determining that the first seat area of the target object is the driver area; and/or, the The video stream includes a second video stream of the occupant area in the vehicle cabin, the target object includes the driver and/or passengers, and according to each video frame in the plurality of video frames, determining the second video stream of the target object
- a seat area comprising: performing face detection on each of the plurality of video frames; and determining a first seat area of the target object according to the detected face position.
- the method further includes: when the target object is speaking, performing content recognition on the first sound signal, and determining the speech content corresponding to the first sound signal ; when the voice content includes a preset voice instruction, according to the first seating area of the target object, determine an area control function corresponding to the voice instruction; execute the area control function.
- the video stream includes a first video stream of the driver's area
- the target object includes the driver
- the target is determined according to the lip movement recognition result
- the speech detection result of the object includes: in the case that the first sound signal is determined to be from the driver's area by performing sound zone positioning according to the first sound signal, determining whether the driver has a lip movement recognition result according to the lip movement recognition result. In response to no lip movement of the driver, it is determined that the speaking detection result of the target object is: the driver is in a non-speaking state.
- determining the lip movement recognition result of the target object's lips according to the face areas of the target object in the N video frames includes: determining the target object in the N video frames The face regions in the video frame are respectively subjected to feature extraction to obtain the N face features of the target object; the N face features are fused to obtain the face fusion features of the target object; The features are fused to perform lip movement recognition, and a lip movement recognition result of the target object is obtained.
- performing feature extraction on the face regions of the target object in the N video frames respectively to obtain N face features of the target object includes: In the i-th video frame in the video frame, the face key point extraction is carried out to the face area of the target object in the i-th video frame to obtain a plurality of face key points, 1 ⁇ i ⁇ N; A plurality of face key points are normalized to obtain the ith face feature of the target object.
- an object speaking detection device including: a signal acquisition module, used to acquire the video stream in the cabin, and the sound signal collected by the vehicle microphone; a face detection module, used to detect the Each video frame in a plurality of video frames of the video stream performs face detection, and determines the face area of the target object in the car in each video frame; For the face area in the N video frames, determine the lip movement recognition result of the target object's lips, where N is an integer greater than 1; the speaking detection module is used to, according to the lip movement recognition result and the first sound signal, determining a speech detection result of the target object, wherein the first sound signal includes the sound signal in a time period corresponding to N video frames, and the speech detection result includes that the target object is in a speaking state or in a silent state.
- the speaking detection module is configured to: determine that the target object is in a speaking state when the lip movement recognition result is lip movement and the first sound signal includes voice .
- the device further includes: a content identification module, configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal; a function execution module, configured to execute a control function corresponding to the voice command when the voice content includes a preset voice command.
- a content identification module configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal
- a function execution module configured to execute a control function corresponding to the voice command when the voice content includes a preset voice command.
- the target object includes a driver, wherein the function execution module is configured to: in the case that the voice instruction corresponds to multiple control functions with directionality, according to the target object In the human face area in the N video frames, determine the gaze direction of the target object; determine the target control function from the plurality of control functions according to the gaze direction of the target object; execute the target control function Function.
- the video stream includes a first video stream of the driver area, and/or a second video stream of the passenger area in the cabin;
- the face detection module is configured to: based on the Each first video frame of a plurality of first video frames of the first video stream detects the driver's face; and/or each second video of a plurality of second video frames of the second video stream
- the frame detects the human face in the vehicle cabin, and determines the driver's human face in each second video frame according to the position of the detected human face in the vehicle cabin.
- the signal acquisition module is configured to: acquire the first video stream of the driver's area captured by the camera of the driver detection system DMS; and/or acquire the occupants in the cabin captured by the camera of the occupant detection system OMS Region's second video stream.
- the device further includes: a seat area determining module, configured to determine a first seat area of the target object according to each video frame in the plurality of video frames; wherein, the The speaking detection module is used for: in the case that the first sound signal includes voice, perform sound region positioning on the first sound signal, and determine the second seat area corresponding to the first sound signal; If the lip movement recognition result is lip movement and the first seating area is the same as the second seating area, it is determined that the target object is in a speaking state.
- a seat area determining module configured to determine a first seat area of the target object according to each video frame in the plurality of video frames; wherein, the The speaking detection module is used for: in the case that the first sound signal includes voice, perform sound region positioning on the first sound signal, and determine the second seat area corresponding to the first sound signal; If the lip movement recognition result is lip movement and the first seating area is the same as the second seating area, it is determined that the target object is in a speaking state.
- the video stream includes a first video stream of a driver's area
- the target object includes a driver
- the seat area determining module is configured to: respond to each video frame according to the A human face is detected in the driver area, and the first seat area of the target object is determined to be the driver area; and/or the video stream includes a second video stream of the passenger area in the cabin, and the target object includes the driver And/or passengers
- the seat area determination module is used to: perform face detection on each video frame in the plurality of video frames; determine the first seat area of the target object according to the detected face position .
- the device further includes: a content identification module, configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal; a function determination module, used to determine the area control function corresponding to the voice command according to the first seat area of the target object when the voice content includes a preset voice command ; An area function executing module, configured to execute the area control function.
- a content identification module configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal
- a function determination module used to determine the area control function corresponding to the voice command according to the first seat area of the target object when the voice content includes a preset voice command
- An area function executing module configured to execute the area control function.
- the video stream includes a first video stream of the driver's area
- the target object includes the driver
- the speaking detection module is configured to: In the case where it is determined that the first sound signal comes from the area of the driver, determine whether the driver has lip movement according to the lip movement recognition result; in response to no lip movement of the driver, determine the target object
- the result of the speaking detection is: the driver is not speaking.
- the lip movement recognition module is configured to: perform feature extraction on the face regions of the target object in the N video frames respectively, to obtain N face features of the target object ; Fusing the N face features to obtain the face fusion features of the target object; performing lip movement recognition on the face fusion features to obtain a lip movement recognition result of the target object.
- performing feature extraction on the face regions of the target object in the N video frames respectively to obtain N face features of the target object includes: In the i-th video frame in the video frame, the face key point extraction is carried out to the face area of the target object in the i-th video frame to obtain a plurality of face key points, 1 ⁇ i ⁇ N; A plurality of face key points are normalized to obtain the ith face feature of the target object.
- an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to call the instructions stored in the memory to execute the above-mentioned method.
- a computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented.
- a computer program including computer readable codes, and when the computer readable codes are run in an electronic device, a processor in the electronic device executes the above method.
- the video stream and sound signal in the cabin can be acquired; face detection is performed on each frame of multiple video frames of the video stream to determine the face area of the object; Determine the result of lip movement recognition based on the face area in the face; judge whether the subject is speaking based on the lip movement recognition result and the sound signal, thereby improving the accuracy of the subject's speech detection and reducing the false positive rate of speech recognition.
- Fig. 1 shows a flow chart of a method for detecting a speaking of an object according to an embodiment of the present disclosure.
- FIG. 2 shows a schematic diagram of lip movement recognition in the object speaking detection method according to an embodiment of the present disclosure.
- Fig. 3 shows a block diagram of an object speaking detection device according to an embodiment of the present disclosure.
- Fig. 4 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
- Fig. 5 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
- the voice detection function In in-vehicle voice interaction, the voice detection function usually runs in real time in the vehicle, and the false alarm rate of the voice detection function needs to be kept at a very low level.
- a signal detection method based on pure voice is usually used, and it is difficult to suppress voice false alarms, resulting in a high false alarm rate and poor user interaction experience.
- computer vision technology can be used to detect the video stream in the cabin, determine the object in the video frame, identify the lip movement state of the object, and then use the lip movement state and the sound signal Jointly determine whether someone is speaking, thereby improving the accuracy of object speaking detection, reducing the false positive rate of speech recognition, and improving user experience.
- the object speaking detection method may be performed by electronic equipment such as a terminal device or a server, and the terminal device may be a vehicle-mounted device, a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone , a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, etc.
- the method can be implemented by calling a computer-readable instruction stored in a memory by a processor.
- the on-board device can be the car machine, domain controller or processor in the cabin, and can also be used in DMS (Driver Monitor System, driver detection system) or OMS (Occupant Monitoring System, occupant detection system) to execute image processing.
- Device hosts for data processing operations, etc.
- Fig. 1 shows a flow chart of a method for detecting a subject's speaking according to an embodiment of the present disclosure. As shown in Fig. 1, the method for detecting a subject's speaking includes:
- step S11 the video stream in the cabin and the sound signal collected by the vehicle microphone are obtained;
- step S12 face detection is performed on each of the multiple video frames of the video stream, and the face area of the target object in the car is determined in each of the video frames;
- step S13 according to the face area of the target object in N video frames among the plurality of video frames, determine the lip movement recognition result of the target object's lips, where N is an integer greater than 1;
- step S14 the speech detection result of the target object is determined according to the lip movement recognition result and the first sound signal, wherein the first sound signal includes all time periods corresponding to the N video frames
- the sound signal, the speaking detection result includes that the target object is in a speaking state or in a non-speaking state.
- embodiments of the present disclosure may be applied to any type of vehicle, such as passenger cars, taxis, shared cars, buses, freight vehicles, subways, trains, and the like.
- the video stream in the vehicle cabin may be collected through the vehicle camera, and the sound signal may be collected through the vehicle microphone.
- the vehicle-mounted camera can be any camera installed in the vehicle, the number can be one or more, and the type can be DMS camera, OMS camera, common camera, etc.
- the vehicle-mounted microphone can also be arranged at any position in the vehicle, and the number can be one or more. The present disclosure does not limit the location, quantity and type of the vehicle-mounted camera and the vehicle-mounted microphone.
- multiple video frames may be obtained from the video stream, and the multiple video frames may be a group of continuous video frames in the video stream; they may also be obtained by sampling a video frame sequence of the video stream.
- step S12 face detection can be performed on each video frame in a plurality of video frames, and a human face frame in each video frame is determined;
- the face frame is tracked to determine the face frame of the object belonging to the same identity, so as to determine the face frame of the object in the car in each video frame.
- the way of face detection can be, for example, face key point recognition, face contour detection, etc.
- the way of face tracking can be, for example, to determine objects belonging to the same identity according to the intersection ratio of face frames in adjacent video frames .
- face detection and tracking can be implemented in any manner in the related art, which is not limited in the present disclosure.
- the face area of each object may be obtained.
- the face area of the target object in N video frames among the plurality of video frames may be determined, where N is an integer greater than 1. That is, N video frames corresponding to a certain duration (for example, 2s) are selected from multiple video frames for subsequent lip movement detection.
- the N video frames may be the latest N video frames collected in the video stream. N may be, for example, 10, 15, 20, etc., which is not limited in the present disclosure.
- the lip shape of a person's lips changes within a certain period of time (for example, 2s), that is, the lip movement of the person's lips can be considered, and the period of time is the duration corresponding to N video frames.
- the duration can be set as 1s, 2s or 3s, for example, which is not limited in the present disclosure.
- the lip movement recognition result of the target object's lips may be determined according to the face area of the target object in the N video frames.
- the lip movement recognition result includes lip movement of the target object or no lip movement.
- the area images of the face area of the target object in the N video frames may be directly input into a preset lip movement recognition network for processing, and a lip movement recognition result is output. It is also possible to perform feature extraction on the regional images of the face area of the target object in N video frames to obtain face features; input the face features into the preset lip movement recognition network for processing, and output the lip movement recognition results.
- the present disclosure does not limit the specific processing manner.
- the lip movement recognition network may be, for example, a convolutional neural network, and the present disclosure does not limit the specific network structure of the lip movement recognition network.
- the speech detection result of the target object may be jointly determined according to the lip movement recognition result of the target object and the first sound signal of the time period corresponding to the N video frames.
- the first sound signal includes a sound signal of a time period corresponding to N video frames, for example, the time period corresponding to N video frames is the latest 2s (2s ago-now), the first The sound signal is also the sound signal of the last 2s.
- Voice detection is performed on the first sound signal to determine whether the first sound signal includes voice, and the present disclosure does not limit the implementation of voice detection.
- the speaking detection result of the target object includes that the target object is in a speaking state or in a non-speaking state. For example, if the lip movement recognition result shows that lip movement occurs, and the first sound signal includes speech, the target object is considered to be speaking; if the first sound signal includes speech, and the lip movement recognition result shows that no lip movement occurs, the target object is considered The subject is in a state of not speaking; if the result of lip movement recognition is lip movement, but the first sound signal does not include speech, the target object is considered to be in a state of not speaking.
- the present disclosure does not limit the specific determination method.
- the video stream and sound signal in the cabin it is possible to obtain the video stream and sound signal in the cabin; perform face detection on each frame of multiple video frames of the video stream to determine the face area of the object; Determine the result of lip movement recognition based on the face area in the face; judge whether the subject is speaking based on the lip movement recognition result and the sound signal, thereby improving the accuracy of the subject's speech detection and reducing the false positive rate of speech recognition.
- step S11 the video stream in the cabin collected by the vehicle camera and the sound signal collected by the vehicle microphone can be obtained.
- step S11 may include:
- the on-board camera may include, for example, a driver detection system DMS camera, and/or an occupant detection system OMS camera.
- the video stream collected by the DMS camera is the video stream of the driver's area (called the first video stream), and the video stream collected by the OMS camera is the video stream of the occupant area in the cabin (called the second video stream).
- the video stream acquired in step S11 may include the first video stream and/or the second video stream.
- the first video stream of the driver area and the second video stream of the passenger area in the cabin may also be collected by a camera installed in the cabin that is not dedicated to driver detection or passenger detection get.
- the above-mentioned first video stream may be obtained by intercepting the video information of the driver area in the first video stream.
- the video streams of different areas in the cabin can be obtained for subsequent processing respectively, thereby improving the comprehensiveness of object speech detection.
- performing face detection on each of the multiple video frames of the video stream in step S12 may include:
- each second video frame of multiple second video frames of the second video stream Based on each second video frame of multiple second video frames of the second video stream, detect a human face in the vehicle cabin, and determine each second video frame according to the detected position of the human face in the vehicle cabin. The driver's face in the video frame.
- the first video stream and the second video stream may be processed respectively.
- first video frames may be obtained from the first video stream, and the multiple first video frames may be A group of continuous video frames; it may also be a group of video frames obtained by sampling the video frame sequence of the first video stream.
- the first video frame corresponds to a driver area, and the area includes only the driver.
- face detection and tracking may be performed on each of the multiple first video frames to obtain the driver's face area in each first video frame.
- a plurality of video frames may be obtained from the second video stream, and the plurality of second video frames may be A group of continuous video frames; it may also be a group of video frames obtained by sampling the video frame sequence of the second video stream.
- the second video frame corresponds to the occupant area in the vehicle cabin, including the driver and/or passengers.
- face detection and tracking can be performed on each second video frame in a plurality of second video frames to obtain the face area of each occupant in the cabin in each second video frame; and according to The position of each occupant's face area in each second video frame determines the driver's face and each passenger's face in each second video frame.
- the camera for capturing the second video stream is set from the top position in front of the cabin (such as the top of the front windshield) towards the direction of the cabin
- the The human face at the lower right position in the second video frame is determined to be the driver's human face.
- the present disclosure does not limit the specific manner of determining each occupant.
- the face of the driver in the first video stream can be determined, and the faces of different occupants (driver and/or passenger) in the second video stream can be determined, so that lip movement recognition and speech detection can be performed respectively and corresponding responses, further improving the accuracy of speech detection and making subsequent responses more targeted.
- step S13 may include:
- Lip movement recognition is performed on the face fusion feature to obtain a lip movement recognition result of the target object.
- feature extraction may be performed on images of face regions of the target object in N video frames to obtain face features of the target object in N video frames.
- the manner of feature extraction may be, for example, human face key point extraction.
- the step of respectively performing feature extraction on the face regions of the target object in the N video frames may include:
- the human face area of the target object in the i-th video frame can be
- the facial landmarks are extracted to obtain multiple facial landmarks (landmarks-i), wherein the number of facial landmarks can be, for example, 106, which is not limited in the present disclosure.
- the mean mean-i and standard deviation std-i can be obtained for multiple face key points (landmarks-i), and then the multiple face key points landmarks-i can be calculated by the following formula (1).
- i is normalized to obtain the normalized multiple face key points Landmarks-i-normalize:
- Landmarks-i-normalize (landmarks-i–mean-i)/std-i (1)
- the normalized multiple face key points Landmarks-i-normalize can be directly used as the i-th face feature of the target object; some of them can also be selected from Landmarks-i-normalize Face key points, such as selecting the face key points in the lower half of the face or selecting the mouth key points in the face key points, as the i-th face feature of the target object. This disclosure does not limit this.
- facial features can be made more standardized and the accuracy of subsequent lip movement recognition can be improved.
- N video frames are respectively processed to obtain N facial features of the target object.
- the first N-1 video frames among the N video frames may have been processed in the previous lip movement recognition.
- feature extraction can be performed on the face area of the target object in the latest Nth video frame to obtain the Nth face features; and read the existing N-1 in the previous processing Personal face features, so as to obtain the N face features of the target object. In this way, the calculation amount can be reduced, and the processing efficiency can be improved.
- the N facial features may be fused to obtain the fused face features of the target object. It can be sorted according to the frame order to obtain facial features landmarks-1-normalize, landmarks-2-normalize, ..., landmarks-N-normalize, and splicing or superimposing the N facial features to obtain facial fusion features. Denote it as face-video-feature.
- lip movement recognition may be performed on face fusion features to obtain a lip movement recognition result of the target object.
- a lip movement recognition neural network can be preset, input the face fusion features into the lip movement recognition neural network for processing, and output the lip movement recognition result of the target object.
- the lip movement recognition neural network can be, for example, a convolutional neural network, which includes multiple fully connected layers, softmax layers, etc., and is used for binary classification of human face fusion features.
- the face fusion feature is input into the fully connected layer of the lip movement recognition neural network, and the two-dimensional output can be obtained, corresponding to the occurrence of lip movement and the absence of lip movement respectively; after the softmax layer processing, the normalized score (score) or confidence is obtained Spend.
- a preset threshold (for example, set to 0.8) may be set for the score or confidence of lip movement. If the preset threshold is exceeded, it is determined that the target object is in a speaking state; otherwise, it is determined that the target object is in a non-speaking state.
- the present disclosure does not limit the network structure, training method and specific value of the preset threshold of the lip movement recognition neural network.
- FIG. 2 shows a schematic diagram of lip movement recognition in the object speaking detection method according to an embodiment of the present disclosure.
- face detection can be performed on the N video frames respectively, and it is determined that the target object is in the N video frames
- the face area of the target object; the face features of the target object in the N video frames are extracted respectively to obtain N face features; the feature fusion is carried out to the N face features to obtain the face fusion feature of the target object;
- the face fusion feature is input into the lip movement recognition neural network for processing, and the lip movement recognition result of the target object is output, including lip movement or no lip movement.
- the lip movement recognition of the target object's lips can be realized, assisting in judging whether the object is speaking, and improving the accuracy of speech detection.
- step S14 may include:
- the lip movement recognition result is lip movement and the first sound signal includes speech, it is determined that the target object is in a speaking state.
- the lip movement recognition result is lip movement and the first sound signal includes speech, it is considered that the target object is in a speaking state. That is, it is judged that the target object is speaking only when the two conditions of lip movement and detected voice are met; if only one condition is satisfied, or both conditions are not satisfied, it is judged that the target object is not speaking.
- the object speaking detection method according to the embodiment of the present disclosure may further include:
- the voice content includes a preset voice command
- a control function corresponding to the voice command is executed.
- the speech recognition function can be activated to perform content recognition on the first sound signal to determine the speech content corresponding to the first sound signal.
- the implementation method is not limited.
- various voice commands may be preset. If the recognized voice content includes a preset voice command, the control function corresponding to the voice command can be executed. For example, if it recognizes that the voice content includes the voice command "play music", it can control the car music player to play music; if it recognizes the voice content includes the voice command "open the left window", it can control the left window to open.
- the voice interaction with the occupants in the vehicle can be realized, so that the user can realize various control functions through voice, which improves the convenience of the user and improves the user experience.
- the identity of the target object may not be distinguished, that is, if the target object is judged to be speaking, speech recognition is started and a corresponding control function is executed.
- the identity of the target object can also be distinguished, such as only responding to the driver's voice, performing voice recognition when judging that the driver is speaking, and not responding to the passenger's voice; or according to the seat area where the passenger is, when judging that the passenger is speaking Perform voice recognition, and perform zone control functions for the passenger's seat zone, etc.
- the video stream includes the first video stream of the driver's area, and the target object includes the driver.
- step S14 may include:
- the speaking detection result of the target object is: the driver is in a non-speaking state.
- the first video frame of the first video stream is an image of the driver's area.
- Face detection and tracking are performed in step S12 to obtain the driver's face area in each first video frame.
- lip movement recognition can be performed in step S13 to obtain the driver's lip movement recognition result, that is, lip movement occurs or does not occur.
- sound zone positioning may be performed on the first sound signal to determine an area corresponding to the first sound signal, for example, a driver area or a passenger area.
- the present disclosure does not limit the implementation of sound zone positioning.
- the first sound signal is determined to be from the driver's area by performing sound zone positioning according to the first sound signal, it can be determined whether the driver has lip movement according to the lip movement recognition result; If it moves, it can be judged that the driver is not speaking, that is, the driver is in a state of not speaking. In this case, there may be no response to the voice.
- the voice recognition function can be activated to respond to the voice.
- the video stream includes a first video stream of a driver area, and/or a second video stream of a passenger area in a vehicle cabin, and the target object includes a driver.
- step S12 face detection and tracking may be performed on the first video frame of the first video stream to obtain the face area of the driver in each first video frame.
- face detection and tracking can also be carried out to the second video frame of the second video stream to determine the face area; and according to the position of the face area, determine the driver's face in each second video frame .
- step S13 can carry out lip movement recognition in step S13, obtain the driver's lip movement recognition result, namely occur lip movement or not take place lip movement, and determine driver's speaking state in step S14; If the driver is speaking, then start The voice recognition function determines the voice content corresponding to the first sound signal.
- the step of executing the control function corresponding to the voice command may include:
- the voice instruction corresponds to a plurality of directional control functions, determine the gaze direction of the target object according to the face area of the target object in the N video frames;
- a voice command may correspond to multiple control functions with directionality.
- the voice command "open the window” may correspond to the windows in both directions of left and right, and multiple control functions include “open the window on the left”. side window” and “open the right window”; it can also correspond to the windows in the four directions of left front, left rear, right front and right rear.
- the multiple control functions include "open the left front window”, “ Open the front right window”, “Open the rear left window”, “Open the rear right window”.
- the corresponding control function can be determined in conjunction with image recognition.
- the gaze direction of the target object may be determined according to the face areas of the target object in N video frames.
- feature extraction can be performed on the images of the face areas of the target object in N video frames, respectively, to obtain the face features of the target object in N video frames; the N face features are fused , to obtain the face fusion feature of the target object; input the face fusion feature into the preset gaze direction recognition network for processing, and obtain the gaze direction of the target object, that is, the gaze direction of the target object's eyes.
- the gaze direction recognition network may be, for example, a convolutional neural network, including a convolutional layer, a fully connected layer, a softmax layer, and the like.
- the disclosure does not limit the network structure and training method of the gaze direction recognition network.
- the target control function may be determined from multiple control functions according to the gaze direction of the target object. For example, if the voice command is "open the car window", and it is determined that the gaze direction of the target object is facing the right, then it may be determined that the target control function is "open the car window on the right". In turn, targeted control functions can be performed, such as opening the right-hand window.
- the video stream includes a first video stream of the driver area, and/or a second video stream of the passenger area in the vehicle cabin, and the target objects include the driver and/or the passenger.
- the object speaking detection method according to the embodiment of the present disclosure may further include:
- a first seating area of the target object is determined according to each of the plurality of video frames.
- the seat area of each occupant can be determined respectively according to the position of the face area of each occupant in each of the plurality of video frames (called for the first seating area).
- the first seat area includes the driver area or the passenger area.
- the passenger area may include the co-pilot area, the left rear seat area, the right rear seat area, etc.
- the present disclosure does not limit the division of the seat area .
- the video stream includes a first video stream of a driver area
- the target object includes a driver
- the determined The first seating area for the above target audience including:
- the video stream includes a second video stream of an occupant area in the vehicle cabin, the target object includes a driver and/or passengers, and the target object is determined according to each video frame in the plurality of video frames
- the first seating area including:
- the first seat area of the target object is determined according to the detected face position.
- the video stream includes the first video stream of the driver's area
- the first seat area of the target object is the driving area.
- the seating area complete the process of determining the first seating area.
- face detection and tracking can be performed on each of the multiple video frames to obtain the The face area of each occupant in each video frame; according to the detected face position, the first seat area of the target object can be determined.
- the face at the lower right position in the video frame can be determined as the driver's face, and the first seat area of the target object is determined to be the driver's area;
- the face at the lower left position in the video frame is determined as the face of the passenger of the co-pilot, and the first seat area of the target object is determined as the co-pilot area.
- the present disclosure does not limit the specific manner of determining each occupant.
- the speaking state of the target object can be jointly judged based on the first seating area in step S14.
- step S14 may include:
- the first sound signal includes speech
- perform sound zone positioning on the first sound signal and determine a second seat area corresponding to the first sound signal
- the lip movement recognition result is lip movement and the first seating area is the same as the second seating area, it is determined that the target object is in a speaking state.
- the first sound signal can be sound zone localized to determine the seat area corresponding to the first sound signal (referred to as the second seat area), such as the driver area or the passenger area .
- the present disclosure does not limit the implementation of sound zone positioning.
- the lip movement recognition result of the target object is lip movement
- the first seat area is the same as the second seat area (that is, the seat area located by the image If it is consistent with the seat area located through the sound zone), it can be determined that the target object is in a speaking state; and then the voice recognition function can be activated to respond to the voice.
- the seat area located by the image is consistent with the seat area located by the sound zone, and lip movement occurs, it is judged that the subject is speaking, which further improves the accuracy of the subject's speaking detection;
- the speaking object in order to perform targeted control functions, further improves user convenience.
- the object speaking detection method according to the embodiment of the present disclosure may further include:
- the voice content includes a preset voice command
- the speech recognition function can be activated to perform content recognition on the first sound signal, and determine the speech content corresponding to the first sound signal.
- the implementation method is not limited.
- the area control function corresponding to the voice command may be determined according to the first seating area of the target object. For example, if it is recognized that the voice content includes the voice command "open the window", and the first seat area of the target object is the left rear seat area, then it can be determined that the corresponding area control function is "open the left rear window". In turn, this area control function can be performed, for example controlling the opening of the left rear side window.
- the video stream and sound signal in the cabin can be obtained; face detection is performed on each frame of multiple video frames of the video stream to determine the face area of the object; The face area in the N video frames determines the lip movement recognition result; judges whether the subject is speaking based on the lip movement recognition result and the sound signal, thereby improving the accuracy of the subject's speech detection and reducing the false positive rate of speech recognition.
- the object speaking detection method can be applied to the intelligent cabin perception system, and can analyze the lip movement status of the occupants in the cabin based on the intelligent video analysis algorithm, and judge whether someone is speaking in the cabin based on the voice signal , so as to effectively avoid false alarms caused by relying solely on voice signals, ensure that voice recognition can be triggered normally, and improve user interaction experience.
- the present disclosure also provides an object speaking detection device, electronic equipment, a computer-readable storage medium, and a program, all of which can be used to implement any object speech detection method provided in the present disclosure, and refer to the corresponding technical solutions and descriptions in the method section Corresponding records are not repeated here.
- FIG. 3 shows a block diagram of an object speaking detection device according to an embodiment of the present disclosure. As shown in FIG. 3 , the device includes:
- Signal acquiring module 31 is used for acquiring the video flow in the cabin, and the sound signal that vehicle-mounted microphone collects;
- the human face detection module 32 is used for carrying out human face detection to each video frame in a plurality of video frames of the video stream, and determines the human face area of the target object in the car in each video frame;
- a lip movement recognition module 33 configured to determine the lip movement recognition result of the target object's lips according to the face area of the target object in N video frames among the plurality of video frames, where N is an integer greater than 1 ;
- the speech detection module 34 is configured to determine the speech detection result of the target object according to the lip movement recognition result and the first sound signal, wherein the first sound signal includes a time period corresponding to the N video frames
- the sound signal of the speech detection result includes that the target object is in a speaking state or in a non-speaking state.
- the speaking detection module is configured to: determine that the target object is in a speaking state when the lip movement recognition result is lip movement and the first sound signal includes voice .
- the device further includes:
- a content recognition module configured to perform content recognition on the first sound signal when the target object is speaking, and determine the speech content corresponding to the first sound signal;
- a function executing module configured to execute a control function corresponding to the voice command when the voice content includes a preset voice command.
- the target object includes a driver, wherein the function execution module is configured to: in the case that the voice instruction corresponds to multiple control functions with directionality, according to the target object In the human face area in the N video frames, determine the gaze direction of the target object; determine the target control function from the plurality of control functions according to the gaze direction of the target object; execute the target control function Function.
- the video stream includes a first video stream of the driver area, and/or a second video stream of the passenger area in the cabin;
- the face detection module is configured to: based on the Each of the plurality of first video frames of the first video stream detects a driver's face;
- each second video frame of multiple second video frames of the second video stream Based on each second video frame of multiple second video frames of the second video stream, detect a human face in the vehicle cabin, and determine each second video frame according to the detected position of the human face in the vehicle cabin. The driver's face in the video frame.
- the signal acquisition module is configured to: acquire the first video stream of the driver's area captured by the camera of the driver detection system DMS; and/or acquire the occupants in the cabin captured by the camera of the occupant detection system OMS Region's second video stream.
- the device further includes: a seat area determining module, configured to determine a first seat area of the target object according to each video frame in the plurality of video frames;
- the speaking detection module is used for: in the case that the first sound signal includes speech, perform sound region positioning on the first sound signal, and determine the second seat area corresponding to the first sound signal; If the lip movement recognition result is lip movement and the first seating area is the same as the second seating area, it is determined that the target object is in a speaking state.
- the video stream includes a first video stream of a driver area
- the target object includes a driver
- the seat area determination module is configured to: in response to detecting a human face in the driver area according to each of the video frames, determining the first seating area of the target object as the driver area;
- the video stream includes a second video stream of an occupant area in the vehicle cabin, the target object includes a driver and/or a passenger, and the seat area determining module is configured to: each video in the plurality of video frames Frame face detection; determine the first seat area of the target object according to the detected face position.
- the device further includes: a content identification module, configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal; a function determination module, used to determine the area control function corresponding to the voice command according to the first seat area of the target object when the voice content includes a preset voice command ; An area function executing module, configured to execute the area control function.
- a content identification module configured to perform content identification on the first sound signal when the target object is in a speaking state, and determine Voice content corresponding to the sound signal
- a function determination module used to determine the area control function corresponding to the voice command according to the first seat area of the target object when the voice content includes a preset voice command
- An area function executing module configured to execute the area control function.
- the video stream includes a first video stream of the driver's area
- the target object includes the driver
- the speaking detection module is configured to: In the case where the zone location determines that the first sound signal comes from the driver's area, determine whether the driver has lip movement according to the lip movement recognition result; in response to the driver not having lip movement, determine the target object The result of the speaking detection is: the driver is not speaking.
- the lip movement recognition module is configured to: perform feature extraction on the face regions of the target object in the N video frames respectively, to obtain N face features of the target object ; Fusing the N face features to obtain the face fusion features of the target object; performing lip movement recognition on the face fusion features to obtain a lip movement recognition result of the target object.
- performing feature extraction on the face regions of the target object in the N video frames respectively to obtain N face features of the target object includes: for the N In the i-th video frame in the i-th video frame, the face key point extraction is carried out to the face area of the target object in the i-th video frame to obtain a plurality of face key points, 1 ⁇ i ⁇ N; A plurality of face key points are normalized to obtain the ith face feature of the target object.
- the functions or modules included in the device provided by the embodiments of the present disclosure can be used to execute the methods described in the method embodiments above, and its specific implementation can refer to the description of the method embodiments above. For brevity, here No longer.
- Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor.
- Computer readable storage media may be volatile or nonvolatile computer readable storage media.
- An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.
- An embodiment of the present disclosure also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.
- An embodiment of the present disclosure also provides a computer program, including computer readable codes, and when the computer readable codes are run in an electronic device, a processor in the electronic device executes the above method.
- Electronic devices may be provided as terminals, servers, or other forms of devices.
- FIG. 4 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure.
- the electronic device 800 may be a terminal such as a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a personal digital assistant.
- electronic device 800 may include one or more of the following components: processing component 802, memory 804, power supply component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814 , and the communication component 816.
- the processing component 802 generally controls the overall operations of the electronic device 800, such as those associated with display, telephone calls, data communications, camera operations, and recording operations.
- the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802 .
- the memory 804 is configured to store various types of data to support operations at the electronic device 800 . Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like.
- the memory 804 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
- SRAM static random access memory
- EEPROM electrically erasable programmable read-only memory
- EPROM erasable Programmable Read Only Memory
- PROM Programmable Read Only Memory
- ROM Read Only Memory
- Magnetic Memory Flash Memory
- Magnetic or Optical Disk Magnetic Disk
- the power supply component 806 provides power to various components of the electronic device 800 .
- Power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 800 .
- the multimedia component 808 includes a screen providing an output interface between the electronic device 800 and the user.
- the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
- the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action.
- the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.
- the audio component 810 is configured to output and/or input audio signals.
- the audio component 810 includes a microphone (MIC), which is configured to receive external audio signals when the electronic device 800 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 804 or sent via communication component 816 .
- the audio component 810 also includes a speaker for outputting audio signals.
- the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.
- Sensor assembly 814 includes one or more sensors for providing status assessments of various aspects of electronic device 800 .
- the sensor component 814 can detect the open/closed state of the electronic device 800, the relative positioning of components, such as the display and the keypad of the electronic device 800, the sensor component 814 can also detect the electronic device 800 or a Changes in position of components, presence or absence of user contact with electronic device 800 , electronic device 800 orientation or acceleration/deceleration and temperature changes in electronic device 800 .
- Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
- Sensor assembly 814 may also include an optical sensor, such as a complementary metal-oxide-semiconductor (CMOS) or charge-coupled device (CCD) image sensor, for use in imaging applications.
- CMOS complementary metal-oxide-semiconductor
- CCD charge-coupled device
- the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
- the communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices.
- the electronic device 800 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof.
- the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
- the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication.
- the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.
- RFID Radio Frequency Identification
- IrDA Infrared Data Association
- UWB Ultra Wide Band
- Bluetooth Bluetooth
- electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGA field programmable A programmable gate array
- controller microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
- a non-volatile computer-readable storage medium such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to implement the above method.
- FIG. 5 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure.
- electronic device 1900 may be provided as a server.
- electronic device 1900 includes processing component 1922 , which further includes one or more processors, and a memory resource represented by memory 1932 for storing instructions executable by processing component 1922 , such as application programs.
- the application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions.
- the processing component 1922 is configured to execute instructions to perform the above method.
- Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input-output (I/O) interface 1958 .
- the electronic device 1900 can operate based on the operating system stored in the memory 1932, such as the Microsoft server operating system (Windows Server TM ), the graphical user interface-based operating system (Mac OS X TM ) introduced by Apple Inc., and the multi-user and multi-process computer operating system (Unix TM ), a free and open-source Unix-like operating system (Linux TM ), an open-source Unix-like operating system (FreeBSD TM ), or the like.
- Microsoft server operating system Windows Server TM
- Mac OS X TM graphical user interface-based operating system
- Unix TM multi-user and multi-process computer operating system
- Linux TM free and open-source Unix-like operating system
- FreeBSD TM open-source Unix-like operating system
- a non-transitory computer-readable storage medium such as the memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to implement the above method.
- the present disclosure can be a system, method and/or computer program product.
- a computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.
- a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
- a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- Computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above.
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable read-only memory
- flash memory static random access memory
- SRAM static random access memory
- CD-ROM compact disc read only memory
- DVD digital versatile disc
- memory stick floppy disk
- mechanically encoded device such as a printer with instructions stored thereon
- a hole card or a raised structure in a groove and any suitable combination of the above.
- computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
- Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
- the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
- a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
- Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages.
- Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
- the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect).
- LAN local area network
- WAN wide area network
- an electronic circuit such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA)
- FPGA field programmable gate array
- PDA programmable logic array
- These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
- These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
- each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.
- the computer program product can be specifically realized by means of hardware, software or a combination thereof.
- the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. Wait.
- a software development kit Software Development Kit, SDK
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (16)
- 一种对象说话检测方法,其特征在于,包括:获取车舱内的视频流,以及车载麦克风采集的声音信号;对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,确定车内的目标对象在所述每一个视频帧中的人脸区域;根据所述目标对象在N个所述视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,N为大于1的整数;根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,其中,所述第一声音信号包括与所述N个视频帧对应的时间段的所述声音信号,所述说话检测结果包括所述目标对象处于说话状态或处于未说话状态。
- 根据权利要求1所述的方法,其特征在于,所述根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,包括:在所述唇动识别结果为发生唇动,且所述第一声音信号包括语音的情况下,确定所述目标对象处于说话状态。
- 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能。
- 根据权利要求3所述的方法,其特征在于,所述目标对象包括驾驶员,其中,所述在所述语音内容包括预设的语音指令的情况下,执行与所述语音指令对应的控制功能,包括:在所述语音指令对应具有方向性的多个控制功能的情况下,根据所述目标对象在所述N个视频帧中的人脸区域,确定所述目标对象的注视方向;根据所述目标对象的注视方向,从所述多个控制功能中确定出目标控制功能;执行所述目标控制功能。
- 根据权利要求1-4中任意一项所述的方法,其特征在于,所述视频流包括驾驶员区域的第一视频流,和/或车舱内乘员区域的第二视频流;所述对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,包括:基于所述第一视频流的多个第一视频帧中的每一个第一视频帧检测驾驶员的人脸;和/或基于所述第二视频流的多个第二视频帧中的每一个第二视频帧检测车舱内的人脸,并根据检测到的车舱内的人脸的位置确定所述每一个第二视频帧中的驾驶员的人脸。
- 根据权利要求1-5中任意一项所述的方法,其特征在于,所述获取车舱内的视频流,包括:获取驾驶员检测系统DMS摄像头采集的驾驶员区域的第一视频流;和/或获取乘员检测系统OMS摄像头采集的车舱内乘员区域的第二视频流。
- 根据权利要求1-6中任意一项所述的方法,其特征在于,所述方法还包括:根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域;其中,所述根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,包括:在所述第一声音信号包括语音的情况下,对所述第一声音信号进行音区定位,确定与所述第一声音信号对应的第二座位区域;在所述唇动识别结果为发生唇动,且所述第一座位区域与所述第二座位区域相同的情况下,确定所述目标对象处于说话状态。
- 根据权利要求7所述的方法,其特征在于,所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员,以及所述根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域,包括:响应于根据所述每一个视频帧在驾驶员区域检测到人脸,确定所述目标对象的第一座位区域为驾驶员区域;和/或所述视频流包括车舱内乘员区域的第二视频流,所述目标对象包括驾驶员和/或乘客,以及所述根据所述多个视频帧中的每一个视频帧,确定所述目标对象的第一座位区域,包括:对所述多个视频帧中的每一个视频帧进行人脸检测;根据检测到的人脸位置确定所述目标对象的第一座位区域。
- 根据权利要求7或8所述的方法,其特征在于,所述方法还包括:在所述目标对象处于说话状态的情况下,对所述第一声音信号进行内容识别,确定与所述第一声音信号对应的语音内容;在所述语音内容包括预设的语音指令的情况下,根据所述目标对象的第一座位区域,确定与所述语音指令对应的区域控制功能;执行所述区域控制功能。
- 根据权利要求1-6中任意一项所述的方法,其特征在于,所述视频流包括驾驶员区域的第一视频流,所述目标对象包括驾驶员;所述根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,包括:在根据所述第一声音信号进行音区定位确定所述第一声音信号来自驾驶员区域的情况下,根据所述唇动识别结果确定所述驾驶员是否发生唇动;响应于所述驾驶员未发生唇动,确定所述目标对象的说话检测结果为:所述驾驶员处于未说话状态。
- 根据权利要求1-10中任意一项所述的方法,其特征在于,根据所述目标对象在N个所述视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,包括:对所述目标对象在N个所述视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征;对所述N个人脸特征进行融合,得到所述目标对象的人脸融合特征;对所述人脸融合特征进行唇动识别,得到所述目标对象的唇动识别结果。
- 根据权利要求11所述的方法,其特征在于,所述对所述目标对象在N个所述视频帧中的人脸区域分别进行特征提取,得到所述目标对象的N个人脸特征,包括:针对N个所述视频帧中的第i个视频帧,对所述目标对象在第i个视频帧中的人脸区域进行人脸关键点提取,得到多个人脸关键点,1≤i≤N;对所述多个人脸关键点进行归一化处理,得到所述目标对象的第i个人脸特征。
- 一种对象说话检测装置,其特征在于,包括:信号获取模块,用于获取车舱内的视频流,以及车载麦克风采集的声音信号;人脸检测模块,用于对所述视频流的多个视频帧中的每一个视频帧进行人脸检测,确定车内的目标对象在所述每一个视频帧中的人脸区域;唇动识别模块,用于根据所述目标对象在所述多个视频帧中的N个视频帧中的人脸区域,确定所述目标对象嘴唇的唇动识别结果,N为大于1的整数;说话检测模块,用于根据所述唇动识别结果以及第一声音信号,确定所述目标对象的说话检测结果,其中,所述第一声音信号包括与所述N个视频帧对应的时间段的所述声音信号,所述说话检测结果包括所述目标对象处于说话状态或处于未说话状态。
- 一种电子设备,其特征在于,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行权利要求1至12中任意一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至12中任意一项所述的方法。
- 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至12中的任一权利要求所述的方法。
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110735963.6A CN113486760A (zh) | 2021-06-30 | 2021-06-30 | 对象说话检测方法及装置、电子设备和存储介质 |
| CN202110735963.6 | 2021-06-30 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023273064A1 true WO2023273064A1 (zh) | 2023-01-05 |
Family
ID=77937073
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/127097 Ceased WO2023273064A1 (zh) | 2021-06-30 | 2021-10-28 | 对象说话检测方法及装置、电子设备和存储介质 |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN113486760A (zh) |
| WO (1) | WO2023273064A1 (zh) |
Families Citing this family (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113486760A (zh) * | 2021-06-30 | 2021-10-08 | 上海商汤临港智能科技有限公司 | 对象说话检测方法及装置、电子设备和存储介质 |
| CN114299944B (zh) * | 2021-12-08 | 2023-03-24 | 天翼爱音乐文化科技有限公司 | 视频处理方法、系统、装置及存储介质 |
| CN114170559A (zh) * | 2021-12-14 | 2022-03-11 | 北京地平线信息技术有限公司 | 车载设备的控制方法、装置和车辆 |
| CN114615534B (zh) * | 2022-01-27 | 2024-11-15 | 海信视像科技股份有限公司 | 显示设备及音频处理方法 |
| CN115410566A (zh) * | 2022-03-10 | 2022-11-29 | 北京罗克维尔斯科技有限公司 | 一种车辆控制方法、装置、设备及存储介质 |
| CN114734942A (zh) * | 2022-04-01 | 2022-07-12 | 深圳地平线机器人科技有限公司 | 调节车载音响音效的方法及装置 |
| CN115063867A (zh) * | 2022-06-30 | 2022-09-16 | 上海商汤临港智能科技有限公司 | 说话状态识别方法及模型训练方法、装置、车辆、介质 |
| CN115361481B (zh) * | 2022-08-01 | 2025-04-08 | 北京达佳互联信息技术有限公司 | 提示文本显示方法、装置、电子设备及存储介质 |
| CN115880744B (zh) * | 2022-08-01 | 2023-10-20 | 北京中关村科金技术有限公司 | 一种基于唇动的视频角色识别方法、装置及存储介质 |
| CN116259100B (zh) * | 2022-09-28 | 2024-09-03 | 北京中关村科金技术有限公司 | 基于唇动跟踪的识别方法、装置、存储介质及电子设备 |
| CN115909128B (zh) * | 2022-10-17 | 2025-10-10 | 北京达佳互联信息技术有限公司 | 视频识别方法、装置、电子设备及存储介质 |
| CN116013298A (zh) * | 2023-01-03 | 2023-04-25 | 重庆长安汽车股份有限公司 | 一种车内免唤醒语音交互方法、装置、设备及存储介质 |
| CN116206614A (zh) * | 2023-01-30 | 2023-06-02 | 北京百度网讯科技有限公司 | 一种对话检测方法及装置 |
| CN118571219B (zh) * | 2024-08-02 | 2024-10-15 | 成都赛力斯科技有限公司 | 座舱内人员对话增强方法、装置、设备及存储介质 |
| CN119479649B (zh) * | 2025-01-13 | 2025-07-22 | 比亚迪股份有限公司 | 车辆控制方法、装置、存储介质、控制器及车辆 |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110071830A1 (en) * | 2009-09-22 | 2011-03-24 | Hyundai Motor Company | Combined lip reading and voice recognition multimodal interface system |
| CN105700676A (zh) * | 2014-12-11 | 2016-06-22 | 现代自动车株式会社 | 可佩戴眼镜及其控制方法、以及车辆控制系统 |
| CN109410957A (zh) * | 2018-11-30 | 2019-03-01 | 福建实达电脑设备有限公司 | 基于计算机视觉辅助的正面人机交互语音识别方法及系统 |
| CN110544491A (zh) * | 2019-08-30 | 2019-12-06 | 上海依图信息技术有限公司 | 一种实时关联说话人及其语音识别结果的方法及装置 |
| CN110750152A (zh) * | 2019-09-11 | 2020-02-04 | 云知声智能科技股份有限公司 | 一种基于唇部动作的人机交互方法和系统 |
| CN113486760A (zh) * | 2021-06-30 | 2021-10-08 | 上海商汤临港智能科技有限公司 | 对象说话检测方法及装置、电子设备和存储介质 |
Family Cites Families (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108320739B (zh) * | 2017-12-22 | 2022-03-01 | 景晖 | 根据位置信息辅助语音指令识别方法和装置 |
| CN108831462A (zh) * | 2018-06-26 | 2018-11-16 | 北京奇虎科技有限公司 | 车载语音识别方法及装置 |
| CN110857067B (zh) * | 2018-08-24 | 2023-04-07 | 上海汽车集团股份有限公司 | 一种人车交互装置和人车交互方法 |
| CN109814448A (zh) * | 2019-01-16 | 2019-05-28 | 北京七鑫易维信息技术有限公司 | 一种车载多模态控制方法及系统 |
| CN110545396A (zh) * | 2019-08-30 | 2019-12-06 | 上海依图信息技术有限公司 | 一种基于定位去噪的语音识别方法及装置 |
| CN110503957A (zh) * | 2019-08-30 | 2019-11-26 | 上海依图信息技术有限公司 | 一种基于图像去噪的语音识别方法及装置 |
| CN111240477A (zh) * | 2020-01-07 | 2020-06-05 | 北京汽车研究总院有限公司 | 一种车载人机交互方法、系统和具有该系统的车辆 |
| CN111341350A (zh) * | 2020-01-18 | 2020-06-26 | 南京奥拓电子科技有限公司 | 人机交互控制方法、系统、智能机器人及存储介质 |
| WO2021217572A1 (zh) * | 2020-04-30 | 2021-11-04 | 华为技术有限公司 | 车内用户定位方法、车载交互方法、车载装置及车辆 |
| CN112102546A (zh) * | 2020-08-07 | 2020-12-18 | 浙江大华技术股份有限公司 | 一种人机交互控制方法、对讲呼叫方法及相关装置 |
| CN112026790B (zh) * | 2020-09-03 | 2022-04-15 | 上海商汤临港智能科技有限公司 | 车载机器人的控制方法及装置、车辆、电子设备和介质 |
| CN112397065A (zh) * | 2020-11-04 | 2021-02-23 | 深圳地平线机器人科技有限公司 | 语音交互方法、装置、计算机可读存储介质及电子设备 |
-
2021
- 2021-06-30 CN CN202110735963.6A patent/CN113486760A/zh active Pending
- 2021-10-28 WO PCT/CN2021/127097 patent/WO2023273064A1/zh not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110071830A1 (en) * | 2009-09-22 | 2011-03-24 | Hyundai Motor Company | Combined lip reading and voice recognition multimodal interface system |
| CN105700676A (zh) * | 2014-12-11 | 2016-06-22 | 现代自动车株式会社 | 可佩戴眼镜及其控制方法、以及车辆控制系统 |
| CN109410957A (zh) * | 2018-11-30 | 2019-03-01 | 福建实达电脑设备有限公司 | 基于计算机视觉辅助的正面人机交互语音识别方法及系统 |
| CN110544491A (zh) * | 2019-08-30 | 2019-12-06 | 上海依图信息技术有限公司 | 一种实时关联说话人及其语音识别结果的方法及装置 |
| CN110750152A (zh) * | 2019-09-11 | 2020-02-04 | 云知声智能科技股份有限公司 | 一种基于唇部动作的人机交互方法和系统 |
| CN113486760A (zh) * | 2021-06-30 | 2021-10-08 | 上海商汤临港智能科技有限公司 | 对象说话检测方法及装置、电子设备和存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113486760A (zh) | 2021-10-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2023273064A1 (zh) | 对象说话检测方法及装置、电子设备和存储介质 | |
| CN113488043B (zh) | 乘员说话检测方法及装置、电子设备和存储介质 | |
| WO2022183661A1 (zh) | 事件检测方法、装置、电子设备、存储介质及程序产品 | |
| CN112124073B (zh) | 基于酒精检测的智能驾驶控制方法及装置 | |
| JP7526897B2 (ja) | 車室内乗員検出方法及び装置、電子機器並びに記憶媒体 | |
| CN112026790B (zh) | 车载机器人的控制方法及装置、车辆、电子设备和介质 | |
| WO2022048119A1 (zh) | 车辆控制方法及装置、电子设备、存储介质和车辆 | |
| CN113989889B (zh) | 遮光板调节方法及装置、电子设备和存储介质 | |
| JP7594677B2 (ja) | 危険行為の識別方法及び装置、電子機器並びに記憶媒体 | |
| US10108334B2 (en) | Gesture device, operation method for same, and vehicle comprising same | |
| US20220206567A1 (en) | Method and apparatus for controlling vehicle display screen, and storage medium | |
| WO2023071174A1 (zh) | 车内人员检测方法及装置、电子设备和存储介质 | |
| WO2023231211A1 (zh) | 语音识别方法、装置、电子设备、存储介质及产品 | |
| WO2022183663A1 (zh) | 事件检测方法、装置、电子设备、存储介质及程序产品 | |
| CN114495073A (zh) | 方向盘脱手检测方法及装置、电子设备和存储介质 | |
| CN114332941A (zh) | 基于乘车对象检测的报警提示方法、装置及电子设备 | |
| CN114005103B (zh) | 关联车内的人和物的方法及装置、电子设备和存储介质 | |
| CN117666794A (zh) | 车辆交互方法、装置、电子设备及存储介质 | |
| HK40051880A (zh) | 对象说话检测方法及装置、电子设备和存储介质 | |
| HK40051878A (zh) | 乘员说话检测方法及装置、电子设备和存储介质 | |
| CN114708577A (zh) | 车窗的控制方法、装置、电子设备及存储介质 | |
| CN113449693A (zh) | 乘梯行为检测方法及装置、电子设备和存储介质 | |
| CN113361361B (zh) | 与乘员交互的方法及装置、车辆、电子设备和存储介质 | |
| CN115848111B (zh) | 一种车辆天窗控制方法、装置、电子设备及可读存储介质 | |
| WO2024046353A2 (en) | Presentation control method, device for in-vehicle glass of vehicle, and vehicle |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21947991 Country of ref document: EP Kind code of ref document: A1 |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31/05/2024) |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21947991 Country of ref document: EP Kind code of ref document: A1 |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21947991 Country of ref document: EP Kind code of ref document: A1 |