CN107705785A

CN107705785A - Sound localization method, intelligent sound box and the computer-readable medium of intelligent sound box

Info

Publication number: CN107705785A
Application number: CN201710647123.8A
Authority: CN
Inventors: 高聪
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-08-01
Filing date: 2017-08-01
Publication date: 2018-02-16

Abstract

The invention provides a sound source localization method for an intelligent speaker, an intelligent speaker and a computer-readable medium. The method includes: if it is determined that the voice signal sent by the target sound source needs to be collected, acquiring at least two groups of signal receiving modules on the smart speaker to receive the first voice signal carrying a preset wake-up word sent by the target sound source; Two signal receiving modules in the module pair receive the time difference of the first voice signal; according to the first voice signal and the time difference of each signal receiving module pair receiving the first voice signal, determine the orientation of the target sound source that sends out the first voice signal. The technical solution of the present invention can locate the target sound source in a scene with many sound sources. In this way, the smart speaker can only collect the voice signal of the target sound source in the positioning direction, and then provide the user corresponding to the target sound source. services; and can effectively enrich the functions of smart speakers, making the use of smart speakers more flexible and convenient.

Description

Sound source localization method for smart speaker, smart speaker and computer readable medium

【技术领域】【Technical field】

本发明涉及计算机应用技术领域，尤其涉及一种智能音箱的声源定位方法、智能音箱及计算机可读介质。The invention relates to the technical field of computer applications, in particular to a sound source localization method for a smart speaker, a smart speaker and a computer-readable medium.

【背景技术】【Background technique】

随着科技的发展，智能设备逐步进入用户的家庭，形成智能化的家具环境。例如智能音箱作为智能家居中的一种智能化设备，能够帮助用户查音乐、查天气、聊天、对话等等，因此智能音箱需要具有语音识别、语义解析、内容服务、话术生成、语音TTS(TextToSpeech；TTS)播报反馈等功能。With the development of science and technology, smart devices gradually enter users' homes, forming an intelligent furniture environment. For example, smart speakers, as an intelligent device in smart homes, can help users check music, weather, chat, dialogue, etc., so smart speakers need to have voice recognition, semantic analysis, content services, speech generation, voice TTS ( TextToSpeech; TTS) broadcast feedback and other functions.

现有技术中，智能音箱都设置有默认的唤醒词，智能音箱在不工作时，可以处于休眠状态。当用户需要智能音箱启动时，可以通过语音的方式呼叫智能音箱的唤醒词，智能音箱检测到自身的唤醒词被唤醒，便启动进入工作状态。接收用户输入的语音查询(Query)，然后进行语音识别、语义解析、查询语音Query的查询结果，然后根据查询结果生成反馈信息的话术，并将反馈信息的话术进行TTS转换，通过语音的方式向用户播报反馈信息。In the prior art, the smart speakers are all provided with a default wake-up word, and the smart speakers can be in a dormant state when not working. When the user needs to start the smart speaker, he can call the wake-up word of the smart speaker through voice, and the smart speaker detects that its own wake-up word is awakened, and then starts to work. Receive the voice query (Query) input by the user, then perform voice recognition, semantic analysis, query the query result of the voice query, and then generate the speech of the feedback information according to the query result, and perform TTS conversion on the speech of the feedback information, and send it to the Users broadcast feedback information.

但是，现有技术中，智能音箱不能对声源进行定位，若智能音箱周围的多个用户同时呼叫智能音箱的时候，多个用户相当于多个声源，会造成智能音箱工作混乱，无法实现语音Query的查询。However, in the prior art, the smart speaker cannot locate the sound source. If multiple users around the smart speaker call the smart speaker at the same time, the multiple users are equivalent to multiple sound sources, which will cause confusion in the work of the smart speaker and cannot be realized. Voice Query query.

【发明内容】【Content of invention】

本发明提供了一种智能音箱的声源定位方法、智能音箱及计算机可读介质，用于实现智能音箱对声源的定位。The invention provides a method for locating a sound source of an intelligent speaker, the intelligent speaker and a computer-readable medium, which are used for realizing the location of the sound source by the intelligent speaker.

本发明提供一种智能音箱的声源定位方法，所述方法包括：The present invention provides a sound source localization method for an intelligent speaker, the method comprising:

若确定需要采集目标声源发出的语音信号时，获取智能音箱上至少两组信号接收模块对接收所述目标声源发送的携带预设唤醒词的第一语音信号；所述预设唤醒词用于供所述目标声源唤醒所述智能音箱；If it is determined that the voice signal sent by the target sound source needs to be collected, at least two groups of signal receiving modules on the smart speaker are obtained to receive the first voice signal carrying the preset wake-up word sent by the target sound source; the preset wake-up word is used waking up the smart speaker for the target sound source;

获取各组所述信号接收模块对中的两个所述信号接收模块接收所述第一语音信号的时间差；Acquiring the time difference between the two signal receiving modules in each group of the signal receiving module pairs receiving the first voice signal;

根据所述第一语音信号以及各所述信号接收模块对接收所述第一语音信号的时间差，确定发出所述第一语音信号的所述目标声源的方位。The orientation of the target sound source that sends out the first voice signal is determined according to the first voice signal and the time difference between each of the signal receiving modules receiving the first voice signal.

进一步可选地，如上所述的方法中，根据所述第一语音信号以及各所述信号接收模块对接收所述第一语音信号的时间差，确定发出所述第一语音信号的所述目标声源的方位之后，所述方法还包括：Further optionally, in the above-mentioned method, according to the first voice signal and the time difference between each of the signal receiving modules receiving the first voice signal, determine the target voice that emits the first voice signal. After the location of the source, the method also includes:

旋转定位指示标记至所述目标声源的方位上和/或向所述目标声源的方位亮起定位指示灯，以告知所述目标声源对应的用户，所述目标声源的方位已经被确定。Rotate the positioning indicator mark to the azimuth of the target sound source and/or light up the positioning indicator light to the azimuth of the target sound source to inform the user corresponding to the target sound source that the azimuth of the target sound source has been fixed Sure.

进一步可选地，如上所述的方法中，获取各所述信号接收模块对中的两个所述信号接收模块接收所述第一语音信号的时间差，具体包括：Further optionally, in the method as described above, obtaining the time difference between two signal receiving modules in each pair of signal receiving modules receiving the first voice signal specifically includes:

以所述至少两组信号接收模块对中的第一组信号接收模块对为参照物，选取所述目标声源的候选方向θ；Taking the first group of signal receiving module pairs in the at least two groups of signal receiving module pairs as a reference object, selecting the candidate direction θ of the target sound source;

对于各所述信号接收模块对，根据所述目标声源的候选方向θ，获取对应的所述信号接收模块对中的两个所述信号接收模块接收所述第一语音信号的时间差t0，其中所述t0为关于所述θ的函数。For each pair of signal receiving modules, according to the candidate direction θ of the target sound source, obtain the time difference t0 at which two of the signal receiving modules in the corresponding pair of signal receiving modules receive the first speech signal, wherein Said t0 is a function of said θ.

进一步可选地，如上所述的方法中，根据所述第一语音信号以及各所述信号接收模块对接收所述第一语音信号的时间差，确定发出所述第一语音信号的所述目标声源的方位，具体包括：Further optionally, in the above-mentioned method, according to the first voice signal and the time difference between each of the signal receiving modules receiving the first voice signal, determine the target voice that emits the first voice signal. The location of the source, including:

根据各组所述信号接收模块对接收所述第一语音信号的时间差t0，将各组所述信号接收模块对中的两个所述信号接收模块接收的所述第一语音信号在时间上进行对齐处理；According to the time difference t0 of each group of signal receiving modules receiving the first voice signal, the first voice signals received by the two signal receiving modules in each group of the signal receiving module pairs are timed. alignment processing;

计算各组所述信号接收模块对对应的两个对齐处理后的所述第一语音信号的相关性；calculating the correlation of each group of the signal receiving modules with respect to the corresponding two aligned first speech signals;

将各组所述信号接收模块对对应的所述相关性叠加，得到总相关性；superimposing the corresponding correlations of each group of the signal receiving modules to obtain the total correlation;

获取所述总相关性取最大值时所述目标声源对应的所述候选方向θ为所述目标声源的目标方向。Obtaining the candidate direction θ corresponding to the target sound source when the total correlation takes a maximum value is the target direction of the target sound source.

进一步可选地，如上所述的方法中，根据各所述信号接收模块对接收所述第一语音信号的时间差t0，将各所述信号接收模块对中的两个所述信号接收模块接收的所述第一语音信号在时间上进行对齐处理，具体包括：Further optionally, in the above method, according to the time difference t0 of each pair of signal receiving modules receiving the first voice signal, the signal received by the two signal receiving modules in each pair of signal receiving modules The first speech signal is aligned in time, specifically including:

将各所述信号接收模块对接收的两个所述第一语音信号中先接收到的所述第一语音信号延迟所述时间差t0，或者将各所述信号接收模块对接收的两个所述第一语音信号中后接收到的所述第一语音信号提前所述时间差t0，以使得两个所述第一语音信号在时间上对齐。Delaying the time difference t0 of the first received first voice signal among the two first voice signals received by each pair of signal receiving modules, or delaying the two received voice signals by each pair of signal receiving modules The first voice signal received later in the first voice signal is advanced by the time difference t0, so that the two first voice signals are aligned in time.

进一步可选地，如上所述的方法中，获取智能音箱上至少两组信号接收模块对接收所述目标声源发送的携带预设唤醒词的第一语音信号之前，所述方法还包括：Further optionally, in the above-mentioned method, before acquiring at least two groups of signal receiving modules on the smart speaker to receive the first voice signal carrying the preset wake-up word sent by the target sound source, the method further includes:

确定需要采集所述目标声源发出的语音信号；Determine the need to collect the voice signal from the target sound source;

进一步可选地，确定需要采集所述目标声源发出的语音信号，具体包括：Further optionally, determining that the voice signal emitted by the target sound source needs to be collected includes:

获取所述目标声源发出的携带所述预设唤醒词的所述第一语音信号；acquiring the first voice signal carrying the preset wake-up word sent by the target sound source;

从所述第一语音信号中提取所述预设唤醒词；extracting the preset wake-up word from the first voice signal;

提取所述第一语音信号的声纹特征；extracting voiceprint features of the first voice signal;

根据预存储的所述预设唤醒词与所述目标声源的声纹特征的对应关系，判断所述预设唤醒词与所述第一语音信号的声纹特征是否匹配；judging whether the preset wake-up word matches the voiceprint feature of the first voice signal according to the pre-stored correspondence between the preset wake-up word and the voiceprint feature of the target sound source;

若匹配，确定需要采集所述目标声源发出的语音信号。If they match, it is determined that the voice signal from the target sound source needs to be collected.

进一步可选地，如上所述的方法中，获取所述目标声源发出的携带所述预设唤醒词的所述第一语音信号之前，所述方法还包括：Further optionally, in the above method, before acquiring the first voice signal carrying the preset wake-up word sent by the target sound source, the method further includes:

接收所述目标声源对应的用户语音输入的携带所述预设唤醒词的第二语音信号；receiving a second voice signal carrying the preset wake-up word input by a user voice corresponding to the target sound source;

从所述第二语音信号中提取所述预设唤醒词；extracting the preset wake-up word from the second voice signal;

提取所述第二语音信号的声纹特征，作为所述目标声源的声纹特征；extracting the voiceprint feature of the second speech signal as the voiceprint feature of the target sound source;

建立并存储所述预设唤醒词与所述目标声源的声纹特征的对应关系。A correspondence relationship between the preset wake-up word and the voiceprint feature of the target sound source is established and stored.

本发明提供一种智能音箱，所述智能音箱包括：The present invention provides a kind of intelligent sound box, and described intelligent sound box comprises:

信号获取模块，用于若确定需要采集目标声源发出的语音信号时，获取智能音箱上至少两组信号接收模块对接收所述目标声源发送的携带预设唤醒词的第一语音信号；所述预设唤醒词用于供所述目标声源唤醒所述智能音箱；The signal acquisition module is used to obtain at least two groups of signal receiving modules on the smart speaker to receive the first voice signal carrying the preset wake-up word sent by the target sound source if it is determined that the voice signal sent by the target sound source needs to be collected; The preset wake-up word is used for the target sound source to wake up the smart speaker;

时间差获取模块，用于获取各组所述信号接收模块对中的两个所述信号接收模块接收所述第一语音信号的时间差；A time difference acquiring module, configured to acquire the time difference between two signal receiving modules in each group of signal receiving module pairs receiving the first voice signal;

定位模块，用于根据所述第一语音信号以及各所述信号接收模块对接收所述第一语音信号的时间差，确定发出所述第一语音信号的所述目标声源的方位。The positioning module is configured to determine the azimuth of the target sound source that sends out the first voice signal according to the first voice signal and the time difference between each of the signal receiving modules receiving the first voice signal.

进一步可选地，如上所述的智能音箱中，还包括：Further optionally, the above smart speaker also includes:

定位指示模块，用于旋转定位指示标记至所述目标声源的方位上和/或向所述目标声源的方位亮起定位指示灯，以告知所述目标声源对应的用户，所述目标声源的方位已经被确定。The positioning indication module is used to rotate the positioning indication mark to the azimuth of the target sound source and/or light the positioning indicator light to the azimuth of the target sound source, so as to inform the user corresponding to the target sound source that the target sound source The direction of the sound source has been determined.

进一步可选地，如上所述的智能音箱中，所述时间差获取模块，具体用于：Further optionally, in the smart speaker as described above, the time difference acquisition module is specifically used for:

进一步可选地，如上所述的智能音箱中，所述定位模块，具体用于：Further optionally, in the smart speaker as described above, the positioning module is specifically used for:

进一步可选地，如上所述的智能音箱中，所述定位模块，具体用于将各所述信号接收模块对接收的两个所述第一语音信号中先接收到的所述第一语音信号延迟所述时间差t0，或者将各所述信号接收模块对接收的两个所述第一语音信号中后接收到的所述第一语音信号提前所述时间差t0，以使得两个所述第一语音信号在时间上对齐。Further optionally, in the smart speaker as described above, the positioning module is specifically configured to combine the first received first voice signal among the two first voice signals received by each of the signal receiving modules Delay the time difference t0, or advance the time difference t0 to the first speech signal received by each of the two first speech signals received by each of the signal receiving modules, so that the two first Speech signals are aligned in time.

进一步可选地，如上所述的智能音箱中，所述智能音箱还包括：Further optionally, in the smart speaker as described above, the smart speaker further includes:

确定模块，用于确定需要采集所述目标声源发出的语音信号；A determination module, configured to determine the need to collect the voice signal from the target sound source;

进一步地，所述确定模块，具体用于：Further, the determining module is specifically used for:

接收模块，用于接收所述目标声源对应的用户语音输入的携带所述预设唤醒词的第二语音信号；A receiving module, configured to receive a second voice signal carrying the preset wake-up word input by the user's voice corresponding to the target sound source;

提取模块，用于从所述第二语音信号中提取所述预设唤醒词；An extraction module, configured to extract the preset wake-up word from the second voice signal;

所述提取模块，还用于提取所述第二语音信号的声纹特征，作为所述目标声源的声纹特征；The extraction module is further configured to extract the voiceprint feature of the second speech signal as the voiceprint feature of the target sound source;

建立模块，用于建立并存储所述预设唤醒词与所述目标声源的声纹特征的对应关系。The establishment module is used to establish and store the corresponding relationship between the preset wake-up word and the voiceprint feature of the target sound source.

本发明还提供一种智能音箱，包括用于收发信号的多个麦克风；所述智能音箱还包括：The present invention also provides a smart speaker, including a plurality of microphones for sending and receiving signals; the smart speaker also includes:

一个或多个处理器；one or more processors;

存储器，用于存储一个或多个程序，memory for storing one or more programs,

当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现如上所述的智能音箱的声源定位方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the sound source localization method for the smart speaker as described above.

本发明还提供一种计算机可读介质，其上存储有计算机程序，该程序被处理器执行时实现如上所述的智能音箱的声源定位方法。The present invention also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, the method for locating a sound source of a smart speaker as described above is realized.

本发明的智能音箱的声源定位方法、智能音箱及计算机可读介质，若确定需要采集目标声源发出的语音信号时，通过获取智能音箱上至少两组信号接收模块对接收目标声源发送的携带预设唤醒词的第一语音信号；获取各组信号接收模块对中的两个信号接收模块接收第一语音信号的时间差；根据第一语音信号以及各信号接收模块对接收第一语音信号的时间差，确定发出第一语音信号的目标声源的方位。本发明的技术方案，可以在声源较多的场景下，对目标声源进行定位，这样，智能音箱可以仅采集定位方向的目标声源的语音信号，进而为该目标声源对应的用户提供服务；而且还能够有效地丰富智能音箱的功能，使得智能音箱的使用更加灵活、方便。The sound source localization method of the smart speaker, the smart speaker and the computer-readable medium of the present invention, if it is determined that the voice signal sent by the target sound source needs to be collected, the voice signal sent by the target sound source is received by at least two groups of signal receiving modules on the smart speaker. Carrying the first voice signal of the preset wake-up word; obtaining the time difference between the two signal receiving modules in each group of signal receiving module pairs receiving the first voice signal; according to the first voice signal and each signal receiving module pair receiving the first voice signal The time difference is used to determine the azimuth of the target sound source that sends out the first voice signal. The technical solution of the present invention can locate the target sound source in a scene with many sound sources. In this way, the smart speaker can only collect the voice signal of the target sound source in the positioning direction, and then provide the user corresponding to the target sound source. services; and can effectively enrich the functions of smart speakers, making the use of smart speakers more flexible and convenient.

【附图说明】【Description of drawings】

图1为本发明的智能音箱的声源定位方法实施例一的流程图。FIG. 1 is a flow chart of Embodiment 1 of a sound source localization method for a smart speaker according to the present invention.

图2为本发明的智能音箱的声源定位方法的一种应用场景图。Fig. 2 is an application scene diagram of the sound source localization method of the smart speaker according to the present invention.

图3为本发明的智能音箱的声源定位方法的另一种应用场景图。FIG. 3 is another application scene diagram of the sound source localization method for a smart speaker according to the present invention.

图4为本发明的智能音箱的声源定位方法实施例二的流程图。FIG. 4 is a flow chart of Embodiment 2 of the sound source localization method for a smart speaker according to the present invention.

图5为本发明的智能音箱实施例一的结构图。FIG. 5 is a structural diagram of Embodiment 1 of the smart speaker of the present invention.

图6为本发明的智能音箱实施例二的结构图。FIG. 6 is a structural diagram of Embodiment 2 of the smart speaker of the present invention.

图7为本发明的智能音箱实施例三的结构图。FIG. 7 is a structural diagram of Embodiment 3 of the smart speaker of the present invention.

图8为本发明提供的一种智能音箱的示例图。Fig. 8 is an example diagram of a smart speaker provided by the present invention.

【具体实施方式】【detailed description】

为了使本发明的目的、技术方案和优点更加清楚，下面结合附图和具体实施例对本发明进行详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1为本发明的智能音箱的声源定位方法实施例一的流程图。如图1所示，本实施例的智能音箱的声源定位方法，具体可以包括如下步骤：FIG. 1 is a flow chart of Embodiment 1 of a sound source localization method for a smart speaker according to the present invention. As shown in Figure 1, the sound source localization method of the smart speaker of the present embodiment may specifically include the following steps:

100、若确定需要采集目标声源发出的语音信号时，获取智能音箱上至少两组信号接收模块对接收目标声源发送的携带预设唤醒词的第一语音信号；100. If it is determined that the voice signal sent by the target sound source needs to be collected, acquire at least two groups of signal receiving modules on the smart speaker to receive the first voice signal carrying the preset wake-up word sent by the target sound source;

本实施例的智能音箱的声源定位方法的执行主体为智能音箱。本实施例的预设唤醒词用于供目标声源唤醒智能音箱。本实施例的目标声源优选地为与智能音箱互动的用户。为了能够接收用户的语音Query，并基于用户的语音Query向用户播报反馈信息，本实施例的智能音箱上设置有信号接收模块和信号发送模块。例如，智能音箱上的信号接收模块和信号发送模块可以集成为一体，如可以为集成在智能音箱的麦克风，实现对来自四面八方的信号的接收和将反馈信息向四面八方播报。可选地，本实施例中，智能音箱上可以对称地设置有偶数个麦克风，如在智能音箱的四周均匀设置四个麦克风或者四组麦克风，每组可以包括两个。The execution subject of the sound source localization method for a smart speaker in this embodiment is a smart speaker. The preset wake-up word in this embodiment is used for the target sound source to wake up the smart speaker. The target sound source in this embodiment is preferably a user interacting with the smart speaker. In order to receive the user's voice query and broadcast feedback information to the user based on the user's voice query, the smart speaker in this embodiment is provided with a signal receiving module and a signal sending module. For example, the signal receiving module and the signal sending module on the smart speaker can be integrated, such as a microphone integrated in the smart speaker, to receive signals from all directions and broadcast feedback information to all directions. Optionally, in this embodiment, an even number of microphones may be arranged symmetrically on the smart speaker, for example, four microphones or four groups of microphones may be uniformly arranged around the smart speaker, and each group may include two.

本实施例的使用场景中，可以由多个声源呼叫智能音箱，由于智能音箱不能同时处理多个语音Query，智能音箱经过自身分析，可以确定需要采集其中的目标声源发出的语音信号，例如，可以确定需要采集最先接收到的语音信号，或者通过其他策略从多个声源中获取其中的目标声源，并确定需要采集目标声源发出的语音信号。此时需要进一步对目标声源进行定位，本实施例中，首先需要获取智能音箱上至少两组信号接收模块对接收目标声源发送的携带预设唤醒词的第一语音信号，以借助于各组信号接收模块对接收到的第一语音信号对目标声源进行定位。In the usage scenario of this embodiment, the smart speaker can be called by multiple sound sources. Since the smart speaker cannot process multiple voice queries at the same time, the smart speaker can determine the voice signal from the target sound source that needs to be collected through its own analysis, for example , it may be determined that the first received speech signal needs to be collected, or the target sound source among multiple sound sources is obtained through other strategies, and it is determined that the speech signal emitted by the target sound source needs to be collected. At this time, it is necessary to further locate the target sound source. In this embodiment, it is first necessary to obtain at least two groups of signal receiving modules on the smart speaker to receive the first voice signal carrying the preset wake-up word sent by the target sound source, so as to use each The group signal receiving module locates the target sound source on the received first voice signal.

本实施例的信号接收模块对可以为智能音箱上的麦克风对，每组麦克风对包括两个麦克风，这两个麦克风为智能音箱上任意两个麦克风。由于一组麦克风对可能会定位到来自对称的两个方位的目标声源，导致定位不够准确。本实施例中，为了能够对目标声源的方位进行定位，需要选择至少两组麦克风对来对目标声源的方位进行定位。本实施例中选择的至少两组麦克风对的关系不做限制。例如，其中一组为相邻的两个麦克风，另一对可以为相邻的，也可以为位于对角线上的两个相对的麦克风。由于不同的麦克风距离目标声源的位置不一样，则不同的麦克风收到的目标声源的发出的同一语音信号的时刻不相同。The pair of signal receiving modules in this embodiment may be a pair of microphones on the smart speaker, and each pair of microphones includes two microphones, and the two microphones are any two microphones on the smart speaker. The localization is not accurate enough because a set of microphone pairs may localize target sound sources from two symmetrical directions. In this embodiment, in order to locate the azimuth of the target sound source, at least two groups of microphone pairs need to be selected to locate the azimuth of the target sound source. The relationship between the at least two microphone pairs selected in this embodiment is not limited. For example, one group may be two adjacent microphones, and the other pair may be adjacent or two opposite microphones located on a diagonal. Since different microphones are at different positions from the target sound source, different microphones receive different timings of the same voice signal from the target sound source.

本实施例中，若确定需要采集目标声源发出的语音信号时，则需要获取智能音箱上至少两组信号接收模块对接收目标声源发送的携带预设唤醒词的第一语音信号，例如，若智能音箱的预设唤醒词为“大白”，用户发出语音“大白，大白”的语音信号，则可以为目标声源发出的第一语音信号。也就是说，本实施例中，在对目标声源进行定位时，可以根据目标声源使用预设唤醒词唤醒智能音箱的第一语音信号进行定位，而不需要再采集目标声源的其他语音信号。In this embodiment, if it is determined that the voice signal sent by the target sound source needs to be collected, it is necessary to obtain at least two groups of signal receiving modules on the smart speaker to receive the first voice signal carrying the preset wake-up word sent by the target sound source, for example, If the preset wake-up word of the smart speaker is "Dabai", and the user sends out a voice signal of "Dabai, Dabai", it may be the first voice signal sent by the target sound source. That is to say, in this embodiment, when locating the target sound source, the target sound source can use the preset wake-up word to wake up the first voice signal of the smart speaker for positioning, without the need to collect other voices of the target sound source Signal.

101、获取各组信号接收模块对中的两个信号接收模块接收第一语音信号的时间差；101. Obtain the time difference between two signal receiving modules in each group of signal receiving module pairs receiving the first voice signal;

例如，若信号接收模块对为麦克风对时，由于不同的麦克风至目标声源的距离不相同，因此，对于各麦克风对中的两个麦克风接收到目标声源的第一语音信号存在时间差。可选地，该步骤101，具体可以包括：以至少两组信号接收模块对中的第一组信号接收模块对为参照物，选取目标声源的候选方向θ；对于各信号接收模块对，根据目标声源的候选方向θ，获取对应的信号接收模块对中的两个信号接收模块接收第一语音信号的时间差t0，其中t0为关于θ的函数。For example, if the signal receiving module pair is a microphone pair, since different microphones have different distances from the target sound source, there is a time difference between the two microphones in each microphone pair receiving the first voice signal of the target sound source. Optionally, this step 101 may specifically include: taking the first group of signal receiving module pairs in at least two groups of signal receiving module pairs as a reference object, selecting the candidate direction θ of the target sound source; for each signal receiving module pair, according to For the candidate direction θ of the target sound source, obtain the time difference t0 between two signal receiving modules in the corresponding signal receiving module pair receiving the first speech signal, where t0 is a function of θ.

具体地，由于麦克风对无法准确得知该时间差，本实施例中可以先预设目标声源的候选方向θ，然后基于目标声源的候选方向θ，可以表示各麦克风对中的两个麦克风接收目标声源的第一语音信号的时间差t0。本实施例中的目标声源为远场声源，即目标声源与麦克风之间的距离远远大于各麦克风之间的距离，此时可以认为目标声源发出的语音信号是以平行线的方式传向各麦克风。此时可以选择至少两组麦克风对中的第一组麦克风对为参照物，取目标声源相对于该参照物的第一组麦克风对的候选方向为θ。由于至少两组麦克风对中各组麦克风对中的两个麦克风之间的距离是已知的，因此根据目标声源的候选方向和各组麦克风对中两个麦克风之间的距离，便可以标识各组麦克风对中两个麦克风接收到的第一语音信号的时间差t0，此时t0为关于目标声源的候选方向θ的函数，而目标声源的候选方向θ可以为空间内0-360度任一角度上，即目标声源可以位于空间中0-360度的任一方位上。Specifically, since the microphone pair cannot accurately know the time difference, in this embodiment, the candidate direction θ of the target sound source can be preset first, and then based on the candidate direction θ of the target sound source, it can represent that the two microphones in each microphone pair receive The time difference t0 of the first speech signal of the target sound source. The target sound source in this embodiment is a far-field sound source, that is, the distance between the target sound source and the microphones is far greater than the distance between the microphones. mode to each microphone. At this time, the first group of microphone pairs among the at least two groups of microphone pairs may be selected as a reference object, and the candidate direction of the target sound source relative to the first group of microphone pairs of the reference object is taken as θ. Since the distance between the two microphones in each group of microphone pairs in at least two groups of microphone pairs is known, according to the candidate direction of the target sound source and the distance between the two microphones in each group of microphone pairs, it can be identified The time difference t0 of the first speech signal received by the two microphones in each group of microphone pairs, at this time t0 is a function of the candidate direction θ of the target sound source, and the candidate direction θ of the target sound source can be 0-360 degrees in the space At any angle, that is, the target sound source can be located in any direction from 0 to 360 degrees in the space.

例如图2为本发明的智能音箱的声源定位方法的一种应用场景图。如图2所示，其中A、B和C分别为智能音箱上的麦克风，本实施例中可以取A和B、A和C组成麦克风对，目标声源发出的第一语音信号以平行波的形式向各麦克风传输。如图2所示，以麦克风对A和B作为参照物，可以取目标声源的候选方向为θ，然后做辅助线BO垂直于目标声源的平行波方向，AO即为第一语音信号到达麦克风B与到达麦克风A的距离差，AO的距离Δd＝L×cosθ，其中L等于麦克风A与B之间的距离。进一步地，麦克风A和B收到目标声源发出的第一语音信号的时间差t0等于AO的距离除以声速V，这样，时间差t0可以表示为：t0＝Δd/V＝L×cosθ/V，即时间差t0为关于θ的函数。For example, FIG. 2 is an application scene diagram of the sound source localization method for a smart speaker according to the present invention. As shown in Figure 2, A, B and C are the microphones on the smart speaker respectively. In this embodiment, A and B, and A and C can be used to form a microphone pair. The form is transmitted to each microphone. As shown in Figure 2, with the microphone pair A and B as the reference object, the candidate direction of the target sound source can be taken as θ, and then the auxiliary line BO is perpendicular to the parallel wave direction of the target sound source, and AO is the arrival of the first voice signal The distance difference between microphone B and reaching microphone A, AO distance Δd=L×cosθ, where L is equal to the distance between microphones A and B. Further, the time difference t0 between the microphones A and B receiving the first voice signal from the target sound source is equal to the distance of AO divided by the sound velocity V, so the time difference t0 can be expressed as: t0=Δd/V=L×cosθ/V, That is, the time difference t0 is a function of θ.

此时，对于图2中的另一组麦克风对A和C，AC的距离也等于L。智能音箱上，麦克风对A和C的连线垂直于麦克风对A和B的连线，此时根据三角形的几何关系，麦克风A和C收到目标声源发出的第一语音信号的时间差t0可以表示为：t0＝L×sinθ/V。At this time, for another group of microphone pairs A and C in FIG. 2 , the distance of AC is also equal to L. On the smart speaker, the connection line between microphone pair A and C is perpendicular to the connection line between microphone pair A and B. At this time, according to the geometric relationship of the triangle, the time difference t0 between microphone A and C receiving the first voice signal from the target sound source can be Expressed as: t0=L×sinθ/V.

例如图3为本发明的智能音箱的声源定位方法的另一种应用场景图。与上述图2的处理方式类似，麦克风A和B收到目标声源发出的第一语音信号的时间差t0可以表示为：t0＝Δd/V＝L×cosθ/V。而麦克风A和C所在的直线若平行于第一语音信号的平行波，则此时麦克风A和C收到目标声源发出的第一语音信号的时间差t0等于AC的长度处于AC的长度L’除以声速V。For example, FIG. 3 is another application scene diagram of the sound source localization method for a smart speaker according to the present invention. Similar to the processing method in FIG. 2 above, the time difference t0 between microphones A and B receiving the first voice signal from the target sound source can be expressed as: t0=Δd/V=L×cosθ/V. And if the straight line where microphones A and C are located is parallel to the parallel wave of the first voice signal, then the time difference t0 between microphones A and C receiving the first voice signal from the target sound source is equal to the length of AC and the length L' of AC Divide by the speed of sound V.

上述图2和图3仅为两种特殊场景的举例，实际应用中，对于任意场景下的智能音箱的至少两组信号接收模块对，总可以选择其中的第一组信号接收模块对为参照物，获取目标声源的候选方向θ，并可以根据智能音箱上的各组信号接收模块对的位置关系，将各组信号接收模块对中的两个信号接收模块接收第一语音信号的时间差采用目标声源的候选方向θ表示出来。The above Figures 2 and 3 are only examples of two special scenarios. In practical applications, for at least two sets of signal receiving module pairs of smart speakers in any scenario, the first set of signal receiving module pairs can always be selected as the reference object , to obtain the candidate direction θ of the target sound source, and according to the positional relationship of each group of signal receiving module pairs on the smart speaker, the time difference between the two signal receiving modules in each group of signal receiving module pairs receiving the first voice signal can be used as the target The candidate direction θ of the sound source is indicated.

102、根据第一语音信号以及各信号接收模块对接收第一语音信号的时间差，确定发出第一语音信号的目标声源的方位。102. Determine an azimuth of a target sound source that sends out the first voice signal according to the first voice signal and the time difference between each signal receiving module for receiving the first voice signal.

当信号接收模块对为麦克风对时，对于每组麦克风对，两个麦克风接收的第一语音信号的时间差可以表示为目标声源的候选方向θ的函数，而候选方向θ可以选取0-360度范围内的任一方位的角度。而且，可以将两个麦克风接收到的第一语音信号在时间上对齐，这样，两个第一语音信号应该具有最强的相关性。然后通过遍历每个方位的角度的方式，获取两个第一语音信号相关性最大的角度，便为目标声源的方位的角度。When the signal receiving module pair is a microphone pair, for each microphone pair, the time difference of the first voice signal received by the two microphones can be expressed as a function of the candidate direction θ of the target sound source, and the candidate direction θ can be selected from 0-360 degrees Angle in any orientation within the range. Moreover, the first voice signals received by the two microphones can be aligned in time, so that the two first voice signals should have the strongest correlation. Then, by traversing the angles of each azimuth, the angle with the greatest correlation between the two first speech signals is obtained, which is the azimuth angle of the target sound source.

即该步骤102，具体可以包括如下步骤：That is, step 102 may specifically include the following steps:

(a1)根据各组信号接收模块对接收第一语音信号的时间差t0，将各信号接收模块对中的两个信号接收模块接收的第一语音信号在时间上进行对齐处理；(a1) According to the time difference t0 of each group of signal receiving modules receiving the first voice signal, the first voice signals received by the two signal receiving modules in each signal receiving module are aligned in time;

本实施例中，对齐处理时，可以将各信号接收模块对接收的两个第一语音信号中先接收到的第一语音信号延迟时间差t0，或者将各信号接收模块对接收的两个第一语音信号中后接收到的第一语音信号提前时间差t0，以使得两个第一语音信号在时间上对齐。In this embodiment, during alignment processing, each signal receiving module can delay the time difference t0 of the first received first voice signal among the two first voice signals received by each signal receiving module, or the two first received voice signals received by each signal receiving module The first voice signal received later in the voice signal is advanced by the time difference t0, so that the two first voice signals are aligned in time.

(a2)计算各组信号接收模块对对应的两个对齐处理后的第一语音信号的相关性；(a2) calculating the correlation of each group of signal receiving modules to the corresponding two aligned first speech signals;

(a3)将各信号接收模块对对应的相关性叠加，得到总相关性；(a3) superimposing the corresponding correlations of each signal receiving module to obtain the total correlation;

(a4)获取总相关性取最大值时目标声源对应的候选方向θ为目标声源的目标方向。(a4) Obtain the candidate direction θ corresponding to the target sound source when the total correlation takes the maximum value as the target direction of the target sound source.

若在确定发出第一语音信号的目标声源的方位时，仅选择一对麦克风对，如图2中的麦克风A和C，此时，如图2所示，可能在AC的左侧还存在一个与目标声源关于AC对称的备选目标声源，对齐处理后的第一语音信号的相关性也可以达到最大值，此时无法唯一确定目标声源的方位。而如果再选取麦克风对A和B，来共同确定目标声源的方位，便可以唯一确定目标声源的方位。因此，本实施例中，需要获取至少两组信号接收模块对如麦克风对，才可以唯一确定发出第一语音信号的目标声源的方位。If only a pair of microphone pairs are selected when determining the orientation of the target sound source that sends out the first voice signal, such as microphones A and C in Figure 2, at this time, as shown in Figure 2, there may still be a For a candidate target sound source that is symmetrical about AC with the target sound source, the correlation of the first speech signal after alignment processing can also reach a maximum value, and at this time, the orientation of the target sound source cannot be uniquely determined. However, if the microphone pair A and B are selected to jointly determine the orientation of the target sound source, the orientation of the target sound source can be uniquely determined. Therefore, in this embodiment, it is necessary to obtain at least two groups of signal receiving module pairs, such as microphone pairs, in order to uniquely determine the orientation of the target sound source that sends out the first voice signal.

具体地，对于每一组信号接收模块对，将该组信号接收模块对中的两个信号接收模块接收的第一语音信号对齐出来，对齐处理方式参考上述记载。例如，可以对于某麦克风对a和b，a先接收到第一语音信号Y1，b后接收到第一语音信号Y2，时间差为t0，可以将a接收到的第一语音信号Y1延迟t0，或者可以将b接收到的第一语音信号Y2提前t0。本实施例中，若以将先接收到的第一语音信号延迟，此时延迟后的第一语音信号为Y1‘。然后可以计算Y1’和Y2的相关性。对于每一组麦克风对，都可以按照上述方式获得对应的相关性，然后将各组麦克风对对应的相关性相加，便得到总相关性；由于Y1‘延迟了t0，t0又是关于θ的函数，因此，上述步骤(a4)，可以通过遍历每一个θ的方式，确定目标声源对应的每一个候选方向θ对应的总相关性的取值，获取总相关性取最大值时目标声源对应的候选方向θ为目标声源的目标方向。Specifically, for each pair of signal receiving modules, the first voice signals received by the two signal receiving modules in the group of signal receiving module pairs are aligned, and the alignment processing method refers to the above description. For example, for a pair of microphones a and b, a first receives the first voice signal Y1, and b receives the first voice signal Y2 later, the time difference is t0, and the first voice signal Y1 received by a can be delayed by t0, or The first speech signal Y2 received by b may be advanced by t0. In this embodiment, if the first voice signal received earlier is delayed, the delayed first voice signal is Y1'. The correlation of Y1' and Y2 can then be calculated. For each group of microphone pairs, the corresponding correlation can be obtained in the above way, and then the correlations corresponding to each group of microphone pairs can be added to obtain the total correlation; since Y1' is delayed by t0, t0 is about θ function, therefore, the above step (a4) can determine the value of the total correlation corresponding to each candidate direction θ corresponding to the target sound source by traversing each θ, and obtain the target sound source when the total correlation takes the maximum value The corresponding candidate direction θ is the target direction of the target sound source.

也就是说，如果选择的目标声源的候选方向θ，正好就是声源的真实方向，那么此时，在时间轴上将两个麦克风接收到的第一语音信号按照候选方向θ计算的时间差对齐之后，两路信号之间应当具有最强的相关性。反之，如果选择的候选方向θ，不是声源的真实方向，那么按照候选方向θ计算的时间差对齐之后，两个麦克风接收到的第一语音信号之间的相关性变弱。因此，可通过检测每一个候选方向θ对应的两个麦克风的对齐的第一语音信号的相关性，便可判断第一语音信号来自各候选方向θ的可能性，相关性越强，说明声音越有可能是从θ方向入射。That is to say, if the candidate direction θ of the selected target sound source is exactly the real direction of the sound source, then at this time, align the first speech signals received by the two microphones on the time axis according to the time difference calculated by the candidate direction θ Afterwards, there should be the strongest correlation between the two signals. Conversely, if the selected candidate direction θ is not the real direction of the sound source, then after the time difference calculated according to the candidate direction θ is aligned, the correlation between the first speech signals received by the two microphones becomes weaker. Therefore, by detecting the correlation of the aligned first speech signals of the two microphones corresponding to each candidate direction θ, the possibility of the first speech signal coming from each candidate direction θ can be judged. The stronger the correlation, the stronger the sound. It may be incident from the θ direction.

进一步可选地，步骤102“根据第一语音信号以及各信号接收模块对接收第一语音信号的时间差，确定发出第一语音信号的目标声源的方位”之后,还可以包括：旋转定位指示标记至目标声源的方位上和/或向目标声源的方位亮起定位指示灯，以告知目标声源对应的用户，目标声源的方位已经被确定。Further optionally, after step 102 "according to the first voice signal and the time difference between each signal receiving module for receiving the first voice signal, determine the orientation of the target sound source that sends out the first voice signal", it may also include: rotating the positioning indicator mark To the azimuth of the target sound source and/or to light up the positioning indicator light to the azimuth of the target sound source, to inform the user corresponding to the target sound source that the azimuth of the target sound source has been determined.

也就是说，智能音箱在对目标声源定位之后，需要给该目标声源的用户做出一定的反馈，以告知用户已经对其发出的目标声源定位，用户可以与该智能音箱互动，由智能音箱提供服务。本实施例中，智能音箱上可以设置有定位指示标记，该定位指示标志可以为一个可以旋转的指针，智能音箱对该目标声源定位之后，可以旋转该定位指示标记至目标声源的方位上，这样，这个方位的用户可以看到其目标声源已经被定位。或者该智能音箱上还可以设置有定位指示灯，例如该定位指示灯可以设置在定位指示标记的指针上，这样，智能音箱定位目标声源的方位之后，还可以向目标声源的方位亮起定位指示灯，以告知目标声源对应的用户，目标声源的方位已经被确定。上述两种方式可以单独使用，也可以组合使用。That is to say, after the smart speaker locates the target sound source, it needs to give certain feedback to the user of the target sound source, so as to inform the user that the target sound source has been positioned, and the user can interact with the smart speaker. Smart speakers provide services. In this embodiment, the smart speaker can be provided with a positioning indicator mark, which can be a rotatable pointer. After the smart speaker locates the target sound source, it can rotate the positioning indicator mark to the direction of the target sound source. , so that the user in this direction can see that the target sound source has been located. Or the smart speaker can also be provided with a positioning indicator light. For example, the positioning indicator light can be set on the pointer of the positioning indicator mark. In this way, after the smart speaker locates the direction of the target sound source, it can also light up to the direction of the target sound source. The positioning indicator light is used to inform the user corresponding to the target sound source that the direction of the target sound source has been determined. The above two methods can be used alone or in combination.

本实施例的智能音箱的声源定位方法，若确定需要采集目标声源发出的语音信号时，通过获取智能音箱上至少两组信号接收模块对接收目标声源发送的携带预设唤醒词的第一语音信号；获取各组信号接收模块对中的两个信号接收模块接收第一语音信号的时间差；根据第一语音信号以及各信号接收模块对接收第一语音信号的时间差，确定发出第一语音信号的目标声源的方位。本实施例的技术方案，可以在声源较多的场景下，对目标声源进行定位，这样，智能音箱可以仅采集定位方向的目标声源的语音信号，进而为该目标声源对应的用户提供服务；而且还能够有效地丰富智能音箱的功能，使得智能音箱的使用更加灵活、方便。In the sound source localization method of the smart speaker in this embodiment, if it is determined that the voice signal sent by the target sound source needs to be collected, at least two groups of signal receiving modules on the smart speaker receive the first voice signal carrying the preset wake-up word sent by the target sound source. A voice signal; obtain the time difference between the two signal receiving modules in each group of signal receiving modules receiving the first voice signal; determine the first voice according to the first voice signal and the time difference between each signal receiving module receiving the first voice signal The azimuth of the signal's target sound source. The technical solution of this embodiment can locate the target sound source in a scene where there are many sound sources. In this way, the smart speaker can only collect the voice signal of the target sound source in the positioning direction, and then provide the user corresponding to the target sound source. Provide services; and can effectively enrich the functions of smart speakers, making the use of smart speakers more flexible and convenient.

图4为本发明的智能音箱的声源定位方法实施例二的流程图。如图4所示，本实施例的智能音箱的声源定位方法，在上述实施例的技术方案的基础上，进一步更加详细地介绍本发明的技术方案。如图4所示，本实施例的智能音箱的声源定位方法，具体还可以包括如下技术方案：FIG. 4 is a flow chart of Embodiment 2 of the sound source localization method for a smart speaker according to the present invention. As shown in FIG. 4 , the sound source localization method for a smart speaker in this embodiment further introduces the technical solution of the present invention in more detail on the basis of the technical solution of the above-mentioned embodiment. As shown in Figure 4, the sound source localization method of the smart speaker of the present embodiment may specifically include the following technical solutions:

200、接收目标声源对应的用户语音输入的携带预设唤醒词的第二语音信号；200. Receive a second voice signal carrying a preset wake-up word input by a user voice corresponding to a target sound source;

201、从第二语音信号中提取预设唤醒词；201. Extract a preset wake-up word from the second voice signal;

202、提取第二语音信号的声纹特征，作为目标声源的声纹特征；202. Extract the voiceprint feature of the second voice signal as the voiceprint feature of the target sound source;

203、建立并存储预设唤醒词与目标声源的声纹特征的对应关系；203. Establish and store the correspondence between the preset wake-up word and the voiceprint feature of the target sound source;

本实施例的智能音箱的声源定位方法的应用场景中，该智能音箱可以支持增加设置预设唤醒词。例如，在现有技术的智能音箱的唤醒词不可改变的基础上，本实施例中，智能音箱的主人可以为自己或者家属在智能音箱中设置预设唤醒词。例如，若智能音箱的默认唤醒词为小A，主人在购买智能音箱之后，为第一个与该智能音箱对话的人，该默认唤醒词可以为该主人的私有唤醒词。经过主人同意，主人也可以让其他家属用户在智能音箱上设置其私有唤醒词。例如，按照类似的方式，目标声源对应的用户设置私有唤醒词时，目标声源对应的用户可以语音输入携带预设唤醒词的第二语音信号，如该用户呼叫“小可爱”，该小可爱为该目标声源的用户对该智能音箱设置的预设唤醒词。此时智能音箱从第二语音信号中提取预设唤醒词；并提取第二语音信号的声纹特征，作为目标声源的声纹特征；然后建立并存储预设唤醒词与目标声源的声纹特征的对应关系。也就是说，呼叫该预设唤醒词的语音信号的声纹特征必须为该对应关系中的声纹特征，或者采用该声纹特征呼叫的语音信号中携带的唤醒词必须为对应关系中的预设唤醒词，否则智能音箱可以不予理会。In the application scenario of the method for localizing a sound source of a smart speaker in this embodiment, the smart speaker may support adding and setting a preset wake-up word. For example, on the basis that the wake-up word of the smart speaker in the prior art cannot be changed, in this embodiment, the owner of the smart speaker can set a preset wake-up word in the smart speaker for himself or a family member. For example, if the default wake-up word of the smart speaker is small A, and the owner is the first person to talk to the smart speaker after purchasing the smart speaker, the default wake-up word can be the owner's private wake-up word. With the consent of the owner, the owner can also allow other family members to set their private wake-up words on the smart speaker. For example, in a similar manner, when the user corresponding to the target sound source sets a private wake-up word, the user corresponding to the target sound source can voice input the second voice signal carrying the preset wake-up word. Lovely is the preset wake-up word set by the user of the target sound source for the smart speaker. At this time, the smart speaker extracts the preset wake-up word from the second voice signal; and extracts the voiceprint feature of the second voice signal as the voiceprint feature of the target sound source; then establishes and stores the preset wake-up word and the voice print feature of the target sound source Correspondence relationship of texture features. That is to say, the voiceprint feature of the voice signal calling the preset wake-up word must be the voiceprint feature in the corresponding relationship, or the wake-up word carried in the voice signal called by using the voiceprint feature must be the preset wake-up word in the corresponding relationship. Set a wake-up word, otherwise the smart speaker can ignore it.

上述建立并存储预设唤醒词与目标声源的声纹特征的对应关系的过程可以预先进行，便于后续直接使用该对应关系检测。The above process of establishing and storing the correspondence between the preset wake-up word and the voiceprint feature of the target sound source can be performed in advance, so that the correspondence can be directly used for subsequent detection.

204、获取目标声源发出的携带预设唤醒词的第一语音信号；204. Obtain a first voice signal carrying a preset wake-up word sent by the target sound source;

205、从第一语音信号中提取预设唤醒词；205. Extract a preset wake-up word from the first voice signal;

206、提取第一语音信号的声纹特征；206. Extracting voiceprint features of the first voice signal;

207、根据预存储的预设唤醒词与目标声源的声纹特征的对应关系，判断预设唤醒词与第一语音信号的声纹特征是否匹配；若匹配，执行步骤208；否则，执行步骤209；207. According to the correspondence between the pre-stored preset wake-up words and the voiceprint features of the target sound source, determine whether the preset wake-up words match the voiceprint features of the first voice signal; if they match, perform step 208; otherwise, perform step 207. 209;

208、确定需要采集目标声源发出的语音信号，结束。208. Determine that the voice signal from the target sound source needs to be collected, and end.

也就是说，根据步骤203得到的预设唤醒词与目标声源的声纹特征的对应关系，判断步骤205得到的预设唤醒词和步骤206提取的声纹特征是否匹配。若匹配，此时确定需要采集目标声源发出的语音信号，即可以执行上述图1所示实施例的技术方案。That is to say, according to the corresponding relationship between the preset wake-up words obtained in step 203 and the voiceprint features of the target sound source, it is judged whether the preset wake-up words obtained in step 205 match the voiceprint features extracted in step 206. If they match, it is determined that the voice signal emitted by the target sound source needs to be collected at this time, that is, the technical solution of the above-mentioned embodiment shown in FIG. 1 can be implemented.

209、确定预设唤醒词与声纹特征不匹配，暂不执行任何操作。209. It is determined that the preset wake-up word does not match the voiceprint feature, and no operation is performed temporarily.

或者可选地，若智能音箱确定当前预设唤醒词与声纹特征不匹配，此时智能音箱也可以做出一定的语音提示“对不起，您使用的唤醒词有误，暂时无法给您提供服务”等等类似的提示消息。Or optionally, if the smart speaker determines that the current preset wake-up word does not match the voiceprint features, the smart speaker can also make a certain voice prompt at this time, "Sorry, the wake-up word you used is wrong, and we cannot provide you with services temporarily." ” and similar prompt messages.

本实施例的智能音箱的声源定位方法，通过采用上述方案，还可以进一步采用声纹和预设唤醒词一起实现对语音信号的确定。本实施例的方式中，对于同一个智能音箱，不同的用户可以设置不同的唤醒词，且在智能音箱中可以将该用户的唤醒词与该用户的声纹特征的对应关系存储，这样每个用户只能使用其私有的唤醒词唤醒该智能音箱，并与其交互，这样，智能音箱对每个语音信号的声纹特征和唤醒词都有很好的辨识性，不仅能够增强智能音箱对语音信号定位的准确性，而且还能够大大地增强了用户的使用体验度。The sound source localization method for the smart speaker of this embodiment can further use the voiceprint and the preset wake-up word together to realize the determination of the voice signal by adopting the above solution. In the method of this embodiment, for the same smart speaker, different users can set different wake-up words, and the corresponding relationship between the user's wake-up word and the user's voiceprint feature can be stored in the smart speaker, so that each Users can only use their private wake-up words to wake up the smart speaker and interact with it. In this way, the smart speaker has a good recognition of the voiceprint characteristics and wake-up words of each voice signal, which can not only enhance the smart speaker's recognition of voice signals Positioning accuracy, but also can greatly enhance the user experience.

图5为本发明的智能音箱实施例一的结构图。如图5所示，本实施例的智能音箱，具体可以包括：信号获取模块10、时间差获取模块11和定位模块12。FIG. 5 is a structural diagram of Embodiment 1 of the smart speaker of the present invention. As shown in FIG. 5 , the smart speaker in this embodiment may specifically include: a signal acquisition module 10 , a time difference acquisition module 11 and a positioning module 12 .

其中信号获取模块10用于若确定需要采集目标声源发出的语音信号时，获取智能音箱上至少两组信号接收模块对接收目标声源发送的携带预设唤醒词的第一语音信号；预设唤醒词用于供目标声源唤醒智能音箱；Wherein the signal acquisition module 10 is used to obtain at least two groups of signal receiving modules on the smart speaker to receive the first voice signal of the preset wake-up word sent by the target sound source if it is determined that the voice signal sent by the target sound source needs to be collected; preset The wake-up word is used for the target sound source to wake up the smart speaker;

时间差获取模块11用于获取信号获取模块10获取的各组信号接收模块对中的两个信号接收模块接收第一语音信号的时间差；The time difference acquiring module 11 is used to acquire the time difference when two signal receiving modules in each group of signal receiving modules acquired by the signal acquiring module 10 receive the first voice signal;

定位模块12用于根据信号获取模块10获取的第一语音信号以及时间差获取模块11获取的各信号接收模块对接收第一语音信号的时间差，确定发出第一语音信号的目标声源的方位。The positioning module 12 is used to determine the orientation of the target sound source that sends out the first voice signal according to the first voice signal acquired by the signal acquisition module 10 and the time difference between each signal receiving module receiving the first voice signal acquired by the time difference acquisition module 11 .

本实施例的智能音箱，通过采用上述模块实现声源定位的实现原理以及技术效果与上述相关方法实施例的实现相同，详细可以参考上述相关方法实施例的记载，在此不再赘述。For the intelligent speaker of this embodiment, the implementation principle and technical effect of sound source localization by using the above-mentioned modules are the same as those of the above-mentioned related method embodiments. For details, please refer to the records of the above-mentioned related method embodiments, which will not be repeated here.

图6为本发明的智能音箱实施例二的结构图。如图6所示，本实施例的智能音箱，在上述图5所示实施例的技术方案的基础上，进一步还可以包括如下技术方案。FIG. 6 is a structural diagram of Embodiment 2 of the smart speaker of the present invention. As shown in FIG. 6 , the smart speaker of this embodiment may further include the following technical solutions on the basis of the technical solution of the above-mentioned embodiment shown in FIG. 5 .

如图6所示，本实施例的智能音箱，还可以包括定位指示模块13。As shown in FIG. 6 , the smart speaker of this embodiment may further include a positioning indication module 13 .

定位指示模块13用于在旋转定位指示标记至定位模块12定位的目标声源的方位上和/或向定位模块12定位的目标声源的方位亮起定位指示灯，以告知目标声源对应的用户，目标声源的方位已经被确定。The positioning indicator module 13 is used to turn the positioning indicator mark to the orientation of the target sound source positioned by the positioning module 12 and/or light up the positioning indicator light to the orientation of the target sound source positioned by the positioning module 12, so as to inform the corresponding position of the target sound source. User, the orientation of the target sound source has been determined.

进一步可选地，本实施例的智能音箱中，时间差获取模块11具体用于：Further optionally, in the smart speaker of this embodiment, the time difference acquisition module 11 is specifically used for:

以至少两组信号接收模块对中的第一组信号接收模块对为参照物，选取目标声源的候选方向θ；Taking the first group of signal receiving module pairs in at least two groups of signal receiving module pairs as a reference object, selecting the candidate direction θ of the target sound source;

对于各信号接收模块对，根据目标声源的候选方向θ，获取对应的信号接收模块对中的两个信号接收模块接收第一语音信号的时间差t0，其中t0为关于θ的函数。For each pair of signal receiving modules, according to the candidate direction θ of the target sound source, the time difference t0 between two signal receiving modules in the corresponding signal receiving module pair receiving the first voice signal is obtained, where t0 is a function of θ.

进一步可选地，本实施例的智能音箱中，定位模块12具体用于：Further optionally, in the smart speaker of this embodiment, the positioning module 12 is specifically used for:

根据各组信号接收模块对接收第一语音信号的时间差t0，将各组信号接收模块对中的两个信号接收模块接收的第一语音信号在时间上进行对齐处理；According to the time difference t0 of each group of signal receiving modules receiving the first voice signal, the first voice signals received by the two signal receiving modules in each group of signal receiving modules are aligned in time;

计算各组信号接收模块对对应的两个对齐处理后的第一语音信号的相关性；Calculate the correlation of each group of signal receiving modules to the corresponding two aligned first speech signals;

将各组信号接收模块对对应的相关性叠加，得到总相关性；Superimpose the corresponding correlations of each group of signal receiving modules to obtain the total correlation;

获取总相关性取最大值时目标声源对应的候选方向θ为目标声源的目标方向。The candidate direction θ corresponding to the target sound source when the total correlation takes the maximum value is the target direction of the target sound source.

进一步可选地，本实施例的智能音箱中，定位模块12具体用于将各信号接收模块对接收的两个第一语音信号中先接收到的第一语音信号延迟时间差t0，或者将各信号接收模块对接收的两个第一语音信号中后接收到的第一语音信号提前时间差t0，以使得两个第一语音信号在时间上对齐。Further optionally, in the smart speaker of this embodiment, the positioning module 12 is specifically configured to delay the first voice signal received first among the two first voice signals received by each signal receiving module by a time difference t0, or to delay each signal The receiving module advances the time difference t0 of the first speech signal received later among the two received first speech signals, so that the two first speech signals are aligned in time.

进一步可选地，如图6所示，本实施例的智能音箱还包括：Further optionally, as shown in Figure 6, the smart speaker of this embodiment also includes:

确定模块14用于确定需要采集目标声源发出的语音信号。The determining module 14 is used for determining that the voice signal emitted by the target sound source needs to be collected.

进一步可选地，本实施例的智能音箱中，确定模块14具体用于：Further optionally, in the smart speaker of this embodiment, the determining module 14 is specifically used for:

获取目标声源发出的携带预设唤醒词的第一语音信号；Obtaining a first voice signal carrying a preset wake-up word sent by a target sound source;

从第一语音信号中提取预设唤醒词；extracting a preset wake-up word from the first voice signal;

提取第一语音信号的声纹特征；Extracting the voiceprint feature of the first voice signal;

根据预存储的预设唤醒词与目标声源的声纹特征的对应关系，判断预设唤醒词与第一语音信号的声纹特征是否匹配；According to the corresponding relationship between the pre-stored preset wake-up word and the voiceprint feature of the target sound source, it is judged whether the preset wake-up word matches the voiceprint feature of the first voice signal;

若匹配，确定需要采集目标声源发出的语音信号。If they match, it is determined that the voice signal from the target sound source needs to be collected.

对应地，确定模块14在确定需要采集目标声源发出的语音信号时，触发信号获取模块10获取智能音箱上至少两组信号接收模块对接收目标声源发送的携带预设唤醒词的第一语音信号。Correspondingly, when the determination module 14 determines that it is necessary to collect the voice signal from the target sound source, the trigger signal acquisition module 10 acquires at least two groups of signal receiving modules on the smart speaker to receive the first voice carrying the preset wake-up word sent by the target sound source. Signal.

接收模块15用于接收目标声源对应的用户语音输入的携带预设唤醒词的第二语音信号；The receiving module 15 is used to receive the second voice signal carrying the preset wake-up word input by the user's voice corresponding to the target sound source;

提取模块16用于从接收模块15接收的第二语音信号中提取预设唤醒词；The extraction module 16 is used to extract the preset wake-up word from the second voice signal received by the receiving module 15;

提取模块16还用于提取接收模块15接收的第二语音信号的声纹特征，作为目标声源的声纹特征；The extraction module 16 is also used to extract the voiceprint feature of the second voice signal received by the receiving module 15 as the voiceprint feature of the target sound source;

建立模块17用于建立并存储提取模块16提取的预设唤醒词与目标声源的声纹特征的对应关系。The establishment module 17 is used to establish and store the correspondence between the preset wake-up words extracted by the extraction module 16 and the voiceprint features of the target sound source.

对应地，确定模块14用于根据建立模块17建立的预存储的预设唤醒词与目标声源的声纹特征的对应关系，判断预设唤醒词与第一语音信号的声纹特征是否匹配。Correspondingly, the determination module 14 is configured to determine whether the preset wake-up word matches the voiceprint feature of the first voice signal according to the correspondence between the pre-stored preset wake-up word and the voiceprint feature of the target sound source established by the establishment module 17 .

图7为本发明的智能音箱实施例三的结构图。如图7所示，本实施例的智能音箱，包括用于收发信号的多个麦克风(图中未示出)。例如，该智能音箱的多个麦克风可以均匀分布在智能音箱的壳体上，多个麦克风用于接收用户的语音Query，还用于基于用户的语音Query向用户播报反馈信息。本实施例的智能音箱还包括：一个或多个处理器30，以及存储器40，存储器40用于存储一个或多个程序，当存储器40中存储的一个或多个程序被一个或多个处理器30执行，使得一个或多个处理器30实现如上图1-图4所示实施例的智能音箱的声源定位方法。图7所示实施例中以包括多个处理器30为例。FIG. 7 is a structural diagram of Embodiment 3 of the smart speaker of the present invention. As shown in FIG. 7 , the smart speaker in this embodiment includes multiple microphones (not shown in the figure) for sending and receiving signals. For example, multiple microphones of the smart speaker may be evenly distributed on the casing of the smart speaker, and the multiple microphones are used to receive the user's voice query, and are also used to broadcast feedback information to the user based on the user's voice query. The smart speaker of this embodiment also includes: one or more processors 30, and a memory 40, the memory 40 is used to store one or more programs, when the one or more programs stored in the memory 40 are used by one or more processors 30, so that one or more processors 30 implement the sound source localization method for the smart speaker in the embodiment shown in Fig. 1-Fig. 4 above. The embodiment shown in FIG. 7 takes a plurality of processors 30 as an example.

例如，图8为本发明提供的一种智能音箱的示例图。图8示出了适于用来实现本发明实施方式的示例性智能音箱12a的框图。图8显示的智能音箱12a仅仅是一个示例，不应对本发明实施例的功能和使用范围带来任何限制。For example, FIG. 8 is an example diagram of a smart speaker provided by the present invention. Figure 8 shows a block diagram of an exemplary smart speaker 12a suitable for use in implementing embodiments of the present invention. The smart speaker 12a shown in FIG. 8 is only an example, and should not limit the functions and scope of use of this embodiment of the present invention.

如图8所示，本实施例的智能音箱12a以通用计算设备的形式表现。智能音箱12a的组件可以包括但不限于：一个或者多个处理器16a，系统存储器28a，连接不同系统组件(包括系统存储器28a和处理器16a)的总线18a。As shown in FIG. 8, the smart speaker 12a of this embodiment is represented in the form of a general-purpose computing device. Components of the smart speaker 12a may include, but are not limited to: one or more processors 16a, a system memory 28a, and a bus 18a connecting different system components (including the system memory 28a and the processor 16a).

总线18a表示几类总线结构中的一种或多种，包括存储器总线或者存储器控制器，外围总线，图形加速端口，处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说，这些体系结构包括但不限于工业标准体系结构(ISA)总线，微通道体系结构(MAC)总线，增强型ISA总线、视频电子标准协会(VESA)局域总线以及外围组件互连(PCI)总线。Bus 18a represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus structures. These architectures include, by way of example, but are not limited to Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, Enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect ( PCI) bus.

智能音箱12a典型地包括多种计算机系统可读介质。这些介质可以是任何能够被智能音箱12a访问的可用介质，包括易失性和非易失性介质，可移动的和不可移动的介质。Smart speaker 12a typically includes various computer system readable media. These media can be any available media that can be accessed by the smart speaker 12a, including volatile and non-volatile media, removable and non-removable media.

系统存储器28a可以包括易失性存储器形式的计算机系统可读介质，例如随机存取存储器(RAM)30a和/或高速缓存存储器32a。智能音箱12a可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例，存储系统34a可以用于读写不可移动的、非易失性磁介质(图8未显示，通常称为“硬盘驱动器”)。尽管图8中未示出，可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器，以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下，每个驱动器可以通过一个或者多个数据介质接口与总线18a相连。系统存储器28a可以包括至少一个程序产品，该程序产品具有一组(例如至少一个)程序模块，这些程序模块被配置以执行本发明上述图1-图6各实施例的功能。System memory 28a may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30a and/or cache memory 32a. The smart speaker 12a may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34a may be used to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 8, commonly referred to as a "hard drive"). Although not shown in FIG. 8, a disk drive for reading and writing to removable nonvolatile disks (e.g., "floppy disks") may be provided, as well as for removable nonvolatile optical disks (e.g., CD-ROM, DVD-ROM or other optical media) CD-ROM drive. In these cases, each drive may be connected to bus 18a via one or more data media interfaces. The system memory 28a may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of the above-described embodiments of FIGS. 1-6 of the present invention.

具有一组(至少一个)程序模块42a的程序/实用工具40a，可以存储在例如系统存储器28a中，这样的程序模块42a包括——但不限于——操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42a通常执行本发明所描述的上述图1-图6各实施例中的功能和/或方法。A program/utility 40a having a set (at least one) of program modules 42a may be stored, for example, in system memory 28a, such program modules 42a including - but not limited to - an operating system, one or more application programs, other Program modules, as well as program data, each or some combination of these examples may include implementations of network environments. The program module 42a generally executes the functions and/or methods described in the embodiments of FIG. 1 to FIG. 6 described in the present invention.

智能音箱12a也可以与一个或多个外部设备14a(例如键盘、指向设备、显示器24a等)通信，还可与一个或者多个使得用户能与该智能音箱12a交互的设备通信，和/或与使得该智能音箱12a能与一个或多个其它计算设备进行通信的任何设备(例如网卡，调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22a进行。并且，智能音箱12a还可以通过网络适配器20a与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。如图所示，网络适配器20a通过总线18a与智能音箱12a的其它模块通信。应当明白，尽管图中未示出，可以结合智能音箱12a使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理器、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The smart speaker 12a may also communicate with one or more external devices 14a (such as a keyboard, pointing device, display 24a, etc.), communicate with one or more devices that enable a user to interact with the smart speaker 12a, and/or communicate with Any device (eg, network card, modem, etc.) that enables the smart speaker 12a to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 22a. Moreover, the smart speaker 12a can also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN) and/or a public network such as the Internet) through the network adapter 20a. As shown, network adapter 20a communicates with other modules of smart speaker 12a via bus 18a. It should be understood that although not shown, other hardware and/or software modules may be used in conjunction with smart speaker 12a, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

处理器16a通过运行存储在系统存储器28a中的程序，从而执行各种功能应用以及数据处理，例如实现上述实施例所示的智能音箱的声源定位方法。The processor 16a executes various functional applications and data processing by running the programs stored in the system memory 28a, for example, implementing the sound source localization method for the smart speaker shown in the above-mentioned embodiments.

本发明还提供一种计算机可读介质，其上存储有计算机程序，该程序被处理器执行时实现如上述实施例所示的智能音箱的声源定位方法。The present invention also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, the method for localizing the sound source of the smart speaker as shown in the above-mentioned embodiments is realized.

本实施例的计算机可读介质可以包括上述图8所示实施例中的系统存储器28a中的RAM30a、和/或高速缓存存储器32a、和/或存储系统34a。The computer-readable medium in this embodiment may include the RAM 30a in the system memory 28a in the embodiment shown in FIG. 8 above, and/or the cache memory 32a, and/or the storage system 34a.

随着科技的发展，计算机程序的传播途径不再受限于有形介质，还可以直接从网络下载，或者采用其他方式获取。因此，本实施例中的计算机可读介质不仅可以包括有形的介质，还可以包括无形的介质。With the development of science and technology, the transmission channels of computer programs are no longer limited to tangible media, and can also be directly downloaded from the Internet or obtained in other ways. Therefore, the computer-readable medium in this embodiment may include not only tangible media, but also intangible media.

本实施例的计算机可读介质可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。The computer-readable medium of this embodiment may use any combination of one or more computer-readable mediums. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include: electrical connections with one or more leads, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this document, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括——但不限于——电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。A computer readable signal medium may include a data signal carrying computer readable program code in baseband or as part of a carrier wave. Such propagated data signals may take many forms, including - but not limited to - electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device. .

计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括——但不限于——无线、电线、光缆、RF等等，或者上述的任意合适的组合。Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including - but not limited to - wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

可以以一种或多种程序设计语言或其组合来编写用于执行本发明操作的计算机程序代码，所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言—诸如”C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out the operations of the present invention may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages—such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).

在本发明所提供的几个实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式。In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other division methods in actual implementation.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or in the form of hardware plus software functional units.

上述以软件功能单元的形式实现的集成的单元，可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-mentioned integrated units implemented in the form of software functional units may be stored in a computer-readable storage medium. The above-mentioned software functional units are stored in a storage medium, and include several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) or a processor (processor) execute the methods described in various embodiments of the present invention. partial steps. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other various media that can store program codes. .

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims

1. a sound source localization method of intelligent speaker, is characterized in that, described method comprises:

If it is determined that the voice signal sent by the target sound source needs to be collected, at least two groups of signal receiving modules on the smart speaker are obtained to receive the first voice signal carrying the preset wake-up word sent by the target sound source; the preset wake-up word is used waking up the smart speaker for the target sound source;

Acquiring the time difference between the two signal receiving modules in each group of the signal receiving module pairs receiving the first voice signal;

The orientation of the target sound source that sends out the first voice signal is determined according to the first voice signal and the time difference between each of the signal receiving modules receiving the first voice signal.

2. The method according to claim 1, wherein, according to the first voice signal and each of the signal receiving modules receiving the time difference of the first voice signal, it is determined to send the first voice signal. After describing the orientation of the target sound source, the method also includes:

Rotate the positioning indicator mark to the azimuth of the target sound source and/or light up the positioning indicator light to the azimuth of the target sound source to inform the user corresponding to the target sound source that the azimuth of the target sound source has been fixed Sure.

3. The method according to claim 1, characterized in that, obtaining the time difference at which two of the signal receiving modules in each of the signal receiving modules receive the first voice signal, specifically includes:

Taking the first group of signal receiving module pairs in the at least two groups of signal receiving module pairs as a reference object, selecting the candidate direction θ of the target sound source;

For each pair of signal receiving modules, according to the candidate direction θ of the target sound source, obtain the time difference t0 at which two of the signal receiving modules in the corresponding pair of signal receiving modules receive the first speech signal, wherein Said t0 is a function of said θ.

4. The method according to claim 3, wherein, according to the first voice signal and each of the signal receiving modules receiving the time difference of the first voice signal, it is determined to send the first voice signal. Describe the direction of the target sound source, including:

According to the time difference t0 of each group of signal receiving modules receiving the first voice signal, the first voice signals received by the two signal receiving modules in each group of the signal receiving module pairs are timed. alignment processing;

calculating the correlation of each group of the signal receiving modules with respect to the corresponding two aligned first speech signals;

superimposing the corresponding correlations of each group of the signal receiving modules to obtain the total correlation;

Obtaining the candidate direction θ corresponding to the target sound source when the total correlation takes a maximum value is the target direction of the target sound source.

5. method according to claim 4, is characterized in that, according to each described signal receiving module to receive the time difference t0 of described first speech signal, two described signal receiving modules in each described signal receiving module are received The first voice signal received by the module is aligned in time, specifically including:

Delaying the time difference t0 of the first received first voice signal among the two first voice signals received by each pair of signal receiving modules, or delaying the two received voice signals by each pair of signal receiving modules The first voice signal received later in the first voice signal is advanced by the time difference t0, so that the two first voice signals are aligned in time.

6. The method according to claim 1, wherein, before obtaining at least two groups of signal receiving modules on the smart speaker to receive the first voice signal carrying the preset wake-up word sent by the target sound source, the method further include:

Determine the need to collect the voice signal from the target sound source;

Further, it is determined that the voice signal emitted by the target sound source needs to be collected, specifically including:

acquiring the first voice signal carrying the preset wake-up word sent by the target sound source;

extracting the preset wake-up word from the first voice signal;

extracting voiceprint features of the first voice signal;

judging whether the preset wake-up word matches the voiceprint feature of the first voice signal according to the pre-stored correspondence between the preset wake-up word and the voiceprint feature of the target sound source;

If they match, it is determined that the voice signal from the target sound source needs to be collected.

7. The method according to claim 6, wherein before acquiring the first voice signal carrying the preset wake-up word sent by the target sound source, the method further comprises:

receiving a second voice signal carrying the preset wake-up word input by a user voice corresponding to the target sound source;

extracting the preset wake-up word from the second voice signal;

extracting the voiceprint feature of the second speech signal as the voiceprint feature of the target sound source;

A correspondence relationship between the preset wake-up word and the voiceprint feature of the target sound source is established and stored.

8. A smart speaker, characterized in that the smart speaker includes:

The signal acquisition module is used to obtain at least two groups of signal receiving modules on the smart speaker to receive the first voice signal carrying the preset wake-up word sent by the target sound source if it is determined that the voice signal sent by the target sound source needs to be collected; The preset wake-up word is used for the target sound source to wake up the smart speaker;

A time difference acquiring module, configured to acquire the time difference between two signal receiving modules in each group of signal receiving module pairs receiving the first voice signal;

The positioning module is configured to determine the azimuth of the target sound source that sends out the first voice signal according to the first voice signal and the time difference between each of the signal receiving modules receiving the first voice signal.

9. The smart speaker according to claim 8, further comprising:

The positioning indication module is used to rotate the positioning indication mark to the azimuth of the target sound source and/or light the positioning indicator light to the azimuth of the target sound source, so as to inform the user corresponding to the target sound source that the target sound source The direction of the sound source has been determined.

10. The smart speaker according to claim 8, wherein the time difference acquisition module is specifically used for:

For each pair of signal receiving modules, according to the candidate direction θ of the target sound source, the time difference t0 at which two of the signal receiving modules in the corresponding pair of signal receiving modules receive the first speech signal is obtained, wherein Said t0 is a function of said θ.

11. The smart speaker according to claim 10, wherein the positioning module is specifically used for:

12. The smart speaker according to claim 11, wherein the positioning module is specifically configured to place the first received first voice signal among the two first voice signals received by each of the signal receiving modules. A voice signal is delayed by the time difference t0, or the first voice signal received by each of the two first voice signals received by each of the signal receiving modules is advanced by the time difference t0, so that the two received The first speech signals are aligned in time.

13. The smart speaker according to claim 8, wherein the smart speaker further comprises:

A determination module, configured to determine the need to collect the voice signal from the target sound source;

Further, the determining module is specifically used for:

extracting the preset wake-up word from the first voice signal;

extracting voiceprint features of the first voice signal;

14. The smart speaker according to claim 13, wherein the smart speaker further comprises:

A receiving module, configured to receive a second voice signal carrying the preset wake-up word input by the user's voice corresponding to the target sound source;

An extraction module, configured to extract the preset wake-up word from the second voice signal;

The extraction module is further configured to extract the voiceprint feature of the second speech signal as the voiceprint feature of the target sound source;

The establishment module is used to establish and store the corresponding relationship between the preset wake-up word and the voiceprint feature of the target sound source.

15. A smart speaker, including a plurality of microphones for sending and receiving signals; it is characterized in that, the smart speaker also includes:

one or more processors;

memory for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors are made to implement the method according to any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored, wherein, when the program is executed by a processor, the method according to any one of claims 1-7 is implemented.