CN120600051A

CN120600051A - Voiceprint recognition test system and voiceprint recognition test method

Info

Publication number: CN120600051A
Application number: CN202511077139.0A
Authority: CN
Inventors: 张长浩; 刘健; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2025-07-31
Filing date: 2025-07-31
Publication date: 2025-09-05

Abstract

The embodiments of this specification disclose a voiceprint recognition test system and a voiceprint recognition test method. The voiceprint recognition test system includes: a playback device for playing multiple sound source files in sequence; an environmental speaker for playing multiple environmental sound source files in sequence; smart glasses for collecting a live sound audio file corresponding to each sound source file, and adding a corresponding environmental label to the live sound audio file according to a target environmental sound source file played synchronously with the sound source file among the multiple environmental sound source files. In this way, a standardized test solution for the voiceprint recognition ability and physical anti-counterfeiting ability of smart glasses can be implemented through the voiceprint recognition test system to address the problem of missing voiceprint data of smart glasses, thereby ensuring the product release of smart glasses. By conducting robustness testing based on the voiceprint recognition test system in the laboratory, testing of smart glasses products with different hardware can be quickly completed.

Description

Voiceprint recognition test system and voiceprint recognition test method

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voiceprint recognition test system and a voiceprint recognition test method.

Background

With the continuous development of artificial intelligence technology, intelligent glasses are becoming popular as an emerging intelligent wearing product in daily life of people, and more diversified life experiences are brought to people. The intelligent glasses are head-wearing equipment integrated with the microcomputer system, digital information is fused with the physical world through optical display, voice interaction, sensing technology and the like based on the glasses form, and the perception capability and interaction mode of a user on the environment are expanded.

The current voiceprint recognition technology is mainly based on equipment such as mobile phones and intelligent sound boxes, and the intelligent glasses are unmatched with the existing data sets due to the differences of the layout of open headphones, bone conduction technology and microphone of the glasses legs, and in the prior art, although accurate recognition under noisy environments is realized through software and hardware optimization, the industry still lacks a standardized voiceprint database of cross-brand glasses, training and evaluation of a general model cannot be supported, the large-scale application of nuclear body capacity is restricted, and voiceprint data on the intelligent glasses still belongs to industry blank.

Although small batches of glasses data verification can be obtained through schemes such as cold start and model migration, the batch data verification capability is still needed, the cold start depends on a small number of user registration voices, and the model migration is limited by acoustic parameter differences (such as signal-to-noise ratio and frequency response range) among devices. In the prior art, low-delay identification is realized in a logistics scene test by adopting a lightweight small model optimization algorithm, but the method depends on scene customization data, is difficult to generalize to a complex acoustic environment (such as wind noise and multi-person dialogue), and needs to verify robustness through a large-scale multi-device test.

The threshold of voiceprint counterfeiting is lower and lower, a complete laboratory test scheme is established as a key of product calibration, and deep counterfeiting technology (such as voice synthesis and voice conversion) can generate high-simulation voiceprints, and traditional tests only cover basic attack types (such as recording playback). While the voiceprint recognition algorithm is deployed in the prior art to improve the noise immunity, systematic defense verification on dynamic forgery attacks (such as cross-device voice cloning) is lacking, and it is highly required to construct multidimensional breakthrough test indexes (such as motion detection and spectrum anomaly analysis).

Because of different sound field conditions, different intelligent glasses also need to frequently test the isomerism of equipment (such as microphone array design and noise reduction algorithm) on line each time, so that the acoustic characteristic difference is obvious, and the existing test depends on manual subjective evaluation and cannot be automatically adapted. In the prior art, the recognition efficiency is optimized through software and hardware integrated development, but the industry still lacks a standardized test framework, and test cases need to be repeatedly designed when new equipment is on line, so that the cost is high and the efficiency is low.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a voiceprint recognition test system and a voiceprint recognition test method.

The embodiment of the specification provides a voiceprint recognition test system, and aiming at the problem of voiceprint data missing of intelligent glasses, a standardized voiceprint recognition capability and physical anti-counterfeiting capability test scheme of the intelligent glasses can be realized through the voiceprint recognition test system, product calibration of the intelligent glasses is ensured, and a robustness test mode is carried out in a laboratory based on the voiceprint recognition test system so as to rapidly complete test on intelligent glasses with different hardware, and the system comprises:

A playing device for sequentially playing a plurality of source files of the sound source;

the environment sound box is used for sequentially playing a plurality of environment sound source files;

the intelligent glasses are used for collecting field sounding audio files corresponding to each sound source file, and playing corresponding environment labels on the field sounding audio files according to target environment sound source files which are played synchronously with the sound source files in the environment sound source files, wherein the field sounding audio files are used for conducting voiceprint recognition tests on the intelligent glasses based on the environment labels.

Further, the smart glasses are further configured to obtain first start playing indication information of the playing device about each of the plurality of source files, and start to collect a live sounding audio file corresponding to the source file according to the first start playing indication information.

Further, the smart glasses are further configured to obtain second start playing indication information of the ambient sound box about each of the plurality of ambient sound source files, and determine, according to the second start playing indication information, a target ambient sound source file that is played synchronously with the source file.

Further, the playing device comprises a first type playing device, the first type playing device is located at a preset distance below the intelligent glasses, the sound source file comprises a real person sound source file, and the live sound audio file is used for conducting voiceprint recognition passing test on the intelligent glasses based on corresponding environment labels.

Further, the playing device comprises a second type playing device, the second type playing device is located in a preset range near the intelligent glasses, the sound source files comprise real person sound source files and synthesized sound source files, the intelligent glasses are further used for obtaining relative position information of the second type playing device relative to the intelligent glasses in the playing process of each sound source file, and corresponding position labels are marked on the on-site sound production audio files according to the relative position information, wherein the on-site sound production audio files are used for conducting voiceprint recognition breakthrough tests on the intelligent glasses based on the environment labels and the position labels.

Further, the playing device comprises a first type playing device and a second type playing device, the first type playing device is located at a preset distance below the intelligent glasses, the second type playing device is located in a preset range near the intelligent glasses, the sound source files comprise real person sound source files and synthesized sound source files, the system is used for controlling the first type playing device to sequentially play a plurality of sound source files to enable voiceprint recognition of the intelligent glasses to pass a test, and controlling the second type playing device to sequentially play the sound source files to enable voiceprint recognition breakthrough test of the intelligent glasses to be conducted.

Further, the first type of playing device is a manual mouth.

Further, the system further comprises:

the mechanical arm is used for clamping the second-type playing device, and sequentially moves to a plurality of designated positions according to a plurality of preset pose parameters so as to change the relative position information of the second-type playing device relative to the intelligent glasses, wherein the system is also used for controlling the movement of the mechanical arm.

Further, the mechanical arm is further configured to obtain end play indication information of the second type play device about a currently played source file of the sound source, determine a latest pose parameter from the preset plurality of pose parameters in response to the end play indication information, and move to a specified position according to the latest pose parameter.

Further, the second type playing device is further used for obtaining movement ending indication information of the mechanical arm about movement to a designated position, and playing a next sound source file in response to the movement ending indication information.

The embodiment of the specification also provides a voiceprint recognition test method, which comprises the following steps:

sequentially playing a plurality of source files of sound sources through a playing device;

sequentially playing a plurality of environment sound source files through an environment sound box;

and collecting site sounding audio files corresponding to each sounding source file through intelligent glasses, and marking corresponding environment labels for the site sounding audio files according to target sounding source files which are synchronously played with the sounding source files in the plurality of environment sounding source files, wherein the site sounding audio files are used for conducting voiceprint recognition tests on the intelligent glasses based on the environment labels.

Sequentially playing a plurality of first sound source files through a first type playing device, wherein the first type playing device is positioned at a preset distance below the intelligent glasses, and the first sound source files comprise real person sound source files;

Sequentially playing a plurality of first environment sound source files through an environment sound box while the first type playing device executes playing operation;

Collecting a first on-site sounding audio file corresponding to each first sounding source file through the intelligent glasses, and marking the first on-site sounding audio file with a corresponding environment label according to a target first environment sound source file synchronously played with the first sounding source file, wherein the first on-site sounding audio file is used for conducting voiceprint recognition passing test on the intelligent glasses based on the environment label;

Sequentially playing a plurality of second sound source files through a second type playing device, wherein the second type playing device is positioned in a preset range near the intelligent glasses, and the second sound source files comprise a real sound source file and a synthesized sound source file;

Sequentially playing a plurality of second environment sound source files through the environment sound box while the second type playing device executes playing operation;

collecting second on-site sounding audio files corresponding to each second sounding source file through the intelligent glasses, marking the corresponding environment labels on the second on-site sounding audio files according to target second environment sound source files synchronously played with the second sounding source files, and marking the corresponding position labels on the second on-site sounding audio files according to the relative position information of the second type playing equipment relative to the intelligent glasses in the playing process of each second sounding source file, wherein the second on-site sounding audio files are used for carrying out voiceprint recognition breakthrough testing on the intelligent glasses based on the environment labels and the position labels.

According to the scheme of the embodiment of the specification, the voiceprint recognition testing system is provided, aiming at the problem of voiceprint data deficiency of the intelligent glasses, a standardized voiceprint recognition capability and physical anti-counterfeiting capability testing scheme of the intelligent glasses can be realized through the voiceprint recognition testing system, the product standard of the intelligent glasses is ensured, and the robustness testing mode is carried out in a laboratory based on the voiceprint recognition testing system, so that the intelligent glasses with different hardware can be tested rapidly.

Drawings

FIG. 1 is a schematic diagram of a voiceprint recognition testing system according to an embodiment of the present disclosure;

Fig. 2 is a schematic flow chart of a voiceprint recognition test method according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of a voiceprint recognition test method according to an embodiment of the present disclosure;

fig. 4 is a schematic flow chart of a voiceprint recognition passing test method according to an embodiment of the present disclosure;

Fig. 5 is a schematic flow chart of a voiceprint recognition breakthrough test method according to an embodiment of the present disclosure;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

Referring to fig. 1, a schematic diagram of a voiceprint recognition testing system according to an embodiment of the present disclosure is provided. The following will describe the functional modules shown in fig. 1 in detail, and the voiceprint recognition test system includes a playing device 1, an ambient sound box 2, and smart glasses 3.

A playing device 1 for sequentially playing a plurality of source files of sound sources.

In some embodiments, the voiceprint recognition testing system is located in an acoustic laboratory where the sound isolation performance is required to meet preset conditions to facilitate audio acquisition and avoid noise interference. In some embodiments, the playing device includes, but is not limited to, devices with audio playing functions such as a manual mouth, a mobile phone, a sound box, and the like, and the specific type of the playing device is not particularly limited in this example embodiment. In some embodiments, the playing device sequentially plays the plurality of audio source files, which may begin playing the next audio source file immediately after playing the previous audio source file according to a preset playlist, or may begin playing the next audio source file after playing the previous audio source file for a preset period of time. In some embodiments, the audio duration of each source file may be the same or may be different, which is not particularly limited by the present example embodiment. In some embodiments, the source file of the sound source includes, but is not limited to, a real person source file (e.g., a sound recording file obtained by live recording of a real person of a user), a synthesized sound source file (e.g., an audio file containing human voice synthesized by AI (ARTIFICIAL INTELLIGENCE, artificial intelligence)), and the like, and the specific file format of the source file of the sound source is not particularly limited in this exemplary embodiment. In some embodiments, the same (i.e., duplicate) source file of the source files of the source sound may exist in the plurality of source files of the source sound, which is not particularly limited by the present exemplary embodiment. In some embodiments, the voiceprint recognition test system further includes a management and control platform, where the management and control platform establishes a connection with the playing device in a wired or wireless (e.g., wi-Fi, bluetooth, etc.) manner and performs communication, so that the management and control platform can perform play control on the playing device, that is, control when the playing device ends playing the last source file, when starts playing the next source file, and so on. In some embodiments, multiple source files of sound may correspond to multiple different sound producing users, or on the basis of that, multiple source files of sound corresponding to the same sound producing user may also correspond to multiple different sound intensities.

And the environment sound box 2 is used for sequentially playing a plurality of environment sound source files.

In some embodiments, the environmental speaker includes, but is not limited to, any speaker specifically responsible for playing environmental sound, ambient sound, and surround effect sound, and the environmental speaker functions to create a sense of space, sense of enclosure, and sense of immersion for a listener to feel himself "in-place", rather than just sound coming from the front, and the specific type of environmental speaker is not specifically limited by the present exemplary embodiment. In some embodiments, the ambient sound box is high fidelity (Hi-Fi), i.e. a sound box designed for highly reproducing the original sound, the core of which is to achieve a sound quality reproduction close to the original sound by professional techniques. In some embodiments, the environment sound box sequentially plays a plurality of environment sound source files (i.e. the sound source files of the environment sound), which may be that according to a preset playlist, the playing of the next environment sound source file is started immediately after the playing of the previous environment sound source file is completed, or may be that the playing of the next environment sound source file is started again after a preset time period is elapsed after the playing of the previous environment sound source file. In some embodiments, the audio duration of each environmental sound source file may be the same or may be different, which is not particularly limited by the present example embodiment. In some embodiments, it is desirable to ensure that the synchronously played (i.e., simultaneously played) source files of sound are played for the same length of time as the source files of ambient sound, i.e., both begin playing at the same time and end playing at the same time. In some embodiments, each of the sound source files in the playlist of the playback device is played in synchronization with the corresponding same order of play of the environmental sound source files in the playlist of the environmental speaker, e.g., the third sound source file in the playlist of the playback device is played in synchronization with the third sound source file in the playlist of the environmental speaker. In some embodiments, the environmental sound source file may be a sound recording file containing environmental sound obtained by recording a real environment in situ, or may also be an audio file containing environmental sound synthesized by AI (ARTIFICIAL INTELLIGENCE ), and the specific file format of the environmental sound source file is not limited in this example embodiment. In some embodiments, the environment type corresponding to the environment sound source file includes, but is not limited to, a mall environment, a restaurant environment, a public transportation environment, a road environment, etc., which is not particularly limited by the present exemplary embodiment. In some embodiments, the same (i.e., duplicate) ambient sound source file may exist in the plurality of ambient sound source files, which is not particularly limited by the present example embodiment. In some embodiments, the voiceprint recognition test system further includes a management and control platform, where the management and control platform establishes connection with the environmental speaker through a wired or wireless (e.g., wi-Fi, bluetooth, etc.) manner and performs communication, so that the management and control platform can perform play control on the environmental speaker, that is, control when the environmental speaker ends playing a previous environmental sound source file, when starts playing a next environmental sound source file, and so on. In some embodiments, the plurality of ambient sound source files may correspond to background sounds of a plurality of different scenes, or on the basis of that, the plurality of ambient sound source files corresponding to the same scene may also correspond to a plurality of different sound intensity background sounds.

And the intelligent glasses 3 are used for collecting field sounding audio files corresponding to each sound source file, and playing corresponding environment labels on the field sounding audio files according to target environment sound source files which are synchronously played with the sound source files in the environment sound source files, wherein the field sounding audio files are used for conducting voiceprint recognition tests on the intelligent glasses 3 based on the environment labels. In some embodiments, the smart glasses are placed at respective human eye locations in a mannequin or face model located in a laboratory. In some embodiments, the smart glasses include a sound collection module (e.g., a microphone) through which live sounded audio files during playback of each of the sounded source files may be collected. in some embodiments, the sound collection module may be multi-microphone array hardware to implement directional pickup and composite noise reduction, where the multi-microphone array hardware (microphone array hardware) is an acoustic system with multiple microphone units arranged according to a specific geometry and integrated with a dedicated processing chip, and the core is to implement accurate pickup through spatial sound wave collection and real-time signal processing. In some embodiments, the smart glasses also denoise the captured live sounding audio file by a synthetic noise reduction algorithm. In some embodiments, the smart glasses need to determine the target sound source file synchronously played with each sound source file during the playing process of the sound source file, for example, if the playing device and the environment sound box play the sound source file and the environment sound source file in sequence according to a preset playlist, and the playing device and the environment sound box play the next sound source file immediately after playing the last sound source file, or begin playing the next sound source file after playing the last sound source file for a fixed time period, and since the time period of each sound source file and each environment sound source file is known, the smart glasses can determine the target environment sound source file synchronously played with the sound source file according to the acquisition time of the on-site sounding audio file, and for example, a wired or wireless (e.g., wi-Fi, bluetooth, etc.) connection is established between the smart glasses and the environment sound box, and the environment sound box can send the identification information (e.g., file name, bluetooth, etc.) of the currently played environment sound source file, file ID, etc.) is sent to the smart glasses over the connection to enable the smart glasses to determine a target ambient sound source file that is played in synchronization with the source file. in some embodiments, after the live sounding audio file is collected, it is further required to determine which source file corresponds to the live sounding audio file, that is, it is required to determine the source file corresponding to the live sounding audio file, and the specific determination manner is similar to the manner of determining the target environmental sound source file described above, which is not described herein again. In some embodiments, since the smart glasses collect the live sounding audio files under different environmental sounds, the smart glasses need to apply corresponding environmental labels to the live sounding audio files according to the target environmental sound source file synchronously played with the live sounding audio source file (i.e., the target environmental sound source file corresponding to the live sounding audio file), where the environmental labels include, but are not limited to, file names, file IDs, environment types, environment descriptions, and the like corresponding to the environmental sound source file, and the present exemplary embodiment is not limited thereto. In some embodiments, the smart glasses may collect live sounding audio files at different ambient sounds, different combinations (combination of real person sources and ambient sounds). In some embodiments, the smart glasses provide the collected live sounding audio file with the corresponding environmental tag to the server, so that the server performs a voiceprint recognition test on the smart glasses based on the environmental tag, wherein the voiceprint recognition test comprises but is not limited to a voiceprint recognition pass test and a voiceprint recognition break-through test. In some embodiments, the voiceprint recognition passing test is that a pointer tests whether the intelligent glasses can identify the person wearing the intelligent glasses from the on-site sounding audio file through the voiceprint recognition technology, and the recognition passing rate of the voiceprint recognition of the intelligent glasses is counted according to the test result, wherein the recognition passing rate ‌ is the probability that the intelligent glasses correctly recognize the person who starts the voice user from the on-site sounding audio file, specifically, the passing rate is the proportion that the person who sounds the user is correctly recognized in all the attempts, and the higher the numerical value is, the better the voiceprint recognition experience of the intelligent glasses is indicated. In some embodiments, the voiceprint recognition breakthrough test is to test whether the smart glasses can recognize the fake audio (i.e. illegal audio, fraudulent audio or physical attack audio, such as recording or AI synthesized audio, instead of actual sounding of the user wearing the smart glasses) from the on-site sounding audio file through the voiceprint recognition technology, and to count the recognition breakthrough rate of the voiceprint recognition of the smart glasses according to the test result, where the recognition breakthrough rate ‌ is the probability that the smart glasses successfully recognize the fake audio from the on-site sounding audio file, specifically, the breakthrough rate is the proportion that the fake audio is successfully recognized in all the attempts, and the higher the numerical value is, the better the voiceprint anti-counterfeiting performance or the voiceprint security performance of the smart glasses. In some embodiments, the specific measurement mode may be that the on-site sounding audio file is input into a voiceprint recognition model (the model may be located at a glasses end or a server end corresponding to the smart glasses) used by the smart glasses, and a corresponding test result is obtained according to a recognition result output by the model, or the on-site sounding audio file may be provided to the glasses end or the server end, and the glasses end or the server end invokes a corresponding internal interface or an external interface to perform recognition, and a corresponding test result is obtained according to a recognition result provided by the glasses end or the server end. In some embodiments, the voiceprint recognition test system further includes a control platform, where the control platform establishes connection with the smart glasses in a wired or wireless (e.g., wi-Fi, bluetooth, etc.) manner and communicates with the smart glasses, so that the control platform can perform collection control on the smart glasses, that is, control when the smart glasses end collecting a previous live sounding audio file, when start collecting a next live sounding audio file, and so on.

In some embodiments, the smart glasses 3 are further configured to obtain first start playing indication information of the playing device 1 about each of the plurality of source files for playing sound, and start to collect the live sounded audio file corresponding to the source file for playing sound according to the first start playing indication information. In some embodiments, a wired or wireless (e.g., wi-Fi, bluetooth, etc.) connection is established between the smart glasses and the playing device, and when a certain sound source file starts playing, the playing device may send first playing start indication information corresponding to the sound source file (i.e., indicate that the sound source file starts playing) to the smart glasses, so that after receiving the first playing start indication information, the smart glasses start to collect a live sound audio file corresponding to the sound source file. In some embodiments, the first start playing indication information includes identification information (e.g., file name, file ID, etc.) of the source file, so that the smart glasses can determine the source file corresponding to the currently collected live-sounding audio file based on the identification information. In some embodiments, the playing device may further send the first start-end playing indication information corresponding to the source file to the smart glasses when the source file ends playing (i.e. indicates that the source file ends playing), after receiving the first playing finishing indication information, the intelligent glasses stop continuously collecting the on-site sounding audio file corresponding to the sound source file and store the on-site sounding audio file.

In some embodiments, the smart glasses 3 are further configured to obtain second start playing indication information of the ambient sound box 2 about each of the plurality of ambient sound source files, and determine, according to the second start playing indication information, a target ambient sound source file that is played synchronously with the source file. In some embodiments, a wired or wireless (e.g., wi-Fi, bluetooth, etc.) connection is also established between the smart glasses and the environment speaker, where the environment speaker may send second start-play indication information corresponding to a certain environment sound source file (i.e., indicating that the environment sound source file starts to play) to the smart glasses when the environment sound source file starts to play, where the second start-play indication information includes identification information (e.g., file name, file ID, etc.) of the environment sound source file, so that the smart glasses can determine a target environment sound source file that is played synchronously with the sound source file after receiving the second start-play indication information.

In some embodiments, the playback device 1 comprises a first type playback device located at a preset distance below the smart glasses 3, and the live soundtrack source file comprises a live soundtrack source file used for voiceprint recognition passing tests with respect to the smart glasses 3 based on their corresponding environmental tags. In some embodiments, the playback device comprises a first type playback device located a preset distance below the smart glasses to simulate an actual phase position relationship between the smart glasses worn by the user and the mouth. In some embodiments, the first type of playback device is placed at a corresponding mouth position in a mannequin or face model of a laboratory where smart glasses have been worn. In some embodiments, the smart glasses may collect live sounding audio files at different ambient sounds, different combinations (combination of real person sources and ambient sounds). In some embodiments, the voiceprint recognition passing test is that a pointer tests whether the intelligent glasses can identify the person wearing the intelligent glasses from the on-site sounding audio file through the voiceprint recognition technology, and the recognition passing rate of the voiceprint recognition of the intelligent glasses is counted according to the test result, wherein ‌ recognition passing rate ‌ is the probability that the intelligent glasses correctly recognize the person who starts the voice user from the on-site sounding audio file, specifically, the passing rate is the proportion that the person who sounds the user is correctly recognized in all the attempts, and the higher the numerical value is, the better the voiceprint recognition experience of the intelligent glasses is indicated. In some embodiments, the specific test mode may be that a voiceprint recognition model (the model may be located at a lens end or a server end corresponding to the smart lens) used for inputting the field sounding audio file with the corresponding environmental label into the smart lens, and a corresponding test result is obtained according to a recognition result output by the model, or may also be that the field sounding audio file is provided to the lens end or the server end, and the lens end or the server end invokes a corresponding internal interface or an external interface to perform recognition, and obtains a corresponding test result according to a recognition result provided by the lens end or the server end, which is not limited in this example embodiment.

In some embodiments, the playing device 1 includes a second type playing device, the second type playing device is located in a preset range near the smart glasses 3, the sound source files include a real person sound source file and a synthesized sound source file, the smart glasses 3 are further configured to obtain relative position information of the second type playing device relative to the smart glasses 3 during playing of each sound source file, and the on-site sound audio file is marked with a corresponding position tag according to the relative position information, where the on-site sound audio file is used to perform a voiceprint recognition breakthrough test on the smart glasses 3 based on the environmental tag and the position tag. In some embodiments, the playing device includes a second type playing device, where the second type playing device is located within a preset range near the smart glasses (for example, the second type playing device may be located in a laboratory where the smart glasses are located, or may be located within a range with a radius of 5 meters where the smart glasses are round dots, for example), or on the basis that a distance between the second type playing device and the smart glasses is further required to be greater than or equal to a preset distance threshold. In some embodiments, the second type of playback device is different from the first type of playback device, and the second type of playback device includes, but is not limited to, a mobile phone, a speaker, and the like, devices having audio playback capabilities. In some embodiments, the relative position of the playback device with respect to the smart glasses may be variable, i.e., the playback device may be movable (e.g., the playback device is placed on a movable flatbed trailer), i.e., the respective playback positions or playback angles of the respective source files of the second type played by the playback device may be different, but during playback of one source file of the second type, the relative position of the playback device with respect to the smart glasses is fixed, i.e., each of the live sounded audio files corresponds to one of the relative positions, and different of the live sounded audio files may correspond to different relative positions. In some embodiments, the playback device may also be variable, i.e., the plurality of source files may be played back using different playback devices, respectively. In some embodiments, the smart glasses may collect live sounding audio files at different sounding locations with different ambient sounds. In some embodiments, since the smart glasses collect the live sounding audio files in different sounding positions and different environmental sounds, besides the environmental tags, the smart glasses need to obtain relative position information (including but not limited to distance information, angle information, etc. of the playing device relative to the smart glasses) of the playing device relative to the smart glasses in the playing process of each sounding source file, that is, need to obtain relative position information corresponding to the collected live sounding audio files, and then label the live sounding audio files according to the relative position information, where the position tags include but are not limited to specific values of relative distances, specific values of relative angles, etc. In some embodiments, the image recognition may be performed on a field picture or a field video obtained by the shooting module on the smart glasses in the playing process of each sound source file, the relative position information of the playing device relative to the smart glasses may be calculated according to the recognition result, or a wired or wireless (e.g., wi-Fi, bluetooth, etc.) connection may be established between the smart glasses and the playing device, the playing device may obtain the movement information (movement distance and/or movement angle) of the playing device relative to the initial position of the playing device in the playing process of each sound source file, then calculate the relative position information of the current position of the playing device relative to the smart glasses according to the initial relative position information of the pre-known initial position relative to the smart glasses (the smart glasses are not movable), and then send the relative position information to the smart glasses. In some embodiments, the voiceprint recognition breakthrough test is to test whether the smart glasses can recognize the fake audio (i.e. illegal audio, fraudulent audio or physical attack audio, such as recording or AI synthesized audio, instead of actual sounding of the user wearing the smart glasses) from the on-site sounding audio file through the voiceprint recognition technology, and to count the recognition breakthrough rate of the voiceprint recognition of the smart glasses according to the test result, where the recognition breakthrough rate ‌ is the probability that the smart glasses successfully recognize the fake audio from the on-site sounding audio file, specifically, the breakthrough rate is the proportion that the fake audio is successfully recognized in all the attempts, and the higher the numerical value is, the better the voiceprint anti-counterfeiting performance or the voiceprint security performance of the smart glasses. In some embodiments, the specific test mode may be inputting the field sounding audio file marked with the corresponding environmental label and the position label into a voiceprint recognition model (the model may be located at a glasses end or a server end corresponding to the smart glasses) used by the smart glasses, and the corresponding test result is obtained according to the recognition result output by the model, or may also be providing the field sounding audio file to the glasses end or the server end, and the glasses end or the server end invokes the corresponding internal interface or the external interface to perform recognition, and obtains the corresponding test result according to the recognition result provided by the glasses end or the server end.

In some embodiments, the playing device 1 includes a first type playing device and a second type playing device, where the first type playing device is located at a preset distance below the smart glasses 3, the second type playing device is located in a preset range near the smart glasses 3, the audio source files include a real audio source file and a synthesized audio source file, and the system is configured to control the first type playing device to sequentially play a plurality of audio source files to perform voiceprint recognition on the smart glasses 3, pass a test, and control the second type playing device to sequentially play a plurality of audio source files to perform voiceprint recognition breakthrough test on the smart glasses 3. In some embodiments, the playing device includes a first type playing device located at a preset distance below the smart glasses and a second type playing device located in a preset range near the smart glasses, where the voiceprint recognition test system controls the first type playing device to sequentially play a plurality of sound source files, at this time, the environment sound box sequentially plays a plurality of environment sound source files synchronously, then the smart glasses collect a field sounding audio file corresponding to each sound source file and make a corresponding environment tag, then the sound source file is used for voiceprint recognition of the smart glasses based on its corresponding environment tag, the specific mode is described in detail before Wen Yu, the voiceprint recognition test system then controls the second type playing device (at this time, the first type playing device stops playing) to sequentially play a plurality of sound source files, at this time, the environment sound box still sequentially plays a plurality of environment sound source files synchronously, then the smart glasses collect a field sounding audio file corresponding to each sound source file and make a corresponding environment tag and position tag, then the sound source file is used for voiceprint recognition of the smart glasses based on its corresponding environment tag, the specific mode is described in detail before Wen Yu, and the specific mode is described in detail before the smart glasses are described. In some embodiments, the first type playing device and the second type playing device may play the same wholesale sound source file respectively, or may play two different wholesale sound source files respectively, which is not limited in this exemplary embodiment. In some embodiments, the environmental sound box may play the same batch of environmental sound source files respectively in the voiceprint recognition pass test and the voiceprint recognition break-through test, or may play two different batches of environmental sound source files respectively, which is not limited in this example embodiment. In some embodiments, the voiceprint recognition test system further includes a management and control platform, and the management and control platform establishes connection with the first type playing device and the second type playing device in a wired or wireless (e.g., wi-Fi, bluetooth, etc.) manner and communicates with the first type playing device and the second type playing device, and the voiceprint recognition test system can play and control the first type playing device and the second type playing device through the management and control platform.

In some embodiments, the first type of playback device is a manual mouth. In some embodiments, in the field of acoustic testing, a human mouth (also called a dummy mouth or an artificial mouth) is a specially designed sound source device for simulating acoustic characteristics of a real human mouth when sounding, the human mouth reproduces acoustic radiation characteristics (such as directivity and radiation pattern) of the human mouth in a near field range through a specially designed speaker and a cavity structure, so as to ensure that a test environment approximates a real human sound scene, and the human mouth is a software and hardware integrated manner, and approximates frequency response of the real voice with as low distortion as possible. In some embodiments, the artificial mouth is placed at a corresponding mouth location in a mannequin or face model of a laboratory where smart glasses have been worn. In some embodiments, the artificial mouth is integrated directly into the manikin or face model, i.e. the artificial mouth is part of the manikin or face model.

In some embodiments, the voiceprint recognition test system further comprises a mechanical arm, wherein the mechanical arm is used for clamping the second type playing device and sequentially moving to a plurality of designated positions according to a plurality of preset pose parameters so as to change the relative position information of the second type playing device relative to the intelligent glasses 3, and the system is further used for controlling the movement of the mechanical arm. In some embodiments, the voiceprint recognition test system further includes a mechanical arm, where the mechanical arm is used to clamp the second type playing device, the mechanical arm is movable, and the mechanical arm can sequentially move to a plurality of designated positions according to a plurality of preset pose parameters, so that the second type playing device clamped by the mechanical arm can also sequentially move to the plurality of designated positions correspondingly, thereby changing the relative position information of the second type playing device relative to the smart glasses, and automatically changing the playing position or the playing angle of the sound source file through the mechanical arm, so that the test efficiency is improved through the mechanical arm. In some embodiments, pose parameters include, but are not limited to, position, pose, etc., where position represents a coordinate point (e.g., x, y, z) of the robotic arm in three-dimensional space and pose represents a rotational orientation of the robotic arm (e.g., a rotational angle of the robotic arm about an axis). In some embodiments, the bottom of the mechanical arm is not self-movable, i.e. the mechanical arm can only move within a preset range around the in-situ position, or the bottom of the mechanical arm is self-movable, for example, wheels are mounted on the bottom of the mechanical arm and can move by themselves according to preset parameters, so that the mechanical arm is driven to move synchronously, and meanwhile, the mechanical arm can also move within a preset range around the current position of the bottom of the mechanical arm. In some embodiments, the robotic arm may move between various points of motion on a 360 degree spatial sphere. In some embodiments, a wired or wireless (e.g., wi-Fi, bluetooth, etc.) connection is established between the mechanical arm and the smart glasses, and the mechanical arm may obtain relative position information of the mechanical arm with respect to the smart glasses (i.e., a relative position signal of the second type playing device with respect to the smart glasses) during the playing process of each sound source file, and send the relative position information to the smart glasses, or may also be that the mechanical arm obtains current relative position information of the mechanical arm with respect to the smart glasses after each movement process of the mechanical arm to the designated position is completed, and sends the relative position information to the smart glasses. In some embodiments, the mechanical arm may calculate movement information (movement distance and/or movement angle) of the current position relative to the initial position according to pose parameters corresponding to the current movement or the last movement of the mechanical arm, then calculate, according to initial relative position information of the initial position known in advance relative to the smart glasses (the smart glasses are not movable), obtain relative position information of the current position of the playing device relative to the smart glasses, and then send the relative position information to the smart glasses. In some embodiments, after moving to a designated position based on a previous pose parameter, the mechanical arm needs to wait for the playing end of the source file of the sound currently played by the second type playing device clamped by the mechanical arm, and starts to move to the next designated position based on the next pose parameter, or starts to move to the next designated position based on the next pose parameter after a preset time interval. In some embodiments, the voiceprint recognition testing system further includes a management and control platform, the management and control platform and the mechanical arm are connected and communicated in a wired or wireless (e.g., wi-Fi, bluetooth, etc.) mode, and the voiceprint recognition testing system can perform motion control on the mechanical arm through the management and control platform, for example, when the mechanical arm starts to move.

In some embodiments, the mechanical arm is further configured to obtain end play indication information of the second type playing device about a currently played source file of the sound source file, determine a latest pose parameter among the preset plurality of pose parameters in response to the end play indication information, and move to a specified position according to the latest pose parameter. In some embodiments, a wired or wireless (e.g., wi-Fi, bluetooth, etc.) connection is established between the mechanical arm and the second type playing device, where the second type playing device may send end playing indication information corresponding to the currently playing sound source file to the mechanical arm when the playing of the currently playing sound source file ends, and after receiving the end playing indication information, the mechanical arm determines an up-to-date pose parameter (e.g., a pose parameter arranged in a preset order at a position subsequent to the current pose parameter) from among a preset plurality of pose parameters, and then starts to move to a next designated position according to the up-to-date pose parameter.

In some embodiments, the second type playing device is further configured to obtain end-of-motion indication information about the robotic arm moving to a specified position, and play a next sound source file in response to the end-of-motion indication information. In some embodiments, after the mechanical arm moves to the designated position according to the current pose parameter, the corresponding movement end indication information is sent to the second type playing device, and after receiving the movement end indication information, the second type playing device ends playing the currently played sound source file and starts playing the next sound source file, for example, the sound source files arranged at the next position of the currently played sound source file according to the preset playing sequence.

Referring to fig. 2, a flow chart of a voiceprint recognition testing method is provided in an embodiment of the present disclosure. In the embodiment of the specification, the voiceprint recognition testing method is applied to a voiceprint recognition testing system, and the voiceprint recognition testing system comprises playing equipment, an environment sound box and intelligent glasses. The following details about the flow shown in fig. 2, the voiceprint recognition testing method specifically may include the following steps:

S102, sequentially playing a plurality of source files of the sound source through a playing device. In some embodiments, the playing device includes, but is not limited to, devices with audio playing functions such as a manual mouth, a mobile phone, a sound box, and the like, and the specific type of the playing device is not particularly limited in this example embodiment. The specific embodiments are described in detail above and will not be repeated here.

S104, sequentially playing a plurality of environment sound source files through the environment sound box. In some embodiments, the environmental speakers include, but are not limited to, any speaker specifically responsible for playing environmental sound, ambient sound, and surround effect sound, and the specific type of environmental speakers is not particularly limited by the present example embodiment. The specific embodiments are described in detail above and will not be repeated here.

S106, collecting site sounding audio files corresponding to each sounding source file through intelligent glasses, and marking corresponding environment labels for the site sounding audio files according to target sounding source files which are synchronously played with the sounding source files in the environment sounding source files, wherein the site sounding audio files are used for conducting voiceprint recognition tests on the intelligent glasses based on the environment labels. In some embodiments, the smart glasses may collect live sounding audio files at different ambient sounds, different combinations (combination of real person sources and ambient sounds). In some embodiments, the voiceprint recognition test includes, but is not limited to, a voiceprint recognition pass test and a voiceprint recognition break-through test. In some embodiments, the voiceprint recognition passing test is that a pointer tests whether the intelligent glasses can identify the person wearing the intelligent glasses from the on-site sounding audio file through the voiceprint recognition technology, and the recognition passing rate of the voiceprint recognition of the intelligent glasses is counted according to the test result, wherein ‌ recognition passing rate ‌ is the probability that the intelligent glasses correctly recognize the person who starts the voice user from the on-site sounding audio file, specifically, the passing rate is the proportion that the person who sounds the user is correctly recognized in all the attempts, and the higher the numerical value is, the better the voiceprint recognition experience of the intelligent glasses is indicated. In some embodiments, the voiceprint recognition breakthrough test is to test whether the smart glasses can recognize the fake audio (i.e. illegal audio, fraudulent audio or physical attack audio, such as recording or AI synthesized audio, instead of actual sounding of the user wearing the smart glasses) from the on-site sounding audio file through the voiceprint recognition technology, and to count the recognition breakthrough rate of the voiceprint recognition of the smart glasses according to the test result, where the recognition breakthrough rate ‌ is the probability that the smart glasses successfully recognize the fake audio from the on-site sounding audio file, specifically, the breakthrough rate is the proportion that the fake audio is successfully recognized in all the attempts, and the higher the numerical value is, the better the voiceprint anti-counterfeiting performance or the voiceprint security performance of the smart glasses. In some embodiments, the specific test mode may be that the on-site sounding audio file is input into a voiceprint recognition model (the model may be located at a glasses end or a server end corresponding to the smart glasses) used by the smart glasses, and a corresponding test result is obtained according to a recognition result output by the model, or the on-site sounding audio file may be provided to the glasses end or the server end, and the glasses end or the server end invokes a corresponding internal interface or an external interface to perform recognition, and a corresponding test result is obtained according to a recognition result provided by the glasses end or the server end. The specific embodiments are described in detail above and will not be repeated here.

Referring to fig. 3, a flow chart of a voiceprint recognition testing method is provided in an embodiment of the present disclosure. In the embodiment of the specification, the voiceprint recognition testing method is applied to a voiceprint recognition testing system, and the voiceprint recognition testing system comprises playing equipment, an environment sound box and intelligent glasses. The following details about the flow shown in fig. 3, the voiceprint recognition testing method specifically may include the following steps:

S202, sequentially playing a plurality of first sound source files through first type playing equipment, wherein the first type playing equipment is located at a preset distance below the intelligent glasses, and the first sound source files comprise real sound source files. In some embodiments, the first type of playback device includes, but is not limited to, a manual mouth, a cell phone, a speaker, etc., with audio playback capabilities, preferably the first type of playback device is a manual mouth. The specific embodiments are described in detail above and will not be repeated here.

S204, when the first type playing device executes the playing operation, sequentially playing a plurality of first environment sound source files through the environment sound box. In some embodiments, the ambient speakers include, but are not limited to, any speaker specifically responsible for playing ambient, and surround effect sounds. The specific embodiments are described in detail above and will not be repeated here.

S206, collecting first site sounding audio files corresponding to each first sounding source file through the intelligent glasses, and marking corresponding environment labels for the first site sounding audio files according to target first environment sound source files synchronously played with the first sounding source files, wherein the first site sounding audio files are used for conducting voiceprint recognition of the intelligent glasses based on the environment labels to pass testing. The specific embodiments are described in detail above and will not be repeated here.

S208, sequentially playing a plurality of second sound source files through a second type playing device, wherein the second type playing device is located in a preset range near the intelligent glasses, and the second sound source files comprise a real sound source file and a synthesized sound source file. In some embodiments, the second type of playback device is different from the first type of playback device, and the second type of playback device includes, but is not limited to, a device with audio playback capabilities such as a cell phone, a speaker, and the like, and preferably the second type of playback device is a non-manual mouth playback device. The specific embodiments are described in detail above and will not be repeated here.

S210, while the second type playing device executes the playing operation, sequentially playing a plurality of second environment sound source files through the environment sound box. The specific embodiments are described in detail above and will not be repeated here.

S212, collecting second field sounding audio files corresponding to each second sounding source file through the intelligent glasses, marking corresponding environment labels for the second field sounding audio files according to target second environment sound source files synchronously played with the second sounding source files, and marking corresponding position labels for the second field sounding audio files according to relative position information of the second type playing equipment relative to the intelligent glasses in the playing process of each second sounding source file, wherein the second field sounding audio files are used for carrying out voiceprint recognition breakthrough testing on the intelligent glasses based on the environment labels and the position labels. The specific embodiments are described in detail above and will not be repeated here.

Fig. 4 is a schematic flow chart of a voiceprint recognition passing test method according to an embodiment of the present disclosure.

As shown in fig. 4, the voiceprint recognition test system includes an artificial mouth, a high-fidelity environment sound box and intelligent glasses, the artificial mouth sequentially selects real audio files to play, the high-fidelity environment sound box sequentially selects environment sound audio files to play, the glasses end collects sound data and stores the sound files, the sound files are marked with labels containing environment information, the sound files are used for voiceprint recognition passing test on the intelligent glasses based on the corresponding environment labels, and recognition passing rate of the intelligent glasses is obtained.

Fig. 5 is a flow chart of a voiceprint recognition breakthrough test method according to an embodiment of the present disclosure.

As shown in fig. 5, the voiceprint recognition test system includes a mechanical arm, a connection device (a playing device connected with the mechanical arm), a high-fidelity environment sound box and intelligent glasses, wherein the connection device sequentially selects real/synthesized audio files, the high-fidelity environment sound box sequentially selects environment sound audio files, the mechanical arm angle is sequentially selected so that the mechanical arm operates to a designated position, an eye end collects sound data, stores the sound files, the sound files are marked with labels containing environment information and angle information, and the sound files are used for voiceprint recognition breakthrough tests on the intelligent glasses based on the corresponding environment labels and angle labels to obtain recognition breakthrough rates of the intelligent glasses.

The embodiment of the specification also provides a schematic structural diagram of the electronic device shown in fig. 6. At the hardware level, as in fig. 6, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, although it may include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the method.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A voiceprint recognition testing system comprising:

2. The voiceprint recognition testing system of claim 1, the smart glasses further configured to obtain first start-play indication information of the playback device for each of the plurality of source files, and start to collect a live sounded audio file corresponding to the source file according to the first start-play indication information.

3. The voiceprint recognition testing system of claim 2, the smart glasses further configured to obtain second start play indication information of the ambient sound box with respect to each of the plurality of ambient sound source files, and determine a target ambient sound source file that is played in synchronization with the source file according to the second start play indication information.

4. The voiceprint recognition testing system of claim 1, the playback device comprising a first type playback device located a preset distance below the smart glasses, the live soundsource file comprising a live soundsource file, the live sounded audio file being used for voiceprint recognition pass testing with respect to the smart glasses based on its corresponding environmental label.

5. The voiceprint recognition testing system of claim 1, the playback device comprising a second type playback device, the second type playback device being located within a preset range in proximity to the smart glasses, the sound source files comprising a live sound source file and a synthesized sound source file, the smart glasses further configured to obtain relative positional information of the second type playback device with respect to the smart glasses during playback of each sound source file, and to label the site sounding audio file with a corresponding positional tag according to the relative positional information, wherein the site sounding audio file is configured to perform a voiceprint recognition breakthrough test with respect to the smart glasses based on the environmental tag and the positional tag.

6. The voiceprint recognition test system of claim 1, the playback device comprising a first type playback device and a second type playback device, the first type playback device being located at a preset distance below the smart glasses, the second type playback device being located within a preset range near the smart glasses, the soundsource files comprising a live sound source file and a synthetic sound source file, the system being configured to control the first type playback device to sequentially play a plurality of soundsource files for voiceprint recognition pass tests with respect to the smart glasses, and to control the second type playback device to sequentially play a plurality of soundsource files for voiceprint recognition break-through tests with respect to the smart glasses.

7. The voiceprint recognition testing system of claim 4 or 6, the first type of playback device being a manual mouth.

8. The voiceprint recognition testing system of claim 5 or 6, further comprising:

9. The voiceprint recognition testing system of claim 8, the robotic arm further configured to obtain end play indication information of the second type of playback device regarding a currently playing source file, determine a latest pose parameter among the preset plurality of pose parameters in response to the end play indication information, and move to a specified position according to the latest pose parameter.

10. The method of claim 9, the second type playback device further operable to obtain end of motion indication information for the robotic arm with respect to movement to a specified location, and to play a next source file of sound in response to the end of motion indication information.

11. A voiceprint recognition test method comprising:

12. A voiceprint recognition test method comprising: