WO2024055299A1 - Real-time speech translation method, system, device, and storage medium - Google Patents
Real-time speech translation method, system, device, and storage medium Download PDFInfo
- Publication number
- WO2024055299A1 WO2024055299A1 PCT/CN2022/119375 CN2022119375W WO2024055299A1 WO 2024055299 A1 WO2024055299 A1 WO 2024055299A1 CN 2022119375 W CN2022119375 W CN 2022119375W WO 2024055299 A1 WO2024055299 A1 WO 2024055299A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- speech
- voice
- synthesized
- real
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/01—Correction of time axis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/055—Time compression or expansion for synchronising with other signals, e.g. video signals
Definitions
- the present invention relates to the technical field of online translation, and specifically to a real-time speech translation method, system, equipment and storage medium.
- the robot voice translator is responsible for translating the speaker's content into the listener's language for playback.
- the translation content may have certain errors with the speaker's original intention, and there will be a certain time difference between the spoken content and the real-time translated voice.
- the speaker does not understand the translated voice content, so the speaker cannot know in time that some of the content just said was heard by the listening party.
- the specific time and correctness of the information that is, whether the correct one has been heard, makes the speaker unable to determine the timeliness and correctness of what he said.
- the speaker needs to pause consciously and wait for the speech translation to be completed before guessing the listener. Have you heard and understood correctly? Therefore, the efficiency and accuracy of communication between both parties cannot be guaranteed.
- the purpose of the present invention is to provide a real-time speech translation method, system, equipment and storage medium, which improves the accuracy of cross-language translation and the communication efficiency of both parties in cross-language communication.
- the present invention provides a real-time speech translation method, which method includes the following steps:
- the first synthesized voice is output to the first user, and the second synthesized voice is outputted to the second user simultaneously.
- the translating the first speech information to obtain a second synthesized speech corresponding to the second language category includes:
- obtaining the first synthesized speech corresponding to the first language category based on the first text data includes:
- the method also includes:
- the method also includes:
- corresponding interest tags are marked on the first text data and the second text data respectively.
- the method before outputting the first synthesized voice to the first user and synchronously outputting the second synthesized voice to the second user, the method includes:
- Outputting the first synthesized voice to the first user and synchronously outputting the second synthesized voice to the second user include:
- performing speech recognition on the first voice information input by the first user to obtain the first text data includes:
- the initial text data is modified using the preceding data to obtain first text data.
- the method before outputting the first synthesized voice to the first user and synchronously outputting the second synthesized voice to the second user, the method includes:
- the earphones When it is detected that a pair of earphones for outputting voice are worn by different users, the earphones are controlled to work in the first state; when it is detected that the first user and the second user each wear a pair of earphones, the earphones are controlled to work in the second state;
- the two earphones output different voices respectively, and one earphone worn by a user serves as the user's voice output channel; in the second state, two of the pair of earphones output the same voice, A pair of headphones worn by a user serves as the user's voice output channel.
- the pair of headphones is provided with a UWB communication module; before outputting the first synthesized voice to the first user and synchronously outputting the second synthesized voice to the second user, the method include:
- the pair of headphones is based on a UWB communication module and detects the distance between the two headphones in real time;
- the earphones When the distance between the two earphones is greater than the first preset threshold, the earphones are controlled to work in the first state; when the distance between the two earphones is less than the first preset threshold, the earphones are controlled to work in the second state. state.
- the method also includes:
- the playback speech speed corresponding to the first synthesized speech increases as the number of audio frames in the audio adjustment library increases, and decreases as the number of audio frames in the audio adjustment library decreases. Small.
- the present invention also provides a real-time speech translation system for realizing the above-mentioned real-time speech translation method.
- the system includes:
- the first text data generation module performs speech recognition on the first voice information input by the first user to obtain the first text data
- a second synthesized speech generation module translates the first speech information to obtain a second synthesized speech corresponding to the second language category
- a first synthesized speech generation module based on the first text data, obtains a first synthesized speech corresponding to the first language category;
- the synthesized voice playing module outputs the first synthesized voice to the first user, and simultaneously outputs the second synthesized voice to the second user.
- the invention also provides a real-time speech translation device, including:
- a memory in which an executable program of the processor is stored
- the processor is configured to execute the steps of any one of the above real-time speech translation methods by executing the executable program.
- the present invention also provides a computer-readable storage medium for storing a program that implements any of the above steps of the real-time speech translation method when the program is executed by a processor.
- the present invention has the following advantages and outstanding effects:
- the real-time speech translation method, system, equipment and storage medium provided by the present invention synthesize the first synthesized speech based on the first text data obtained by speech recognition based on the speaking content of the speaker, and while playing the second synthesized speech to the listening party,
- the first synthesized speech is played synchronously to the speaker, so that errors in the translation content caused by errors in the speech recognition process can be known to the speaker in time, improving the accuracy of cross-language translation and the communication efficiency of both parties in cross-language communication; making Communicate more smoothly in cross-language communication.
- Figure 1 is a schematic diagram of a real-time speech translation method disclosed in an embodiment of the present invention
- Figure 2 is a schematic diagram of an application scenario involved in the real-time speech translation method disclosed in an embodiment of the present invention
- Figure 3 is a schematic diagram of another application scenario involved in the real-time speech translation method disclosed in an embodiment of the present invention.
- Figure 4 is a schematic diagram of another application scenario involved in the real-time speech translation method disclosed in an embodiment of the present invention.
- Figure 5 is a schematic diagram of a real-time speech translation method disclosed in another embodiment of the present invention.
- Figure 6 is a schematic diagram of the synchronous working principle of a real-time speech translation method disclosed in an embodiment of the present invention.
- Figure 7 is a schematic diagram of a real-time speech translation method disclosed in another embodiment of the present invention.
- Figure 8 is a schematic diagram of a real-time speech translation method disclosed in another embodiment of the present invention.
- Figure 9 is a schematic diagram of a real-time speech translation method disclosed in another embodiment of the present invention.
- Figure 10 is a schematic structural diagram of a real-time speech translation system disclosed in an embodiment of the present invention.
- Figure 11 is a schematic structural diagram of a real-time speech translation device disclosed in an embodiment of the present invention. .
- Example embodiments will now be described more fully with reference to the accompanying drawings.
- Example embodiments may, however, be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
- the same reference numerals in the drawings represent the same or similar structures, and thus their repeated description will be omitted.
- one embodiment of the present invention discloses a real-time speech translation method.
- This embodiment can be applied to the situation where two parties perform speech recognition and translation in real-time across languages.
- This method can be performed by online real-time speech translation.
- the device can be implemented by hardware and/or software, and the device can be configured in any terminal or network element.
- online real-time voice translation translation services can be provided in real time for both users who speak different languages, allowing both parties to communicate smoothly.
- the method specifically includes the following steps:
- S110 Perform speech recognition on the first voice information input by the first user to obtain first text data.
- the first user 21 and the second user 22 conduct cross-language language communication.
- the first user 21 can use the first voice based on the first language to input the first voice information, and the online voice translation device can Perform speech recognition on it to obtain the first text data. Collecting the voice uttered by the first user 21 can be implemented through the headphones worn by the first user 21 .
- This embodiment uses the first user 21 as the speaker at the beginning of the cross-language communication, and the second user 22 as the listener at the beginning of the communication. Therefore, at the beginning of the communication, this step collects the input of the first user 21 first voice information, and then recognize the user's voice information.
- the above-mentioned online real-time speech translation device may be the translation device 23 in FIG. 2 or a server. This application does not limit this.
- the online real-time speech translation device is a translation device, it may not be limited to one translation device.
- the first user 21 and the second user 22 can each use a translation device.
- the first user 21 uses the first translation device 24 and the second user 22 uses the second translation device 25 .
- Data transmission and communication are performed between the first translation device 24 and the second translation device 25.
- Text translation between different languages can be completed by the first translation device 24 or the second translation device 25.
- the first translation device 24 and the second translation device 25 are also used to complete speech synthesis for respective corresponding users.
- Figure 4 shows a schematic diagram of a long-distance online translation scenario.
- the first user 21 and the second user 22 respectively perform online translation based on laptop computers.
- step S120 Translate the above-mentioned first speech information to obtain a second synthesized speech corresponding to the second language category.
- online speech translation generally involves two links: the first link is speech recognition, that is, recognizing the first language speech input by the first user as text information; the second link is translating the text information based on the translation corpus , and then generate voice information or text information in the second language and provide it to the second user. Therefore, in some embodiments, step S120 includes:
- S130 Based on the above-mentioned first text data, obtain the first synthesized speech corresponding to the first language category.
- the above-mentioned first synthesized speech is used to allow the first user who is the speaker at this time to know the playback progress of the translated speech through the user's robot, and to promptly detect errors in the speech recognition process, so that the speaker can Accurately obtain the correctness of the translated content and the time it was conveyed to the other party, so that communication with different languages can correct errors in time and improve interactivity.
- step S130 includes:
- S132 Obtain second timestamp information about the second synthesized speech based on the first estimated duration. as well as
- S133 Perform speech synthesis based on the first text data to obtain a first synthesized speech; and synchronize the first timestamp corresponding to the first synthesized speech with the second timestamp information.
- This embodiment uses the above steps to estimate the time required for speech synthesis of the second language, and performs timestamp synchronization based on this.
- each of the two synthesized voices can be divided into multiple audio paragraphs, and then the audio paragraph of the first synthesized voice and the audio paragraph corresponding to the second synthesized voice are timestamp synchronized, thereby realizing two Timestamp synchronization of synthesized speech.
- this application can establish a synchronization mechanism between the synthesized speech of the own language content and the synthesized speech of the other party's language content, so that the playback content of the two parties can maintain the same progress.
- S140 Output the first synthesized voice to the first user, and simultaneously output the second synthesized voice to the second user.
- This embodiment plays two synthesized voices synchronously, which is beneficial to ensuring that the playback content maintains the same progress, allowing the speaker to accurately obtain the correctness of the translated content and the time it was conveyed to the other party, thereby allowing communication in different languages to correct errors in a timely manner and improve interactivity.
- the speaker can continue to speak, or can temporarily stop speaking and listen to the first synthesized speech. If the speaker continues to speak, then the playback volume of the first synthesized speech is set to be no higher than the fourth preset threshold of the speaker's pronunciation volume. This will not interfere with the speaker's normal expression and allow the speaker to hear it through the earphones. What is said.
- on-site face-to-face and off-site long-distance translation since real-time online translation is divided into two scenarios: on-site face-to-face and off-site long-distance translation, during specific implementation, in the on-site face-to-face translation scenario, two users can each wear one of a pair of headphones, that is, share a pair of headphones. You can also wear a pair of headphones separately. When each wears one earphone, the first synthesized voice can be output to the first user through the left channel earphone, and the second synthesized voice can be simultaneously output to the second user through the right channel earphone. In the off-site long-distance translation scenario, two users each wear a pair of headphones.
- Figure 6 is a schematic diagram of the principle of synchronously playing synthesized speech.
- user A is the first user, and the voice person of A refers to the robot used to play the first synthesized voice to user A.
- User B is the second user, and B's voice person refers to the robot used to play the second synthesized voice to user B.
- Sx (including S1, S2 and S3) is the original voice of the first user.
- S1, S2 and S3 can respectively represent the first, second and third pieces of speech sent by user A.
- Sx’ is the machine speech synthesized by the speaker based on the content of Sx recognized as the first language, that is, the synthesized speech after recognition.
- S1', S2' and S3' respectively represent the synthetic voice of the first language corresponding to the first segment of voice S1, the synthetic voice of the first language corresponding to the second segment of voice S2 and the synthetic voice of the third segment played to user A respectively.
- TSx’ is B’s voice, which is the synthesized machine voice after the person translates it into the text of the second language based on the Sx recognition content, that is, the translated voice.
- TS1', TS2' and TS3' respectively represent the second language synthetic speech played to user B and respectively corresponding to the first speech S1, the second language synthetic speech corresponding to the second speech S2 and the second language synthetic speech corresponding to the second speech S2.
- the second language synthetic speech corresponding to the three-segment speech S3.
- step S150 Determine whether positive feedback information from the first user on the first synthesized voice is received. If yes, step S160 is executed. Otherwise, execute step S170.
- S160 After receiving the positive feedback information of the first synthesized voice from the first user, continue to collect the next first voice message input by the first user.
- S170 Collect the first voice information re-inputted by the first user, and output prompt information about the re-input by the first user to the second user.
- the next segment of the first voice message input by the first user continues to be collected.
- the first user confirms the translation error
- the first user is prompted to re-express the content and re-collect it, and the second user is informed that the second synthesized speech just played has an error and will be played again.
- the translation can also be determined by not receiving the preset utterance of the first user within a preset time period after the first synthesized voice completes playing. precise. Otherwise, it is determined that there is an error in the translation. This allows communication with different languages to correct errors in a timely manner and improve interactivity.
- step S110 includes:
- Perform voiceprint detection on the first voice information perform speech recognition only on the corresponding audio paragraphs uttered by the first user, and obtain the first text data. That is, speech recognition is not performed for the corresponding audio paragraphs that do not belong to the first user's utterance.
- Step S120 includes: performing voiceprint detection on the first voice information, and translating only the corresponding audio paragraphs uttered by the first user to obtain a second synthesized voice. That is, during the translation process, when an audio segment that does not belong to the first user's utterance is detected, the translation is paused. When an audio segment belonging to the first user's utterance is detected, translation is continued until the translation is completed.
- the method further includes:
- the environmental voice data of the first user and the second user are respectively collected in real time.
- first preset prompt information is sent to the first user.
- the first preset prompt message is used to remind the first user by voice that there are outsiders around and to pay attention to protecting privacy.
- the translation of the first voice information is suspended, and the second preset prompt information is sent to the second user.
- step S130 This is beneficial to protecting the privacy of the first user and the second user.
- step S140 This is beneficial to protecting the privacy of the first user and the second user.
- step S140 the method further includes the steps:
- step S130 may be located between step S130 and step S140.
- Step S140 includes: when playing the first synthesized voice and the second synthesized voice, providing a voice prompt for the voice associated with the interest tag.
- the above-mentioned interest tags and timestamps have a one-to-one correspondence. Using this interest tag can facilitate users to quickly find key content when listening back to the voice.
- the above interest tags can be displayed in the form of text or voice.
- This application actively activates the interest tag annotation function to record the time and corresponding content for repeatedly mentioned content or speech tone or semantic analysis content, sensitive privacy, or when the listener is temporarily unable to concentrate on listening but wants to review the content later.
- Interest tags and storage When the user looks back at the interest tags, the points of interest of the call translation can be displayed, and the user can jump to view the content before and after the point of interest, or re-listen to the conversation history before and after the point of interest to improve the user experience. You can also rely on interest tags to make the two communicating parties pay attention to key content to avoid being misunderstood or ignored, which will help improve the cross-language communication experience.
- step S140 the method further includes the steps:
- the voiceprint information of the first user and the second user is collected respectively.
- the identity information of the first user and the second user is respectively identified.
- a relationship type between the first user and the second user is determined.
- the above-extracted interest tags are filtered, and the retained ones are used as target interest tags.
- step S130 may be located between step S130 and step S140.
- Step S140 includes: when playing the first synthesized voice and the second synthesized voice, providing a voice prompt for the voice associated with the target interest tag.
- the determination of the relationship type between the first user and the second user can be based on a preset identity relationship database.
- step S110 includes:
- S112 Perform speech recognition on the first voice information input by the first user to obtain initial text data.
- S113 use the previous data to modify the initial text data to obtain the first text data.
- obtaining other data uttered by the user before uttering the first voice message, and correcting the recognized text based on these data will help improve the accuracy of translation, thereby improving the smoothness of cross-language communication. For example, if the text data translates the same speech into a first word, and this step translates the speech into a second word, then the second word is corrected into the first word.
- step S140 the steps are also included:
- the two earphones In the first state, the two earphones output different voices respectively, and the earphone worn by the user serves as the user's voice output channel. In the second state, two of the pair of earphones output the same voice, and the pair of earphones worn by the user serves as the user's voice output channel.
- step S130 may be located between step S130 and step S140.
- Step S140 is replaced with step S141: outputting the first synthesized voice to the first user based on the earphone worn by the first user, and simultaneously outputting the second synthesized voice to the second user based on the earphone worn by the second user.
- the pair of earphones can be controlled to work in a face-to-face translation state.
- the pair of headphones is controlled to work in a long-distance translation state.
- the user's synthesized speech needs to be encoded into the audio channel corresponding to the headset worn by the user.
- the first synthesized speech is encoded as left-channel audio.
- the second synthesized speech is encoded as right channel audio.
- the earphone worn by the first user is used as the first voice output channel
- the earphone worn by the second user is used as the second voice output channel.
- the first synthesized voice is output to the first user based on the first voice output channel
- the second synthesized voice is simultaneously output to the second user based on the second voice output channel.
- Step S180 includes:
- the above pair of headphones are based on the UWB communication module and detect the distance between the two headphones in real time.
- the earphones When the distance between the two earphones is greater than the first preset threshold, the earphones are controlled to operate in the first state; when the distance between the two earphones is less than the first preset threshold, the earphones are controlled to operate in the second state.
- the distance between the two earphones is greater than the first preset threshold, it means that the two earphones in a pair of earphones are respectively worn by two users.
- the distance between the two earphones is less than the first preset threshold, it indicates that one pair of earphones is worn by the same user.
- the above-mentioned first preset threshold may be 25cm.
- the distance between the earphones can also be detected through the built-in ultrasonic signal module in the earphones, which is not limited in this application.
- the method includes the steps:
- S190 Combine and form an audio adjustment library based on the unplayed audio frames in the first synthesized speech and the audio frames that have collected and generated the first speech information and have not been converted into the first synthesized speech.
- S200 Adjust the playback speed of the first synthesized speech based on the number of audio frames in the audio adjustment library.
- the playback speech speed corresponding to the first synthesized speech increases as the number of audio frames in the audio adjustment library increases, and decreases as the number of audio frames in the audio adjustment library decreases.
- the playback speed of the second synthesized voice can also be adjusted synchronously.
- the method further includes:
- the environmental sound data of the environment where the first user is located is collected, and the environmental sound data is collected from the above-mentioned environmental sound data. Multiple voice data are extracted from the data.
- the language category corresponding to the plurality of voice data is determined as the second language category.
- Prompt information is played to the external environment based on the other pair of earphones; the above prompt information is used to prompt the second user to wear the other pair of earphones.
- the above step S140 includes:
- the second synthesized voice is output to the second user through the second earphone.
- an embodiment of the present invention also discloses a real-time speech translation system 10.
- the system includes:
- the first text data generating module 101 performs speech recognition on the first voice information input by the first user to obtain the first text data.
- the second synthesized speech generation module 102 translates the first speech information to obtain a second synthesized speech corresponding to the second language category.
- the first synthesized speech generation module 103 obtains a first synthesized speech corresponding to the first language category based on the first text data.
- the synthesized speech playing module 104 outputs the first synthesized speech to the first user, and simultaneously outputs the second synthesized speech to the second user.
- the real-time speech translation system of the present invention also includes other existing functional modules that support the operation of the real-time speech translation system.
- the real-time speech translation system shown in Figure 10 is only an example and should not impose any restrictions on the functions and scope of use of the embodiments of the present invention.
- the real-time speech translation system in this embodiment is used to implement the above-mentioned real-time speech translation method. Therefore, for the specific implementation steps of the real-time speech translation system, reference can be made to the above description of the real-time speech translation method, which will not be described again here.
- An embodiment of the present invention also discloses a real-time speech translation device, which includes a processor and a memory, wherein the memory stores an executable program of the processor; the processor is configured to execute the above real-time speech translation method by executing the executable program. steps in.
- Figure 11 is a schematic structural diagram of the real-time speech translation device disclosed in the present invention.
- An electronic device 600 according to this embodiment of the present invention is described below with reference to FIG. 11 .
- the electronic device 600 shown in FIG. 11 is only an example and should not impose any limitations on the functions and usage scope of the embodiments of the present invention.
- electronic device 600 is embodied in the form of a general computing device.
- the components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 connecting different platform components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.
- the storage unit stores program code, and the program code can be executed by the processing unit 610, so that the processing unit 610 performs the steps according to various exemplary embodiments of the present invention described in the above-mentioned real-time speech translation method part of this specification.
- processing unit 610 may perform steps as shown in FIG. 1 .
- the storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 6201 and/or a cache storage unit 6202, and may further include a read-only storage unit (ROM) 6203.
- RAM random access storage unit
- ROM read-only storage unit
- Storage unit 620 may also include a program/utility 6204 having a set of (at least one) program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples, or some combination, may include the implementation of a network environment.
- program/utility 6204 having a set of (at least one) program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples, or some combination, may include the implementation of a network environment.
- Bus 630 may be a local area representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or using any of a variety of bus structures. bus.
- Electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, Bluetooth device, etc.), may also communicate with one or more devices that enable a user to interact with electronic device 600, and/or with Any device (eg, router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 650.
- the electronic device 600 may also communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 660.
- Network adapter 660 may communicate with other modules of electronic device 600 via bus 630.
- the invention also discloses a computer-readable storage medium for storing a program.
- the steps in the above real-time speech translation method are implemented.
- various aspects of the present invention can also be implemented in the form of a program product, which includes program code.
- the program product is run on a terminal device, the program code is used to cause the terminal device to execute the above described instructions.
- the steps according to various exemplary embodiments of the present invention are described in the real-time speech translation method.
- the speaking content of the speaking party is synthesized based on the first text data obtained through speech recognition, and the second synthesized speech is played to the listening party.
- the first synthesized voice is simultaneously played to the speaker, so that errors in the translation content caused by errors in the speech recognition process can be known to the speaker in a timely manner, improving the accuracy of cross-language translation and improving the cross-language communication between both parties. Communication efficiency.
- An embodiment of the invention discloses a computer-readable storage medium.
- the storage medium is a program product that implements the above method, which can be a portable compact disk read-only memory (CD-ROM) and include program code, and can be run on a terminal device, such as a personal computer.
- CD-ROM portable compact disk read-only memory
- the program product of the present invention is not limited thereto.
- a readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, apparatus or device.
- the Program Product may take the form of one or more readable media in any combination.
- the readable medium may be a readable signal medium or a readable storage medium.
- the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may include a data signal propagated in baseband or as part of a carrier wave carrying the readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
- a readable storage medium may also be any readable medium other than a readable storage medium that can transmit, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code contained on a readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
- Program code for performing the operations of the present invention may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., as well as conventional procedural programming. Language—such as "C” or a similar programming language.
- the program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
- the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device, such as provided by an Internet service. (business comes via Internet connection).
- LAN local area network
- WAN wide area network
- the real-time speech translation method, system, device and storage medium provided by the embodiments of the present invention synthesize the first synthesized speech based on the first text data obtained by speech recognition based on the speaking content of the speaker, and play the second synthesized speech to the listening party.
- the first synthesized voice is played synchronously to the speaker, so that errors in the translation content caused by errors in the speech recognition process can be known to the speaker in time, improving the accuracy of cross-language translation and the communication efficiency of both parties in cross-language communication. ; Makes communication smoother in cross-language communication.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
本发明涉及在线翻译技术领域,具体地说,涉及一种语音实时翻译方法、系统、设备以及存储介质。The present invention relates to the technical field of online translation, and specifically to a real-time speech translation method, system, equipment and storage medium.
随着国际性交流的增加,使用不同语种的语言沟通越来越频繁。交流过程中,因语言不同需要依赖交流双方的语言技能或要依靠人工口语翻译才能实现交流,给交流双方带来了不便。因此,通过使用翻译设备或穿戴设备配合耳机,并借助机器人语音,来实现实时在线翻译将成为最便捷的方式。With the increase in international exchanges, language communication in different languages is becoming more and more frequent. During the communication process, due to different languages, it is necessary to rely on the language skills of both parties or manual spoken translation to achieve communication, which brings inconvenience to both parties. Therefore, it will become the most convenient way to achieve real-time online translation by using translation devices or wearable devices with headphones and using robot voices.
通话时,机器人语音翻译,负责将说话方的内容翻译为聆听方语种的语音进行播放。但翻译内容可能与说话方本意存在一定的误差,且说话内容和实时翻译语音会存在一定的时间差,而说话方不懂翻译语音内容,所以说话方无法及时获悉刚才说的某些内容被聆听方具体获悉的时间点和正确性,即是否正确的已经被听到,导致说话方无法确定自己说的话的传达时效性和正确性,需要说话方有意识地停顿,等待语音翻译全部完成后猜测聆听方是否已经听到以及理解正确。因此无法保证交流双方的沟通效率和正确性。During the call, the robot voice translator is responsible for translating the speaker's content into the listener's language for playback. However, the translation content may have certain errors with the speaker's original intention, and there will be a certain time difference between the spoken content and the real-time translated voice. The speaker does not understand the translated voice content, so the speaker cannot know in time that some of the content just said was heard by the listening party. The specific time and correctness of the information, that is, whether the correct one has been heard, makes the speaker unable to determine the timeliness and correctness of what he said. The speaker needs to pause consciously and wait for the speech translation to be completed before guessing the listener. Have you heard and understood correctly? Therefore, the efficiency and accuracy of communication between both parties cannot be guaranteed.
发明内容Contents of the invention
针对现有技术中的问题,本发明的目的在于提供一种语音实时翻译方法、系统、设备以及存储介质,提高了跨语言翻译的准确率以及跨语言交流双方的沟通效率。In view of the problems in the prior art, the purpose of the present invention is to provide a real-time speech translation method, system, equipment and storage medium, which improves the accuracy of cross-language translation and the communication efficiency of both parties in cross-language communication.
为实现上述目的,本发明提供了一种语音实时翻译方法,所述方法包括以下步骤:To achieve the above objectives, the present invention provides a real-time speech translation method, which method includes the following steps:
对第一用户输入的第一语音信息进行语音识别,获得第一文本数据;Perform speech recognition on the first voice information input by the first user to obtain first text data;
对所述第一语音信息进行翻译,获得对应于第二语种类别的第二合成语音;Translate the first voice information to obtain a second synthesized voice corresponding to the second language category;
基于所述第一文本数据,获得对应于第一语种类别的第一合成语音;Based on the first text data, obtain a first synthesized speech corresponding to the first language category;
将所述第一合成语音向第一用户输出,以及同步将所述第二合成语音向第二用户输出。The first synthesized voice is output to the first user, and the second synthesized voice is outputted to the second user simultaneously.
可选地,所述对所述第一语音信息进行翻译,获得对应于第二语种类别的第二合成语音,包括:Optionally, the translating the first speech information to obtain a second synthesized speech corresponding to the second language category includes:
对基于第一语音信息语音识别得到的第一文本数据进行翻译,获得对应于第二语种类别的第二文本数据;Translate the first text data obtained based on speech recognition of the first voice information to obtain second text data corresponding to the second language category;
基于所述第二文本数据,获得对应于第二语种类别的第二合成语音。Based on the second text data, a second synthesized speech corresponding to the second language category is obtained.
可选地,所述基于所述第一文本数据,获得对应于第一语种类别的第一合成语音,包括:Optionally, obtaining the first synthesized speech corresponding to the first language category based on the first text data includes:
计算所述第二合成语音完成播放的预计耗时,得到第一预估时长;Calculate the estimated time required to complete the playback of the second synthesized speech to obtain the first estimated duration;
基于所述第一预估时长,获得关于所述第二合成语音的第二时间戳信息;Based on the first estimated duration, obtain second timestamp information about the second synthesized speech;
基于所述第一文本数据进行语音合成,获得第一合成语音;并将所述第一合成语音对应的第一时间戳与所述第二时间戳信息进行同步。Perform speech synthesis based on the first text data to obtain a first synthesized speech; and synchronize the first timestamp corresponding to the first synthesized speech with the second timestamp information.
可选地,所述方法还包括:Optionally, the method also includes:
当接收到所述第一用户对所述第一合成语音的正反馈信息后,继续采集第一用户输入的下一段第一语音消息;After receiving the positive feedback information of the first synthesized voice from the first user, continue to collect the next first voice message input by the first user;
当接收到所述第一用户对所述第一合成语音的负反馈信息后,采集第一用户重新输入的所述第一语音信息,以及向第二用户输出关于所述第一用户重新输入的提示信息。After receiving the negative feedback information of the first synthesized voice from the first user, collecting the first voice information re-inputted by the first user, and outputting the information re-inputted by the first user to the second user. Prompt information.
可选地,所述方法还包括:Optionally, the method also includes:
分别基于所述第一文本数据和所述第二文本数据,提取各自的兴趣标签和对应的时间戳信息;Extract respective interest tags and corresponding timestamp information based on the first text data and the second text data respectively;
依据所述时间戳信息,将对应的兴趣标签分别标注于所述第一文本数据和所述第二文本数据。According to the timestamp information, corresponding interest tags are marked on the first text data and the second text data respectively.
可选地,在所述将所述第一合成语音向第一用户输出,以及同步将所述第二合成语音向第二用户输出之前,所述方法包括:Optionally, before outputting the first synthesized voice to the first user and synchronously outputting the second synthesized voice to the second user, the method includes:
分别采集第一用户和第二用户的声纹信息;Collect the voiceprint information of the first user and the second user respectively;
基于所述声纹信息,分别识别得到第一用户和第二用户的身份信息;Based on the voiceprint information, identify the identity information of the first user and the second user respectively;
基于所述第一用户和第二用户的身份信息,确定第一用户和第二用户之间的关 系类型;Based on the identity information of the first user and the second user, determine the type of relationship between the first user and the second user;
基于第一用户和第二用户之间的关系类型,以及预设身份关系兴趣库,对所述兴趣标签进行筛选,将保留的兴趣标签作为目标兴趣标签;Based on the relationship type between the first user and the second user and the preset identity relationship interest library, filter the interest tags and use the retained interest tags as the target interest tags;
所述将所述第一合成语音向第一用户输出,以及同步将所述第二合成语音向第二用户输出,包括:Outputting the first synthesized voice to the first user and synchronously outputting the second synthesized voice to the second user include:
对所述第一合成语音以及所述第二合成语音中与所述目标兴趣标签关联的语音进行语音提示。Provide voice prompts to the voice associated with the target interest tag among the first synthesized voice and the second synthesized voice.
可选地,所述对第一用户输入的第一语音信息进行语音识别,获得第一文本数据,包括:Optionally, performing speech recognition on the first voice information input by the first user to obtain the first text data includes:
获取关于所述第一语音信息的前文数据;Obtain previous data about the first voice information;
对第一用户输入的第一语音信息进行语音识别,获得初始文本数据;Perform speech recognition on the first voice information input by the first user to obtain initial text data;
利用所述前文数据对所述初始文本数据进行修正,得到第一文本数据。The initial text data is modified using the preceding data to obtain first text data.
可选地,在所述将所述第一合成语音向第一用户输出,以及同步将所述第二合成语音向第二用户输出之前,所述方法包括:Optionally, before outputting the first synthesized voice to the first user and synchronously outputting the second synthesized voice to the second user, the method includes:
当检测到用于输出语音的一对耳机分别由不同用户佩戴时,控制所述耳机工作于第一状态;当检测到第一用户和第二用户各佩戴一对耳机时,控制所述耳机工作于第二状态;When it is detected that a pair of earphones for outputting voice are worn by different users, the earphones are controlled to work in the first state; when it is detected that the first user and the second user each wear a pair of earphones, the earphones are controlled to work in the second state;
其中,在第一状态下,两个耳机分别输出不同的语音,被一用户佩戴的一耳机作为该用户的语音输出通道;在第二状态下,一对耳机中的两个输出相同的语音,被一用户佩戴的一对耳机作为该用户的语音输出通道。Among them, in the first state, the two earphones output different voices respectively, and one earphone worn by a user serves as the user's voice output channel; in the second state, two of the pair of earphones output the same voice, A pair of headphones worn by a user serves as the user's voice output channel.
可选地,所述一对耳机设有UWB通信模块;在所述将所述第一合成语音向第一用户输出,以及同步将所述第二合成语音向第二用户输出之前,所述方法包括:Optionally, the pair of headphones is provided with a UWB communication module; before outputting the first synthesized voice to the first user and synchronously outputting the second synthesized voice to the second user, the method include:
所述一对耳机基于UWB通信模块,实时侦测两个耳机之间的距离;The pair of headphones is based on a UWB communication module and detects the distance between the two headphones in real time;
当两个耳机之间的距离大于第一预设阈值时,控制所述耳机工作于第一状态;当两个耳机之间的距离小于第一预设阈值时,控制所述耳机工作于第二状态。When the distance between the two earphones is greater than the first preset threshold, the earphones are controlled to work in the first state; when the distance between the two earphones is less than the first preset threshold, the earphones are controlled to work in the second state. state.
可选地,所述方法还包括:Optionally, the method also includes:
根据第一合成语音中未播放的音频帧,以及已采集生成第一语音信息且未转换为第一合成语音的音频帧,组合形成音频调节库;Combine and form an audio adjustment library based on the unplayed audio frames in the first synthesized speech and the audio frames that have collected and generated the first speech information and have not been converted into the first synthesized speech;
基于所述音频调节库中的音频帧数量,调节第一合成语音的播放语速;Based on the number of audio frames in the audio adjustment library, adjust the playback speed of the first synthesized speech;
其中,所述第一合成语音对应的播放语速随着所述音频调节库中的音频帧数量的增大而增大,且随着所述音频调节库中的音频帧数量的减小而减小。Wherein, the playback speech speed corresponding to the first synthesized speech increases as the number of audio frames in the audio adjustment library increases, and decreases as the number of audio frames in the audio adjustment library decreases. Small.
本发明还提供了一种语音实时翻译系统,用于实现上述语音实时翻译方法,所述系统包括:The present invention also provides a real-time speech translation system for realizing the above-mentioned real-time speech translation method. The system includes:
第一文本数据生成模块,对第一用户输入的第一语音信息进行语音识别,获得第一文本数据;The first text data generation module performs speech recognition on the first voice information input by the first user to obtain the first text data;
第二合成语音生成模块,对所述第一语音信息进行翻译,获得对应于第二语种类别的第二合成语音;A second synthesized speech generation module translates the first speech information to obtain a second synthesized speech corresponding to the second language category;
第一合成语音生成模块,基于所述第一文本数据,获得对应于第一语种类别的第一合成语音;A first synthesized speech generation module, based on the first text data, obtains a first synthesized speech corresponding to the first language category;
合成语音播放模块,将所述第一合成语音向第一用户输出,以及同步将所述第二合成语音向第二用户输出。The synthesized voice playing module outputs the first synthesized voice to the first user, and simultaneously outputs the second synthesized voice to the second user.
本发明还提供了一种语音实时翻译设备,包括:The invention also provides a real-time speech translation device, including:
处理器;processor;
存储器,其中存储有所述处理器的可执行程序;A memory in which an executable program of the processor is stored;
其中,所述处理器配置为经由执行所述可执行程序来执行上述任意一项语音实时翻译方法的步骤。Wherein, the processor is configured to execute the steps of any one of the above real-time speech translation methods by executing the executable program.
本发明还提供了一种计算机可读存储介质,用于存储程序,所述程序被处理器执行时实现上述任意一项语音实时翻译方法的步骤。The present invention also provides a computer-readable storage medium for storing a program that implements any of the above steps of the real-time speech translation method when the program is executed by a processor.
本发明与现有技术相比,具有以下优点及突出性效果:Compared with the existing technology, the present invention has the following advantages and outstanding effects:
本发明提供的语音实时翻译方法、系统、设备以及存储介质对说话方的说话内容,基于语音识别得到的第一文本数据,合成第一合成语音,在向聆听方播放第二合成语音的同时,向说话方同步播放第一合成语音,使得识别语音过程中出现错误导致翻译内容时产生的误差,能够及时被说话方知晓,提高了跨语言翻译的准确率以及跨语言交流双方的沟通效率;使得在跨语言交流中沟通更加顺畅。The real-time speech translation method, system, equipment and storage medium provided by the present invention synthesize the first synthesized speech based on the first text data obtained by speech recognition based on the speaking content of the speaker, and while playing the second synthesized speech to the listening party, The first synthesized speech is played synchronously to the speaker, so that errors in the translation content caused by errors in the speech recognition process can be known to the speaker in time, improving the accuracy of cross-language translation and the communication efficiency of both parties in cross-language communication; making Communicate more smoothly in cross-language communication.
通过阅读参照以下附图对非限制性实施例所作的详细描述,本发明的其它特征、目的和优点将会变得更明显。Other features, objects and advantages of the present invention will become more apparent upon reading the detailed description of the non-limiting embodiments with reference to the following drawings.
图1为本发明一实施例公开的一种语音实时翻译方法的示意图;Figure 1 is a schematic diagram of a real-time speech translation method disclosed in an embodiment of the present invention;
图2为本发明一实施例公开的语音实时翻译方法所涉及的应用场景示意图;Figure 2 is a schematic diagram of an application scenario involved in the real-time speech translation method disclosed in an embodiment of the present invention;
图3为本发明一实施例公开的语音实时翻译方法所涉及的另一应用场景示意图;Figure 3 is a schematic diagram of another application scenario involved in the real-time speech translation method disclosed in an embodiment of the present invention;
图4为本发明一实施例公开的语音实时翻译方法所涉及的另一应用场景示意图;Figure 4 is a schematic diagram of another application scenario involved in the real-time speech translation method disclosed in an embodiment of the present invention;
图5为本发明另一实施例公开的一种语音实时翻译方法的示意图;Figure 5 is a schematic diagram of a real-time speech translation method disclosed in another embodiment of the present invention;
图6为本发明一实施例公开的一种语音实时翻译方法同步工作原理示意图;Figure 6 is a schematic diagram of the synchronous working principle of a real-time speech translation method disclosed in an embodiment of the present invention;
图7为本发明另一实施例公开的一种语音实时翻译方法的示意图;Figure 7 is a schematic diagram of a real-time speech translation method disclosed in another embodiment of the present invention;
图8为本发明另一实施例公开的一种语音实时翻译方法的示意图;Figure 8 is a schematic diagram of a real-time speech translation method disclosed in another embodiment of the present invention;
图9为本发明另一实施例公开的一种语音实时翻译方法的示意图;Figure 9 is a schematic diagram of a real-time speech translation method disclosed in another embodiment of the present invention;
图10为本发明一实施例公开的一种语音实时翻译系统的结构示意图;Figure 10 is a schematic structural diagram of a real-time speech translation system disclosed in an embodiment of the present invention;
图11为本发明一实施例公开的一种语音实时翻译设备的结构示意图。。Figure 11 is a schematic structural diagram of a real-time speech translation device disclosed in an embodiment of the present invention. .
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的实施方式。相反,提供这些实施方式使得本发明将全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。在图中相同的附图标记表示相同或类似的结构,因而将省略对它们的重复描述。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings represent the same or similar structures, and thus their repeated description will be omitted.
如图1所示,本发明一实施例公开了一种语音实时翻译方法,本实施例可适用于双方跨语言实时进行语音交流时进行语音识别和翻译的情况,该方法可以由在线实时语音翻译装置来执行,该装置可以由硬件和/或软件来实现,该装置可以配置在任意终端或网元中。通过在线实时语音翻译,可以为使用不同语言的用户双方实时提供翻译服务,使双方进行顺利沟通。该方法具体包括如下步骤:As shown in Figure 1, one embodiment of the present invention discloses a real-time speech translation method. This embodiment can be applied to the situation where two parties perform speech recognition and translation in real-time across languages. This method can be performed by online real-time speech translation. To execute, the device can be implemented by hardware and/or software, and the device can be configured in any terminal or network element. Through online real-time voice translation, translation services can be provided in real time for both users who speak different languages, allowing both parties to communicate smoothly. The method specifically includes the following steps:
S110,对第一用户输入的第一语音信息进行语音识别,获得第一文本数据。S110: Perform speech recognition on the first voice information input by the first user to obtain first text data.
参考图2,在上述操作中,第一用户21和第二用户22进行跨语种的语言交流,第一用户21可以采用基于第一语种的第一语音输入第一语音信息,在线语音翻译装置可以对其进行语音识别,得到第一文本数据。对第一用户21发出的语音进行收音可以通过第一用户21佩戴的耳机实现。Referring to Figure 2, in the above operation, the
本实施例将第一用户21作为跨语言沟通刚开始时的说话方,将第二用户22作为沟通刚开始时的聆听方,因此在沟通刚开始时,该步骤采集的是第一用户21输入的第一语音信息,然后对该用户的语音信息进行识别。This embodiment uses the
由于沟通是双向的,所以在后续持续的沟通时,该步骤也将采集第二用户22输入的第二语音信息,然后对第二语音信息进行识别。其中,上述在线实时语音翻译装置可以为图2中的翻译设备23,也可以为服务器。本申请对此不作限制。Since communication is two-way, during subsequent continuous communication, this step will also collect the second voice information input by the
另一方面,当在线实时语音翻译装置为翻译设备时,可以不限于一台翻译设备。比如参考图3,第一用户21和第二用户22可以各自使用一台翻译设备,比如第一用户21使用第一翻译设备24,第二用户22使用第二翻译设备25。第一翻译设备24和第二翻译设备25之间进行数据传输和通信。不同语种之间的文本翻译可以由第一翻译设备24或者第二翻译设备25完成,第一翻译设备24和第二翻译设备25还用于完成各自对应用户的语音合成。On the other hand, when the online real-time speech translation device is a translation device, it may not be limited to one translation device. For example, referring to FIG. 3 , the
需要说明的是,上述图2和图3示出的是面对面实时翻译场景。本实施例还可以应用于远距离在线翻译场景。图4示出的是一种远距离在线翻译场景示意图。该图示中,第一用户21和第二用户22分别基于笔记本电脑进行在线翻译。It should be noted that the above-mentioned Figures 2 and 3 show face-to-face real-time translation scenarios. This embodiment can also be applied to long-distance online translation scenarios. Figure 4 shows a schematic diagram of a long-distance online translation scenario. In this illustration, the
S120,对上述第一语音信息进行翻译,获得对应于第二语种类别的第二合成语音。具体而言,在线语音翻译一般涉及两个环节:第一个环节是进行语音识别,即将第一用户输入的第一语种语音识别为文字信息;第二个环节是将文字信息基于翻译语料库进行翻译,再生成第二语种的语音信息或文字信息,提供给第二用户。因此,在一些实施例中,步骤S120包括:S120: Translate the above-mentioned first speech information to obtain a second synthesized speech corresponding to the second language category. Specifically, online speech translation generally involves two links: the first link is speech recognition, that is, recognizing the first language speech input by the first user as text information; the second link is translating the text information based on the translation corpus , and then generate voice information or text information in the second language and provide it to the second user. Therefore, in some embodiments, step S120 includes:
S121,对上述第一文本数据进行翻译,获得对应于第二语种类别的第二文本数据。S121. Translate the above-mentioned first text data to obtain second text data corresponding to the second language category.
S122,基于上述第二文本数据,获得对应于第二语种类别的第二合成语音。上述步骤即为将第一语种的第一文本数据翻译为第二语种的第二文本数据,然后基于第二文本数据进行语音合成。S122. Based on the above-mentioned second text data, obtain a second synthesized speech corresponding to the second language category. The above steps are to translate the first text data in the first language into the second text data in the second language, and then perform speech synthesis based on the second text data.
S130,基于上述第一文本数据,获得对应于第一语种类别的第一合成语音。其中,上述第一合成语音是用于让此时作为说话方的第一用户,能够通过本方的机器人获知翻译语音的播放进度,以及及时发现语音识别过程中出现的错误,从而可以让说话方准确的获得翻译内容的正确性和传达给对方的时间,从而让语言不同的 交流方能实现及时纠正错误和提升互动性。S130: Based on the above-mentioned first text data, obtain the first synthesized speech corresponding to the first language category. Among them, the above-mentioned first synthesized speech is used to allow the first user who is the speaker at this time to know the playback progress of the translated speech through the user's robot, and to promptly detect errors in the speech recognition process, so that the speaker can Accurately obtain the correctness of the translated content and the time it was conveyed to the other party, so that communication with different languages can correct errors in time and improve interactivity.
为了让说话方的机器人与聆听方的机器人播放内容保持相同进度,就需要将第一合成语音和第二合成语音进行时间戳同步。于是,在一些实施例中,参考图5,步骤S130包括:In order to keep the speaking progress of the content played by the speaking robot and the listening robot, the timestamps of the first synthesized voice and the second synthesized voice need to be synchronized. Thus, in some embodiments, referring to Figure 5, step S130 includes:
S131,计算上述第二合成语音完成播放的预计耗时,得到第一预估时长。S131. Calculate the estimated time required to complete the playback of the second synthesized voice, and obtain the first estimated duration.
S132,基于上述第一预估时长,获得关于上述第二合成语音的第二时间戳信息。以及S132: Obtain second timestamp information about the second synthesized speech based on the first estimated duration. as well as
S133,基于上述第一文本数据进行语音合成,获得第一合成语音;并将上述第一合成语音对应的第一时间戳与上述第二时间戳信息进行同步。S133. Perform speech synthesis based on the first text data to obtain a first synthesized speech; and synchronize the first timestamp corresponding to the first synthesized speech with the second timestamp information.
该实施例利用上述步骤对第二语种语音合成耗时进行预估,从而基于此进行时间戳同步。具体实施时,示例性地,可以将两个合成语音各自划分为多个音频段落,然后将第一合成语音的音频段落与第二合成语音对应的音频段落进行时间戳同步,由此实现两个合成语音的时间戳同步。This embodiment uses the above steps to estimate the time required for speech synthesis of the second language, and performs timestamp synchronization based on this. During specific implementation, for example, each of the two synthesized voices can be divided into multiple audio paragraphs, and then the audio paragraph of the first synthesized voice and the audio paragraph corresponding to the second synthesized voice are timestamp synchronized, thereby realizing two Timestamp synchronization of synthesized speech.
这样本申请即实现在本方语种内容的合成语音和对方语种内容的合成语音之间建立同步机制,实现两方的播放内容保持相同进度。In this way, this application can establish a synchronization mechanism between the synthesized speech of the own language content and the synthesized speech of the other party's language content, so that the playback content of the two parties can maintain the same progress.
S140,将上述第一合成语音向第一用户输出,以及同步将上述第二合成语音向第二用户输出。S140: Output the first synthesized voice to the first user, and simultaneously output the second synthesized voice to the second user.
本实施例将两个合成语音同步播放,利于保证播放内容保持相同进度,可以让说话方准确的获得翻译内容的正确性和传达给对方的时间,从而让语言不同的交流方能实现及时纠正错误和提升互动性。This embodiment plays two synthesized voices synchronously, which is beneficial to ensuring that the playback content maintains the same progress, allowing the speaker to accurately obtain the correctness of the translated content and the time it was conveyed to the other party, thereby allowing communication in different languages to correct errors in a timely manner and improve interactivity.
并且,同步播放时,说话方可以持续继续说话,也可以暂时不说话,聆听该第一合成语音。若说话方继续说话,那么将该第一合成语音的播放音量设置为不高于说话方发音音量的第四预设阈值,这样可以不干扰说话方的正常表达,又能让说话方通过耳机听见所说内容。Moreover, during synchronous playback, the speaker can continue to speak, or can temporarily stop speaking and listen to the first synthesized speech. If the speaker continues to speak, then the playback volume of the first synthesized speech is set to be no higher than the fourth preset threshold of the speaker's pronunciation volume. This will not interfere with the speaker's normal expression and allow the speaker to hear it through the earphones. What is said.
由于实时在线翻译分为现场面对面和非现场远距离翻译两种场景,因此具体实施时,在现场面对面翻译场景下,两个用户可以分别佩戴一副耳机中的一个,即共享一副耳机。也可以分别佩戴一副耳机。当各戴一个耳机时,可以将第一合成语音通过左声道耳机向第一用户输出,同步将上述第二合成语音通过右声道耳机向第二用户输出。在非现场远距离翻译场景下,两个用户分别佩戴一副耳机。Since real-time online translation is divided into two scenarios: on-site face-to-face and off-site long-distance translation, during specific implementation, in the on-site face-to-face translation scenario, two users can each wear one of a pair of headphones, that is, share a pair of headphones. You can also wear a pair of headphones separately. When each wears one earphone, the first synthesized voice can be output to the first user through the left channel earphone, and the second synthesized voice can be simultaneously output to the second user through the right channel earphone. In the off-site long-distance translation scenario, two users each wear a pair of headphones.
图6为同步播放合成语音的原理示意图。参考图6,A用户为第一用户,A的语音人是指用于向A用户播放第一合成语音的机器人。B用户为第二用户,B的语音人是指用于向B用户播放第二合成语音的机器人。Figure 6 is a schematic diagram of the principle of synchronously playing synthesized speech. Referring to Figure 6, user A is the first user, and the voice person of A refers to the robot used to play the first synthesized voice to user A. User B is the second user, and B's voice person refers to the robot used to play the second synthesized voice to user B.
Sx(包括S1、S2与S3)为第一用户的原声。S1、S2与S3可以分别表示A用户发出的第一段、第二段与第三段语音。Sx’为A的语音人根据Sx识别为第一语种的内容后合成的机器语音,即识别后的合成的语音。S1’、S2’与S3’分别表示向A用户播放的,且分别与第一段语音S1对应的第一语种合成语音、与第二段语音S2对应的第一语种合成语音以及与第三段语音S3对应的第一语种合成语音。Sx (including S1, S2 and S3) is the original voice of the first user. S1, S2 and S3 can respectively represent the first, second and third pieces of speech sent by user A. Sx’ is the machine speech synthesized by the speaker based on the content of Sx recognized as the first language, that is, the synthesized speech after recognition. S1', S2' and S3' respectively represent the synthetic voice of the first language corresponding to the first segment of voice S1, the synthetic voice of the first language corresponding to the second segment of voice S2 and the synthetic voice of the third segment played to user A respectively. The first language synthetic speech corresponding to speech S3.
TSx’为B的语音人根据Sx识别内容翻译成第二语种的文本后,合成的机器语音,即翻译后的语音。那么,TS1’、TS2’以及TS3’分别表示向B用户播放的,且分别与第一段语音S1对应的第二语种合成语音、与第二段语音S2对应的第二语种合成语音以及与第三段语音S3对应的第二语种合成语音。TSx’ is B’s voice, which is the synthesized machine voice after the person translates it into the text of the second language based on the Sx recognition content, that is, the translated voice. Then, TS1', TS2' and TS3' respectively represent the second language synthetic speech played to user B and respectively corresponding to the first speech S1, the second language synthetic speech corresponding to the second speech S2 and the second language synthetic speech corresponding to the second speech S2. The second language synthetic speech corresponding to the three-segment speech S3.
在本申请的另一实施例中,公开了另一种语音实时翻译方法。参考图7,该方法在上述图1对应实施例的基础上,还包括步骤:In another embodiment of the present application, another real-time speech translation method is disclosed. Referring to Figure 7, based on the above-mentioned corresponding embodiment of Figure 1, the method also includes steps:
S150,判断是否接收到第一用户对上述第一合成语音的正反馈信息。若是,则执行步骤S160。否则执行步骤S170。S150: Determine whether positive feedback information from the first user on the first synthesized voice is received. If yes, step S160 is executed. Otherwise, execute step S170.
S160,当接收到第一用户对上述第一合成语音的正反馈信息后,继续采集第一用户输入的下一段第一语音消息。S160: After receiving the positive feedback information of the first synthesized voice from the first user, continue to collect the next first voice message input by the first user.
S170,采集第一用户重新输入的上述第一语音信息,以及向第二用户输出关于上述第一用户重新输入的提示信息。S170: Collect the first voice information re-inputted by the first user, and output prompt information about the re-input by the first user to the second user.
具体而言,也即当第一用户确认翻译准确之后,继续采集第一用户输入的下一段第一语音消息。当第一用户确认翻译错误之后,提示第一用户重新表达内容并重新采集,并告知第二用户刚才播放的第二合成语音存在错误,将会重新播放。Specifically, that is, after the first user confirms that the translation is accurate, the next segment of the first voice message input by the first user continues to be collected. When the first user confirms the translation error, the first user is prompted to re-express the content and re-collect it, and the second user is informed that the second synthesized speech just played has an error and will be played again.
关于判断翻译是否准确的实现方式,在其他实施例中,还可以通过当第一合成语音完成播放后的一预设时间段内,未接收到第一用户的预设发声语音时,就确定翻译准确。否则,就确定翻译存在错误。从而让语言不同的交流方能实现及时纠正错误和提升互动性。Regarding the implementation method of determining whether the translation is accurate, in other embodiments, the translation can also be determined by not receiving the preset utterance of the first user within a preset time period after the first synthesized voice completes playing. precise. Otherwise, it is determined that there is an error in the translation. This allows communication with different languages to correct errors in a timely manner and improve interactivity.
在一些实施例中,在上述图1对应实施例的基础上,步骤S110包括:In some embodiments, based on the above-mentioned corresponding embodiment of Figure 1, step S110 includes:
对第一语音信息进行声纹侦测,仅对属于第一用户发声的对应音频段落进行语 音识别,得到第一文本数据。也即对不属于第一用户发声的对应音频段落,不进行语音识别。Perform voiceprint detection on the first voice information, perform speech recognition only on the corresponding audio paragraphs uttered by the first user, and obtain the first text data. That is, speech recognition is not performed for the corresponding audio paragraphs that do not belong to the first user's utterance.
步骤S120包括:对第一语音信息进行声纹侦测,仅对属于第一用户发声的对应音频段落进行翻译,得到第二合成语音。也即,翻译过程中,当侦测到不属于第一用户发声的音频段落时,暂停翻译。当侦测到属于第一用户发声的音频段落时,继续进行翻译,直至完成翻译。Step S120 includes: performing voiceprint detection on the first voice information, and translating only the corresponding audio paragraphs uttered by the first user to obtain a second synthesized voice. That is, during the translation process, when an audio segment that does not belong to the first user's utterance is detected, the translation is paused. When an audio segment belonging to the first user's utterance is detected, translation is continued until the translation is completed.
这样利于保证跨语言沟通时实时翻译的准确性。This helps ensure the accuracy of real-time translation during cross-language communication.
在一些实施例中,在上述图1对应实施例的基础上,在步骤S140之前,该方法还包括:In some embodiments, based on the corresponding embodiment of Figure 1 above, before step S140, the method further includes:
分别实时采集第一用户和第二用户的环境语音数据。The environmental voice data of the first user and the second user are respectively collected in real time.
对环境语音数据进行声纹侦测。Perform voiceprint detection on environmental voice data.
当侦测到对应第一用户的环境语音数据存在其他人的声纹信息时,向第一用户发送第一预设提示信息。该第一预设提示信息用于语音提醒第一用户周围有外人,注意保护隐私。When it is detected that other people's voiceprint information corresponds to the first user's ambient voice data, first preset prompt information is sent to the first user. The first preset prompt message is used to remind the first user by voice that there are outsiders around and to pay attention to protecting privacy.
当侦测到对应第二用户的环境语音数据存在其他人的声纹信息时,暂停对第一语音信息进行翻译,并向第二用户发送第二预设提示信息。When it is detected that the voiceprint information of another person exists in the environmental voice data of the second user, the translation of the first voice information is suspended, and the second preset prompt information is sent to the second user.
直至在第二用户侧未侦测到其他人的声纹信息时,继续对第一语音信息进行翻译。Until the second user side detects no other person's voiceprint information, continue to translate the first voice information.
这样利于对第一用户和第二用户的隐私保护。示例性地,上述步骤可以位于步骤S130和步骤S140之间。This is beneficial to protecting the privacy of the first user and the second user. For example, the above steps may be located between step S130 and step S140.
在一些实施例中,在上述图1对应实施例的基础上,该方法在步骤S140之前,还包括步骤:In some embodiments, based on the above-mentioned corresponding embodiment of Figure 1, before step S140, the method further includes the steps:
分别基于上述第一文本数据和上述第二文本数据,提取各自的兴趣标签和对应的时间戳信息。Based on the above-mentioned first text data and the above-mentioned second text data, respective interest tags and corresponding timestamp information are extracted.
依据上述时间戳信息,将对应的兴趣标签分别标注于上述第一文本数据和上述第二文本数据。According to the above timestamp information, corresponding interest tags are marked on the above first text data and the above second text data respectively.
示例性地,上述步骤可以位于步骤S130和步骤S140之间。For example, the above steps may be located between step S130 and step S140.
步骤S140包括:在播放第一合成语音以及第二合成语音时,对兴趣标签关联的语音进行语音提示。Step S140 includes: when playing the first synthesized voice and the second synthesized voice, providing a voice prompt for the voice associated with the interest tag.
其中,上述兴趣标签和时间戳具有一一对应的关系。利用该兴趣标签可以方便用户在回听语音时,可以快速找到重点内容。上述兴趣标签可以以文本或者语音的形式进行展示。Among them, the above-mentioned interest tags and timestamps have a one-to-one correspondence. Using this interest tag can facilitate users to quickly find key content when listening back to the voice. The above interest tags can be displayed in the form of text or voice.
本申请对反复提及的内容或说话音调或语义分析内容重点,敏感隐私,又或聆听方临时无法专注倾听但又想后续可回顾内容时,主动启动兴趣标签标注功能,记录时间搓和对应的兴趣标签并存储。当用户回看兴趣标签时,可以展示出通话翻译的兴趣点,可跳转查看兴趣点前后内容,或重听兴趣点前后的交谈历史,提升用户体验。还可以依靠兴趣标签使得交流两方关注到重点内容,避免被错误理解或者忽视,利于提高跨语言交流体验。This application actively activates the interest tag annotation function to record the time and corresponding content for repeatedly mentioned content or speech tone or semantic analysis content, sensitive privacy, or when the listener is temporarily unable to concentrate on listening but wants to review the content later. Interest tags and storage. When the user looks back at the interest tags, the points of interest of the call translation can be displayed, and the user can jump to view the content before and after the point of interest, or re-listen to the conversation history before and after the point of interest to improve the user experience. You can also rely on interest tags to make the two communicating parties pay attention to key content to avoid being misunderstood or ignored, which will help improve the cross-language communication experience.
在一些实施例中,在上述实施例的基础上,该方法在步骤S140之前,还包括步骤:In some embodiments, based on the above embodiments, before step S140, the method further includes the steps:
分别采集第一用户和第二用户的声纹信息。The voiceprint information of the first user and the second user is collected respectively.
基于所述声纹信息,分别识别得到第一用户和第二用户的身份信息。Based on the voiceprint information, the identity information of the first user and the second user is respectively identified.
基于所述第一用户和第二用户的身份信息,确定第一用户和第二用户之间的关系类型。Based on the identity information of the first user and the second user, a relationship type between the first user and the second user is determined.
基于第一用户和第二用户之间的关系类型,以及预设身份关系兴趣库,对上述提取的兴趣标签进行筛选,保留下来的作为目标兴趣标签。Based on the relationship type between the first user and the second user and the preset identity relationship interest library, the above-extracted interest tags are filtered, and the retained ones are used as target interest tags.
示例性地,上述步骤可以位于步骤S130和步骤S140之间。For example, the above steps may be located between step S130 and step S140.
步骤S140包括:在播放第一合成语音以及第二合成语音时,对目标兴趣标签关联的语音进行语音提示。Step S140 includes: when playing the first synthesized voice and the second synthesized voice, providing a voice prompt for the voice associated with the target interest tag.
其中,第一用户和第二用户之间的关系类型的确定,可以依据预设身份关系数据库实现。The determination of the relationship type between the first user and the second user can be based on a preset identity relationship database.
在一些实施例中,在上述图1对应实施例的基础上,步骤S110包括:In some embodiments, based on the above-mentioned corresponding embodiment of Figure 1, step S110 includes:
S111,获取关于上述第一语音信息的前文数据。S111. Obtain previous data about the first voice information.
S112,对第一用户输入的第一语音信息进行语音识别,获得初始文本数据。S112: Perform speech recognition on the first voice information input by the first user to obtain initial text data.
S113,利用前文数据对初始文本数据进行修正,得到第一文本数据。S113, use the previous data to modify the initial text data to obtain the first text data.
具体而言,获取用户在发声第一语音信息之前发声的其他数据,基于这些数据对识别的文本进行修正,利于提高翻译的准确性,进而提高跨语言交流时的顺畅度。比如,当前文数据对同一语音翻译为第一词语,而该步骤将该语音翻译为第二词语 时,此时就将第二词语修正为第一词语。Specifically, obtaining other data uttered by the user before uttering the first voice message, and correcting the recognized text based on these data will help improve the accuracy of translation, thereby improving the smoothness of cross-language communication. For example, if the text data translates the same speech into a first word, and this step translates the speech into a second word, then the second word is corrected into the first word.
在一些实施例中,在上述图1对应实施例的基础上,参考图8,在步骤S140之前,还包括步骤:In some embodiments, based on the above-mentioned corresponding embodiment of Figure 1, with reference to Figure 8, before step S140, the steps are also included:
S180,当检测到用于输出语音的一对耳机分别由不同用户佩戴时,控制上述耳机工作于第一状态;当检测到第一用户和第二用户各佩戴一对耳机时,控制上述耳机工作于第二状态。S180, when it is detected that a pair of earphones for outputting voice are worn by different users respectively, control the above-mentioned earphones to work in the first state; when it is detected that the first user and the second user each wear a pair of earphones, control the above-mentioned earphones to work in the second state.
其中,在第一状态下,两个耳机分别输出不同的语音,被用户佩戴的一耳机作为该用户的语音输出通道。在第二状态下,一对耳机中的两个输出相同的语音,被用户佩戴的一对耳机作为该用户的语音输出通道。In the first state, the two earphones output different voices respectively, and the earphone worn by the user serves as the user's voice output channel. In the second state, two of the pair of earphones output the same voice, and the pair of earphones worn by the user serves as the user's voice output channel.
示例性地,上述步骤可以位于步骤S130和步骤S140之间。For example, the above steps may be located between step S130 and step S140.
步骤S140替换为步骤S141:将第一合成语音基于第一用户佩戴的耳机向第一用户输出,同步将第二合成语音基于第二用户佩戴的耳机向第二用户输出。Step S140 is replaced with step S141: outputting the first synthesized voice to the first user based on the earphone worn by the first user, and simultaneously outputting the second synthesized voice to the second user based on the earphone worn by the second user.
具体而言,比如一副耳机中的两个耳机分别被两个用户佩戴时,即第一用户佩戴其中一个,第二用户佩戴另一个,此时可以控制该副耳机工作于面对面翻译状态。当每个用户各佩戴一副耳机时,控制该副耳机工作于远距离翻译状态。面对面翻译状态时,需要将对应该用户的合成语音,编码为该用户佩戴的耳机对应的声道音频。比如,当第一用户佩戴一副耳机中的左声道耳机时,就将第一合成语音编码为左声道音频。相应地,将第二合成语音编码为右声道音频。Specifically, for example, when two earphones in a pair of earphones are worn by two users respectively, that is, the first user wears one and the second user wears the other, then the pair of earphones can be controlled to work in a face-to-face translation state. When each user wears a pair of headphones, the pair of headphones is controlled to work in a long-distance translation state. In the face-to-face translation state, the user's synthesized speech needs to be encoded into the audio channel corresponding to the headset worn by the user. For example, when the first user wears a left-channel earphone in a pair of earphones, the first synthesized speech is encoded as left-channel audio. Accordingly, the second synthesized speech is encoded as right channel audio.
将第一用户佩戴的耳机作为第一语音输出通道,将第二用户佩戴的耳机作为第二语音输出通道。将第一合成语音基于第一语音输出通道向第一用户输出,同步将第二合成语音基于第二语音输出通道向第二用户输出。The earphone worn by the first user is used as the first voice output channel, and the earphone worn by the second user is used as the second voice output channel. The first synthesized voice is output to the first user based on the first voice output channel, and the second synthesized voice is simultaneously output to the second user based on the second voice output channel.
在一些实施例中,在上述实施例的基础上,上述耳机设有UWB(Ultra Wide Band,超宽带无线载波通信)通信模块。步骤S180包括:In some embodiments, based on the above embodiments, the above-mentioned earphones are provided with a UWB (Ultra Wide Band, ultra-wideband wireless carrier communication) communication module. Step S180 includes:
上述一对耳机基于UWB通信模块,实时侦测两个耳机之间的距离。The above pair of headphones are based on the UWB communication module and detect the distance between the two headphones in real time.
当两个耳机之间的距离大于第一预设阈值时,控制上述耳机工作于第一状态;当两个耳机之间的距离小于第一预设阈值时,控制上述耳机工作于第二状态。When the distance between the two earphones is greater than the first preset threshold, the earphones are controlled to operate in the first state; when the distance between the two earphones is less than the first preset threshold, the earphones are controlled to operate in the second state.
具体而言,当两个耳机之间的距离大于第一预设阈值时,说明一副耳机中的两个耳机分别被两个用户佩戴。当两个耳机之间的距离小于第一预设阈值时,说明一副耳机由同一用户佩戴。Specifically, when the distance between the two earphones is greater than the first preset threshold, it means that the two earphones in a pair of earphones are respectively worn by two users. When the distance between the two earphones is less than the first preset threshold, it indicates that one pair of earphones is worn by the same user.
示例性地,上述第一预设阈值可以为25cm。在其他实施例中,耳机之间的距离也可以通过耳机中内设的超声波信号模块实现侦测,本申请对此不作限制。For example, the above-mentioned first preset threshold may be 25cm. In other embodiments, the distance between the earphones can also be detected through the built-in ultrasonic signal module in the earphones, which is not limited in this application.
在一些实施例中,在上述图1对应实施例的基础上,参考图9,该方法包括步骤:In some embodiments, based on the above-mentioned corresponding embodiment of Figure 1, with reference to Figure 9, the method includes the steps:
S190,根据第一合成语音中未播放的音频帧,以及已采集生成第一语音信息且未转换为第一合成语音的音频帧,组合形成音频调节库。S190: Combine and form an audio adjustment library based on the unplayed audio frames in the first synthesized speech and the audio frames that have collected and generated the first speech information and have not been converted into the first synthesized speech.
S200,基于上述音频调节库中的音频帧数量,调节第一合成语音的播放语速。其中,上述第一合成语音对应的播放语速随着上述音频调节库中的音频帧数量的增大而增大,且随着上述音频调节库中的音频帧数量的减小而减小。在一些实施例中,也可以同步对第二合成语音的播放语速进行调节。S200: Adjust the playback speed of the first synthesized speech based on the number of audio frames in the audio adjustment library. Wherein, the playback speech speed corresponding to the first synthesized speech increases as the number of audio frames in the audio adjustment library increases, and decreases as the number of audio frames in the audio adjustment library decreases. In some embodiments, the playback speed of the second synthesized voice can also be adjusted synchronously.
这样可以避免比如当用户已经说到第S6个音频段落时,耳机里还在回放第S3个音频段落,那样将不利于说话方及时发现语音识别过程中出现的错误,从而让语言不同的交流方能实现及时纠正错误。This can avoid, for example, when the user has spoken to the S6th audio paragraph, the headset is still playing back the S3th audio paragraph, which will not be conducive to the speaker's timely discovery of errors in the speech recognition process, thereby allowing communication methods with different languages to Able to correct errors in a timely manner.
在一些实施例中,在上述图1对应实施例的基础上,在上述步骤S110之前,该方法还包括:In some embodiments, based on the above-mentioned corresponding embodiment of Figure 1, before the above-mentioned step S110, the method further includes:
当检测到第一用户仅佩戴一副耳机中的一个,以及该对耳机中两个耳机的距离大于第一预设阈值时,采集第一用户所处环境的环境声数据,并从上述环境声数据中提取出多个语音数据。When it is detected that the first user is wearing only one pair of headphones, and the distance between the two headphones in the pair is greater than the first preset threshold, the environmental sound data of the environment where the first user is located is collected, and the environmental sound data is collected from the above-mentioned environmental sound data. Multiple voice data are extracted from the data.
确定上述多个语音数据对应的语种类别,作为第二语种类别。The language category corresponding to the plurality of voice data is determined as the second language category.
基于该副耳机中的另一个向外界环境中播放提示信息;上述提示信息用于提示第二用户佩戴该副耳机中的另一个。Prompt information is played to the external environment based on the other pair of earphones; the above prompt information is used to prompt the second user to wear the other pair of earphones.
上述步骤S140包括:The above step S140 includes:
在检测到第二用户佩戴第二个耳机后,将第二合成语音通过上述第二个耳机向第二用户输出。After detecting that the second user is wearing the second earphone, the second synthesized voice is output to the second user through the second earphone.
这样可以便于在第一用户处于一个陌生语种的环境中,比如去另一个国家旅游时,与第二用户需要面对面翻译的场景下,帮助与第二用户快速建立沟通,利于提高跨语言沟通的顺畅度。This can facilitate the rapid establishment of communication with the second user when the first user is in an unfamiliar language environment, such as when traveling to another country and needs face-to-face translation with the second user, which will help improve the smoothness of cross-language communication. Spend.
需要说明的是,本申请中公开的上述所有实施例可以进行自由组合,组合后得到的技术方案也在本申请的保护范围之内。It should be noted that all the above-mentioned embodiments disclosed in this application can be freely combined, and the technical solution obtained after the combination is also within the protection scope of this application.
如图10所示,本发明一实施例还公开了一种语音实时翻译系统10,该系统包括:As shown in Figure 10, an embodiment of the present invention also discloses a real-time
第一文本数据生成模块101,对第一用户输入的第一语音信息进行语音识别,获得第一文本数据。The first text
第二合成语音生成模块102,对所述第一语音信息进行翻译,获得对应于第二语种类别的第二合成语音。The second synthesized
第一合成语音生成模块103,基于所述第一文本数据,获得对应于第一语种类别的第一合成语音。The first synthesized
合成语音播放模块104,将所述第一合成语音向第一用户输出,以及同步将所述第二合成语音向第二用户输出。The synthesized
可以理解的是,本发明的语音实时翻译系统还包括其他支持语音实时翻译系统运行的现有功能模块。图10显示的语音实时翻译系统仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。It can be understood that the real-time speech translation system of the present invention also includes other existing functional modules that support the operation of the real-time speech translation system. The real-time speech translation system shown in Figure 10 is only an example and should not impose any restrictions on the functions and scope of use of the embodiments of the present invention.
本实施例中的语音实时翻译系统用于实现上述的语音实时翻译的方法,因此对于语音实时翻译系统的具体实施步骤可以参照上述对语音实时翻译的方法的描述,此处不再赘述。The real-time speech translation system in this embodiment is used to implement the above-mentioned real-time speech translation method. Therefore, for the specific implementation steps of the real-time speech translation system, reference can be made to the above description of the real-time speech translation method, which will not be described again here.
本发明一实施例还公开了一种语音实时翻译设备,包括处理器和存储器,其中存储器存储有所述处理器的可执行程序;处理器配置为经由执行可执行程序来执行上述语音实时翻译方法中的步骤。图11是本发明公开的语音实时翻译设备的结构示意图。下面参照图11来描述根据本发明的这种实施方式的电子设备600。图11显示的电子设备600仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。An embodiment of the present invention also discloses a real-time speech translation device, which includes a processor and a memory, wherein the memory stores an executable program of the processor; the processor is configured to execute the above real-time speech translation method by executing the executable program. steps in. Figure 11 is a schematic structural diagram of the real-time speech translation device disclosed in the present invention. An
如图11所示,电子设备600以通用计算设备的形式表现。电子设备600的组件可以包括但不限于:至少一个处理单元610、至少一个存储单元620、连接不同平台组件(包括存储单元620和处理单元610)的总线630、显示单元640等。As shown in Figure 11,
其中,存储单元存储有程序代码,程序代码可以被处理单元610执行,使得处理单元610执行本说明书上述语音实时翻译方法部分中描述的根据本发明各种示例性实施方式的步骤。例如,处理单元610可以执行如图1中所示的步骤。Among them, the storage unit stores program code, and the program code can be executed by the
存储单元620可以包括易失性存储单元形式的可读介质,例如随机存取存储单 元(RAM)6201和/或高速缓存存储单元6202,还可以进一步包括只读存储单元(ROM)6203。The
存储单元620还可以包括具有一组(至少一个)程序模块6205的程序/实用工具6204,这样的程序模块6205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线630可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备600也可以与一个或多个外部设备700(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备600交互的设备通信,和/或与使得该电子设备600能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口650进行。并且,电子设备600还可以通过网络适配器660与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。网络适配器660可以通过总线630与电子设备600的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备600使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储平台等。
本发明还公开了一种计算机可读存储介质,用于存储程序,所述程序被执行时实现上述语音实时翻译方法中的步骤。在一些可能的实施方式中,本发明的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在终端设备上运行时,程序代码用于使终端设备执行本说明书上述语音实时翻译方法中描述的根据本发明各种示例性实施方式的步骤。The invention also discloses a computer-readable storage medium for storing a program. When the program is executed, the steps in the above real-time speech translation method are implemented. In some possible implementations, various aspects of the present invention can also be implemented in the form of a program product, which includes program code. When the program product is run on a terminal device, the program code is used to cause the terminal device to execute the above described instructions. The steps according to various exemplary embodiments of the present invention are described in the real-time speech translation method.
如上所示,该实施例的计算机可读存储介质的程序在执行时,对说话方的说话内容,基于语音识别得到的第一文本数据,合成第一合成语音,在向聆听方播放第二合成语音的同时,向说话方同步播放第一合成语音,使得识别语音过程中出现错误导致翻译内容时产生的误差,能够及时被说话方知晓,提高了跨语言翻译的准确率以及跨语言交流双方的沟通效率。As shown above, when the program of the computer-readable storage medium of this embodiment is executed, the speaking content of the speaking party is synthesized based on the first text data obtained through speech recognition, and the second synthesized speech is played to the listening party. At the same time as the voice is being spoken, the first synthesized voice is simultaneously played to the speaker, so that errors in the translation content caused by errors in the speech recognition process can be known to the speaker in a timely manner, improving the accuracy of cross-language translation and improving the cross-language communication between both parties. Communication efficiency.
本发明一实施例公开了一种计算机可读存储介质。该存储介质是实现上述方法的程序产品,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。An embodiment of the invention discloses a computer-readable storage medium. The storage medium is a program product that implements the above method, which can be a portable compact disk read-only memory (CD-ROM) and include program code, and can be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto. In this document, a readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, apparatus or device.
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The Program Product may take the form of one or more readable media in any combination. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
计算机可读存储介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读存储介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。A computer-readable storage medium may include a data signal propagated in baseband or as part of a carrier wave carrying the readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A readable storage medium may also be any readable medium other than a readable storage medium that can transmit, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device. Program code contained on a readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Program code for performing the operations of the present invention may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., as well as conventional procedural programming. Language—such as "C" or a similar programming language. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on. In situations involving remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device, such as provided by an Internet service. (business comes via Internet connection).
本发明实施例提供的语音实时翻译方法、系统、设备以及存储介质对说话方的说话内容,基于语音识别得到的第一文本数据,合成第一合成语音,在向聆听方播 放第二合成语音的同时,向说话方同步播放第一合成语音,使得识别语音过程中出现错误导致翻译内容时产生的误差,能够及时被说话方知晓,提高了跨语言翻译的准确率以及跨语言交流双方的沟通效率;使得在跨语言交流中沟通更加顺畅。The real-time speech translation method, system, device and storage medium provided by the embodiments of the present invention synthesize the first synthesized speech based on the first text data obtained by speech recognition based on the speaking content of the speaker, and play the second synthesized speech to the listening party. At the same time, the first synthesized voice is played synchronously to the speaker, so that errors in the translation content caused by errors in the speech recognition process can be known to the speaker in time, improving the accuracy of cross-language translation and the communication efficiency of both parties in cross-language communication. ; Makes communication smoother in cross-language communication.
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be concluded that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field to which the present invention belongs, several simple deductions or substitutions can be made without departing from the concept of the present invention, and all of them should be regarded as belonging to the protection scope of the present invention.
Claims (13)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2022/119375 WO2024055299A1 (en) | 2022-09-16 | 2022-09-16 | Real-time speech translation method, system, device, and storage medium |
| CN202280003469.3A CN116097347A (en) | 2022-09-16 | 2022-09-16 | Voice real-time translation method, system, equipment and storage medium |
| TW111147207A TWI842261B (en) | 2022-09-16 | 2022-12-08 | Voice rael-time translating method, system and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2022/119375 WO2024055299A1 (en) | 2022-09-16 | 2022-09-16 | Real-time speech translation method, system, device, and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024055299A1 true WO2024055299A1 (en) | 2024-03-21 |
Family
ID=86201123
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/119375 Ceased WO2024055299A1 (en) | 2022-09-16 | 2022-09-16 | Real-time speech translation method, system, device, and storage medium |
Country Status (3)
| Country | Link |
|---|---|
| CN (1) | CN116097347A (en) |
| TW (1) | TWI842261B (en) |
| WO (1) | WO2024055299A1 (en) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001222531A (en) * | 2000-02-08 | 2001-08-17 | Atr Interpreting Telecommunications Res Lab | Voice translation device and computer readable recording medium with recorded voice translation processing program with feedback function |
| US20060293874A1 (en) * | 2005-06-27 | 2006-12-28 | Microsoft Corporation | Translation and capture architecture for output of conversational utterances |
| CN101025735A (en) * | 2006-02-20 | 2007-08-29 | 株式会社东芝 | Apparatus and method for supporting in communication through translation between different languages |
| US20090006082A1 (en) * | 2007-06-29 | 2009-01-01 | Microsoft Corporation | Activity-ware for non-textual objects |
| US20100299147A1 (en) * | 2009-05-20 | 2010-11-25 | Bbn Technologies Corp. | Speech-to-speech translation |
| CN110446132A (en) * | 2019-08-07 | 2019-11-12 | 深圳市和信电子有限公司 | A kind of real time translation TWS bluetooth headset and its application method |
| CN112423106A (en) * | 2020-11-06 | 2021-02-26 | 四川长虹电器股份有限公司 | Method and system for automatically translating accompanying sound |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20050109919A (en) * | 2002-12-10 | 2005-11-22 | 텔어바웃 인크 | Content creation, distribution, interaction, and monitoring system |
| TW200538969A (en) * | 2004-02-11 | 2005-12-01 | America Online Inc | Handwriting and voice input with automatic correction |
| DE102004050785A1 (en) * | 2004-10-14 | 2006-05-04 | Deutsche Telekom Ag | Method and arrangement for processing messages in the context of an integrated messaging system |
| US8972268B2 (en) * | 2008-04-15 | 2015-03-03 | Facebook, Inc. | Enhanced speech-to-speech translation system and methods for adding a new word |
| US8775156B2 (en) * | 2010-08-05 | 2014-07-08 | Google Inc. | Translating languages in response to device motion |
-
2022
- 2022-09-16 WO PCT/CN2022/119375 patent/WO2024055299A1/en not_active Ceased
- 2022-09-16 CN CN202280003469.3A patent/CN116097347A/en active Pending
- 2022-12-08 TW TW111147207A patent/TWI842261B/en active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001222531A (en) * | 2000-02-08 | 2001-08-17 | Atr Interpreting Telecommunications Res Lab | Voice translation device and computer readable recording medium with recorded voice translation processing program with feedback function |
| US20060293874A1 (en) * | 2005-06-27 | 2006-12-28 | Microsoft Corporation | Translation and capture architecture for output of conversational utterances |
| CN101025735A (en) * | 2006-02-20 | 2007-08-29 | 株式会社东芝 | Apparatus and method for supporting in communication through translation between different languages |
| US20090006082A1 (en) * | 2007-06-29 | 2009-01-01 | Microsoft Corporation | Activity-ware for non-textual objects |
| US20100299147A1 (en) * | 2009-05-20 | 2010-11-25 | Bbn Technologies Corp. | Speech-to-speech translation |
| CN110446132A (en) * | 2019-08-07 | 2019-11-12 | 深圳市和信电子有限公司 | A kind of real time translation TWS bluetooth headset and its application method |
| CN112423106A (en) * | 2020-11-06 | 2021-02-26 | 四川长虹电器股份有限公司 | Method and system for automatically translating accompanying sound |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202414384A (en) | 2024-04-01 |
| TWI842261B (en) | 2024-05-11 |
| CN116097347A (en) | 2023-05-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12073361B2 (en) | Automated clinical documentation system and method | |
| US10334384B2 (en) | Scheduling playback of audio in a virtual acoustic space | |
| US10089974B2 (en) | Speech recognition and text-to-speech learning system | |
| US20200012724A1 (en) | Bidirectional speech translation system, bidirectional speech translation method and program | |
| JP2023501728A (en) | Privacy-friendly conference room transcription from audio-visual streams | |
| JP5750380B2 (en) | Speech translation apparatus, speech translation method, and speech translation program | |
| US20230092558A1 (en) | Automated clinical documentation system and method | |
| Kim et al. | Automatic recognition of second language speech-in-noise | |
| JP2000207170A (en) | Device and method for processing information | |
| WO2024055299A1 (en) | Real-time speech translation method, system, device, and storage medium | |
| Zhou et al. | M2SILENT: Enabling Multi-user Silent Speech Interactions via Multi-directional Speakers in Shared Spaces | |
| CN116778900A (en) | Speech synthesis method, electronic device and storage medium | |
| JP2015187738A (en) | Speech translation device, speech translation method, and speech translation program | |
| Panek et al. | Challenges in adopting speech control for assistive robots | |
| JP2016186646A (en) | Voice translation apparatus, voice translation method and voice translation program | |
| JP7577700B2 (en) | Program, terminal and method for assisting users who cannot speak during online meetings | |
| US20230260505A1 (en) | Information processing method, non-transitory recording medium, information processing apparatus, and information processing system | |
| CN118430538A (en) | Error correction multi-mode model construction method, system, equipment and medium | |
| Favour et al. | Automatic Subtitle Generation in Noisy Environments Using Robust Speech Recognition Techniques | |
| John et al. | Automatic Subtitle Generation in Noisy Environments Using Robust Speech Recognition Techniques | |
| JP2024179936A (en) | System, method, program, information processing device | |
| CN118095301A (en) | Simultaneous translation method, electronic device and computer-readable storage medium | |
| TW202341703A (en) | Systems and methods for improved group communication sessions | |
| JP2022048516A (en) | Information processing unit, program and information processing method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22958489 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 11202407806S Country of ref document: SG |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22958489 Country of ref document: EP Kind code of ref document: A1 |