CN111899706A - Audio production method, device, equipment and storage medium - Google Patents
Audio production method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN111899706A CN111899706A CN202010753002.3A CN202010753002A CN111899706A CN 111899706 A CN111899706 A CN 111899706A CN 202010753002 A CN202010753002 A CN 202010753002A CN 111899706 A CN111899706 A CN 111899706A
- Authority
- CN
- China
- Prior art keywords
- audio
- lyrics
- lyric
- human voice
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/0033—Recording/reproducing or transmission of music for electrophonic musical instruments
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/02—Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
- G11B27/031—Electronic editing of digitised analogue information signals, e.g. audio or video signals
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Electrophonic Musical Instruments (AREA)
Abstract
The application discloses an audio making method, an audio making device, audio making equipment and a storage medium, and belongs to the technical field of audio processing. The method comprises the following steps: an audio editing interface displaying a first audio, the audio editing interface comprising at least one lyric of the first audio and a lyric editing control, the at least one lyric comprising first lyrics; receiving a lyric editing operation on the lyric editing control for the first lyric, wherein the lyric editing operation comprises inputting second lyric; replacing the first lyrics in the first audio with the second lyrics to generate a second audio, wherein the second audio comprises human voice audio generated according to the second lyrics. The method can simplify the audio production steps.
Description
Technical Field
The embodiment of the invention relates to the technical field of multimedia, in particular to an audio making method, an audio making device, audio making equipment and a storage medium.
Background
When a user hears a favorite song on the audio software, the user may want to re-compose based on the song. For example, the lyrics of a song are changed to make music unique to the user.
In the related art, a user wants to re-create a song, needs to use professional audio editing software, separate accompaniment and vocal from the song according to a professional audio editing method, re-edit the accompaniment or vocal, and synthesize the edited vocal and accompaniment to obtain a new song.
The song making method in the related art requires a user to have professional audio editing capability, and the audio making steps are too complex.
Disclosure of Invention
The embodiment of the invention provides an audio making method, an audio making device, audio making equipment and a storage medium, which can simplify audio making steps. The technical scheme is as follows:
in one aspect, a method of audio production is provided, the method comprising:
an audio editing interface displaying a first audio, the audio editing interface comprising at least one lyric of the first audio and a lyric editing control, the at least one lyric comprising first lyrics;
receiving a lyric editing operation on the lyric editing control for the first lyric, wherein the lyric editing operation comprises inputting second lyric;
replacing the first lyrics in the first audio with the second lyrics to generate a second audio, wherein the second audio comprises human voice audio generated according to the second lyrics.
Optionally, the method further comprises:
acquiring a target tone, wherein the target tone is used for generating the human voice audio;
the replacing the first lyrics in the first audio with the second lyrics to generate a second audio, comprising:
and replacing the first lyrics in the first audio with the second lyrics according to the target timbre to generate the second audio.
Optionally, the replacing the first lyrics in the first audio with the second lyrics according to the target timbre to generate the second audio includes:
generating the human voice audio containing the second lyrics according to the target timbre, the phoneme of the second lyrics and the notes corresponding to the first lyrics in the first audio;
acquiring a template audio of the first audio, wherein the template audio comprises at least one of an accompaniment audio and a main melody audio;
and generating the second audio according to the template audio and the human voice audio.
Optionally, the generating the human voice audio containing the second lyrics according to the target timbre, the phoneme of the second lyrics and the note corresponding to the first lyrics in the first audio comprises:
inputting the tone color identification of the target tone color, the phoneme of the second lyrics and notes corresponding to the first lyrics in the first audio frequency into an acoustic model to obtain a Mel frequency spectrum;
and calling a vocoder to convert the Mel frequency spectrum into the human voice audio.
Optionally, the second audio comprises:
the audio duration is less than the first audio, and the human voice audio fragment of the second lyric is generated according to the target timbre, and the human voice audio fragments of the lyrics except the second lyric use the audio of the original timbre of the first audio;
or the like, or, alternatively,
the audio time length is equal to the first audio frequency, and the human voice audio frequency fragment of the second lyrics is generated according to the target timbre, and the human voice audio frequency fragment of the lyrics except the second lyrics uses the audio frequency of the original timbre of the first audio frequency;
or the like, or, alternatively,
the audio duration is less than the first audio, and the human voice audio of all the lyrics is the audio generated according to the target timbre;
or the like, or, alternatively,
the audio duration is equal to the first audio, and the human voice audio of all the lyrics is the audio generated according to the target timbre.
Optionally, the method further comprises:
obtaining training data, the training data comprising: at least one of a phoneme of the training lyrics, a note of the training lyrics, phoneme position information of the training lyrics, note position information of the training lyrics, a timbre identification of the training audio, and a mel frequency spectrum of the training audio;
and training an initial model according to the training data to obtain the acoustic model.
Optionally, the method further comprises:
displaying an audio playing interface of the second audio, wherein the audio playing interface comprises a playing control;
and responding to the receiving of the playing operation triggering the playing control, and playing the second audio.
Optionally, the obtaining the target tone includes:
displaying a tone selection interface, the tone selection interface comprising at least one candidate tone and a selection control;
in response to receiving a selection operation for triggering the selection control, determining the target tone color from the candidate tone colors according to the selection operation;
after the replacing the first lyrics in the first audio with the second lyrics according to the target timbre and generating the second audio containing the second lyrics, the method further comprises:
and playing the second audio.
In another aspect, there is provided an audio producing apparatus, the apparatus including:
the display module is used for displaying an audio editing interface of a first audio, the audio editing interface comprises at least one lyric of the first audio and a lyric editing control, and the at least one lyric comprises first lyrics;
the interaction module is used for receiving a lyric editing operation on the first lyric on the lyric editing control, wherein the lyric editing operation comprises inputting second lyrics;
the generating module is used for replacing the first lyrics in the first audio with the second lyrics to generate a second audio, and the second audio comprises a human voice audio generated according to the second lyrics.
Optionally, the apparatus further comprises:
the acquisition module is used for acquiring a target tone, and the target tone is used for generating the human voice audio;
the generating module is further configured to replace the first lyrics in the first audio with the second lyrics according to the target timbre, and generate the second audio.
Optionally, the generating module is further configured to generate the human voice audio including the second lyrics according to the target timbre, a phoneme of the second lyrics, and a note corresponding to the first lyrics in the first audio;
the obtaining module is further configured to obtain a template audio of the first audio, where the template audio includes at least one of an accompaniment audio and a main melody audio;
the generating module is further configured to generate the second audio according to the template audio and the human voice audio.
Optionally, the generating module includes:
the model submodule is used for inputting the tone color identification of the target tone color, the phoneme of the second lyrics and the notes corresponding to the first lyrics in the first audio frequency into an acoustic model to obtain a Mel frequency spectrum;
a vocoder submodule for invoking a vocoder to convert the Mel spectrum to the human voice audio.
Optionally, the second audio comprises:
the audio duration is less than the first audio, and the human voice audio fragment of the second lyric is generated according to the target timbre, and the human voice audio fragments of the lyrics except the second lyric use the audio of the original timbre of the first audio;
or the like, or, alternatively,
the audio time length is equal to the first audio frequency, and the human voice audio frequency fragment of the second lyrics is generated according to the target timbre, and the human voice audio frequency fragment of the lyrics except the second lyrics uses the audio frequency of the original timbre of the first audio frequency;
or the like, or, alternatively,
the audio duration is less than the first audio, and the human voice audio of all the lyrics is the audio generated according to the target timbre;
or the like, or, alternatively,
the audio duration is equal to the first audio, and the human voice audio of all the lyrics is the audio generated according to the target timbre.
Optionally, the apparatus further comprises:
the obtaining module is further configured to obtain training data, where the training data includes: at least one of a phoneme of the training lyrics, a note of the training lyrics, phoneme position information of the training lyrics, note position information of the training lyrics, a timbre identification of the training audio, and a mel frequency spectrum of the training audio;
and the training module is used for training an initial model according to the training data to obtain the acoustic model.
Optionally, the apparatus further comprises:
the display module is further configured to display an audio playing interface of the second audio, where the audio playing interface includes a playing control;
the interaction module is further used for receiving a play operation for triggering the play control;
and the playing module is used for responding to the receiving of the playing operation triggering the playing control and playing the second audio.
Optionally, the apparatus further comprises:
the display module is further configured to display a tone selection interface, where the tone selection interface includes at least one candidate tone and a selection control;
the interaction module is further used for receiving selection operation for triggering the selection control;
the obtaining module is further configured to determine, in response to receiving a selection operation that triggers the selection control, the target tone color from the candidate tone colors according to the selection operation.
And the playing module is used for playing the second audio.
In another aspect, a computer device is provided, the computer device comprising: the audio production device comprises a processor and a memory, wherein the memory stores instructions which are executed by the processor to realize the audio production method.
In another aspect, a computer-readable storage medium is provided, which stores instructions that, when executed by a processor, implement the audio production method described above.
In another aspect, a computer program product comprising instructions is provided, which when run on a computer, causes the computer to perform the audio production method described above.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
the method has the advantages that the change of the lyrics of the song by the user is received on the audio editing interface, the changed song is generated according to the changed lyrics of the user and the original song, so that the user can modify the lyrics of the song by one key to quickly generate a new song, the operation steps of generating the audio by the user are simplified, and the audio editing efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a block diagram illustrating the structure of a computer system in accordance with an exemplary embodiment;
FIG. 2 is a flow chart illustrating a method of audio production according to another exemplary embodiment;
FIG. 3 is a schematic diagram of an audio editing interface, shown in accordance with another exemplary embodiment;
FIG. 4 is a flow chart illustrating a method of audio production according to another exemplary embodiment;
FIG. 5 is a schematic diagram illustrating a tone selection interface in accordance with another illustrative embodiment;
FIG. 6 is a flow chart illustrating a method of audio production according to another exemplary embodiment;
FIG. 7 is a flow chart illustrating a method of audio production according to another exemplary embodiment;
FIG. 8 is a flow chart illustrating a method of acoustic model training in accordance with another exemplary embodiment;
FIG. 9 is a schematic diagram illustrating the structure of an audio production device according to another exemplary embodiment;
fig. 10 is a schematic diagram illustrating a structure of a terminal according to another exemplary embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Before the embodiments of the present invention are described in detail, application scenarios and implementation environments related to the embodiments of the present invention are briefly described.
First, terms related to the embodiments of the present invention are briefly explained.
User Interface (UI) controls, any visual control or element that can be seen on a User Interface of an application, such as controls for pictures, input boxes, text boxes, buttons, labels, etc., some of which are responsive to User operations, such as a User triggering an edit control to enter text. The UI control referred to in the embodiments of the present application includes, but is not limited to: lyric editing control, playing control and selecting control.
Phoneme (phone): the minimum phonetic unit is divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme. Phonemes are divided into two major categories, vowels and consonants. For example, the chinese syllables o (ā) have only one phoneme, the love (aji) has two phonemes, the generation (d aji) has three phonemes, etc. A phoneme is the smallest unit or smallest speech segment constituting a syllable, and is the smallest linear speech unit divided from the viewpoint of sound quality. Phonemes are physical phenomena that exist specifically. The phonetic symbols of international phonetic symbols (letters designated by the international phonetic society to uniformly designate the voices of various countries, also called "international phonetic letters" and "universal phonetic letters") correspond to phonemes of the whole human language one by one. Phonemes are generally labeled with International Phonetic Alphabet (IPA). The international phonetic symbol is a popular note symbol in the world, which was established and published by the international voice association in 1888 and then modified for many times. Marking a phoneme with an international sound is used to indicate the phoneme details in the pronunciation, and [ ], and//, is used to mark the phoneme. Phonemes are generally classified into vowels and consonants.
Next, a brief description will be given of a real-time environment to which the embodiments of the present invention relate.
Referring to fig. 1, a schematic structural diagram of a computer system provided in an exemplary embodiment of the present application is shown, the computer system including a terminal 120 and a server 140. The terminal 120 and the server 140 are connected to each other through a wired or wireless network.
Alternatively, the terminal 120 may be at least one of a laptop computer, a desktop computer, a smart phone, a tablet computer, a smart speaker, and a smart robot.
The terminal 120 includes a first memory and a first processor. The first memory stores a first program; the first program is called and executed by the first processor to realize the audio production method. The first memory may include, but is not limited to, the following: random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM), Erasable Read-Only Memory (EPROM), and electrically Erasable Read-Only Memory (EEPROM).
The first processor may be comprised of one or more integrated circuit chips. Alternatively, the first processor may be a general purpose processor, such as a Central Processing Unit (CPU) or a Network Processor (NP). Alternatively, the first processor may implement the audio production method or the training method of the acoustic model provided by the present application.
The server 140 includes a second memory and a second processor. The second memory stores a second program, and the second program is called by the second processor to realize the audio production method provided by the application. Illustratively, the second memory has stored therein a second program; the second program is called and executed by the second processor to realize the audio production method. Optionally, the second memory may include, but is not limited to, the following: RAM, ROM, PROM, EPROM, EEPROM. Alternatively, the second processor may be a general purpose processor, such as a CPU or NP.
Illustratively, the audio production method provided by the application can be applied to scenes such as song recomposition, song production, song preview and the like.
The audio production method provided by the embodiment of the invention can be executed by the terminal, or the terminal and the server; the terminal has an audio production function, and further has an audio playing function. In some embodiments, the terminal may be a mobile phone, a tablet computer, a desktop computer, a portable computer, and the like, which is not limited in this embodiment of the present invention.
Fig. 2 is a flowchart illustrating an audio production method according to another exemplary embodiment, which is exemplified by being applied to a terminal, and the audio production method may include the following steps:
The audio editing interface is used for editing the first audio. The audio editing interface can also comprise an audio selection interface used for determining the target audio needing audio editing. For example, in the present embodiment, the user selects to edit the first audio.
Illustratively, the audio editing interface is used to present audio information of the first audio. The audio information includes at least one of lyrics of the first audio, MV (Music Video), Music track, Music score, tone map (for identifying tone height of the main melody), time domain signal, frequency domain signal, audio producer information, related pictures (album cover, singer picture, etc.), play progress bar, and audio duration.
Illustratively, the audio editing interface further comprises an editing control for editing the first audio. The editing control comprises at least one of a lyric editing control, a tone editing control, a music score editing control, a resetting control, a listening test control, a finishing control, a saving control, a tone color selecting control, a sharing control and a clipping control (selecting control).
Illustratively, the lyric editing control is used for editing lyrics of the first audio, and optionally, the lyric editing control is used for displaying a lyric editing interface after being triggered, wherein the lyric editing interface allows a user to input lyrics, and the client receives the lyrics input by the user at the lyric editing interface. Illustratively, the lyric editing interface may be a new user interface, or may refer to an editing interface located on an audio editing interface; the lyric editing interface comprises an editing box, and the editing box is used for receiving text information input by a user. For example, each lyric of the first audio corresponds to a lyric editing control, the client displays a lyric editing interface corresponding to the lyric after receiving a triggering operation on the lyric editing control, and for example, a section of lyric of the first audio corresponds to a lyric editing control, the client displays a lyric editing interface corresponding to a section of lyric after receiving a triggering operation on the lyric editing control, and a user can edit the lyric in the lyric editing interface. For example, there may be only one lyric editing control, and after receiving a trigger operation on the lyric editing control, the client displays a whole lyric editing interface of the first audio, where a user may edit all lyrics of the first audio. For example, the lyric editing control may be an invisible UI control arranged on the audio editing interface and bound with the lyrics, and the user triggers the lyric editing control corresponding to the lyrics to enter the lyric editing interface of the lyrics by clicking, double-clicking or long-pressing the lyrics or the area corresponding to the lyrics. The lyric editing control can also be an icon visible on the audio editing interface, and the user can trigger the lyric editing control to enter the lyric editing interface by clicking, double clicking or long pressing.
Illustratively, the pitch editing control is used to edit the pitch of the first audio, for example, to adjust the pitch of a song lyric in the first audio corresponding to the human voice audio. Illustratively, the melody editing control is for editing a main melody or accompaniment melody or vocal tones of the first audio. The tone selection control is used to select the tone of the human voice that generated the second audio. Illustratively, the client provides the user with the timbres of different singing picks from which the user can select a favorite timbre to generate a new song, for example, the timbres of singing picks include: youth sound, rale sound, yujie sound, tertiary sound, etc. Illustratively, the reset control is used for clearing the historical editing operation of the first audio by the user so that the user can edit the first audio again. The audition control is used for playing a second audio obtained according to the modification of the first audio by the user. And the finishing control is used for finishing the audio editing and generating a second audio. The saving control is used for saving the generated second audio. The sharing control is used for sharing the second audio. The clip control user selects a portion of the audio segment from the first audio and generates a second audio based on the audio segment.
Illustratively, the first audio is the audio of a song. Illustratively, the first audio includes at least one of vocal audio, main melody audio, and accompaniment audio. The human voice audio is an audio that sings the lyrics of the first audio. The main melody audio is an audio of a main melody tone of the first audio. Illustratively, the first audio is audio synthesized from at least two of human voice audio, main melody audio, and accompaniment audio. Illustratively, the first audio corresponds to at least one lyric, and the lyric refers to the character information sung by the first audio.
For example, as shown in fig. 3, an audio editing interface of a first audio is provided, in which lyrics 301 of the first audio are displayed, a lyric editing control 302 corresponding to the first lyrics, when a client receives a trigger operation on the lyric editing control 302, a lyric editing interface 303 of the first lyrics is displayed, and a user may input a second lyric on the lyric editing interface 303 to replace the first lyrics.
The client receives a lyric editing operation of a user on a lyric editing control for a first lyric, the user calls a lyric editing interface for the first lyric through the lyric editing control, and the client receives a second lyric input by the user in the lyric editing interface.
For example, the second lyrics may have the same number of words as the first lyrics or may have a different number of words from the first lyrics. For example, in order to guarantee the audio effect of the second audio generated according to the second lyrics, the number of words of the second lyrics input by the user may be limited, for example, the number of words of the second lyrics may be limited to five words floating up and down based on the number of words of the first lyrics.
For example, the first lyrics may be at least one lyric in the first audio, and the second lyrics may be at least one lyric corresponding to the first lyrics.
And step 250, replacing the first lyrics in the first audio with second lyrics to generate second audio, wherein the second audio comprises human voice audio generated according to the second lyrics.
And the client replaces the voice audio frequency fragment corresponding to the first lyrics in the first audio frequency with the voice audio frequency fragment of the second lyrics according to the second lyrics input by the user to obtain a second audio frequency. That is, the second audio is an audio synthesized from the accompaniment audio or/and the main melody audio of the first audio, and the human voice audio of the second lyrics.
Illustratively, the audio duration of the second audio is not greater than the first audio. That is, the audio duration of the second audio may be equal to the first audio, or the audio duration of the second audio may be less than the first audio. For example, if the duration of the second audio is equal to the first audio, the second audio is an audio synthesized from all accompaniment audio or all main melody audio of the first audio, the human voice audio corresponding to the lyrics of the first audio except the first lyrics, and the human voice audio corresponding to the second lyrics. If the duration of the second audio is less than the first audio, the second audio is generated according to the audio clip intercepted from the first audio, the audio clip of the first audio has the same duration as the audio clip of the second audio, and the second audio is synthesized according to the accompaniment audio or the main melody audio of the audio clip, the human voice audio corresponding to the lyrics of the audio clip except the first lyrics, and the human voice audio corresponding to the second lyrics.
For example, the step of generating the second audio may be performed by a client on the terminal, or may be performed by a server, and the server sends the second audio to the client after generating the second audio.
Illustratively, the human voice audio of the second lyrics is computer-generated audio according to a user-selected timbre. As shown in fig. 4, step 250 is preceded by step 240, and step 250 further comprises step 251.
And 240, acquiring a target tone, wherein the target tone is used for generating human voice audio.
The target timbre is used for enabling the client to generate the voice audio according to the voice characteristic of the target timbre.
Illustratively, the target timbre may be a client default timbre or a timbre selected by the user from a plurality of candidate timbres. Illustratively, one timbre represents one voice feature, and the client may refer to different timbres with virtual singing. For example, virtual singing Ji A refers to the voice of a child, and virtual singing Ji B refers to the voice of an adult male.
For example, the client may display a timbre selection interface, the timbre selection interface including at least one candidate timbre and a selection control; and the client responds to the received selection operation for triggering the selection control, and determines a target tone color from the candidate tone colors according to the selection operation. For example, the client may generate the second audio in real time according to the tone selected by the user, and play the second audio.
For example, as shown in fig. 5, it is a tone selection interface in which two candidate tones of singji a and singji B are included, and a selection control 401, from which the user can select one as a target tone to generate a second audio. For example, the currently selected tone is singapo a, and the client is playing the second audio generated according to the tone of singapo a. If the user wants to generate the second audio using the tone color of singing ji B, the user may click on singing ji B and then click on the selection control 401 to make the target tone color smoother to singing ji B.
For example, the human voice audio of the second audio may be generated entirely with the target timbre, only part of the second lyrics may be generated with the target timbre, and only part of the fragments containing the second lyrics may be generated with the target timbre. That is, the human voice audio of the second audio may contain one tone: the target timbre. The human voice audio of the second audio may also contain two timbres: the target tone and the original tone of the human voice audio of the first audio.
Then, the second audio comprises: the audio duration is less than the first audio, and the human voice audio fragment of the second lyric is generated according to the target timbre, and the human voice audio fragments of the lyrics except the second lyric use the audio of the original timbre of the first audio; or the audio time length is equal to the first audio, the human voice audio fragment of the second lyrics is generated according to the target timbre, and the human voice audio fragments of the lyrics except the second lyrics use the audio of the original sound timbre of the first audio; or the audio time length is less than the first audio, and the human voice audio of all the lyrics is the audio generated according to the target tone; or the audio time length is equal to the first audio, and the human voice audio of all the lyrics is the audio generated according to the target tone.
And 251, replacing the first lyrics in the first audio with the second lyrics according to the target timbre to generate a second audio.
Illustratively, the client replaces the vocal audio of the first lyrics in the first audio according to the vocal audio of the second lyrics generated by using the target timbre, and synthesizes the second audio by using the accompaniment audio or/and the main melody audio of the first audio.
For example, given a method for generating human voice audio using a target tone color, as shown in fig. 6, step 251 further includes steps 2511 to 2513.
The client generates the human voice audio of the second lyrics by using the target tone color, the phonemes of the second lyrics and the notes corresponding to the first lyrics in the first audio. If the human voice audio of other lyrics in the second audio is also generated by using the target tone color, the human voice audio of other lyrics needs to be generated by using the target tone color, phonemes of other lyrics and corresponding notes of other lyrics in the first audio. If the voice audio of other lyrics in the second audio uses the voice audio of the first audio, the voice audio of the first audio can be cut and then spliced with the voice audio of the second lyrics to obtain the complete voice audio of the second audio.
Illustratively, the synthesis of the human voice audio further uses the position information of the phonemes, the position information of the notes, and the like. The position information of the phoneme is used to label the position of the phoneme in the audio. For example, the first phoneme occupies the positions of the 1 st frame to the 100 th frame in the audio, the second phoneme occupies the positions of the 101 st frame to the 200 th frame in the audio, and so on. The position information of the musical note is used to mark the position of the musical note in the audio, for example, the position of the first musical note in the audio from the 50 th frame to the 200 th frame. Illustratively, according to the above-mentioned position relationship, information of the phoneme and the note corresponding to each frame of the audio can be obtained.
In step 2512, a template audio of the first audio is obtained, wherein the template audio includes at least one of an accompaniment audio and a main melody audio.
Illustratively, the client obtains a template audio of the first audio, where the template audio is an audio of the first audio other than the human voice audio. Illustratively, the template audio includes at least one of accompaniment audio and main melody audio.
Illustratively, for each song, template audio for the song needs to be made in advance. The template audio production steps comprise the following steps: 1. the accompaniment audio of the song is acquired, and illustratively, the song accompaniment can be separated from the audio of the original song, or the accompaniment audio of the song is directly acquired. Illustratively, a portion of a song also clips the accompaniment, e.g., a chorus portion of the song. 2. Manual faceting, making midi (Musical instrument digital Interface) files for making the main melody audio. 3. And making template audio, aligning and synthesizing the accompaniment audio and the main melody audio to obtain the template audio.
And the client synthesizes the template audio of the first audio and the human voice audio containing the second lyrics to obtain a second audio.
The second audio is the same audio as the first audio melody, tone, but different lyrics.
After obtaining the second audio, the client may also play the second audio. As shown in fig. 4, step 260 and step 270 are also included after step 250.
And step 260, displaying an audio playing interface of the second audio, wherein the audio playing interface comprises a playing control.
Illustratively, the audio playback interface and the audio editing interface may be the same interface. That is, the client plays the second audio immediately after generating the second audio, so that the user can preview the generated second audio in the audio editing interface.
Illustratively, the audio playback interface and the audio editing interface may also be different interfaces. That is, after the client generates the second audio, the user may click the completion control to jump to the audio playing interface, and play, store, or share the second audio.
In summary, the method provided in this embodiment receives the change of the lyrics of the song by the user on the audio editing interface, and generates the modified song according to the lyrics changed by the user and the original song, so that the user can modify the lyrics of the song by one key to quickly generate a new song, thereby simplifying the operation steps of generating the audio by the user, and improving the efficiency of audio editing.
In the method provided by the embodiment, after the user changes the lyrics, the user also needs to select singing Ji, the lyrics adapted by the user are generated into the voice audio by using the voice of the singing Ji, and the voice audio of the original lyrics in the original song is replaced by the voice audio to obtain a new song. The user can select different singing Ji to sing and recompose the lyrics, so that different new songs are obtained, the editability of the user on the songs is enriched, the operation steps of the user for generating the audio are simplified, and the editing efficiency is improved.
The method provided by the embodiment comprises the steps of firstly using the tone of the singing Ji, the phoneme of the lyrics recomposed by the user and the notes of the original lyrics to generate the voice frequency which uses the tone of the original lyrics to sing and recompose the lyrics, and then using the accompaniment of the original song and the voice frequency which is newly generated to synthesize a new song, thereby realizing one-key replacement of the lyrics of the song, generating the new song and simplifying the operation of generating the voice frequency by the user.
In the method provided by this embodiment, the new song may be a part of the original song or the whole song, and the new song may be a song performed by using the singing Ji sound designated by the user only by changing the lyric part, or a song performed by using the designated singing Ji sound of the whole song.
According to the method provided by the embodiment, after the user selects the singing Ji, a new song is generated according to the selected singing Ji, then the new song is played immediately, the user can preview the song generated according to the current singing Ji in real time, and if the user is not satisfied with the song effect, the singing Ji can be replaced in real time.
According to the method provided by the embodiment, after the new song is generated, a preview playing interface of the new song can be displayed, so that the user can preview the generated new song.
Illustratively, the present embodiment presents a method for deriving human voice audio using a neural network model. Fig. 7 is a flowchart illustrating an audio production method according to another exemplary embodiment, which is exemplified by applying the audio production method to a terminal, and the step 2511 further includes steps 2511-1 to 2511-2.
Step 2511-1, the timbre identification of the target timbre, the phoneme of the second lyrics and the note corresponding to the first lyrics in the first audio are input into the acoustic model to obtain the Mel frequency spectrum.
Illustratively, the acoustic model is a deep neural network acoustic model. The acoustic model is used to generate a mel-frequency spectrum (mel-frequency spectrum) from the input two-dimensional text information. Illustratively, the acoustic model is a neural network model that employs a long short-Term Memory network (LSTM) structure.
The mel frequency spectrum labels the frequency domain features of the audio using the mel scale. The method comprises the steps of performing frame windowing on a time domain waveform of an audio signal, performing Fourier transform on the time domain waveform of the audio signal to obtain a frequency domain signal of the audio signal in each frame, stacking the frequency domain signal of each frame to obtain a spectrogram of the audio signal, and labeling frequencies in the spectrogram by using a Mel scale to obtain a Mel frequency spectrum of the audio signal. Also, the audio signal may be restored according to the mel spectrum of the audio signal. The Mel scale, named by Stevens, Volkmann and Newman in 1937. The unit of frequency is hertz (Hz), the audible frequency range of the human ear is 20-20000Hz, but the human ear does not have a linear perceptual relationship to this scale unit of Hz. For example, if one adapts to a 1000Hz tone and then increases the tone frequency to 2000Hz, the human ear can only perceive a small increase in frequency, but not a doubling of frequency at all. If the ordinary frequency scale is converted into the Mel frequency scale, the perception of the human ear to the frequency is linear. That is, on the mel scale, if the mel frequencies of two pieces of speech differ by a factor of two, the pitch that the human ear can perceive is roughly twice as different. The mapping relationship between the mel scale and the frequency scale is shown as follows:
mel(f)=2595*log10(1+f/700)
where mel (f) is the mel scale and f is the frequency.
Illustratively, the sound color identification of the target sound color, the phoneme of the second lyrics, the position information of the phoneme, the notes corresponding to the first lyrics in the first audio and the position information of the notes are input into an acoustic model to obtain a Mel frequency spectrum.
Wherein the position information of the phoneme of the second lyrics may be determined according to the position information of the phoneme of the first lyrics. When the second lyrics have the same number of words as the first lyrics, the word bits (positions of the respective words) of the second lyrics may be determined based on the word bits of the first lyrics, and the positions of the phonemes may be determined based on the word bits. When the number of words of the second lyric is different from that of the first lyric, a plurality of word bits can be designed for the first lyric in advance, for example, the first lyric originally has 5 words and corresponds to 5 word bits, the distribution of 6 word bits when the second lyric has 6 lyrics can be designed in advance, the distribution of 8 word bits when the second lyric has 8 words can be designed in advance, and then the positions of the words of the second lyric are determined in turn according to the preset word bit distribution, so that the position of the phoneme of the second lyric is determined. The words with the longest occupied time length can be selected from the first lyrics, the occupied time length of the words is equally divided according to the number of the words with the second lyrics more than the first lyrics, the word bits are filled in the word bits of the first lyrics according to the equal division result to obtain the word bits conforming to the word number of the second lyrics, then the word bits are sequentially filled in the second lyrics to determine the positions of the words, and further the positions of the phonemes are determined. For example, the first lyric includes three words "ABC", the second lyric includes five words "12345", wherein a of the first lyric lasts for 3 seconds, B lasts for 1 second, and C lasts for 1 second, the duration of a is divided equally according to the word number difference between the first lyric and the second lyric, that is, 3 seconds are divided equally to obtain one word bit for the first second, one word bit for the second, and one word bit for the third second, so that two more word bits can be obtained, five word bits can be obtained by filling in the original word bits of the first lyric, and the positions of the words can be determined by filling in five word bits respectively with five words of the second lyric, thereby determining the positions of the words.
Or, the method may be used to determine the position of the phoneme of the second lyric directly according to the position of the phoneme of the first lyric, and when the number of the phonemes of the first lyric and the second lyric is not equal, a new vacancy may be added to the original phoneme position of the first lyric by using the method, so that the phoneme that the second lyric exceeds may be filled in the vacancy.
Illustratively, the acoustic models respectively correspond to different timbres, and the client calls the acoustic model corresponding to the timbre identification according to the inputted timbre identification of the target timbre to obtain the mel-frequency spectrum.
Step 2511-2, the vocoder is invoked to convert the mel spectrum to human audio.
The client calls the vocoder to convert the Mel frequency spectrum to obtain the human voice audio. Illustratively, the vocoder may use a WaveRNN vocoder or a WaveGlow vocoder.
In summary, in the method provided in this embodiment, for obtaining the human voice, first, a mel spectrum of the human voice is obtained by using an acoustic model, where the acoustic model is a deep neural network acoustic model and is used to generate the mel spectrum of the audio according to the input two-dimensional text information, and after the mel spectrum is obtained, the mel spectrum is converted into the human voice by using a vocoder, so as to obtain the audio generated by singing the human voice by using the designated singing ja sound.
Illustratively, the present embodiment presents a method of training an acoustic model. Fig. 8 is a flowchart illustrating an acoustic model training method according to another exemplary embodiment, which is illustrated as a method applied to a terminal and includes the following steps.
Illustratively, the training audio includes singing audio (singing data), and the training lyrics are lyrics of the singing audio. Illustratively, the client acquires the singing data as training data, the singing data only comprises human voice and audio, then, the unvoiced data is labeled manually with phonemes, notes, phoneme position information, note position information, tone marks and the like, and a mel frequency spectrum of the singing data is generated according to the singing data, so that a group of training data (phonemes, notes, phoneme position information, note position information, tone marks and mel frequency spectrum) corresponding to one piece of the singing data is obtained. Illustratively, the client acquires a plurality of sets of training data corresponding to a plurality of singing data.
And 320, training the initial model according to the training data to obtain the acoustic model.
The client takes the Mel frequency spectrum as expectation, inputs the phoneme, the notes, the phoneme position information and the note position information into an initialization model, trains the initialization model and obtains the acoustic model.
In summary, in the method provided in this embodiment, by training the acoustic model, the client can use the acoustic model to obtain the mel spectrum and obtain the voice audio according to the sister spectrum, so as to simplify the operation steps of the user to edit the audio.
Fig. 9 is a schematic diagram illustrating a structure of an audio producing apparatus, which may be implemented by software, hardware, or a combination of both, according to an exemplary embodiment. The audio producing apparatus may include:
a display module 901, configured to display an audio editing interface of a first audio, where the audio editing interface includes at least one lyric of the first audio and a lyric editing control, and the at least one lyric includes first lyrics;
an interaction module 902, configured to receive a lyric editing operation on the lyric editing control for the first lyric, where the lyric editing operation includes inputting a second lyric;
a generating module 903, configured to replace the first lyrics in the first audio with the second lyrics, and generate a second audio, where the second audio includes a human voice audio generated according to the second lyrics.
Optionally, the apparatus further comprises:
an obtaining module 904, configured to obtain a target tone, where the target tone is used to generate the human voice audio;
the generating module 903 is further configured to replace the first lyrics in the first audio with the second lyrics according to the target timbre, and generate the second audio.
Optionally, the generating module 903 is further configured to generate the human voice audio containing the second lyrics according to the target timbre, a phoneme of the second lyrics, and a note corresponding to the first lyrics in the first audio;
the obtaining module 904 is further configured to obtain a template audio of the first audio, where the template audio includes at least one of an accompaniment audio and a main melody audio;
the generating module 903 is further configured to generate the second audio according to the template audio and the human voice audio.
Optionally, the generating module 903 includes:
the model submodule 905 is configured to input the timbre identification of the target timbre, the phoneme of the second lyrics, and the musical note corresponding to the first lyrics in the first audio into an acoustic model to obtain a mel frequency spectrum;
a vocoder submodule 906 for invoking a vocoder to convert the mel spectrum into the human voice audio.
Optionally, the second audio comprises:
the audio duration is less than the first audio, and the human voice audio fragment of the second lyric is generated according to the target timbre, and the human voice audio fragments of the lyrics except the second lyric use the audio of the original timbre of the first audio;
or the like, or, alternatively,
the audio time length is equal to the first audio frequency, and the human voice audio frequency fragment of the second lyrics is generated according to the target timbre, and the human voice audio frequency fragment of the lyrics except the second lyrics uses the audio frequency of the original timbre of the first audio frequency;
or the like, or, alternatively,
the audio duration is less than the first audio, and the human voice audio of all the lyrics is the audio generated according to the target timbre;
or the like, or, alternatively,
the audio duration is equal to the first audio, and the human voice audio of all the lyrics is the audio generated according to the target timbre.
Optionally, the apparatus further comprises:
the obtaining module 904 is further configured to obtain training data, where the training data includes: at least one of a phoneme of the training lyrics, a note of the training lyrics, phoneme position information of the training lyrics, note position information of the training lyrics, a timbre identification of the training audio, and a mel frequency spectrum of the training audio;
a training module 907, configured to train an initial model according to the training data to obtain the acoustic model.
Optionally, the apparatus further comprises:
the display module 901 is further configured to display an audio playing interface of the second audio, where the audio playing interface includes a playing control;
the interaction module 902 is further configured to receive a play operation that triggers the play control;
a playing module 908, configured to play the second audio in response to receiving a playing operation that triggers the playing control.
Optionally, the apparatus further comprises:
the display module 901 is further configured to display a tone selection interface, where the tone selection interface includes at least one candidate tone and a selection control;
the interaction module 902 is further configured to receive a selection operation that triggers the selection control;
the obtaining module 904 is further configured to, in response to receiving a selection operation that triggers the selection control, determine the target tone color from the candidate tone colors according to the selection operation.
A playing module 908, configured to play the second audio.
It should be noted that: in the audio production apparatus provided in the foregoing embodiment, when the audio production method is implemented, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the audio production apparatus and the audio production method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.
Fig. 10 shows a block diagram of a terminal 1000 according to an exemplary embodiment of the present invention. The terminal 1000 can be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio layer iii, motion video Experts compression standard Audio layer 3), an MP4 player (Moving Picture Experts Group Audio layer IV, motion video Experts compression standard Audio layer 4), a notebook computer, or a desktop computer. Terminal 1000 can also be referred to as user equipment, portable terminal, laptop terminal, desktop terminal, or the like by other names.
In general, terminal 1000 can include: a processor 1001 and a memory 1002.
In some embodiments, terminal 1000 can also optionally include: a peripheral interface 1003 and at least one peripheral. The processor 1001, memory 1002 and peripheral interface 1003 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, touch screen display 1005, camera assembly 1006, audio circuitry 1007, positioning assembly 1008, and power supply 1009.
The peripheral interface 1003 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1001 and the memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1001, the memory 1002, and the peripheral interface 1003 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.
The Radio Frequency circuit 1004 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1004 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1004 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1005 is a touch display screen, the display screen 1005 also has the ability to capture touch signals on or over the surface of the display screen 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display screen 1005 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display screen 1005 can be one, providing a front panel of terminal 1000; in other embodiments, display 1005 can be at least two, respectively disposed on different surfaces of terminal 1000 or in a folded design; in still other embodiments, display 1005 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1000. Even more, the display screen 1005 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 1005 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 1006 is used to capture images or video. Optionally, the camera assembly 1006 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1001 for processing or inputting the electric signals to the radio frequency circuit 1004 for realizing voice communication. For stereo sound collection or noise reduction purposes, multiple microphones can be provided, each at a different location of terminal 1000. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1007 may also include a headphone jack.
A location component 1008 is employed to locate a current geographic location of terminal 1000 for navigation or LBS (location based Service). The positioning component 1008 may be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, or the galileo System in russia.
In some embodiments, terminal 1000 can also include one or more sensors 1010. The one or more sensors 1010 include, but are not limited to: acceleration sensor 1011, gyro sensor 1012, pressure sensor 1013, fingerprint sensor 1014, optical sensor 1015, and proximity sensor 1016.
Acceleration sensor 1011 can detect acceleration magnitudes on three coordinate axes of a coordinate system established with terminal 1000. For example, the acceleration sensor 1011 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1001 may control the touch display screen 1005 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1011. The acceleration sensor 1011 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 1012 may detect a body direction and a rotation angle of the terminal 1000, and the gyro sensor 1012 and the acceleration sensor 1011 may cooperate to acquire a 3D motion of the user on the terminal 1000. From the data collected by the gyro sensor 1012, the processor 1001 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensor 1013 may be disposed on a side frame of terminal 1000 and/or on a lower layer of touch display 1005. When pressure sensor 1013 is disposed on a side frame of terminal 1000, a user's grip signal on terminal 1000 can be detected, and processor 1001 performs left-right hand recognition or shortcut operation according to the grip signal collected by pressure sensor 1013. When the pressure sensor 1013 is disposed at a lower layer of the touch display screen 1005, the processor 1001 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 1005. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 1014 is used to collect a fingerprint of the user, and the processor 1001 identifies the user according to the fingerprint collected by the fingerprint sensor 1014, or the fingerprint sensor 1014 identifies the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 1001 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying, and changing settings, etc. Fingerprint sensor 1014 can be disposed on the front, back, or side of terminal 1000. When a physical key or vendor Logo is provided on terminal 1000, fingerprint sensor 1014 can be integrated with the physical key or vendor Logo.
The optical sensor 1015 is used to collect the ambient light intensity. In one embodiment, the processor 1001 may control the display brightness of the touch display screen 1005 according to the intensity of the ambient light collected by the optical sensor 1015. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1005 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 1005 is turned down. In another embodiment, the processor 1001 may also dynamically adjust the shooting parameters of the camera assembly 1006 according to the intensity of the ambient light collected by the optical sensor 1015.
Proximity sensor 1016, also known as a distance sensor, is typically disposed on a front panel of terminal 1000. Proximity sensor 1016 is used to gather the distance between the user and the front face of terminal 1000. In one embodiment, when proximity sensor 1016 detects that the distance between the user and the front surface of terminal 1000 gradually decreases, processor 1001 controls touch display 1005 to switch from a bright screen state to a dark screen state; when proximity sensor 1016 detects that the distance between the user and the front of terminal 1000 is gradually increased, touch display screen 1005 is controlled by processor 1001 to switch from a breath-screen state to a bright-screen state.
Those skilled in the art will appreciate that the configuration shown in FIG. 10 is not intended to be limiting and that terminal 1000 can include more or fewer components than shown, or some components can be combined, or a different arrangement of components can be employed.
An embodiment of the present application further provides a computer device, where the computer device includes: the audio production system comprises a processor and a memory, wherein the memory stores instructions which are executed by the processor to realize the audio production method provided by the embodiment.
Embodiments of the present application further provide a non-transitory computer-readable storage medium, where instructions in the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the audio production method provided in the above-described illustrated embodiments.
Embodiments of the present application further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the audio production method provided by the above embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present invention and should not be taken as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (14)
1. A method of audio production, the method comprising:
an audio editing interface displaying a first audio, the audio editing interface comprising at least one lyric of the first audio and a lyric editing control, the at least one lyric comprising first lyrics;
receiving a lyric editing operation on the lyric editing control for the first lyric, wherein the lyric editing operation comprises inputting second lyric;
replacing the first lyrics in the first audio with the second lyrics to generate a second audio, wherein the second audio comprises human voice audio generated according to the second lyrics.
2. The method of claim 1, wherein the method further comprises:
acquiring a target tone, wherein the target tone is used for generating the human voice audio;
the replacing the first lyrics in the first audio with the second lyrics to generate a second audio, comprising:
and replacing the first lyrics in the first audio with the second lyrics according to the target timbre to generate the second audio.
3. The method of claim 2, wherein the replacing the first lyrics in the first audio with the second lyrics according to the target timbre, generating the second audio, comprises:
generating the human voice audio containing the second lyrics according to the target timbre, the phoneme of the second lyrics and the notes corresponding to the first lyrics in the first audio;
acquiring a template audio of the first audio, wherein the template audio comprises at least one of an accompaniment audio and a main melody audio;
and generating the second audio according to the template audio and the human voice audio.
4. The method of claim 3, wherein generating the human voice audio including the second lyrics based on the target timbre, the phoneme of the second lyrics, the note corresponding to the first lyrics in the first audio comprises:
inputting the tone color identification of the target tone color, the phoneme of the second lyrics and notes corresponding to the first lyrics in the first audio frequency into an acoustic model to obtain a Mel frequency spectrum;
and calling a vocoder to convert the Mel frequency spectrum into the human voice audio.
5. The method of any of claims 2 to 4, wherein the second audio comprises:
the audio duration is less than the first audio, and the human voice audio fragment of the second lyric is generated according to the target timbre, and the human voice audio fragments of the lyrics except the second lyric use the audio of the original timbre of the first audio;
or the like, or, alternatively,
the audio time length is equal to the first audio frequency, and the human voice audio frequency fragment of the second lyrics is generated according to the target timbre, and the human voice audio frequency fragment of the lyrics except the second lyrics uses the audio frequency of the original timbre of the first audio frequency;
or the like, or, alternatively,
the audio duration is less than the first audio, and the human voice audio of all the lyrics is the audio generated according to the target timbre;
or the like, or, alternatively,
the audio duration is equal to the first audio, and the human voice audio of all the lyrics is the audio generated according to the target timbre.
6. The method of claim 4, wherein the method further comprises:
obtaining training data, the training data comprising: at least one of a phoneme of the training lyrics, a note of the training lyrics, phoneme position information of the training lyrics, note position information of the training lyrics, a timbre identification of the training audio, and a mel frequency spectrum of the training audio;
and training an initial model according to the training data to obtain the acoustic model.
7. The method of any of claims 1 to 4, further comprising:
displaying an audio playing interface of the second audio, wherein the audio playing interface comprises a playing control;
and responding to the receiving of the playing operation triggering the playing control, and playing the second audio.
8. The method of any one of claims 2 to 4, wherein the obtaining the target timbre comprises:
displaying a tone selection interface, the tone selection interface comprising at least one candidate tone and a selection control;
in response to receiving a selection operation for triggering the selection control, determining the target tone color from the candidate tone colors according to the selection operation;
after the replacing the first lyrics in the first audio with the second lyrics according to the target timbre and generating the second audio containing the second lyrics, the method further comprises:
and playing the second audio.
9. An audio producing apparatus, characterized in that the apparatus comprises:
the display module is used for displaying an audio editing interface of a first audio, the audio editing interface comprises at least one lyric of the first audio and a lyric editing control, and the at least one lyric comprises first lyrics;
the interaction module is used for receiving a lyric editing operation on the first lyric on the lyric editing control, wherein the lyric editing operation comprises inputting second lyrics;
the generating module is used for replacing the first lyrics in the first audio with the second lyrics to generate a second audio, and the second audio comprises a human voice audio generated according to the second lyrics.
10. The apparatus of claim 9, wherein the apparatus further comprises:
the acquisition module is used for acquiring a target tone, and the target tone is used for generating the human voice audio;
the generating module is further configured to replace the first lyrics in the first audio with the second lyrics according to the target timbre, and generate the second audio.
11. The apparatus of claim 10, wherein the generating module is further configured to generate the human voice audio including the second lyrics according to the target timbre, a phoneme of the second lyrics, a note corresponding to the first lyrics in the first audio;
the obtaining module is further configured to obtain a template audio of the first audio, where the template audio includes at least one of an accompaniment audio and a main melody audio;
the generating module is further configured to generate the second audio according to the template audio and the human voice audio.
12. The apparatus of claim 11, wherein the generating module comprises:
the model submodule is used for inputting the tone color identification of the target tone color, the phoneme of the second lyrics and the notes corresponding to the first lyrics in the first audio frequency into an acoustic model to obtain a Mel frequency spectrum;
a vocoder submodule for invoking a vocoder to convert the Mel spectrum to the human voice audio.
13. A computer device, the computer device comprising: a processor and a memory, the memory having stored thereon instructions, wherein the instructions, when executed by the processor, implement the steps of any of the methods of claims 1-8.
14. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of any of the methods of claims 1-8.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010753002.3A CN111899706B (en) | 2020-07-30 | 2020-07-30 | Audio production method, device, equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010753002.3A CN111899706B (en) | 2020-07-30 | 2020-07-30 | Audio production method, device, equipment and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111899706A true CN111899706A (en) | 2020-11-06 |
| CN111899706B CN111899706B (en) | 2024-08-23 |
Family
ID=73182697
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010753002.3A Active CN111899706B (en) | 2020-07-30 | 2020-07-30 | Audio production method, device, equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111899706B (en) |
Cited By (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112614477A (en) * | 2020-11-16 | 2021-04-06 | 北京百度网讯科技有限公司 | Multimedia audio synthesis method and device, electronic equipment and storage medium |
| CN112750421A (en) * | 2020-12-23 | 2021-05-04 | 出门问问(苏州)信息科技有限公司 | Singing voice synthesis method and device and readable storage medium |
| CN112988018A (en) * | 2021-04-13 | 2021-06-18 | 杭州网易云音乐科技有限公司 | Multimedia file output method, device, equipment and computer readable storage medium |
| CN113032620A (en) * | 2021-03-02 | 2021-06-25 | 百度时代网络技术(北京)有限公司 | Data processing method and device for audio data, electronic equipment and medium |
| CN113590076A (en) * | 2021-07-12 | 2021-11-02 | 杭州网易云音乐科技有限公司 | Audio processing method and device |
| CN113836344A (en) * | 2021-09-30 | 2021-12-24 | 广州艾美网络科技有限公司 | Personalized song file generation method and device and music singing equipment |
| CN113963674A (en) * | 2021-09-30 | 2022-01-21 | 北京百度网讯科技有限公司 | Method, device, electronic device and storage medium for work generation |
| CN115083397A (en) * | 2022-05-31 | 2022-09-20 | 腾讯音乐娱乐科技(深圳)有限公司 | Training method of lyric acoustic model, lyric recognition method, equipment and product |
| WO2023024501A1 (en) * | 2021-08-24 | 2023-03-02 | 北京百度网讯科技有限公司 | Audio data processing method and apparatus, and device and storage medium |
| WO2023131266A1 (en) * | 2022-01-10 | 2023-07-13 | 北京字跳网络技术有限公司 | Audio special effect editing method and apparatus, device, and storage medium |
| WO2023207541A1 (en) * | 2022-04-29 | 2023-11-02 | 华为技术有限公司 | Speech processing method and related device |
| WO2024082802A1 (en) * | 2022-10-20 | 2024-04-25 | 抖音视界有限公司 | Audio processing method and apparatus and terminal device |
| WO2024099348A1 (en) * | 2022-11-09 | 2024-05-16 | 脸萌有限公司 | Method and apparatus for editing audio special effect, and device and storage medium |
| CN118870145A (en) * | 2024-07-03 | 2024-10-29 | 北京字跳网络技术有限公司 | Method, device, equipment and storage medium for generating media content |
| CN119580673A (en) * | 2024-12-06 | 2025-03-07 | 北京字跳网络技术有限公司 | AI song editing method, device, electronic device and storage medium |
| CN119653174A (en) * | 2024-12-11 | 2025-03-18 | 北京字跳网络技术有限公司 | A method, device, equipment, medium and program product for generating an audio file |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2002182675A (en) * | 2000-12-11 | 2002-06-26 | Yamaha Corp | Speech synthesizer, vocal data former and singing apparatus |
| CN101789255A (en) * | 2009-12-04 | 2010-07-28 | 康佳集团股份有限公司 | Processing method for changing lyrics based on original mobile phone songs and mobile phone |
| CN106971749A (en) * | 2017-03-30 | 2017-07-21 | 联想(北京)有限公司 | Audio-frequency processing method and electronic equipment |
| CN108319712A (en) * | 2018-02-08 | 2018-07-24 | 广州酷狗计算机科技有限公司 | The method and apparatus for obtaining lyrics data |
| CN110189741A (en) * | 2018-07-05 | 2019-08-30 | 腾讯数码(天津)有限公司 | Audio synthesis method, apparatus, storage medium and computer equipment |
| CN111370011A (en) * | 2020-02-21 | 2020-07-03 | 联想(北京)有限公司 | Method, device, system and storage medium for replacing audio |
| CN111445897A (en) * | 2020-03-23 | 2020-07-24 | 北京字节跳动网络技术有限公司 | Song generation method and device, readable medium and electronic equipment |
-
2020
- 2020-07-30 CN CN202010753002.3A patent/CN111899706B/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2002182675A (en) * | 2000-12-11 | 2002-06-26 | Yamaha Corp | Speech synthesizer, vocal data former and singing apparatus |
| CN101789255A (en) * | 2009-12-04 | 2010-07-28 | 康佳集团股份有限公司 | Processing method for changing lyrics based on original mobile phone songs and mobile phone |
| CN106971749A (en) * | 2017-03-30 | 2017-07-21 | 联想(北京)有限公司 | Audio-frequency processing method and electronic equipment |
| CN108319712A (en) * | 2018-02-08 | 2018-07-24 | 广州酷狗计算机科技有限公司 | The method and apparatus for obtaining lyrics data |
| CN110189741A (en) * | 2018-07-05 | 2019-08-30 | 腾讯数码(天津)有限公司 | Audio synthesis method, apparatus, storage medium and computer equipment |
| CN111370011A (en) * | 2020-02-21 | 2020-07-03 | 联想(北京)有限公司 | Method, device, system and storage medium for replacing audio |
| CN111445897A (en) * | 2020-03-23 | 2020-07-24 | 北京字节跳动网络技术有限公司 | Song generation method and device, readable medium and electronic equipment |
Cited By (23)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112614477B (en) * | 2020-11-16 | 2023-09-12 | 北京百度网讯科技有限公司 | Method and device for synthesizing multimedia audio, electronic equipment and storage medium |
| CN112614477A (en) * | 2020-11-16 | 2021-04-06 | 北京百度网讯科技有限公司 | Multimedia audio synthesis method and device, electronic equipment and storage medium |
| CN112750421A (en) * | 2020-12-23 | 2021-05-04 | 出门问问(苏州)信息科技有限公司 | Singing voice synthesis method and device and readable storage medium |
| CN112750421B (en) * | 2020-12-23 | 2022-12-30 | 出门问问(苏州)信息科技有限公司 | Singing voice synthesis method and device and readable storage medium |
| CN113032620A (en) * | 2021-03-02 | 2021-06-25 | 百度时代网络技术(北京)有限公司 | Data processing method and device for audio data, electronic equipment and medium |
| CN113032620B (en) * | 2021-03-02 | 2024-05-07 | 百度时代网络技术(北京)有限公司 | Data processing method and device for audio data, electronic equipment and medium |
| CN112988018A (en) * | 2021-04-13 | 2021-06-18 | 杭州网易云音乐科技有限公司 | Multimedia file output method, device, equipment and computer readable storage medium |
| CN113590076A (en) * | 2021-07-12 | 2021-11-02 | 杭州网易云音乐科技有限公司 | Audio processing method and device |
| CN113590076B (en) * | 2021-07-12 | 2024-03-29 | 杭州网易云音乐科技有限公司 | Audio processing method and device |
| WO2023024501A1 (en) * | 2021-08-24 | 2023-03-02 | 北京百度网讯科技有限公司 | Audio data processing method and apparatus, and device and storage medium |
| CN113836344A (en) * | 2021-09-30 | 2021-12-24 | 广州艾美网络科技有限公司 | Personalized song file generation method and device and music singing equipment |
| CN113963674A (en) * | 2021-09-30 | 2022-01-21 | 北京百度网讯科技有限公司 | Method, device, electronic device and storage medium for work generation |
| WO2023131266A1 (en) * | 2022-01-10 | 2023-07-13 | 北京字跳网络技术有限公司 | Audio special effect editing method and apparatus, device, and storage medium |
| US12354578B2 (en) | 2022-01-10 | 2025-07-08 | Beijing Zitiao Network Technology Co., Ltd. | Method and apparatus for editing audio special effect, device and storage medium |
| WO2023207541A1 (en) * | 2022-04-29 | 2023-11-02 | 华为技术有限公司 | Speech processing method and related device |
| CN115083397A (en) * | 2022-05-31 | 2022-09-20 | 腾讯音乐娱乐科技(深圳)有限公司 | Training method of lyric acoustic model, lyric recognition method, equipment and product |
| CN115083397B (en) * | 2022-05-31 | 2025-04-22 | 腾讯音乐娱乐科技(深圳)有限公司 | Lyrics acoustic model training method, lyrics recognition method, device and product |
| WO2024082802A1 (en) * | 2022-10-20 | 2024-04-25 | 抖音视界有限公司 | Audio processing method and apparatus and terminal device |
| WO2024099348A1 (en) * | 2022-11-09 | 2024-05-16 | 脸萌有限公司 | Method and apparatus for editing audio special effect, and device and storage medium |
| CN118870145A (en) * | 2024-07-03 | 2024-10-29 | 北京字跳网络技术有限公司 | Method, device, equipment and storage medium for generating media content |
| CN119580673A (en) * | 2024-12-06 | 2025-03-07 | 北京字跳网络技术有限公司 | AI song editing method, device, electronic device and storage medium |
| CN119580673B (en) * | 2024-12-06 | 2025-12-19 | 北京字跳网络技术有限公司 | AI song editing method and device, electronic equipment and storage medium |
| CN119653174A (en) * | 2024-12-11 | 2025-03-18 | 北京字跳网络技术有限公司 | A method, device, equipment, medium and program product for generating an audio file |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111899706B (en) | 2024-08-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111899706B (en) | Audio production method, device, equipment and storage medium | |
| CN112735429B (en) | Method for determining lyric timestamp information and training method of acoustic model | |
| US12437739B2 (en) | Method and apparatus for determining volume adjustment ratio information, device, and storage medium | |
| CN110933330A (en) | Video dubbing method and device, computer equipment and computer-readable storage medium | |
| CN109192218B (en) | Method and apparatus for audio processing | |
| CN110992927B (en) | Audio generation method, device, computer readable storage medium and computing equipment | |
| CN112487940B (en) | Video classification method and device | |
| CN111524501A (en) | Voice playing method and device, computer equipment and computer readable storage medium | |
| CN109616090B (en) | Multi-track sequence generation method, device, equipment and storage medium | |
| CN110867194B (en) | Audio scoring method, device, equipment and storage medium | |
| CN111933098B (en) | Method, device and computer-readable storage medium for generating accompaniment music | |
| CN112435643A (en) | Method, device, equipment and storage medium for generating electronic style song audio | |
| CN113420177A (en) | Audio data processing method and device, computer equipment and storage medium | |
| CN109243479B (en) | Audio signal processing method and device, electronic equipment and storage medium | |
| CN111081277B (en) | Audio evaluation method, device, equipment and storage medium | |
| CN111428079B (en) | Text content processing method, device, computer equipment and storage medium | |
| US20240339094A1 (en) | Audio synthesis method, and computer device and computer-readable storage medium | |
| CN113257222B (en) | Method, terminal and storage medium for synthesizing song audio | |
| CN113160781B (en) | Audio generation method, device, computer equipment and storage medium | |
| CN112992107B (en) | Method, terminal and storage medium for training acoustic conversion model | |
| CN112380380B (en) | Method, device, equipment and computer readable storage medium for displaying lyrics | |
| CN113204673A (en) | Audio processing method, device, terminal and computer readable storage medium | |
| CN115862586B (en) | Method and device for training timbre feature extraction model and audio synthesis | |
| CN111028823B (en) | Audio generation method, device, computer readable storage medium and computing equipment | |
| CN111091807B (en) | Speech synthesis method, device, computer equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |