[go: up one dir, main page]

CN116665696A - Piano playing video generation method, device, computer equipment and storage medium - Google Patents

Piano playing video generation method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN116665696A
CN116665696A CN202310638047.XA CN202310638047A CN116665696A CN 116665696 A CN116665696 A CN 116665696A CN 202310638047 A CN202310638047 A CN 202310638047A CN 116665696 A CN116665696 A CN 116665696A
Authority
CN
China
Prior art keywords
video
piano
audio
sequence
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310638047.XA
Other languages
Chinese (zh)
Inventor
亢祖衡
彭俊清
王健宗
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310638047.XA priority Critical patent/CN116665696A/en
Publication of CN116665696A publication Critical patent/CN116665696A/en
Pending legal-status Critical Current

Links

Landscapes

  • Television Signal Processing For Recording (AREA)

Abstract

本发明涉及语音分析领域,尤其涉及一种钢琴弹奏视频生成方法、装置、计算机设备及存储介质。其方法包括:获取音频流数据;将音频流数据输入音频编码器进行编码处理,得到音频编码;通过钢琴视频代码转换模型对音频编码进行代码转换,得到钢琴视频代码薄序列;通过钢琴视频代码薄解码器对钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据;钢琴视频流数据是指人手在钢琴上弹奏与音频编码对应的音乐的视频流;将钢琴视频流数据和音频流数据进行合并,得到钢琴弹奏视频。本发明将音频流数据转换为具有人手弹奏钢琴的画面的视频流数据,并最终生成既包含音频又包含人手在钢琴上弹奏该音频的视频,使视频效果和质量更好,提高用户体验感。

The invention relates to the field of speech analysis, in particular to a method, device, computer equipment and storage medium for generating a piano playing video. The method includes: obtaining audio stream data; inputting the audio stream data into an audio encoder for encoding processing to obtain audio encoding; performing code conversion on the audio encoding through a piano video code conversion model to obtain a sequence of piano video codebooks; The decoder decodes the piano video codebook sequence to obtain the piano video stream data; the piano video stream data refers to the video stream of people playing music corresponding to the audio encoding on the piano; the piano video stream data and the audio stream data are combined Merge to get the piano playing video. The present invention converts audio stream data into video stream data with a picture of human hands playing the piano, and finally generates a video containing both audio and human hands playing the audio on the piano, so that the video effect and quality are better, and user experience is improved feel.

Description

钢琴弹奏视频生成方法、装置、计算机设备及存储介质Piano playing video generation method, device, computer equipment and storage medium

技术领域technical field

本发明涉及语音分析领域,尤其涉及一种钢琴弹奏视频生成方法、装置、计算机设备及存储介质。The invention relates to the field of speech analysis, in particular to a method, device, computer equipment and storage medium for generating a piano playing video.

背景技术Background technique

随着科技和娱乐业的发展,人们的娱乐方式已经从单纯的文字阅读、图片展示向音视频等多媒体转变。特别是随着短视频的兴起,对音乐视频的需求也越来越大。现有获取音乐的渠道通常是通过音乐软件获取,但往往获取的只有弹奏钢琴的音频,而没有完整的包含人手弹奏钢琴的钢琴弹奏视频。其次,即使有些钢琴音乐生成了相对应的包含人手弹奏钢琴的钢琴弹奏视频。但现有的钢琴弹奏视频,通常是通过说话来驱动虚拟人的运动而生成的,使得生成的钢琴弹奏视频不够流畅,质量较差,影响用户体验感。With the development of technology and the entertainment industry, people's entertainment methods have changed from simple text reading and picture display to multimedia such as audio and video. Especially with the rise of short videos, the demand for music videos is also increasing. Existing channels for obtaining music are usually obtained through music software, but often only the audio of playing the piano is obtained, and there is no complete piano playing video including human hands playing the piano. Secondly, even some piano music generates a corresponding piano playing video containing human hands playing the piano. However, the existing piano playing videos are usually generated by speaking to drive the movement of the virtual human, which makes the generated piano playing videos not smooth enough and of poor quality, which affects the user experience.

发明内容Contents of the invention

基于此,有必要针对上述技术问题,提供一种钢琴弹奏视频生成方法、装置、计算机设备及存储介质,以解决现有钢琴弹奏视频生成技术存在的视频效果较差的问题。Based on this, it is necessary to provide a piano playing video generation method, device, computer equipment and storage medium for the above technical problems, so as to solve the problem of poor video effect existing in the existing piano playing video generation technology.

一种钢琴弹奏视频生成方法,包括:A method for generating a piano playing video, comprising:

获取音频流数据;Get audio stream data;

将所述音频流数据输入音频编码器进行编码处理,得到音频编码;Inputting the audio stream data into an audio encoder for encoding processing to obtain audio encoding;

通过钢琴视频代码转换模型对所述音频编码进行代码转换,得到与所述音频编码对应的钢琴视频代码薄序列;Carry out transcoding to described audio coding by piano video transcoding model, obtain the piano video codebook sequence corresponding to described audio coding;

通过钢琴视频代码薄解码器对所述钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据;所述钢琴视频流数据是指人手在钢琴上弹奏与所述音频编码对应的音乐的视频流;The piano video codebook sequence is decoded by a piano video codebook decoder to obtain piano video stream data; the piano video stream data refers to a video stream of people playing music corresponding to the audio code on the piano ;

将所述钢琴视频流数据和所述音频流数据进行合并,得到钢琴弹奏视频。Combining the piano video stream data and the audio stream data to obtain a piano playing video.

一种钢琴弹奏视频生成装置,包括:A device for generating a piano playing video, comprising:

音频流数据模块,用于获取音频流数据;The audio stream data module is used to obtain audio stream data;

音频编码模块,用于将所述音频流数据输入音频编码器进行编码处理,得到音频编码;An audio encoding module, configured to input the audio stream data into an audio encoder for encoding processing to obtain audio encoding;

钢琴视频代码薄序列模块,用于通过钢琴视频代码转换模型对所述音频编码进行代码转换,得到与所述音频编码对应的钢琴视频代码薄序列;A piano video codebook sequence module, used for transcoding the audio code through a piano video code conversion model, to obtain a piano video codebook sequence corresponding to the audio code;

钢琴视频流数据模块,用于通过钢琴视频代码薄解码器对所述钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据;所述钢琴视频流数据是指人手在钢琴上弹奏与所述音频编码对应的音乐的视频流;The piano video stream data module is used to decode the piano video codebook sequence through the piano video codebook decoder to obtain piano video stream data; The video stream of the music corresponding to the audio encoding;

钢琴弹奏视频模块,用于将所述钢琴视频流数据和所述音频流数据进行合并,得到钢琴弹奏视频。The piano playing video module is used to combine the piano video stream data and the audio stream data to obtain a piano playing video.

一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述钢琴弹奏视频生成方法。A computer device, comprising a memory, a processor, and computer-readable instructions stored in the memory and operable on the processor, when the processor executes the computer-readable instructions, the above-mentioned piano playing video is realized generate method.

一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如上述钢琴弹奏视频生成方法。One or more readable storage media storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, the one or more processors execute the method for generating a piano playing video as described above .

上述钢琴弹奏视频生成方法、装置、计算机设备及存储介质,通过获取音频流数据;将所述音频流数据输入音频编码器进行编码处理,得到音频编码;通过钢琴视频代码转换模型对所述音频编码进行代码转换,得到与所述音频编码对应的钢琴视频代码薄序列;通过钢琴视频代码薄解码器对所述钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据;将所述钢琴视频流数据和所述音频流数据进行合并,得到钢琴弹奏视频。本发明通过将音频数据进行编码处理并进行代码转换为钢琴视频代码薄序列,进而将钢琴视频代码薄序列解码为钢琴视频流数据(该钢琴视频流数据是指人手在钢琴上弹奏与所述音频编码对应的音乐的视频流)。进而对钢琴视频流数据和音频流数据进行合并,得到钢琴弹奏视频,最终实现将音频流数据转换为既包含音频又包含人手在钢琴上弹奏该音频的钢琴弹奏视频,根据音频流数据生成的钢琴弹奏视频更加流畅,使视频效果和质量更好,提高用户体验感。The above-mentioned piano playing video generation method, device, computer equipment and storage medium obtain audio stream data; input the audio stream data into an audio encoder for encoding processing to obtain audio encoding; Encoding is carried out transcoding, obtains the piano video codebook sequence corresponding to described audio coding; The piano video codebook sequence is decoded by piano video codebook decoder, obtains piano video stream data; Described piano video stream The data and the audio stream data are combined to obtain a piano playing video. The present invention encodes the audio data and converts the code into a piano video codebook sequence, and then decodes the piano video codebook sequence into piano video stream data (the piano video stream data refers to the human hand playing on the piano and the piano video stream data). Audio encoding corresponds to the video stream of the music). Further, the piano video stream data and audio stream data are combined to obtain the piano playing video, and finally the audio stream data is converted into a piano playing video containing both audio and the audio played by the human hand on the piano. According to the audio stream data The generated piano playing video is smoother, the video effect and quality are better, and the user experience is improved.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments of the present invention. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention , for those skilled in the art, other drawings can also be obtained according to these drawings without paying creative labor.

图1是本发明一实施例中钢琴弹奏视频生成方法的一应用环境示意图;Fig. 1 is a schematic diagram of an application environment of a piano playing video generation method in an embodiment of the present invention;

图2是本发明一实施例中钢琴弹奏视频生成方法的一流程示意图;Fig. 2 is a schematic flow chart of the piano playing video generation method in an embodiment of the present invention;

图3是本发明一实施例中钢琴弹奏视频生成装置的一结构示意图;Fig. 3 is a structural representation of a piano playing video generating device in an embodiment of the present invention;

图4是本发明一实施例中计算机设备的一示意图。FIG. 4 is a schematic diagram of a computer device in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

本实施例提供的钢琴弹奏视频生成方法,可应用在如图1的应用环境中,其中,客户端与服务端进行通信。其中,客户端包括但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务端可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The method for generating a piano playing video provided in this embodiment can be applied in an application environment as shown in FIG. 1 , where the client communicates with the server. Among them, clients include but are not limited to various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of multiple servers.

在一实施例中,如图2所示,提供一种钢琴弹奏视频生成方法,以该方法应用在图1中的服务端为例进行说明,包括如下步骤:In one embodiment, as shown in Figure 2, a method for generating a piano playing video is provided, and the method is applied to the server in Figure 1 as an example for illustration, including the following steps:

S10、获取音频流数据。S10. Acquire audio stream data.

可理解的,音频流数据是指音频数据。例如,从音乐平台获取的音乐数据。Understandably, the audio stream data refers to audio data. For example, music data obtained from music platforms.

S20、将所述音频流数据输入音频编码器进行编码处理,得到音频编码。S20. Input the audio stream data into an audio encoder for encoding processing to obtain audio encoding.

可理解的,音频编码器是指对音频数据进行编码的编码器。音频编码是指音频数据的编码,该音频编码可以是由若干字符组成的编码。具体的,在将音频流数据输入音频编码器之前,可先将音频流数据转换为梅尔频谱数据,并根据梅尔频谱数据获取音频流数据的梅尔频谱特征,将音频流数据的梅尔频谱特征输入音频编码器中进行编码处理。Understandably, an audio encoder refers to an encoder that encodes audio data. Audio coding refers to coding of audio data, and the audio coding may be a coding composed of several characters. Specifically, before the audio stream data is input into the audio encoder, the audio stream data can be converted into Mel spectrum data first, and the Mel spectrum feature of the audio stream data can be obtained according to the Mel spectrum data, and the Mel spectrum feature of the audio stream data can be converted into The spectral features are input into an audio encoder for encoding processing.

S30、通过钢琴视频代码转换模型对所述音频编码进行代码转换,得到与所述音频编码对应的钢琴视频代码薄序列。S30. Perform code conversion on the audio code by using a piano video code conversion model to obtain a piano video codebook sequence corresponding to the audio code.

可理解的,钢琴视频代码转换模型是训练完成的神经网络模型,用于将音频编码转换为人手弹钢琴的钢琴视频代码薄序列。其中,钢琴视频代码转换模型可为GPT2模型(语言模型),GPT2模型为Transformer(自注意力网络)模型的一种。代码转换是指对音频编码转码为钢琴视频代码薄序列的过程,即将音频编码转换为视频编码的过程。钢琴视频代码薄序列是通过音频编码进行代码转换得到的视频编码。该钢琴视频代码薄序列是指人手在钢琴上弹奏与音频编码对应的音乐的视频的代码序列。Understandably, the piano video code conversion model is a trained neural network model, which is used to convert audio codes into piano video codebook sequences of human hands playing the piano. Wherein, the piano video transcoding model may be a GPT2 model (language model), and the GPT2 model is a type of Transformer (self-attention network) model. Transcoding refers to the process of transcoding audio codes into piano video codebook sequences, that is, the process of converting audio codes into video codes. Piano video codebook sequences are video codes obtained by transcoding audio codes. The piano video codebook sequence refers to a code sequence of a video of a human hand playing music corresponding to audio coding on a piano.

S40、通过钢琴视频代码薄解码器对所述钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据;所述钢琴视频流数据是指人手在钢琴上弹奏与所述音频编码对应的音乐的视频流。S40. Decode the piano video codebook sequence by a piano video codebook decoder to obtain piano video stream data; the piano video stream data refers to the music corresponding to the audio code being played on the piano by hand video stream.

可理解的,钢琴视频代码薄解码器是已训练完成的神经网络模型,用于对钢琴视频代码薄序列进行解码,将钢琴视频代码薄序列解码为钢琴视频流数据。例如,钢琴视频代码薄解码器可为VqGAN模型(Vector Quantised General Adversarial Network,图像生成模型),该模型可根据输入数据生成对应的图像。解码处理是指将钢琴视频代码薄序列解码成钢琴视频流数据的过程。钢琴视频流数据是指包含若干视频帧的数据。所述钢琴视频流数据是指人手在钢琴上弹奏与所述音频编码对应的音乐的视频流。It can be understood that the piano video codebook decoder is a trained neural network model, which is used to decode the piano video codebook sequence, and decode the piano video codebook sequence into piano video stream data. For example, the piano video codebook decoder can be a VqGAN model (Vector Quantized General Adversarial Network, image generation model), which can generate corresponding images according to input data. The decoding process refers to the process of decoding the piano video codebook sequence into piano video stream data. The piano video stream data refers to data including several video frames. The piano video stream data refers to a video stream of people playing music corresponding to the audio code on the piano.

S50、将所述钢琴视频流数据和所述音频流数据进行合并,得到钢琴弹奏视频。S50. Combine the piano video stream data and the audio stream data to obtain a piano playing video.

可理解的,在得到人手在钢琴上弹奏与音频编码对应的音乐的视频流之后,将人手在钢琴上弹奏与音频编码对应的音乐的视频流和音频流数据进行合并,得到既包含音频又包含人手在钢琴上弹奏该音频的视频,即钢琴弹奏视频。Understandably, after obtaining the video stream of people playing music corresponding to audio coding on the piano, the video stream and audio stream data of people playing music corresponding to audio coding on the piano are combined to obtain both audio and video streams. It also includes a video of people playing the audio on the piano, that is, a piano playing video.

在步骤S10-S50中,通过获取音频流数据;将所述音频流数据输入音频编码器进行编码处理,得到音频编码;通过钢琴视频代码转换模型对所述音频编码进行代码转换,得到与所述音频编码对应的钢琴视频代码薄序列;通过钢琴视频代码薄解码器对所述钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据;将所述钢琴视频流数据和所述音频流数据进行合并,得到钢琴弹奏视频。本实施例通过将音频数据进行编码处理并进行代码转换为钢琴视频代码薄序列,进而将钢琴视频代码薄序列解码为钢琴视频流数据(该钢琴视频流数据是指人手在钢琴上弹奏与所述音频编码对应的音乐的视频流)。进而对钢琴视频流数据和音频流数据进行合并,得到钢琴弹奏视频,最终实现将音频流数据转换为既包含音频又包含人手在钢琴上弹奏该音频的钢琴弹奏视频,根据音频流数据生成的钢琴弹奏视频更加流畅,使视频效果和质量更好,提高用户体验感。In steps S10-S50, by acquiring audio stream data; inputting the audio stream data into an audio encoder for encoding processing to obtain audio encoding; transcoding the audio encoding through the piano video transcoding model to obtain the same as the described audio encoding. A piano video codebook sequence corresponding to the audio encoding; decoding the piano video codebook sequence by a piano video codebook decoder to obtain piano video stream data; merging the piano video stream data and the audio stream data , to get the piano playing video. In this embodiment, the audio data is encoded and converted into a piano video codebook sequence, and then the piano video codebook sequence is decoded into piano video stream data (the piano video stream data refers to the human hand playing on the piano and the piano video stream data). video stream of the music corresponding to the audio encoding described above). Further, the piano video stream data and audio stream data are combined to obtain the piano playing video, and finally the audio stream data is converted into a piano playing video containing both audio and the audio played by the human hand on the piano. According to the audio stream data The generated piano playing video is smoother, the video effect and quality are better, and the user experience is improved.

可选的,在步骤S30之前,即在所述通过钢琴视频代码转换模型对所述音频编码进行代码转换,得到与所述音频编码对应的钢琴视频代码薄序列之前,包括:Optionally, before step S30, that is, before performing code conversion on the audio code through the piano video code conversion model to obtain a piano video codebook sequence corresponding to the audio code, the method includes:

S301、获取钢琴弹奏视频样本;S301. Obtain a piano playing video sample;

S302、对所述钢琴弹奏视频样本进行编码处理,得到第一视频代码薄序列和音频样本编码;S302. Perform encoding processing on the piano playing video sample to obtain a first video codebook sequence and audio sample encoding;

S303、将所述音频样本编码输入初始钢琴视频代码转换模型中,得到目标视频代码薄序列;S303. Encode the audio samples into the initial piano video transcoding model to obtain a target video codebook sequence;

S304、根据所述第一视频代码薄序列和所述目标视频代码薄序列确定视频损失值;S304. Determine a video loss value according to the first video codebook sequence and the target video codebook sequence;

S305、在所述视频损失值未达到预设的收敛条件时,迭代更新所述初始钢琴视频代码转换模型的初始参数,直至所述视频损失值达到所述预设的收敛条件时,将收敛之后的所述初始钢琴视频代码转换模型作为所述钢琴视频代码转换模型。S305. When the video loss value does not reach the preset convergence condition, iteratively update the initial parameters of the initial piano video transcoding model until the video loss value reaches the preset convergence condition, after convergence The initial piano video transcoding model of is used as the piano video transcoding model.

可理解的,钢琴弹奏视频样本是指收集的用于训练钢琴视频代码转换模型的既包含音频又包含人手在钢琴上弹奏该音频的视频。对钢琴弹奏视频样本进行编码处理是指将钢琴弹奏视频样本编码为第一视频代码薄序列和音频样本编码的处理过程。具体的,先对钢琴弹奏视频样本进行分流处理,得到第一钢琴视频流样本和音频流样本;进而,对第一钢琴视频流样本和音频流样本分别进行编码,得到第一视频代码薄序列和音频样本编码。初始钢琴视频代码转换模型用于对第一视频代码薄序列和音频样本编码进行训练学习,最终生成可将音频样本编码转换为视频代码薄序列的钢琴视频代码转换模型。初始钢琴视频代码转换模型可以是GPT2模型(语言模型),GPT2模型为Transformer(自注意力网络)模型的一种,可根据输入数据预测输出结果。初始钢琴视频代码转换模型的代码转换能力较弱,需经过不断的迭代更新,最终得到满足收敛条件转换能力较强的钢琴视频代码转换模型。目标视频代码薄序列是通过初始钢琴视频代码转换模型对音频样本编码进行代码转换得到的视频代码薄序列。视频损失值是指第一视频代码薄序列与目标视频代码薄序列之间的损失值。当视频损失值越小,表明目标视频代码薄序列越接近第一视频代码薄序列,代表初始钢琴视频代码转换模型的代码转换能力越强。优选的,视频损失值可通过基于交叉熵(Cross Entropy)的损失函数获得。It can be understood that the piano playing video sample refers to a video collected for training a piano video transcoding model that contains both audio and human hands playing the audio on the piano. Encoding the piano playing video samples refers to a process of encoding the piano playing video samples into a first video codebook sequence and encoding audio samples. Specifically, first stream the piano playing video samples to obtain the first piano video stream samples and audio stream samples; then, encode the first piano video stream samples and audio stream samples respectively to obtain the first video codebook sequence and audio sample encoding. The initial piano video transcoding model is used to train and learn the first video codebook sequence and audio sample encoding, and finally generate a piano video transcoding model that can convert the audio sample encoding into a video codebook sequence. The initial piano video code conversion model can be a GPT2 model (language model), and the GPT2 model is a type of Transformer (self-attention network) model, which can predict output results according to input data. The transcoding ability of the initial piano video transcoding model is weak, and it needs to be updated continuously, and finally a piano video transcoding model with strong transcoding ability that satisfies the convergence condition is obtained. The target video codebook sequence is the video codebook sequence obtained by transcoding the audio sample encoding through the initial piano video transcoding model. The video loss value refers to the loss value between the first video codebook sequence and the target video codebook sequence. When the video loss value is smaller, it indicates that the target video codebook sequence is closer to the first video codebook sequence, which means that the transcoding ability of the initial piano video transcoding model is stronger. Preferably, the video loss value can be obtained through a loss function based on cross entropy (Cross Entropy).

在本实施例中,基于钢琴弹奏视频样本对初始钢琴视频代码转换模型进行模型训练,使得到的钢琴视频代码转换模型具有较强的代码转换能力,提高钢琴视频代码转换模型的转换准确率。In this embodiment, model training is performed on the initial piano video transcoding model based on piano playing video samples, so that the obtained piano video transcoding model has strong transcoding capability and improves the conversion accuracy of the piano video transcoding model.

可选的,在步骤S303中,所述将所述音频样本编码输入初始钢琴视频代码转换模型中,得到目标视频代码薄序列,包括:Optionally, in step S303, the encoding of the audio samples is input into the initial piano video code conversion model to obtain the target video codebook sequence, including:

S3031、将所述音频样本编码作为第一输入数据,输入所述初始钢琴视频代码转换模型中,得到第一帧视频代码薄序列;S3031. Encode the audio sample as first input data, and input it into the initial piano video transcoding model to obtain a first frame video codebook sequence;

S3032、将所述音频样本编码和所述第一帧视频代码薄序列进行代码拼接作为第二输入数据,输入所述初始钢琴视频代码转换模型中,得到第二帧视频代码薄序列;S3032. Perform code splicing of the audio sample code and the first frame video codebook sequence as second input data, and input it into the initial piano video code conversion model to obtain a second frame video codebook sequence;

S3033、将所述音频样本编码、所述第一帧视频代码薄序列和所述第二帧视频代码薄序列进行代码拼接作为第三输入数据,输入所述初始钢琴视频代码转换模型中,得到第三帧视频代码薄序列;S3033. Perform code concatenation on the audio sample code, the first frame video codebook sequence, and the second frame video codebook sequence as third input data, and input them into the initial piano video code conversion model to obtain the first A three-frame video codebook sequence;

S3034、当所述音频样本编码全部转换为视频代码薄序列时,得到所述目标视频代码薄序列。S3034. Obtain the target video codebook sequence when all the audio sample codes are converted into a video codebook sequence.

可理解的,音频样本编码可以是由若干字符组成的编码。视频代码薄序列可以是由若干字符组成的编码。将整个音频样本编码作为第一输入数据输入初始钢琴视频代码转换模型之后,初始钢琴视频代码转换模型对音频样本编码进行代码转换时,先进行一个视频帧的代码序列的预测,得到一个视频帧的视频代码薄序列,将该视频代码薄序列记作第一帧视频代码薄序列。进而,将预测得到第一帧视频代码薄序列和音频样本编码进行拼接作为第二输入数据,进行下一个视频帧的视频代码薄序列的预测,将该视频代码薄序列记作第二帧视频代码薄序列。同理,将音频样本编码、第一帧视频代码薄序列和第二帧视频代码薄序列进行代码拼接作为第三输入数据,输入初始钢琴视频代码转换模型中,可得到第三帧视频代码薄序列。直至音频样本编码的所有字符都转换为视频代码薄序列时,即得到目标视频代码薄序列。例如,当第一输入数据为(a1,a2,a3,…,<BOS>)时,对应的输出数据可为(a2,a3,a4,…,<BOS>,v1)。其中,(a1,a2,a3,…)为音频样本编码,<BOS>为分割符,v1为第一帧视频代码薄序列;则第二输入数据为(a1,a2,a3,…,<BOS>,v1),对应的输出数据可为(a3,a4,a5,…,<BOS>,v1,v2),其中,v2为第二帧视频代码薄序列;第三输入数据为(a1,a2,a3,…,<BOS>,v1,v2),对应的输出数据可为(a4,a5,a6,…,<BOS>,v1,v2,v3),其中,v3为第三帧视频代码薄序列;在音频样本编码全部转换为视频代码薄序列时,对应的输出数据可为(v1,v2,v3,…),其中,(v1,v2,v3,…)为目标视频代码薄序列。Understandably, the audio sample code may be a code composed of several characters. A video codebook sequence may be a code consisting of several characters. After inputting the entire audio sample encoding as the first input data into the initial piano video transcoding model, when the initial piano video transcoding model transcodes the audio sample encoding, it first predicts the code sequence of a video frame, and obtains the code sequence of a video frame The video codebook sequence is recorded as the first frame video codebook sequence. Furthermore, the video codebook sequence of the first frame predicted and the audio sample coding are concatenated as the second input data, and the video codebook sequence of the next video frame is predicted, and the video codebook sequence is recorded as the second frame video code thin sequence. Similarly, code concatenation of the audio sample code, the first frame of video codebook sequence and the second frame of video codebook sequence is used as the third input data, input into the initial piano video code conversion model, and the third frame of video codebook sequence can be obtained . The target video codebook sequence is obtained when all characters encoded up to the audio sample are converted to a video codebook sequence. For example, when the first input data is (a1, a2, a3, . . . , <BOS>), the corresponding output data may be (a2, a3, a4, . . . , <BOS>, v1). Among them, (a1, a2, a3, ...) is the audio sample code, <BOS> is the separator, and v1 is the video codebook sequence of the first frame; then the second input data is (a1, a2, a3, ..., <BOS >, v1), the corresponding output data can be (a3, a4, a5, ..., <BOS>, v1, v2), wherein, v2 is the second frame video codebook sequence; the third input data is (a1, a2 , a3, ..., <BOS>, v1, v2), the corresponding output data can be (a4, a5, a6, ..., <BOS>, v1, v2, v3), where v3 is the video codebook of the third frame Sequence; when all audio sample codes are converted into video codebook sequences, the corresponding output data can be (v1, v2, v3, ...), where (v1, v2, v3, ...) is the target video codebook sequence.

在本实施例中,通过整个音频编码预测第一帧视频代码薄序列,进而,基于整个音频编码和已得到的视频代码薄序列进行下一个视频帧的视频代码薄序列预测,使每一个视频帧的视频代码薄序列充分学习了整个音频编码和该视频帧之前的所有视频帧的视频代码薄序列的内容,提高了目标视频代码薄序列的准确性。In this embodiment, the video codebook sequence of the first frame is predicted through the entire audio coding, and then the video codebook sequence prediction of the next video frame is performed based on the entire audio coding and the obtained video codebook sequence, so that each video frame The video codebook sequence fully learns the entire audio code and the content of the video codebook sequence of all video frames before this video frame, which improves the accuracy of the target video codebook sequence.

可选的,在步骤S302中,即所述对所述钢琴弹奏视频样本进行编码处理,得到第一视频代码薄序列和音频样本编码,包括:Optionally, in step S302, that is, performing encoding processing on the piano playing video sample to obtain the first video codebook sequence and audio sample encoding, including:

S3021、对所述钢琴弹奏视频样本进行分流处理,得到第一钢琴视频流样本和音频流样本;S3021. Perform streaming processing on the piano playing video sample to obtain a first piano video stream sample and audio stream sample;

S3022、将所述第一钢琴视频流样本输入代码薄编码器中,得到所述第一视频代码薄序列;S3022. Input the first piano video stream sample into a codebook encoder to obtain the first video codebook sequence;

S3023、将所述音频流样本输入所述音频编码器中,得到所述音频样本编码。S3023. Input the audio stream sample into the audio encoder to obtain the audio sample code.

可理解的,分流处理是指通过视频分流技术将钢琴弹奏视频样本分割成第一钢琴视频流样本和音频流样本的过程。其中,第一钢琴视频流样本是指从钢琴弹奏视频样本分割出来的视频流,包含若干人手弹奏钢琴的视频帧。音频流样本是指从钢琴弹奏视频样本分割出来的音频流。代码薄编码器是指对视频流进行编码的编码器。通过代码薄编码器可将第一钢琴视频流样本编码为第一视频代码薄序列。音频编码器是指对音频数据进行编码的编码器。通过音频编码器可将音频流样本编码为音频样本编码。It can be understood that the splitting process refers to the process of splitting the piano playing video sample into a first piano video stream sample and an audio stream sample by using video splitting technology. Wherein, the first piano video stream sample refers to a video stream segmented from the piano playing video sample, including several video frames of people playing the piano by hand. The audio stream sample refers to an audio stream segmented from the piano playing video sample. A codebook encoder is an encoder that encodes a video stream. The first piano video stream samples may be encoded into a first video codebook sequence by a codebook encoder. An audio encoder refers to an encoder that encodes audio data. Audio stream samples can be encoded into audio sample codes by an audio encoder.

在本实施例中,对钢琴弹奏视频样本进行分流处理,并对视频流和音频流分别进行不同的编码,可快速准确地得到第一视频代码薄序列和音频样本编码,提高模型训练的效率。In this embodiment, the piano playing video samples are streamed, and the video streams and audio streams are coded separately, so that the first video codebook sequence and audio sample codes can be quickly and accurately obtained, improving the efficiency of model training .

可选的,在步骤S40之前,在所述通过钢琴视频代码薄解码器对所述钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据之前,包括:Optionally, before step S40, before the piano video codebook sequence is decoded by the piano video codebook decoder to obtain piano video stream data, the steps include:

S401、获取第二钢琴视频流样本;S401. Obtain a second piano video stream sample;

S402、通过手部模型从所述第二钢琴视频流样本中提取第一骨骼关键点视频流;S402. Extract the first skeletal key point video stream from the second piano video stream sample by using the hand model;

S403、通过代码薄编码器对所述第二钢琴视频流样本进行编码处理,得到第二视频代码薄序列;S403. Use a codebook encoder to encode the second piano video stream sample to obtain a second video codebook sequence;

S404、对所述第二视频代码薄序列进行量化处理,得到视频量化代码薄序列;S404. Perform quantization processing on the second video codebook sequence to obtain a video quantized codebook sequence;

S405、将所述视频量化代码薄序列输入初始钢琴视频代码薄解码器中进行解码处理,得到目标钢琴视频流和第二骨骼关键点视频流;S405. Input the video quantization codebook sequence into the initial piano video codebook decoder for decoding processing, and obtain the target piano video stream and the second bone key point video stream;

S406、根据所述第二钢琴视频流样本、所述目标钢琴视频流、所述第一骨骼关键点视频流、所述第二骨骼关键点视频流、所述第二视频代码薄序列和所述视频量化代码薄序列确定总损失值;S406. According to the second piano video stream sample, the target piano video stream, the first skeletal key point video stream, the second skeletal key point video stream, the second video codebook sequence, and the The video quantization codebook sequence determines the total loss value;

S407、在所述总损失值未达到预设的收敛条件时,迭代更新所述初始钢琴视频代码薄解码器的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始钢琴视频代码薄解码器作为所述钢琴视频代码薄解码器。S407. When the total loss value does not reach the preset convergence condition, iteratively update the initial parameters of the initial piano video codebook decoder until the total loss value reaches the preset convergence condition, and converge The subsequent initial piano video codebook decoder serves as the piano video codebook decoder.

可理解的,第二钢琴视频流样本是指从钢琴弹奏视频样本分割出来的视频流,包含若干人手弹奏钢琴的视频帧。手部模型是指训练完成的神经网络模型,用于从输入数据中识别出人手。通过手部模型可提取第二钢琴视频流样本中每个视频帧中人手的骨骼关键点,并将获取的每个视频帧中人手的骨骼关键点合并为视频,输出为第一骨骼关键点视频流。第二视频代码薄序列是指通过代码薄编码器对第二钢琴视频流样本进行编码得到的视频代码薄序列。量化处理是指通过量化函数在代码薄库中查询与第二视频代码薄序列对应的视频代码薄序列的过程。其中,代码薄库是现有的包含若干钢琴视频编码的代码库。视频量化代码薄序列是指通过量化函数在代码薄库中查询的与第二视频代码薄序列对应的视频代码薄序列。初始钢琴视频代码薄解码器可为VqGAN模型(Vector Quantised GeneralAdversarial Network,图像生成模型),该模型可根据输入数据生成对应的图像。通过初始钢琴视频代码薄解码器可将视频量化代码薄序列解码为由若干图像组成的视频流,即目标钢琴视频流。同理,参考第一骨骼关键点视频流的获取方法,可根据目标钢琴视频流得到第二骨骼关键点视频流。进而,根据第二钢琴视频流样本、目标钢琴视频流、第一骨骼关键点视频流、第二骨骼关键点视频流、第二视频代码薄序列和视频量化代码薄序列确定初始钢琴视频代码薄解码器的总损失值,并根据该总损失值对初始钢琴视频代码薄解码器的初始参数进行迭代更新,直至该初始钢琴视频代码薄解码器满足收敛条件,得到钢琴视频代码薄解码器。It can be understood that the second piano video stream sample refers to a video stream segmented from the piano playing video sample, which includes several video frames of people playing the piano by hand. The hand model refers to the trained neural network model used to recognize human hands from the input data. Through the hand model, the skeleton key points of the human hand in each video frame in the second piano video stream sample can be extracted, and the skeleton key points of the human hand in each video frame obtained are combined into a video, and the output is the first bone key point video flow. The second video codebook sequence refers to the video codebook sequence obtained by encoding the second piano video stream samples by the codebook encoder. Quantization processing refers to a process of querying the codebook library for a video codebook sequence corresponding to the second video codebook sequence through a quantization function. Among them, the code book library is an existing code library containing several piano video codes. The video quantization codebook sequence refers to the video codebook sequence corresponding to the second video codebook sequence that is queried in the codebook library through the quantization function. The initial piano video codebook decoder can be a VqGAN model (Vector Quantised General Adversarial Network, image generation model), which can generate corresponding images according to input data. The video quantized codebook sequence can be decoded into a video stream composed of several images through the initial piano video codebook decoder, that is, the target piano video stream. Similarly, referring to the acquisition method of the first skeletal key point video stream, the second skeletal key point video stream can be obtained according to the target piano video stream. Furthermore, according to the second piano video stream sample, the target piano video stream, the first skeletal key point video stream, the second skeletal key point video stream, the second video codebook sequence and the video quantization codebook sequence, determine the initial piano video codebook decoding The total loss value of the decoder, and iteratively update the initial parameters of the initial piano video codebook decoder according to the total loss value, until the initial piano video codebook decoder meets the convergence condition, and the piano video codebook decoder is obtained.

在本实施例中,基于第二钢琴视频流样本对初始钢琴视频代码薄解码器进行模型训练,使得到的钢琴视频代码薄解码器具有较强的解码能力,提高钢琴视频代码薄解码器解码的准确率。In this embodiment, model training is carried out to the initial piano video codebook decoder based on the second piano video stream sample, so that the obtained piano video codebook decoder has stronger decoding ability, and the decoding performance of the piano video codebook decoder is improved. Accuracy.

可选的,在步骤S406中,即所述根据所述第二钢琴视频流样本、所述目标钢琴视频流、所述第一骨骼关键点视频流、所述第二骨骼关键点视频流、所述第二视频代码薄序列和所述视频量化代码薄序列确定总损失值,包括:Optionally, in step S406, that is, according to the second piano video stream sample, the target piano video stream, the first skeletal key point video stream, the second skeletal key point video stream, the The second video codebook sequence and the video quantization codebook sequence determine a total loss value, comprising:

S4061、根据所述第二钢琴视频流样本和所述目标钢琴视频流,确定第一损失值;S4061. Determine a first loss value according to the second piano video stream sample and the target piano video stream;

S4062、根据所述第一骨骼关键点视频流和所述第二骨骼关键点视频流,确定第二损失值;S4062. Determine a second loss value according to the first skeletal key point video stream and the second skeletal key point video stream;

S4063、根据所述第二视频代码薄序列和所述视频量化代码薄序列,确定第三损失值;S4063. Determine a third loss value according to the second video codebook sequence and the video quantized codebook sequence;

S4064、根据所述第一损失值、所述第二损失值和所述第三损失值确定所述总损失值。S4064. Determine the total loss value according to the first loss value, the second loss value, and the third loss value.

可理解的,第一损失值是指第二钢琴视频流样本和目标钢琴视频流之间的损失值,可通过判别器计算两者之间的损失值。同理,第二损失值是指第二视频代码薄序列和视频量化代码薄序列之间的损失值,第三损失值是指第二视频代码薄序列和视频量化代码薄序列之间的损失值。其中,第二损失值和第三损失值可基于欧式距离进行计算。进而,可将第一损失值、第二损失值和第三损失值的和确定为总损失值。It can be understood that the first loss value refers to the loss value between the second piano video stream sample and the target piano video stream, and the loss value between the two can be calculated by a discriminator. Similarly, the second loss value refers to the loss value between the second video codebook sequence and the video quantization codebook sequence, and the third loss value refers to the loss value between the second video codebook sequence and the video quantization codebook sequence . Wherein, the second loss value and the third loss value may be calculated based on Euclidean distance. Furthermore, the sum of the first loss value, the second loss value and the third loss value may be determined as the total loss value.

在本实施例中,总损失值考虑了第二钢琴视频流样本和目标钢琴视频流之间、第一骨骼关键点视频流和第二骨骼关键点视频流之间,以及第二视频代码薄序列和视频量化代码薄序列之间的损失值,使得初始钢琴视频代码薄解码器在模型训练过程中,充分学习了人手骨骼关键点、视频量化代码薄序列以及钢琴视频流,使得生成的钢琴弹奏视频中的人手更加清晰,视频弹奏画面更加流畅,提高用户体验感。In this embodiment, the total loss value takes into account the interval between the second piano video stream sample and the target piano video stream, between the first skeleton keypoint video stream and the second skeleton keypoint video stream, and the second video codebook sequence The loss value between the video quantization codebook sequence and the video quantization codebook sequence, so that the initial piano video codebook decoder fully learns the key points of the human hand skeleton, the video quantization codebook sequence, and the piano video stream during the model training process, so that the generated piano playing The human hands in the video are clearer, and the playing screen of the video is smoother, which improves the user experience.

应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。It should be understood that the sequence numbers of the steps in the above embodiments do not mean the order of execution, and the execution order of each process should be determined by its functions and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present invention.

在一实施例中,提供一种钢琴弹奏视频生成装置,该钢琴弹奏视频生成装置与上述实施例中钢琴弹奏视频生成方法一一对应。如图3所示,该钢琴弹奏视频生成装置包括音频流数据模块10、音频编码模块20、钢琴视频代码薄序列模块30、钢琴视频流数据模块40和钢琴弹奏视频模块50。各功能模块详细说明如下:In one embodiment, a device for generating a piano playing video is provided, and the device for generating a piano playing video corresponds to the method for generating a piano playing video in the above-mentioned embodiments. As shown in FIG. 3 , the piano playing video generating device includes an audio stream data module 10 , an audio encoding module 20 , a piano video codebook sequence module 30 , a piano video streaming data module 40 and a piano playing video module 50 . The detailed description of each functional module is as follows:

音频流数据模块10,用于获取音频流数据;Audio stream data module 10, for obtaining audio stream data;

音频编码模块20,用于将所述音频流数据输入音频编码器进行编码处理,得到音频编码;An audio encoding module 20, configured to input the audio stream data into an audio encoder for encoding processing to obtain audio encoding;

钢琴视频代码薄序列模块30,用于通过钢琴视频代码转换模型对所述音频编码进行代码转换,得到与所述音频编码对应的钢琴视频代码薄序列;Piano video codebook sequence module 30, for transcoding described audio coding by piano video transcoding model, obtains the piano video codebook sequence corresponding to described audio coding;

钢琴视频流数据模块40,用于通过钢琴视频代码薄解码器对所述钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据;The piano video stream data module 40 is used to decode the piano video codebook sequence by a piano video codebook decoder to obtain piano video stream data;

钢琴弹奏视频模块50,用于将所述钢琴视频流数据和所述音频流数据进行合并,得到钢琴弹奏视频;所述钢琴视频流数据是指人手在钢琴上弹奏与所述音频编码对应的音乐的视频流。The piano playing video module 50 is used for merging the piano video stream data and the audio stream data to obtain the piano playing video; The corresponding music video stream.

可选的,该钢琴弹奏视频生成装置,还包括:Optionally, the piano playing video generation device also includes:

钢琴弹奏视频样本模块,用于获取钢琴弹奏视频样本;Piano playing video sample module, used to obtain piano playing video samples;

编码处理模块,用于对所述钢琴弹奏视频样本进行编码处理,得到第一视频代码薄序列和音频样本编码;An encoding processing module, configured to encode the piano playing video sample to obtain the first video codebook sequence and audio sample encoding;

目标视频代码薄序列模块,用于将所述音频样本编码输入初始钢琴视频代码转换模型中,得到目标视频代码薄序列;A target video codebook sequence module for encoding the audio samples into the initial piano video transcoding model to obtain a target video codebook sequence;

视频损失值模块,用于根据所述第一视频代码薄序列和所述目标视频代码薄序列确定视频损失值;a video loss value module, configured to determine a video loss value based on the first video codebook sequence and the target video codebook sequence;

钢琴视频代码转换模型模块,用于在所述视频损失值未达到预设的收敛条件时,迭代更新所述初始钢琴视频代码转换模型的初始参数,直至所述视频损失值达到所述预设的收敛条件时,将收敛之后的所述初始钢琴视频代码转换模型作为所述钢琴视频代码转换模型。A piano video transcoding model module, configured to iteratively update the initial parameters of the initial piano video transcoding model when the video loss value does not reach a preset convergence condition, until the video loss value reaches the preset convergence condition When the convergence condition is used, the initial piano video transcoding model after convergence is used as the piano video transcoding model.

可选的,所述目标视频代码薄序列模块,包括:Optionally, the target video codebook sequence module includes:

第一帧视频代码薄序列单元,用于将所述音频样本编码作为第一输入数据,输入所述初始钢琴视频代码转换模型中,得到第一帧视频代码薄序列;A first frame video codebook sequence unit, configured to encode the audio samples as first input data and input them into the initial piano video transcoding model to obtain a first frame video codebook sequence;

第二帧视频代码薄序列单元,用于将所述音频样本编码和所述第一帧视频代码薄序列进行代码拼接作为第二输入数据,输入所述初始钢琴视频代码转换模型中,得到第二帧视频代码薄序列;The second frame video codebook sequence unit is used to code the audio sample code and the first frame video codebook sequence as the second input data and input it into the initial piano video code conversion model to obtain the second frame video codebook sequence;

第三帧视频代码薄序列单元,用于将所述音频样本编码、所述第一帧视频代码薄序列和所述第二帧视频代码薄序列进行代码拼接作为第三输入数据,输入所述初始钢琴视频代码转换模型中,得到第三帧视频代码薄序列;The third frame video codebook sequence unit is used to code the audio sample code, the first frame video codebook sequence and the second frame video codebook sequence as the third input data, and input the initial In the piano video code conversion model, the third frame video codebook sequence is obtained;

目标视频代码薄序列单元,用于当所述音频样本编码全部转换为视频代码薄序列时,得到所述目标视频代码薄序列。A target video codebook sequence unit, configured to obtain the target video codebook sequence when all audio sample codes are converted into video codebook sequences.

可选的,所述编码处理模块,包括:Optionally, the encoding processing module includes:

分流处理单元,用于对所述钢琴弹奏视频样本进行分流处理,得到第一钢琴视频流样本和音频流样本;A streaming processing unit, configured to perform streaming processing on the piano playing video sample to obtain a first piano video stream sample and an audio stream sample;

第一视频代码薄序列单元,用于将所述第一钢琴视频流样本输入代码薄编码器中,得到所述第一视频代码薄序列;A first video codebook sequence unit, configured to input the first piano video stream sample into a codebook encoder to obtain the first video codebook sequence;

音频样本编码单元,用于将所述音频流样本输入所述音频编码器中,得到所述音频样本编码。An audio sample encoding unit, configured to input the audio stream sample into the audio encoder to obtain the audio sample encoding.

可选的,该钢琴弹奏视频生成装置,还包括:Optionally, the piano playing video generation device also includes:

第二钢琴视频流样本模块,用于获取第二钢琴视频流样本;The second piano video stream sample module is used to obtain the second piano video stream sample;

第一骨骼关键点视频流模块,用于通过手部模型从所述第二钢琴视频流样本中提取第一骨骼关键点视频流;The first skeletal key point video stream module is used to extract the first skeletal key point video stream from the second piano video stream sample through the hand model;

第二视频代码薄序列模块,用于通过代码薄编码器对所述第二钢琴视频流样本进行编码处理,得到第二视频代码薄序列;The second video codebook sequence module is used to encode the second piano video stream sample through a codebook encoder to obtain a second video codebook sequence;

视频量化代码薄序列模块,用于对所述第二视频代码薄序列进行量化处理,得到视频量化代码薄序列;A video quantization codebook sequence module, configured to perform quantization processing on the second video codebook sequence to obtain a video quantization codebook sequence;

解码处理模块,用于将所述视频量化代码薄序列输入初始钢琴视频代码薄解码器中进行解码处理,得到目标钢琴视频流和第二骨骼关键点视频流;The decoding processing module is used to input the video quantization codebook sequence into the initial piano video codebook decoder for decoding processing to obtain the target piano video stream and the second skeleton key point video stream;

总损失值模块,用于根据所述第二钢琴视频流样本、所述目标钢琴视频流、所述第一骨骼关键点视频流、所述第二骨骼关键点视频流、所述第二视频代码薄序列和所述视频量化代码薄序列确定总损失值;The total loss value module is used for according to the second piano video stream sample, the target piano video stream, the first skeleton key point video stream, the second skeleton key point video stream, the second video code A thin sequence and said video quantization codebook sequence determine a total loss value;

钢琴视频代码薄解码器模块,用于在所述总损失值未达到预设的收敛条件时,迭代更新所述初始钢琴视频代码薄解码器的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始钢琴视频代码薄解码器作为所述钢琴视频代码薄解码器。A piano video codebook decoder module, configured to iteratively update the initial parameters of the initial piano video codebook decoder until the total loss value reaches the preset convergence condition. When the convergence condition is set, the initial piano video codebook decoder after convergence is used as the piano video codebook decoder.

可选的,所述总损失值模块,包括:Optionally, the total loss value module includes:

第一损失值单元,用于根据所述第二钢琴视频流样本和所述目标钢琴视频流,确定第一损失值;A first loss value unit, configured to determine a first loss value according to the second piano video stream sample and the target piano video stream;

第二损失值单元,用于根据所述第一骨骼关键点视频流和所述第二骨骼关键点视频流,确定第二损失值;A second loss value unit, configured to determine a second loss value according to the first skeletal key point video stream and the second skeletal key point video stream;

第三损失值单元,用于根据所述第二视频代码薄序列和所述视频量化代码薄序列,确定第三损失值;a third loss value unit, configured to determine a third loss value according to the second video codebook sequence and the video quantized codebook sequence;

总损失值单元,用于根据所述第一损失值、所述第二损失值和所述第三损失值确定所述总损失值。A total loss value unit, configured to determine the total loss value according to the first loss value, the second loss value and the third loss value.

关于钢琴弹奏视频生成装置的具体限定可以参见上文中对于钢琴弹奏视频生成方法的限定,在此不再赘述。上述钢琴弹奏视频生成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For specific limitations on the apparatus for generating a piano playing video, please refer to the above-mentioned limitations on the method for generating a piano playing video, which will not be repeated here. Each module in the above-mentioned piano playing video generating device can be fully or partially realized by software, hardware and combinations thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图4所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储钢琴弹奏视频生成方法所涉及的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种钢琴弹奏视频生成方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure may be as shown in FIG. 4 . The computer device includes a processor, memory, network interface and database connected by a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a readable storage medium and an internal memory. The readable storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer readable instructions in the readable storage medium. The database of the computer device is used to store the data involved in the method for generating the piano playing video. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer-readable instructions are executed by the processor, a method for generating a piano performance video is realized. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.

在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and operable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:

获取音频流数据;Get audio stream data;

将所述音频流数据输入音频编码器进行编码处理,得到音频编码;Inputting the audio stream data into an audio encoder for encoding processing to obtain audio encoding;

通过钢琴视频代码转换模型对所述音频编码进行代码转换,得到与所述音频编码对应的钢琴视频代码薄序列;Carry out transcoding to described audio coding by piano video transcoding model, obtain the piano video codebook sequence corresponding to described audio coding;

通过钢琴视频代码薄解码器对所述钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据;所述钢琴视频流数据是指人手在钢琴上弹奏与所述音频编码对应的音乐的视频流;The piano video codebook sequence is decoded by a piano video codebook decoder to obtain piano video stream data; the piano video stream data refers to a video stream of people playing music corresponding to the audio code on the piano ;

将所述钢琴视频流数据和所述音频流数据进行合并,得到钢琴弹奏视频。Combining the piano video stream data and the audio stream data to obtain a piano playing video.

在一个实施例中,提供了一个或多个存储有计算机可读指令的计算机可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。可读存储介质上存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时实现以下步骤:In one embodiment, one or more computer-readable storage media storing computer-readable instructions is provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media. storage medium. Computer-readable instructions are stored on the readable storage medium, and when the computer-readable instructions are executed by one or more processors, the following steps are implemented:

获取音频流数据;Get audio stream data;

将所述音频流数据输入音频编码器进行编码处理,得到音频编码;Inputting the audio stream data into an audio encoder for encoding processing to obtain audio encoding;

通过钢琴视频代码转换模型对所述音频编码进行代码转换,得到与所述音频编码对应的钢琴视频代码薄序列;Carry out transcoding to described audio coding by piano video transcoding model, obtain the piano video codebook sequence corresponding to described audio coding;

通过钢琴视频代码薄解码器对所述钢琴视频代码薄序列进行解码处理,得到钢琴视频流数据;所述钢琴视频流数据是指人手在钢琴上弹奏与所述音频编码对应的音乐的视频流;The piano video codebook sequence is decoded by a piano video codebook decoder to obtain piano video stream data; the piano video stream data refers to a video stream of people playing music corresponding to the audio code on the piano ;

将所述钢琴视频流数据和所述音频流数据进行合并,得到钢琴弹奏视频。Combining the piano video stream data and the audio stream data to obtain a piano playing video.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing related hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile memory When being read from a storage medium or a volatile readable storage medium, the computer-readable instructions may include the processes of the embodiments of the above-mentioned methods when executed. Wherein, any references to memory, storage, database or other media used in the various embodiments provided in the present application may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and brevity of description, only the division of the above-mentioned functional units and modules is used for illustration. In practical applications, the above-mentioned functions can be assigned to different functional units, Completion of modules means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.

以上所述实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围,均应包含在本发明的保护范围之内。The above-described embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still carry out the foregoing embodiments Modifications to the technical solutions recorded in the examples, or equivalent replacement of some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention, and should be included in within the protection scope of the present invention.

Claims (10)

1. A piano-playing video generation method, characterized by comprising:
acquiring audio stream data;
inputting the audio stream data into an audio encoder for encoding processing to obtain audio codes;
Transcoding the audio codes through a piano video transcoding model to obtain piano video code book sequences corresponding to the audio codes;
decoding the piano video code book sequence through a piano video code Bao Jiema device to obtain piano video stream data; the piano video stream data refers to a video stream in which a human hand plays music corresponding to the audio code on a piano;
and combining the piano video stream data and the audio stream data to obtain piano playing video.
2. The piano-playing video generating method of claim 1, wherein before said transcoding said audio codes through a piano video transcoding model, obtaining a piano video codebook sequence corresponding to said audio codes, comprising:
acquiring a piano playing video sample;
performing coding processing on the piano playing video sample to obtain a first video code book sequence and an audio sample code;
inputting the audio sample codes into an initial piano video code conversion model to obtain a target video code book sequence;
determining a video loss value from the first video codebook sequence and the target video codebook sequence;
And when the video loss value does not reach a preset convergence condition, iteratively updating initial parameters of the initial piano video transcoding model, and taking the initial piano video transcoding model after convergence as the piano video transcoding model until the video loss value reaches the preset convergence condition.
3. The piano-playing video generation method of claim 2, wherein said encoding said audio samples into an initial piano video transcoding model results in a sequence of target video codebooks, comprising:
the audio sample codes are used as first input data and are input into the initial piano video transcoding model, and a first frame video code book sequence is obtained;
performing code splicing on the audio sample codes and the first frame video code book sequence to obtain second input data, and inputting the second input data into the initial piano video code conversion model to obtain a second frame video code book sequence;
performing code splicing on the audio sample codes, the first frame video code book sequence and the second frame video code book sequence to obtain third input data, and inputting the third input data into the initial piano video code conversion model to obtain a third frame video code book sequence;
And when the audio sample codes are all converted into video code book sequences, obtaining the target video code book sequences.
4. The piano-playing video generating method of claim 2, wherein said encoding the piano-playing video sample to obtain a first video codebook sequence and an audio sample code comprises:
carrying out splitting processing on the piano playing video sample to obtain a first piano video stream sample and an audio stream sample;
inputting the first piano video stream sample into a code Bao Bianma device to obtain the first video code book sequence;
and inputting the audio stream samples into the audio encoder to obtain the audio sample codes.
5. The piano action video generating method of claim 1, wherein before said decoding process is performed on said piano video codebook sequence by means of a piano video coder Bao Jiema, comprising:
acquiring a second piano video stream sample;
extracting a first skeleton key point video stream from the second piano video stream sample through a hand model;
the second piano video stream sample is subjected to coding processing through a code Bao Bianma device to obtain a second video code book sequence;
Performing quantization processing on the second video code book sequence to obtain a video quantized code book sequence;
inputting the video quantized code book sequence into an initial piano video code Bao Jiema device for decoding processing to obtain a target piano video stream and a second bone key point video stream;
determining a total loss value from the second piano video stream sample, the target piano video stream, the first skeletal keypoint video stream, the second video codebook sequence and the video quantization codebook sequence;
and when the total loss value does not reach the preset convergence condition, iteratively updating the initial parameters of the initial piano video code Bao Jiema device until the total loss value reaches the preset convergence condition, and taking the initial piano video code Bao Jiema device after convergence as the piano video code Bao Jiema device.
6. The piano action video generating method of claim 5, wherein said determining a total loss value from said second piano video stream sample, said target piano video stream, said first skeletal keypoint video stream, said second video codebook sequence and said video quantization codebook sequence comprises:
Determining a first loss value according to the second piano video stream sample and the target piano video stream;
determining a second loss value according to the first skeletal keypoint video stream and the second skeletal keypoint video stream;
determining a third loss value based on the second video codebook sequence and the video quantization codebook sequence;
determining the total loss value from the first loss value, the second loss value, and the third loss value.
7. A piano-playing video generating apparatus, characterized by comprising:
the audio stream data module is used for acquiring audio stream data;
the audio coding module is used for inputting the audio stream data into an audio coder for coding processing to obtain audio codes;
the piano video code book sequence module is used for carrying out code conversion on the audio codes through a piano video code conversion model to obtain piano video code book sequences corresponding to the audio codes;
the piano video stream data module is used for decoding the piano video code book sequence through a piano video code Bao Jiema device to obtain piano video stream data; the piano video stream data refers to a video stream in which a human hand plays music corresponding to the audio code on a piano;
And the piano playing video module is used for combining the piano video stream data and the audio stream data to obtain piano playing video.
8. The piano-playing video generating device of claim 7, comprising, prior to said piano video codebook sequence block:
the piano playing video sample module is used for acquiring piano playing video samples;
the coding processing module is used for carrying out coding processing on the piano playing video sample to obtain a first video code book sequence and an audio sample code;
the target video code book sequence module is used for inputting the audio sample codes into an initial piano video code conversion model to obtain a target video code book sequence;
a video loss value module for determining a video loss value from the first video codebook sequence and the target video codebook sequence;
and the piano video transcoding model module is used for iteratively updating the initial parameters of the initial piano video transcoding model when the video loss value does not reach the preset convergence condition, and taking the initial piano video transcoding model after convergence as the piano video transcoding model until the video loss value reaches the preset convergence condition.
9. A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor, when executing the computer readable instructions, implements the piano action video generating method of any one of claims 1 to 6.
10. One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the piano action video generation method of any one of claims 1 to 6.
CN202310638047.XA 2023-05-31 2023-05-31 Piano playing video generation method, device, computer equipment and storage medium Pending CN116665696A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310638047.XA CN116665696A (en) 2023-05-31 2023-05-31 Piano playing video generation method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310638047.XA CN116665696A (en) 2023-05-31 2023-05-31 Piano playing video generation method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116665696A true CN116665696A (en) 2023-08-29

Family

ID=87721933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310638047.XA Pending CN116665696A (en) 2023-05-31 2023-05-31 Piano playing video generation method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116665696A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119091905A (en) * 2024-07-29 2024-12-06 广州大学 A method, device and medium for generating hand movements for playing a musical instrument

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339865A (en) * 2020-02-17 2020-06-26 杭州慧川智能科技有限公司 Method for synthesizing video MV (music video) by music based on self-supervision learning
CN113936243A (en) * 2021-12-16 2022-01-14 之江实验室 A discrete representation of video behavior recognition system and method
US20220375190A1 (en) * 2020-08-25 2022-11-24 Deepbrain Ai Inc. Device and method for generating speech video

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339865A (en) * 2020-02-17 2020-06-26 杭州慧川智能科技有限公司 Method for synthesizing video MV (music video) by music based on self-supervision learning
US20220375190A1 (en) * 2020-08-25 2022-11-24 Deepbrain Ai Inc. Device and method for generating speech video
CN113936243A (en) * 2021-12-16 2022-01-14 之江实验室 A discrete representation of video behavior recognition system and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAYES T, ZHANG S, YIN X, ET AL.: "Mugen: A playground for video-audio-text multimodal understanding and generation", EUROPEAN CONFERENCE ON COMPUTER VISION. CHAM: SPRINGER NATURE SWITZERLAND, 31 December 2022 (2022-12-31), pages 431 - 449 *
SONGWEI GE: "Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer", EUROPEAN CONFERENCE ON COMPUTER VISION. CHAM: SPRINGER NATURE SWITZERLAND, 24 September 2022 (2022-09-24), pages 1 - 30 *
YAN W, ZHANG Y, ABBEEL P, ET AL: "Videogpt: Video generation using vq-vae and transformers", ARXIV PREPRINT ARXIV:2104.10157, 31 December 2021 (2021-12-31) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119091905A (en) * 2024-07-29 2024-12-06 广州大学 A method, device and medium for generating hand movements for playing a musical instrument

Similar Documents

Publication Publication Date Title
JP6928041B2 (en) Methods and equipment for processing video
CN110223705B (en) Voice conversion method, device, equipment and readable storage medium
CN114360493A (en) Speech synthesis method, apparatus, medium, computer device and program product
CN113539273B (en) Voice recognition method and device, computer equipment and storage medium
CN111241853B (en) Session translation method, device, storage medium and terminal equipment
US20240257819A1 (en) Voice audio compression using neural networks
TW201236444A (en) Video transmission and sharing over ultra-low bitrate wireless communication channel
CN111816197B (en) Audio encoding method, device, electronic equipment and storage medium
CN113870827B (en) Training method, device, equipment and medium for speech synthesis model
CN117809621B (en) A speech synthesis method, device, electronic device and storage medium
WO2021028236A1 (en) Systems and methods for sound conversion
CN119252268A (en) Audio decoding, encoding method, device, electronic device and storage medium
CN114842857B (en) Voice processing method, device, system, equipment and storage medium
CN117577121B (en) Audio encoding and decoding method and device, storage medium and equipment based on diffusion model
CN116665696A (en) Piano playing video generation method, device, computer equipment and storage medium
WO2021166129A1 (en) Speech recognition device, control method, and program
CN116129876A (en) Speech conversion model training method and device, and speech generation method and device
CN115482832B (en) Virtual face generation method and device, computer equipment and readable storage medium
CN114283791A (en) Speech recognition method based on high-dimensional acoustic features and model training method
Valin et al. DRED: Deep REDundancy Coding of Speech Using a Rate-Distortion-Optimized Variational Autoencoder
WO2024160281A1 (en) Audio encoding and decoding method and apparatus, and electronic device
US20220417291A1 (en) Systems and Methods for Performing Video Communication Using Text-Based Compression
KR20250169167A (en) Audio generation using non-self-regressive decoding
CN116434763A (en) Autoregressive audio generation method, device, equipment and storage medium based on audio quantization
US12374318B1 (en) System and method for style extraction in speech synthesis using neural networks and stored augmentations to simulate degraded speech characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination