CN109936766B

CN109936766B - An end-to-end audio generation method for water scenes

Info

Publication number: CN109936766B
Application number: CN201910091367.1A
Authority: CN
Inventors: 刘世光; 程皓楠; 王凯
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-01-30
Filing date: 2019-01-30
Publication date: 2021-04-13
Anticipated expiration: 2039-01-30
Also published as: CN109936766A

Abstract

The invention belongs to the technical field of audio processing, and in particular relates to an end-to-end water scene audio generation method, comprising the following steps: Step 1, select various water scene videos, and perform preprocessing; The data of the generator model is obtained through training; step 3, the silent video is preprocessed, loaded into the trained generator model, and the audio corresponding to the silent video is output; step 4, the envelope is generated according to the sequence of the audio, and loaded To the trained timbre enhancer model, output the timbre-enhanced audio. The invention can realize the automatic generation of end-to-end outdoor water scene sound, solve the problem of time-consuming and laborious dubbing of the scene, and at the same time, use the model obtained by training to generate water scene audio, which can improve the generation speed and synchronization, thereby improving work efficiency .

Description

An end-to-end audio generation method for water scenes

技术领域technical field

本发明属于音频处理的技术领域，具体涉及一种基于端到端的水场景音频的生成方法。The invention belongs to the technical field of audio processing, and in particular relates to an end-to-end audio generation method for water scenes.

背景技术Background technique

随着计算机图形学技术的不断发展，人们对视频及动画的声音质量提出了更高的要求。而水场景，尤其是户外水场景存在于影视、游戏之中，所以开发一种能够自动的根据户外水场景视频去生成对应场景声音的方法显得十分必要。目前，人们大多利用基于物理的方法去生成水场景的声音。With the continuous development of computer graphics technology, people put forward higher requirements for the sound quality of video and animation. Water scenes, especially outdoor water scenes exist in movies and games, so it is very necessary to develop a method that can automatically generate corresponding scene sounds based on outdoor water scene videos. Currently, people mostly use physics-based methods to generate the sound of water scenes.

基于物理的水场景声音生成方法主要基于一种理论，即气泡的形成和共振是水声音的最主要的来源。Zheng等人在谐波气泡中提出一个基于谐波气泡的水流声音生成方法，通过对声音传播过程的考虑，其生成了包括水龙头流水在内的多种流水声音，但其所生成的结果需要经过繁琐的人为调整，随后，Langlois 等人在基于复杂声学气泡的水模拟中提出一个基于二相不可压缩流体模拟的声音生成方法被提出，用于改进利用气泡生成的流体声音结果，其液体中的气泡不再采用随机的模型，而是根据流体的状态去产生更加真实的气泡，也使得最终的声音效果更加逼真，但这些方法的主要研究对象都局限于小规模的水流，并且，随着声音结果的不断改良，算法复杂度也在不断的提升，这就使得他们无法应用到户外水场景的声音合成中。The physics-based sound generation method for water scenes is mainly based on a theory that the formation and resonance of air bubbles are the most important sources of water sound. Zheng et al. proposed a harmonic bubble-based water flow sound generation method in the harmonic bubble. By considering the sound propagation process, it generated a variety of flowing water sounds including faucet flowing water, but the generated results need to go through After tedious artificial adjustment, Langlois et al. proposed a sound generation method based on two-phase incompressible fluid simulation in the water simulation based on complex acoustic bubbles. The bubble no longer adopts a random model, but generates more realistic bubbles according to the state of the fluid, which also makes the final sound effect more realistic, but the main research objects of these methods are limited to small-scale water flow, and, with the sound The continuous improvement of the results and the increasing complexity of the algorithms make them unable to be applied to the sound synthesis of outdoor water scenes.

深度学习的声音生成方法，基于视频去生成对应的声音。Owens等人在视觉表明声音中提出一个由卷积神经网络(CNN)和长短期记忆单元(LSTM)组合而成的神经网络，其通过输入每一帧视频灰度图及其前后帧灰度图像组成的 spacetime图的图像特征，输出与视频相对应的声音耳蜗电图，再去声音库中寻找与此图最匹配的声音样本拼接生成最终结果，Chen等人在深跨模态视听生成中提出利用GAN网络设计了两种转换模式，分别将输入乐器声的对数振幅梅尔频谱图(LMS)转换为对应的乐器图，以及将乐器图转换为对应的LMS图，再去寻找与LMS匹配的乐器声音，这两个算法的深度网络的输出都是类似于图像的谱图，并没有直接生成原始的声音信号，Zhou等人在视频到声音：室外视频的声音生成中提出利用SampleRNN模型对自然场景视频的声音进行了尝试性地生成，通过提取视频图像或者光流图的特征作为RNN的输入，从而直接生成对应的声音信号，然而其在音视频同步性上仍然存在一些问题。The sound generation method of deep learning, based on the video to generate the corresponding sound. Owens et al. proposed a neural network composed of a convolutional neural network (CNN) and a long short-term memory unit (LSTM) in the visual representation of the sound. The image features of the composed spacetime map, output the sound cochlear map corresponding to the video, and then go to the sound library to find the sound sample that best matches this map to splicing to generate the final result. Chen et al. proposed in deep cross-modal audiovisual generation Using the GAN network, two conversion modes are designed, which convert the logarithmic amplitude Mel-spectrogram (LMS) of the input instrument sound into the corresponding instrument diagram, and convert the instrument diagram into the corresponding LMS diagram, and then look for a match with the LMS. The output of the deep network of these two algorithms is similar to the spectrogram of the image, and does not directly generate the original sound signal. Zhou et al. in Video to Sound: Sound Generation for Outdoor Video proposed to use the SampleRNN model to The sound of natural scene video is tentatively generated, and the corresponding sound signal is directly generated by extracting the features of the video image or optical flow graph as the input of the RNN. However, there are still some problems in the synchronization of audio and video.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于：针对现有技术的不足，提供一种基于端到端的水场景音频的生成方法，能够实现端到端的户外水场景声音的自动生成，解决为场景配音费时和费力的问题，同时，利用训练所得的模型来生成水场景音频，能够提高生成速度和同步度，从而提高工作效率。The object of the present invention is to: aiming at the deficiencies of the prior art, provide a kind of generation method based on end-to-end water scene audio, can realize the automatic generation of end-to-end outdoor water scene sound, solve the problem of time-consuming and laborious dubbing for the scene, At the same time, using the trained model to generate water scene audio can improve the generation speed and synchronization, thereby improving work efficiency.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种基于端到端的水场景音频的生成方法，包括如下步骤：An end-to-end water scene audio generation method, comprising the following steps:

步骤一，选取各类水场景视频，并进行预处理；Step 1, select all kinds of water scene videos, and perform preprocessing;

步骤二，根据预处理后的数据，通过训练获得生成器模型；Step 2: Obtain a generator model through training according to the preprocessed data;

步骤三，将无声视频进行预处理，加载到训练好的所述生成器模型，输出与所述无声视频对应的音频；Step 3, preprocess the silent video, load it into the trained generator model, and output the audio corresponding to the silent video;

步骤四，根据所述音频的序列生成包络，并加载到训练好的音色增强器模型，输出音色增强后的所述音频。Step 4: Generate an envelope according to the audio sequence, load it into the trained timbre enhancer model, and output the timbre-enhanced audio.

还需要说明的是：本发明的生成方法中，步骤一中，选取各类水场景视频进行训练，有助于对模型进行优化训练，降低误差，同时，由于视频的图像信息与声音之间有较大的维度差异，通过预处理能够使图像信息与声音在同一个维度；步骤二中，通过对预处理后的数据训练生成器模型，可自动合成与户外水场景视频相同步的流体声音，不需要专业的拟音师来合成同步的水场景声音，也不需要人为的根据不同的场景特征去设计不同的算法来生成各类场景的声音，节约人力物力的同时，提高生成器模型的准确性，满足人们的需求，同时，还需要设置辨别器，用于评估生成器生成结果的好坏，并将评估结果反馈到生成器模型中，生成器模型经过多次的反馈及调整过程，实现对生成器模型进行有效训练，从而提高生成器模型的准确性，给无声视频同步配声音；步骤三中，无声视频不具有声音，需要训练好的生成器模型根据每一秒的无声视频信息向量，生成对应的音频数据，从而完成给无声视频配上声音；步骤四中，由于生成器模型输出的音频数据未必能符合实际水场景，如瀑布场景，需要对音色进行增强，以符合实际水场景需求，同时，为了进一步提高自动化水平，也采用训练好的音色增强器模型对音色进行增强，实现端到端的户外水场景声音的自动生成，训练好的音色增强器模型能够根据声音的包络，直接得到增强后的音频，免去中间的物理方法，如，图象法、比较法、综合法、控制变量法和转化法等，大大提高处理速度，减少用户等待的时间。It should also be noted that: in the generation method of the present invention, in step 1, various types of water scene videos are selected for training, which is helpful for optimizing the training of the model and reducing errors. If there is a large dimensional difference, the image information and sound can be in the same dimension through preprocessing; in step 2, by training the generator model on the preprocessed data, the fluid sound synchronized with the video of the outdoor water scene can be automatically synthesized. There is no need for a professional foley engineer to synthesize the synchronized water scene sounds, and there is no need to manually design different algorithms to generate the sounds of various scenes according to different scene characteristics, which saves manpower and material resources and improves the accuracy of the generator model. At the same time, a discriminator needs to be set up to evaluate the quality of the generated results of the generator, and the evaluation results are fed back to the generator model. After many feedback and adjustment processes, the generator model can achieve Effectively train the generator model, thereby improving the accuracy of the generator model, and adding sound to the silent video synchronously; in step 3, the silent video does not have sound, and the trained generator model needs to be based on the information vector of each second of the silent video. , generate the corresponding audio data, so as to complete the sound for the silent video; in step 4, since the audio data output by the generator model may not conform to the actual water scene, such as the waterfall scene, the sound quality needs to be enhanced to conform to the actual water scene. At the same time, in order to further improve the level of automation, the trained timbre enhancer model is also used to enhance the timbre to realize the automatic generation of end-to-end outdoor water scene sounds. The trained timbre enhancer model can be based on the envelope of the sound, The enhanced audio is directly obtained, eliminating the intermediate physical methods, such as image method, comparison method, synthesis method, control variable method and transformation method, etc., which greatly improves the processing speed and reduces the waiting time of users.

作为本发明所述的一种基于端到端的水场景音频的生成方法的一种改进，步骤一中，所述预处理的方法，包括如下步骤：As an improvement of the end-to-end water scene audio generation method of the present invention, in step 1, the preprocessing method includes the following steps:

A1、提取视频帧的特征，获取视频的信息；A1. Extract the features of the video frame and obtain the information of the video;

A2、将每秒视频信息转换为与音频维度相同的向量。A2. Convert the video information per second into a vector with the same dimension as the audio.

作为本发明所述的一种基于端到端的水场景音频的生成方法的一种改进，步骤二中，所述生成器模型的训练方法，包括如下步骤：As an improvement of the end-to-end water scene audio generation method according to the present invention, in step 2, the training method of the generator model includes the following steps:

B1、输入所述视频信息的向量，通过所述生成器模型输出音频信号；B1, input the vector of described video information, output audio signal through described generator model;

B2、评估所述音频信号，若不对应，则反馈给所述生成器模型，并重新进行调整，直到输出对应的音频信号；若对应，则继续进行下一个视频信息的训练。B2. Evaluate the audio signal, and if it does not correspond, feed it back to the generator model, and re-adjust until the corresponding audio signal is output; if it corresponds, continue to train the next video information.

作为本发明所述的一种基于端到端的水场景音频的生成方法的一种改进，步骤四中，所述音色增强器模型的训练方法，包括如下步骤：As an improvement of the end-to-end water scene audio generation method according to the present invention, in step 4, the training method of the timbre enhancer model includes the following steps:

C1、输入目标音频的包络，通过所述音色增强器模型输出所述音频的序列；C1, input the envelope of target audio frequency, output the sequence of described audio frequency through described timbre enhancer model;

C2、评估所述音频的序列，若不是目标序列，则反馈给所述音色增强器模型，并重新进行调整，直到输出目标音频的序列；若是目标序列，则继续进行下一个音色增强训练。C2. Evaluate the audio sequence, and if it is not the target sequence, feed it back to the timbre enhancer model, and re-adjust until the target audio sequence is output; if it is the target sequence, proceed to the next timbre enhancement training.

作为本发明所述的一种基于端到端的水场景音频的生成方法的一种改进，步骤四中，所述包络的生成方法，包括如下步骤：As an improvement of the end-to-end water scene audio generation method of the present invention, in step 4, the envelope generation method includes the following steps:

D1、输入一段音频序列G_V以及包络的采样间隔L_step；D1. Input a segment of audio sequence G _V and the sampling interval L _step of the envelope;

D2、取音频序列G_V中每一个采样间隔L_step内的绝对值的最大值作为这段间隔内的一个包络点pi；D2. Take the maximum value of the absolute value in each sampling interval L _step in the audio sequence G _V as an envelope point pi in this interval;

D3、所有采样间隔内的包络点pi连接而成的数组Ep，经过线性插值形成长度与G_V相同的序列E(1:len)，即为音频序列G_V所对应的包络，D3. The array Ep formed by connecting the envelope points pi in all sampling intervals is linearly interpolated to form a sequence E(1:len) with the same length as G _V , which is the envelope corresponding to the audio sequence G _V ,

其中，Pi∈Gv，interp()表示线性插值，

表示连接操作。Among them, Pi∈Gv, interp() represents linear interpolation,

Represents a connect operation.

作为本发明所述的一种基于端到端的水场景音频的生成方法的一种改进，步骤A2中，所述视频信息转换公式为：As an improvement of the method for generating end-to-end water scene audio according to the present invention, in step A2, the video information conversion formula is:

G(y₁，...，y_m)→x₁，...，x_n，x∈{音频)，y∈{视频}G(y ₁ , ..., y _m ) → x ₁ , ..., x _n , x ∈ {audio), y ∈ {video}

其中y1，...，ym代表所述视频帧的颜色通道信息，每一个通道都是由介于0 到255之间的数组成的矩阵，G(y1，...，ym)表示基于视频帧生成的音频信号的值 (取值范围为-1到1)，x1，...，xn表示视频对应的音频信号的值(变化范围为-1 到1)。Where y1,...,ym represents the color channel information of the video frame, each channel is a matrix composed of numbers between 0 and 255, G(y1,...,ym) represents the video frame based on The value of the generated audio signal (the value range is -1 to 1), x1, . . . , xn represents the value of the audio signal corresponding to the video (the change range is -1 to 1).

作为本发明所述的一种基于端到端的水场景音频的生成方法的一种改进，步骤二中，输出所述音频信号所使用的损失函数为：As an improvement of the end-to-end water scene audio generation method according to the present invention, in step 2, the loss function used for outputting the audio signal is:

其中，λ＝100，其中，X表示声音真实值，V表示视频帧信息，G表示生成器生成的结果，D表示评估的结果，E表示求均值。

Among them, λ=100, where X represents the real sound value, V represents the video frame information, G represents the result generated by the generator, D represents the evaluation result, and E represents the mean value.

作为本发明所述的一种基于端到端的水场景音频的生成方法的一种改进，步骤二中，评估所述音频信号所使用的损失函数为：As an improvement of the end-to-end water scene audio generation method according to the present invention, in step 2, the loss function used for evaluating the audio signal is:

其中，V表示视频帧信息，G表示生成器生成的结果，D表示评估的结果，E表示求均值。

Among them, V represents the video frame information, G represents the result generated by the generator, D represents the evaluation result, and E represents the mean value.

作为本发明所述的一种基于端到端的水场景音频的生成方法的一种改进，所述水场景音频的生成方法基于GAN网络，所述GAN网络包括生成器、辨别器及音色增强器。As an improvement of the end-to-end water scene audio generation method of the present invention, the water scene audio generation method is based on a GAN network, and the GAN network includes a generator, a discriminator and a timbre enhancer.

作为本发明所述的一种基于端到端的水场景音频的生成方法的一种改进，所述水场景音频的生成方法基于GAN网络，步骤一中，预处理后的视频帧产生的向量V_t可以表示为如下形式：As an improvement of the end-to-end water scene audio generation method of the present invention, the water scene audio generation method is based on a GAN network. In step 1, the vector V _t generated by the preprocessed video frame is It can be expressed in the following form:

其中，

表示连接操作，

v_t,q表示第t秒的第q帧所提取的特征，Floor表示向下取整；in,

represents the connection operation,

v _t,q represents the feature extracted from the qth frame of the tth second, and Floor represents the rounding down;

声音的生成任务可进一步表示为如下形式：The sound generation task can be further expressed as the following form:

G(V₁，V₂，...，V_Δt)→X₁，X₂，...，X_Δt G(V ₁ , V ₂ , ..., V _Δt )→X ₁ , X ₂ , ..., X _Δt

其中，X_t＝{X_t，1，x_t，2，...，X_t，SR_audio}，t∈{1，2，...，Δt}。where X _t = {X _{t, 1} , x _{t, 2} , ..., X _t , SR _audio }, t∈{1, 2, ..., Δt}.

本发明的有益效果在于，本发明包括如下步骤：步骤一，选取各类水场景视频，并进行预处理；步骤二，根据预处理后的数据，通过训练获得生成器模型；步骤三，将无声视频进行预处理，加载到训练好的生成器模型，输出与无声视频对应的音频；步骤四，根据音频的序列生成包络，并加载到训练好的音色增强器模型，输出音色增强后的音频。在本发明的生成方法中，步骤一中，选取各类水场景视频进行训练，有助于对模型进行优化训练，降低误差，同时，由于视频的图像信息与声音之间有较大的维度差异，通过预处理能够使图像信息与声音在同一个维度；步骤二中，通过对预处理后的数据训练生成器模型，可自动合成与户外水场景视频相同步的流体声音，不需要专业的拟音师来合成同步的水场景声音，也不需要人为的根据不同的场景特征去设计不同的算法来生成各类场景的声音，节约人力物力的同时，提高生成器模型的准确性，满足人们的需求，同时，还需要设置辨别器，用于评估生成器生成结果的好坏，并将评估结果反馈到生成器模型中，生成器模型经过多次的反馈及调整过程，实现对生成器模型进行有效训练，从而提高生成器模型的准确性，给无声视频同步配声音；步骤三中，无声视频不具有声音，需要训练好的生成器模型根据每一秒的无声视频信息向量，生成对应的音频数据，从而完成给无声视频配上声音；步骤四中，由于生成器模型输出的音频数据未必能符合实际水场景，如瀑布场景，需要对音色进行增强，以符合实际水场景需求，同时，为了进一步提高自动化水平，也采用训练好的音色增强器模型对音色进行增强，实现端到端的户外水场景声音的自动生成，训练好的音色增强器模型能够根据声音的包络，直接得到增强后的音频，免去中间的物理方法，如，图象法、比较法、综合法、控制变量法和转化法等，大大提高处理速度，减少用户等待的时间。本发明能够实现端到端的户外水场景声音的自动生成，解决为场景配音费时和费力的问题，同时，利用训练所得的模型来生成水场景音频，能够提高生成速度和同步度，从而提高工作效率。The beneficial effect of the present invention is that the present invention includes the following steps: step 1, selecting various water scene videos and preprocessing; step 2, obtaining a generator model through training according to the preprocessed data; step 3, adding the silent The video is preprocessed, loaded into the trained generator model, and the audio corresponding to the silent video is output; in step 4, the envelope is generated according to the audio sequence, and loaded into the trained timbre enhancer model, and the timbre-enhanced audio is output . In the generation method of the present invention, in step 1, various water scene videos are selected for training, which helps to optimize the training of the model and reduce errors. At the same time, because there is a large dimensional difference between the image information of the video and the sound , through preprocessing, the image information and the sound can be in the same dimension; in step 2, by training the generator model on the preprocessed data, the fluid sound synchronized with the video of the outdoor water scene can be automatically synthesized, without the need for professional simulation. The sound engineer synthesizes the synchronized water scene sounds, and does not need to manually design different algorithms according to different scene characteristics to generate the sounds of various scenes, saving manpower and material resources, and improving the accuracy of the generator model to meet people's needs. At the same time, it is also necessary to set a discriminator to evaluate the quality of the results generated by the generator, and to feed the evaluation results back to the generator model. Effective training, thereby improving the accuracy of the generator model, and adding sound to the silent video synchronously; in step 3, the silent video does not have sound, and the trained generator model needs to generate the corresponding audio according to the information vector of the silent video every second In step 4, since the audio data output by the generator model may not meet the actual water scene, such as the waterfall scene, the sound quality needs to be enhanced to meet the needs of the actual water scene. At the same time, in order to To further improve the level of automation, the trained timbre enhancer model is also used to enhance the timbre to realize the automatic generation of end-to-end outdoor water scene sounds. The trained timbre enhancer model can directly obtain the enhanced sound according to the envelope of the sound. Audio, without intermediate physical methods, such as image method, comparison method, synthesis method, control variable method and transformation method, etc., greatly improve the processing speed and reduce the waiting time of users. The invention can realize the automatic generation of end-to-end outdoor water scene sound, solve the problem of time-consuming and laborious dubbing of the scene, and at the same time, use the model obtained by training to generate water scene audio, which can improve the generation speed and synchronization, thereby improving work efficiency .

附图说明Description of drawings

图1为本发明的流程示意图；Fig. 1 is the schematic flow chart of the present invention;

图2为本发明的工作示意图；Fig. 2 is the working schematic diagram of the present invention;

图3为本发明中水场景及其对应的音频信号的波形图；Fig. 3 is the waveform diagram of the water scene in the present invention and its corresponding audio signal;

图4为本发明中音色增强前后的频谱对比图。FIG. 4 is a spectrum comparison diagram before and after timbre enhancement in the present invention.

具体实施方式Detailed ways

如在说明书及权利要求当中使用了某些词汇来指称特定组件。本领域技术人员应可理解，硬件制造商可能会用不同名词来称呼同一个组件。本说明书及权利要求并不以名称的差异来作为区分组件的方式，而是以组件在功能上的差异来作为区分的准则。如在通篇说明书及权利要求当中所提及的“包含”为一开放式用语，故应解释成“包含但不限定于”。“大致”是指在可接受的误差范围内，本领域技术人员能够在一定误差范围内解决技术问题，基本达到技术效果。As used in the specification and claims, certain terms are used to refer to particular components. It should be understood by those skilled in the art that hardware manufacturers may refer to the same component by different nouns. The description and claims do not use the difference in name as a way to distinguish components, but use the difference in function of the components as a criterion for distinguishing. As mentioned in the entire specification and claims, "comprising" is an open-ended term, so it should be interpreted as "including but not limited to". "Approximately" means that within an acceptable error range, those skilled in the art can solve technical problems within a certain error range, and basically achieve technical effects.

在本发明的描述中，需要理解的是，术语“上”、“下”、“前”、“后”、“左”、“右”、水平”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that the orientation or positional relationship indicated by the terms "upper", "lower", "front", "rear", "left", "right", horizontal" etc. is based on the accompanying drawings The orientation or positional relationship shown is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the indicated device or element must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as a reference to the present invention. limits.

在发明中，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”、“固定”等术语应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本发明中的具体含义。In the invention, unless otherwise expressly specified and limited, the terms "installation", "connection", "connection", "fixation" and other terms should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection, Or integrally connected; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium, and it can be the internal communication between the two components. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood according to specific situations.

以下结合附图1～4对本发明作进一步详细说明，但不作为对本发明的限定。The present invention will be further described in detail below in conjunction with accompanying drawings 1 to 4, but it is not intended to limit the present invention.

实施例1Example 1

步骤三，将无声视频进行预处理，加载到训练好的生成器模型，输出与无声视频对应的音频；Step 3: Preprocess the silent video, load it into the trained generator model, and output the audio corresponding to the silent video;

步骤四，根据音频的序列生成包络，并加载到训练好的音色增强器模型，输出音色增强后的音频。In step 4, the envelope is generated according to the audio sequence, and loaded into the trained timbre enhancer model, and the timbre-enhanced audio is output.

优选的，步骤一中，预处理的方法，包括如下步骤：Preferably, in step 1, the method for preprocessing includes the following steps:

上述预处理方法中，步骤A1中，因为完整的水场景视频占用更大的内存空间，不利于获取视频的信息，且计算量较大，所以，通过提取视频帧的特征，能够减少计算量，同时达到获取视频的信息的目的，提高运算速度；步骤A2中，由于视频的图像信息与声音之间有较大的维度差异，不仅计算量很多，还会增加生成器模型的误差，降低水场景声音与视频配对的效果。In the above preprocessing method, in step A1, because the complete water scene video occupies a larger memory space, which is not conducive to obtaining video information, and the amount of calculation is large, the amount of calculation can be reduced by extracting the characteristics of the video frame, At the same time, the purpose of obtaining the information of the video is achieved, and the operation speed is improved; in step A2, due to the large dimensional difference between the image information of the video and the sound, it not only requires a lot of calculation, but also increases the error of the generator model and reduces the water scene. The effect of pairing sound with video.

优选的，步骤二中，生成器模型的训练方法，包括如下步骤：Preferably, in step 2, the training method of the generator model includes the following steps:

B1、输入视频信息的向量，通过生成器模型输出音频信号；B1. Input the vector of video information, and output the audio signal through the generator model;

B2、评估音频信号，若不对应，则反馈给生成器模型，并重新进行调整，直到输出对应的音频信号；若对应，则继续进行下一个视频信息的训练。B2. Evaluate the audio signal, and if it does not correspond, feed it back to the generator model, and adjust it again until the corresponding audio signal is output; if it corresponds, continue the training of the next video information.

上述训练方法中，步骤B2中，初始的生成器模型没有经过训练，输出的音频信号未必与频信息的向量一一对应，通过各类水场景视频进行训练，并实时反馈给生成器模型，有助于对模型进行优化训练，降低输出的误差。In the above training method, in step B2, the initial generator model has not been trained, and the output audio signal may not correspond to the vector of frequency information one-to-one. It is trained through various water scene videos and fed back to the generator model in real time. Helps to optimize the training of the model and reduce the error of the output.

优选的，步骤四中，音色增强器模型的训练方法，包括如下步骤：Preferably, in step 4, the training method of the timbre enhancer model includes the following steps:

C1、输入目标音频的包络，通过音色增强器模型输出音频的序列；C1. Input the envelope of the target audio, and output the audio sequence through the timbre enhancer model;

C2、评估音频的序列，若不是目标序列，则反馈给音色增强器模型，并重新进行调整，直到输出目标音频的序列；若是目标序列，则继续进行下一个音色增强训练。C2. Evaluate the audio sequence. If it is not the target sequence, feed it back to the timbre enhancer model and re-adjust until the target audio sequence is output; if it is the target sequence, proceed to the next timbre enhancement training.

上述训练方法中，步骤C2中，初始音色增强器模型没有经过训练，输出的音频序列未必与目标音频的包络对应，通过各类音频的包络进行训练，并实时反馈给音色增强器模型，有助于对模型进行优化训练，降低输出的误差。In the above training method, in step C2, the initial timbre enhancer model has not been trained, and the output audio sequence may not correspond to the envelope of the target audio. It helps to optimize the training of the model and reduce the error of the output.

优选的，步骤四中，包络的生成方法，包括如下步骤：Preferably, in step 4, the method for generating the envelope includes the following steps:

其中，Pi∈Gv，interp()表示线性插值，

Represents a connect operation.

优选的，步骤A2中，视频信息转换公式为：Preferably, in step A2, the video information conversion formula is:

G(y₁，...，y_m)→x₁，...，x_n，x∈{音频}，y∈{视频}G(y ₁ , ..., y _m ) → x ₁ , ..., x _n , x ∈ {audio}, y ∈ {video}

其中y₁，...，ym代表所述视频帧的颜色通道信息，每一个通道都是由介于0 到255之间的数组成的矩阵，G(y1，...，ym)表示基于视频帧生成的音频信号的值 (取值范围为-1到1)，x1，...，xn表示视频对应的音频信号的值(变化范围为-1 到1)。where y ₁ ,...,ym represents the color channel information of the video frame, each channel is a matrix composed of numbers between 0 and 255, G(y1,...,ym) represents the video-based The value of the audio signal generated by the frame (value range is -1 to 1), x1, . . . , xn represent the value of the audio signal corresponding to the video (the change range is -1 to 1).

优选的，步骤二中，输出音频信号所使用的损失函数为：Preferably, in step 2, the loss function used to output the audio signal is:

优选的，步骤二中，评估音频信号所使用的损失函数为：Preferably, in step 2, the loss function used for evaluating the audio signal is:

实施例2Example 2

与实施例1不同的是：本实施例的视频预处理中，对于不同输入的视频，其图像尺寸通常都不相同，为了减少计算量以及统一管理，将输入图像缩放成大小为256×256×3的图像，然后将每一秒中的30张256×256×3的图像编码为与音频尺度相对应的1×4096×1。首先，对于每一个视频帧y_i，提取其在VGG19 网络下的特征向量v_i，其维度为1×4096×1。设SR_videc和SR_audio为视频和音频的采样率，在本发明中为30和44100。对于第t秒的视频，其对应的视频预处理后的向量V_t可以表示为如下形式：The difference from Embodiment 1 is that in the video preprocessing of this embodiment, the image sizes of different input videos are usually different. 3 images, and then encode 30 256×256×3 images per second into 1×4096×1 corresponding to the audio scale. First, for each video frame _yi , extract its feature vector vi under the _VGG19 network, and its dimension is 1×4096×1. Let SR _videc and SR _audio be the sampling rates of video and audio, which are 30 and 44100 in the present invention. For the video of the t second second, the corresponding video preprocessed vector V _t can be expressed as the following form:

其中，

表示连接操作，

v_t,q表示第t秒的第q帧所提取的VGG19特征，Floor表示向下取整。所以，在本发明中p ＝10，q＝3。对于拼接中由于四舍五入所导致的最终长度的缺失，本发明均匀的在空缺处补零。如此，原本的视频到音频的转换可以表示为如下形式：in,

represents the connection operation,

v _t,q represents the VGG19 feature extracted from the qth frame of the tth second, and Floor represents the rounding down. Therefore, in the present invention, p=10, q=3. For the lack of the final length due to rounding in the splicing, the present invention evenly fills the vacancy with zeros. In this way, the original video-to-audio conversion can be expressed as follows:

其中，X_t＝{x_t，1，x_t，2，...，x_t，SR_audio}，t∈{1，2，...，Δt}。Vt和Xt在此时具有相同的维度。where X _t ={x _t,1 , x _t,2 ,...,x _t ,SR _audio }, t∈{1,2,...,Δt}. Vt and Xt have the same dimension at this point.

实施例3Example 3

与实施例1不同的是：本实施例的水场景音频的生成方法基于GAN网络， GAN网络包括生成器、辨别器及音色增强器。本发明中的网络依据声音生成的需求对输入输出所进行调整，使得原本的图像网络中每一层卷积的感受野(感受野：在卷积神经网络CNN中，决定某一层输出结果中一个元素所对应的输入层的区域大小)不再适用。在图像网络中，通常使用感受野为3×3的卷积层。而对应于本发明44100维的输入和输出，生成器和辨别器的卷积层的感受野也进行了改变，使用了较大的感受野来完成对应的卷积操作。此外，在卷积过程中，图像中所使用的二维滤波器被舍弃，针对于声音维度的特征，本发明使用了一维滤波器进行卷积，为了去除一些声音结果中不需要的频率信息，在生成器的最后增加了一个滤波器滤掉结果中的部分频率信息，在滤波的过程中保持输出序列的长度不变。生成器和辨别器的具体结构可参考表1和表2The difference from Embodiment 1 is that the method for generating audio of a water scene in this embodiment is based on a GAN network, and the GAN network includes a generator, a discriminator, and a timbre enhancer. The network in the present invention adjusts the input and output according to the requirements of sound generation, so that the receptive field of each layer of convolution in the original image network (receptive field: in the convolutional neural network CNN, determines the output result of a certain layer The region size of the input layer corresponding to an element) is no longer applicable. In image networks, convolutional layers with a receptive field of 3×3 are usually used. Corresponding to the 44100-dimensional input and output of the present invention, the receptive fields of the convolutional layers of the generator and the discriminator are also changed, and a larger receptive field is used to complete the corresponding convolution operation. In addition, in the convolution process, the two-dimensional filter used in the image is discarded. For the characteristics of the sound dimension, the present invention uses a one-dimensional filter for convolution, in order to remove some unwanted frequency information in the sound results , a filter is added at the end of the generator to filter out part of the frequency information in the result, and the length of the output sequence remains unchanged during the filtering process. The specific structure of generator and discriminator can refer to Table 1 and Table 2

表1Table 1

表2Table 2

其中，由于卷积层(Conv1D)和反卷积层(Trans Conv1D)之后所对应的部分Relu、LeakRelu以及BatchNorm层不涉及卷积核以及对输出尺寸的更改，表中没有对其进行陈列。Stride表示卷积或反卷积过程中的卷积步长。“卷积核大小”一栏所对应的三个参数分别指的是感受野的大小，此层的输入通道数以及此层的输出通道数。“输出形状”一列所对应的三个参数分别是指本层Batch的大小，输入维度及通道数。为了保证卷积与反卷积过程的对应，在卷积或反卷积的过程中，本发明通过采用不断变化的感受野以及卷积步长使得层内输入与输出之间的转换不存在舍弃维度或者增加维度的过程。Among them, because the corresponding parts of Relu, LeakRelu and BatchNorm layers after the convolutional layer (Conv1D) and the deconvolutional layer (Trans Conv1D) do not involve convolution kernels and changes to the output size, they are not listed in the table. Stride represents the convolution stride in the convolution or deconvolution process. The three parameters corresponding to the column of "convolution kernel size" refer to the size of the receptive field, the number of input channels of this layer, and the number of output channels of this layer. The three parameters corresponding to the column "Output Shape" refer to the batch size, input dimension and number of channels of this layer respectively. In order to ensure the correspondence between the convolution and the deconvolution process, in the process of convolution or deconvolution, the present invention adopts the constantly changing receptive field and the convolution step size, so that the conversion between the input and the output in the layer is not discarded. Dimension or the process of adding dimension.

根据上述说明书的揭示和教导，本发明所属领域的技术人员还能够对上述实施方式进行变更和修改。因此，本发明并不局限于上述的具体实施方式，凡是本领域技术人员在本发明的基础上所作出的任何显而易见的改进、替换或变型均属于本发明的保护范围。此外，尽管本说明书中使用了一些特定的术语，但这些术语只是为了方便说明，并不对本发明构成任何限制。Based on the disclosure and teaching of the above specification, those skilled in the art to which the present invention pertains can also make changes and modifications to the above-described embodiments. Therefore, the present invention is not limited to the above-mentioned specific embodiments, and any obvious improvement, replacement or modification made by those skilled in the art on the basis of the present invention falls within the protection scope of the present invention. In addition, although some specific terms are used in this specification, these terms are only for convenience of description and do not constitute any limitation to the present invention.

Claims

1. An end-to-end-based method for generating audio of a water scene, comprising the steps of:

selecting various water scene videos and preprocessing the videos;

step two, obtaining a generator model through training according to the preprocessed data;

preprocessing a silent video, loading the silent video to the trained generator model, and outputting an audio corresponding to the silent video;

generating an envelope according to the sequence of the audio, loading the envelope to a trained tone enhancer model, and outputting the audio after tone enhancement;

wherein, in the first step, the pretreatment method comprises the following steps,

a1, extracting the characteristics of video frames to obtain the information of videos;

a2, converting video information per second into a vector with the same dimension as audio;

in step a2, the video information conversion formula is,

；

wherein y1, ym represents color channel information of the video frame, each channel is a matrix composed of an array of 0 to 255, G (y1, ym) represents a value of an audio signal generated based on the video frame, ranging from-1 to 1,

x 1.. times, xn represents the value of the audio signal corresponding to the video, and the variation range is-1 to 1;

in step one, the vector Vt generated by the pre-processed video frame can be expressed as follows,

；

wherein,

it is shown that the connection operation is performed,

，

，

and

representing video and audioRespectively, 30 and 44100,

represents the extracted features of the qth frame of the tth second, Floor represents rounding down;

the task of generating sound may be further expressed in the form of,

；

wherein,

。

2. the method for generating an end-to-end-based waterscape audio frequency according to claim 1, wherein in the second step, the training method of the generator model comprises the following steps:

b1, inputting the vector of the video information, and outputting an audio signal through the generator model;

b2, evaluating the audio signals, if the audio signals do not correspond to the audio signals, feeding back the audio signals to the generator model, and readjusting the audio signals until the corresponding audio signals are output; and if so, continuing to train the next video information.

3. The method for generating end-to-end water scene audio according to claim 1, wherein in step four, the method for training the timbre enhancer model comprises the following steps:

c1, inputting the envelope of the target audio, and outputting the sequence of the audio through the tone enhancer model;

c2, evaluating the audio sequence, if the audio sequence is not the target sequence, feeding back the audio sequence to the tone intensifier model, and readjusting until the target audio sequence is output; if the target sequence is obtained, continuing the next tone enhancement training.

4. The method for generating audio for a water scene on an end-to-end basis as claimed in claim 1, wherein in step four, the method for generating the envelope comprises the steps of:

d1, inputting a segment of audio sequenceGVAnd sampling interval of the envelopeLstep；

D2, getting audio sequenceGVEach sampling interval inLstepThe maximum of the absolute value in this interval is taken as an envelope point pi in this interval;

d3, an array Ep formed by connecting envelope points pi in all sampling intervals, and length sum formed by linear interpolationGVThe same sequence E (1: len), i.e. an audio sequenceGVThe corresponding envelope of the envelope is then determined,

E(1 :len) = interp( max p1,……pLstep ⊕ ... ⊕ max plen − Lstep + 1,……plen ) ，

pi ϵ GV and interp () indicate linear interpolation, and ≧ indicate join operation.

5. The method as claimed in claim 2, wherein in step two, the loss function used for outputting the audio signal is:

,

where λ = 100, where X denotes a true sound value, V denotes video frame information, G denotes a result generated by the generator, D denotes a result of evaluation, and E denotes averaging.

6. The method as claimed in claim 2, wherein in step two, the loss function used for evaluating the audio signal is:

,

where V denotes video frame information, G denotes a result generated by the generator, D denotes a result of evaluation, and E denotes averaging.

7. The method as claimed in claim 1, wherein the method for generating the water scene audio is based on a GAN network, and the GAN network comprises a generator, a discriminator and a timbre enhancer.