[go: up one dir, main page]

WO2025118634A1 - Video generation method, electronic device, and computer readable storage medium - Google Patents

Video generation method, electronic device, and computer readable storage medium Download PDF

Info

Publication number
WO2025118634A1
WO2025118634A1 PCT/CN2024/107881 CN2024107881W WO2025118634A1 WO 2025118634 A1 WO2025118634 A1 WO 2025118634A1 CN 2024107881 W CN2024107881 W CN 2024107881W WO 2025118634 A1 WO2025118634 A1 WO 2025118634A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
model
reward
target
video generation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/107881
Other languages
French (fr)
Chinese (zh)
Inventor
袁杭杰
张士伟
张迎亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Publication of WO2025118634A1 publication Critical patent/WO2025118634A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • G06F16/7335Graphical querying, e.g. query-by-region, query-by-sketch, query-by-trajectory, GUIs for designating a person/face/object as a query predicate
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/74Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04842Selection of displayed objects or displayed text elements
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present disclosure relates to the fields of computer technology and video processing technology, and in particular to a video generation method, an electronic device, and a computer-readable storage medium.
  • Video generation models as a video generation tool, can generate realistic video content based on given input.
  • video generation models usually use data on the Internet for model training. Since most of the data on the Internet are of uneven quality, the videos generated by the trained video generation models are of poor quality and do not meet user expectations.
  • the embodiments of the present disclosure provide a video generation method, an electronic device, and a computer-readable storage medium to at least solve the technical problem in the related art that a video generation model is trained based on network data, resulting in that the video quality generated by the trained video generation model is poor and does not meet user expectations.
  • a video generation method comprising: obtaining a target text, wherein the target text is used to describe the content of a video to be generated; performing video generation processing on the target text using a target video generation model to obtain a target video, wherein the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model using a fine-tuning method.
  • a video generation method which provides a graphical user interface through a terminal device, and the content displayed by the graphical user interface at least partially includes a video generation scene, including: in response to a first touch operation applied to the graphical user interface, inputting a target text, wherein the target text is used to describe the video content to be generated; in response to a second touch operation applied to the graphical user interface, performing video generation processing on the target text using a target video generation model to obtain a target video, wherein the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model using a fine-tuning method; and displaying the target video in the graphical user interface.
  • a video generation method including: obtaining a currently input video generation dialogue request, wherein the information carried in the video generation dialogue request includes: a target text, which is used to describe the video content to be generated; in response to the video generation dialogue request, returning a video generation dialogue reply, wherein the information carried in the video generation dialogue reply includes: a target video, which is obtained by performing video generation processing on the target text using a target video generation model, wherein the target video generation model is a fine-tuned method for The model obtained after model alignment between the initial video generation model and the preset reward model; video generation dialogue response is displayed in the graphical user interface.
  • an electronic device including: a memory storing an executable program; and a processor for running the program, wherein any one of the above-mentioned video generation methods is executed when the program is running.
  • a computer-readable storage medium including a stored executable program, wherein when the executable program runs, the device where the computer-readable storage medium is located is controlled to execute any one of the above-mentioned video generation methods.
  • a target text for describing the video content to be generated is obtained, and then a video generation process is performed on the target text based on a target video generation model obtained by aligning an initial video generation model with a preset reward model by fine-tuning. That is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video.
  • FIG1 is a schematic diagram of an application scenario of a video generation method according to Embodiment 1 of the present disclosure
  • FIG2 is a flow chart of a video generation method according to Embodiment 1 of the present disclosure.
  • FIG3 is a schematic diagram of a fine-tuning process according to Embodiment 1 of the present disclosure.
  • FIG4 is a schematic diagram of a video generation method according to Embodiment 1 of the present disclosure.
  • FIG5 is a flow chart of a video generation method according to Embodiment 2 of the present disclosure.
  • FIG6 is a flow chart of a video generation method according to Embodiment 3 of the present disclosure.
  • FIG7 is a schematic structural diagram of a video generating device according to Embodiment 4 of the present disclosure.
  • FIG8 is a schematic structural diagram of another video generating device according to Embodiment 4 of the present disclosure.
  • FIG9 is a schematic structural diagram of another video generating device according to Embodiment 4 of the present disclosure.
  • FIG10 is a structural block diagram of a computer terminal according to Embodiment 5 of the present disclosure.
  • Video diffusion models A deep learning-based generative model used to generate or modify video content. The model generates new video frame sequences by simulating the distribution of video data. Video diffusion models are usually trained with large amounts of data to learn how to generate realistic videos.
  • Human preference refers to the subjective preferences or choices of human users when reviewing or evaluating content. In the context of AI-generated content, human preference usually refers to the user's preference for content quality, style, accuracy, etc.
  • Human preference model A machine learning model that aims to capture and imitate human preference judgments. The model learns by analyzing human evaluations of content in order to produce results that are more in line with user preferences in subsequent generation processes. The model usually requires a large amount of manually annotated data for training.
  • Alignment In the context of AI-generated content, alignment usually refers to the process of adjusting the generative model so that the content produced by the generative model better meets specific standards or goals, such as user preferences, requirements of specific tasks, etc.
  • Reward model In machine learning, a reward model is used to evaluate the effect of an action or output, and is usually used in reinforcement learning/reward learning. In the context of video generation in the disclosed embodiments, a reward model can be used to evaluate the quality of the generated video to guide the model to produce higher quality output.
  • the reward score is an indicator used to quantify the quality or conformity of the model output.
  • reward fine-tuning specifically refers to the evaluation score of the generated video based on the reward model (such as the human preference model).
  • the score reflects the consistency between the generated content and human preferences, target standards or expected goals, and is usually used to guide and optimize the training process of the model to generate more in line with user preferences or Higher quality video content.
  • U-Net A deep learning neural network structure that is widely used in the field of computer vision, especially in image segmentation tasks.
  • the U-Net structure consists of an encoder and a decoder, and its structure resembles a U shape, hence the name.
  • the encoder is responsible for feature extraction and dimensionality reduction of the input image, while the decoder is responsible for restoring the encoded feature map to the original image size and performing pixel-level classification or segmentation.
  • Model fine-tuning refers to adjusting the parameters of a model based on a pre-trained model through a small amount of data or domain-related data to adapt it to a specific task or dataset.
  • fine-tuning is performed on a model that has been trained on a large dataset.
  • This model is usually a deep learning model that performs well on general tasks, or a natural language processing model pre-trained on a large text corpus.
  • Resampling A commonly used data processing method used to adjust the size, distribution or time interval of data samples. In statistics and machine learning, resampling is often used to solve problems such as sample imbalance, missing data or inconsistent data collection frequency.
  • resampling the video refers to adjusting the sampling rate of the original video to change the playback speed of the video or adapt to different playback devices. Resampling can be to increase the sampling rate to improve video quality, or to reduce the sampling rate to reduce the file size or to adapt to specific playback requirements. Resampling usually causes changes in the image quality and smoothness of the video.
  • DDIM sampling is a method for the generative process that aims to extract important information from the data and learn it. This method is mainly used in the field of imitation learning to build a model that can imitate human behavior.
  • the main idea of DDIM sampling is to reduce the dimensionality of the data by distinguishing between important and unimportant dimensions. It discriminates the dimensions of the data to find the most useful dimensions for the model training task, and then uses these dimensions for model training and generation. Specifically, DDIM sampling first finds the most useful features for the model task through feature selection methods or feature extraction methods. Then, it maps the data to the dimensions where these important features are located by learning a discriminative dimensionality reduction model. Finally, these important dimensions are used for model training and generation.
  • Time-Decay Reward A technique used in reinforcement learning to handle situations where the value of future rewards decreases over time.
  • a discount factor is often used to measure the importance of future rewards, but in some cases, the value of future rewards decreases over time, such as in some tasks where earlier rewards may be more important than later rewards.
  • TAR reflects the effect of time by giving higher weights to earlier rewards. This can be achieved by introducing a time decay function when calculating the reward, such as an exponential decay function or a polynomial decay function.
  • Sparse sampling A data sampling method used to select a portion of samples in a large data set for analysis or processing. In sparse sampling, only a small portion of the data set is selected to represent the entire data set to reduce computational cost and time. Sparse sampling can be achieved by random sampling, stratified sampling, or other sampling methods to ensure that all samples are The selected samples can represent the characteristics of the entire dataset.
  • LoRA Low Power Wide Area Network
  • LoRA technology uses a modulation technique called spread spectrum, which enables long-distance communication at low power.
  • LoRA technology can achieve more efficient and reliable communication by fine-tuning the parameters of LoRA devices, that is, achieving efficient fine-tuning. These parameters include transmit power, data rate, receive sensitivity, etc. By adjusting these parameters reasonably, better performance can be achieved in different application scenarios.
  • the related art of training video generation models based on network data has the following defects.
  • Defect 1 Since most of the data on the Internet are of varying quality, the quality of the videos generated by the trained video generation model is poor, which does not meet user expectations.
  • Defect 2 The video diffusion model in related technologies cannot fully consider human aesthetic preferences and content relevance.
  • a video generation method is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in an order different from that shown here.
  • FIG1 shows a hardware structure block diagram of a computer terminal (or mobile device) for implementing a video generation method.
  • the computer terminal 10 may include one or more (102a, 102b, ..., 102n are used in the figure to show) processors 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions.
  • a processing device such as a microprocessor MCU or a programmable logic device FPGA
  • the computer terminal 10 may also include: a display, an input/output interface (I/O interface), a universal serial bus (USB) port (which may be included as one of the ports of the BUS bus), a network interface, a power supply and/or a camera.
  • I/O interface input/output interface
  • USB universal serial bus
  • FIG1 the structure shown in FIG1 is only for illustration and does not limit the structure of the above-mentioned electronic device.
  • the computer terminal 10 may also include more or fewer components than those shown in FIG1 , or have a configuration different from that shown in FIG1 .
  • the one or more processors 102 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits".
  • the data processing circuits may be embodied in whole or in part as software, hardware, firmware, or any other combination thereof.
  • the data processing circuit may be a single independent processing module, or may be incorporated in whole or in part into any of the other components in the computer terminal 10 (or mobile device).
  • the data processing circuit acts as a processor control (e.g., selection of a variable resistor terminal path connected to an interface).
  • the memory 104 can be used to store software programs and modules of application software, such as the video
  • the memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include a memory remotely arranged relative to the processor 102, and these remote memories may be connected to the computer terminal 10 via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the transmission device 106 is used to receive or send data via a network.
  • the specific example of the above network may include a wireless network provided by a communication provider of the computer terminal 10.
  • the transmission device 106 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device 106 can be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet wirelessly.
  • RF Radio Frequency
  • the display may be, for example, a touch screen liquid crystal display (LCD) that enables a user to interact with a user interface of the computer terminal 10 (or mobile device).
  • LCD liquid crystal display
  • FIG2 is a flow chart of a video generation method according to Embodiment 1 of the present disclosure. As shown in FIG2, the method may include the following steps:
  • Step S21 obtaining a target text, wherein the target text is used to describe the video content to be generated;
  • Step S22 using a target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model in a fine-tuning manner.
  • the target text can be understood as the text input into the video generation model, such as the text input into the target video generation model in the embodiment of the present disclosure, that is, as the input of the target video generation model.
  • the target text is used to describe the video content to be generated, for example, to describe the video content that the user expects to generate.
  • the target text may be " ⁇ ", or " ⁇ ⁇ s are blowing in the wind", etc. It is understandable that the target text may be described in natural language, such as Chinese, English, Japanese, etc., which is not limited here.
  • the initial video generation model can be understood as an initial model for generating videos based on text.
  • the initial video generation model can be a pre-trained U-type network (Pre-trained UNet), that is, a U-type network model pre-trained on a large-scale dataset.
  • Pre-trained UNet pre-trained U-type network
  • the present disclosure uses a pre-trained U-type network model as the initial video generation model to make the video generated by the final target video generation model more accurate.
  • the preset reward model can be used to evaluate the quality of the generated video, thereby guiding the video generation model to produce higher
  • a model of output quality such as a preset reward model, can evaluate and score the generated video, thereby quantifying the output quality or conformity of the video generation model through a reward score.
  • the preset reward model in the embodiment of the present disclosure adopts an image-based human preference model, namely an image reward model.
  • the image reward model can evaluate the quality of the generated video based on human aesthetic preferences and content relevance, so that the video generated by the final target video generation model is more in line with user expectations.
  • the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model by fine-tuning the embodiment of the present disclosure, that is, a model obtained by aligning the pre-trained U-type network with the image reward model by fine-tuning the model, so that video content that meets user preferences and expectations can be accurately generated according to the target video generation model.
  • a target text for describing the video content to be generated by obtaining a target text for describing the video content to be generated, and then performing video generation processing on the target text based on a target video generation model obtained by aligning the initial video generation model with a preset reward model in a fine-tuning manner, that is, aligning the pre-trained U-type network with the image reward model through fine-tuning to obtain a fine-tuned target video generation model, and using the fine-tuned target video generation model to generate a video based on the target text, thereby obtaining a target video.
  • the above-mentioned video generation method provided by the embodiments of the present disclosure can be applied to, but is not limited to, application scenarios involving video generation in the fields of e-commerce services, educational services, legal services, medical services, conference services, social network services, financial product services, logistics services, and navigation services.
  • application scenarios involving video generation in the fields of e-commerce services educational services, legal services, medical services, conference services, social network services, financial product services, logistics services, and navigation services.
  • the scenario of generating product display content in e-commerce services the scenario of generating learning content videos in educational services
  • the scenario of generating case-related videos in legal services etc., which are not limited here.
  • a target text for describing the content of a video to be generated is obtained, and then a video generation process is performed on the target text based on a target video generation model obtained by aligning an initial video generation model with a preset reward model by fine-tuning. That is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video.
  • the target video generation model is a video diffusion model
  • the preset reward model is a graph Image reward model, where the image reward model is used to perform preference learning on the video diffusion model.
  • the target video generation model may be a video diffusion model, an autoregressive model, etc.
  • the preset reward model may be an image reward model for performing preference learning on the video diffusion model.
  • the video generation method further includes the following method steps:
  • Step S23 using training samples to perform sampling processing on the initial video generation model to generate sampled videos, wherein the training samples include: a plurality of video-text pairs, and the plurality of video-text pairs each include: a training video and a training text, and the training text is used to describe the video content of the training video;
  • Step S24 using a preset reward model to calculate rewards for the sampled video to obtain a target reward result
  • Step S25 adjusting the model parameters of the initial video generation model based on the target reward result to generate a target video generation model.
  • the training samples can be understood as training samples to be used for model fine-tuning. For example, they can be part of the video text selected from the pre-training data set.
  • the training samples can be selected according to any rules, which is not limited in the embodiments of the present disclosure.
  • the training samples include a plurality of video-text pairs, each of which includes a training video and a corresponding training text, wherein the training text is text content used to describe the video content of the training video.
  • the sampled video can be understood as a video obtained by sampling the training samples using the initial video generation model, that is, a video generated by sampling the training video and training text using the pre-trained U-shaped network model.
  • the target reward result is the reward score of the sampled video obtained by calculating the reward for the sampled video according to the preset reward model.
  • the target reward result is used to reflect whether the sampled video is consistent with human aesthetic preferences and text content, and can be denoted as R.
  • some video texts can be selected from the pre-training data set as training samples, and then the initial video generation model is sampled using the training samples, that is, the training videos and training texts in the training samples are sampled using the pre-trained U-shaped network model to generate a sampled video.
  • a preset reward model is used to calculate rewards for the sampled video, that is, an image reward model is used to calculate rewards for the generated sampled video to obtain the target reward result corresponding to the sampled video.
  • the model parameters of the initial video generation model are adjusted based on the target reward result, that is, the initial video generation model is fine-tuned based on the target reward result to obtain the target video generation model.
  • step S23 the initial video generation model is sampled using the training sample to generate a sampled video, including the following method steps:
  • Step S231 performing noise processing on the training video to obtain a noisy video
  • Step S232 using the noisy video and the training text to perform video resampling on the initial video generation model to generate a sampled video.
  • the training video in the training sample can be denoised to obtain a noisy video, and then the obtained noisy video and the training text in the training sample are used to resample the initial video generation model to generate a sampled video.
  • a diffusion process with noise may be performed on the training video to obtain a noisy video, and then the initial video generation model is sampled by DDIM using the obtained noisy video and the training text in the training sample to generate a sampled video.
  • the resampling of videos through DDIM sampling in the present invention can more effectively utilize data information, improve the generation and generalization capabilities of the model, and help the model better understand the structure and characteristics of the data, thereby better imitating human behavior.
  • step S231 performing noise processing on the training video to obtain a noisy video includes the following method steps:
  • Step S2311 obtaining the number of noise adding steps and the noise level corresponding to the training video, wherein the number of noise adding steps is used to determine the number of steps to be noise added to the training video through a preset noise adding function, and the noise level is used to determine the degree of damage to the training video;
  • Step S2312 Noise the training video based on the number of noise addition steps and the noise level to obtain a noisy video.
  • the preset noise adding function can be expressed as d( ⁇ , D), which is used to calculate how many steps the training video should be noised to according to the value of the noise level ⁇ and the number of noise adding steps D.
  • the output result is usually between 1 and 1000.
  • the number of denoising steps D is used to determine the number of steps to be denoised for the training video by using a preset denoising function d(), and is usually set to 20.
  • the noise level ⁇ is used to determine the degree of damage to the training video, and its value range is between 0 and 1.
  • the number of noise addition steps D represents the number of steps used in the diffusion model generation process.
  • the preset noise addition function d() can determine the degree of video noise addition according to the noise level ⁇ and the number of noise addition steps D, so as to generate a video with a certain degree of noise in the diffusion model.
  • the number of noise adding steps d and the noise level ⁇ corresponding to the training video can be obtained, and then the training video can be noise processed based on the number of noise adding steps d and the noise level ⁇ , thereby generating a noisy video with a certain degree of noise.
  • the embodiment of the present disclosure adopts DDIM sampling.
  • the method of starting the noise addition process from the video only requires a ⁇ proportion of the calculation amount of the complete generation process, so the calculation amount is small, which can effectively save computing resources.
  • a preset reward model is used to calculate a reward for the sampled video to obtain a target reward result, including the following method steps:
  • Step S241 using a preset reward model to calculate rewards for the sampled video to obtain an initial reward result
  • Step S242 using a time decay reward method to adjust the initial reward weight corresponding to the initial reward result to generate a target reward result, wherein the initial reward weight is the default reward weight corresponding to the video frame sequence contained in the sampled video. Heavy.
  • the preset reward model can be used to calculate the reward for the sampled video, that is, an image-based human preference model is used, that is, an image reward model is used to calculate the reward, thereby obtaining an initial reward result, that is, the original output result of the preset reward model.
  • the generated sample video and the text input when generating the sample video can be input into the image reward model, so as to obtain a score value based on the output of the image reward model, that is, to obtain an initial reward result.
  • the score value output by the image reward model can be between 0 and 1, and a larger score indicates a better video quality, which is not limited here.
  • the time-decay reward (TAR) method can be used to adjust the initial reward weight corresponding to the initial reward result, that is, adjust the default reward weight corresponding to the video frame sequence contained in the sampled video to generate the target reward result.
  • a preset reward model is used to calculate a reward for a sampled video to obtain an initial reward result, including the following method steps:
  • Step S2411 performing video segment sampling on the video frame sequence to obtain segment sampling results
  • Step S2412 using a preset reward model to calculate rewards for the segmented sampling results to obtain an initial reward result.
  • segmented video rewards may be used to perform segmental sampling on video frame sequences, that is, sparse sampling is performed on the video, the video is split into several segments, and multiple continuous video frames are grouped to obtain segmented sampling results, and then the preset reward model is used to calculate rewards for the segmented sampling results to obtain initial reward results.
  • the 16-frame video can be divided into 4 groups with 4 frames in each group, and the preset reward model is used to calculate the rewards for the 4 groups to obtain the initial reward results, so as to improve the training effect of the model.
  • step S2411 performing video segment sampling on the video frame sequence to obtain segment sampling results includes the following method steps:
  • Step S24111 obtaining a feature space representation of a video frame sequence
  • Step S24112 performing video segment sampling on the feature space representation to obtain a color space representation of the segmented video frame
  • Step S24113 determine the segmented sampling result based on the color space representation.
  • the feature space representation z_0 of the video frame sequence can be obtained, and then the feature space representation z_0 can be subjected to video segmentation sampling to obtain the color (RGB) space representation x_0 ⁇ g of the segmented video frame, and finally the segmentation sampling result can be determined based on the color space representation x_0 ⁇ g.
  • step S242 the initial reward weight corresponding to the initial reward result is adjusted by using a time decay reward method to generate a target reward result, including the following method steps:
  • Step S2421 using a time decay reward method to differentially adjust the initial reward weight corresponding to the initial reward result to obtain a target reward weight, wherein the target reward weight is used to indicate that the current reward weight of the first video frame in the video frame sequence is higher than the current reward weight of the second video frame, the first video frame is located in the middle of the video frame sequence, and the second video frame is located at the edge of the video frame sequence;
  • Step S2422 generating a target reward result based on the initial reward result and the target reward weight.
  • the time decay reward method can be used to perform differentiated adjustment on the initial reward weight corresponding to the initial reward result.
  • the present disclosure can use a time decay reward method to adjust the current reward weight of the first video frame located in the middle of the video frame sequence to be higher than the current reward weight of the second video frame located at the edge of the video frame sequence. For example, taking a video with 16 frames as an example, the reward weight of each video frame is adjusted by time decay reward, and the coefficient of the middle video frame among the 16 frames is adjusted to 1, and the coefficient of the side video frame is adjusted to Thus generating the target reward result.
  • a graphical user interface is provided by a terminal device, and the content displayed by the graphical user interface at least partially includes a video generation scene.
  • the video generation method further includes:
  • Step S26 in response to the first touch operation on the graphical user interface, inputting a target text, wherein the target text is used to describe the video content to be generated;
  • Step S27 in response to the second touch operation on the graphical user interface, using the target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model in a fine-tuning manner;
  • Step S28 displaying the target video in the graphical user interface.
  • At least a video generation scene is displayed in the graphical user interface, and the user can input the target text in the image encoding scene by executing control operations, and use the target video generation model to perform video generation processing on the target text to obtain the target video and other steps.
  • the above video generation scene can be, but is not limited to, application scenes involving video generation in the fields of e-commerce, education, medical treatment, conferences, social networks, financial products, logistics, and navigation.
  • the graphical user interface also includes a first control (or a first touch area).
  • a first touch operation acting on the first control or the first touch area
  • a target text input by the user can be obtained.
  • the target text can be input by the user from a text box in the graphical user interface through the first touch operation.
  • the first touch operation can be a point selection, a box selection, a check, a conditional screening, etc., which are not limited here.
  • the above-mentioned graphical user interface also includes a second control (or a second touch area), when the second control is detected
  • the target video generation model can be used to generate a video for the target text input by the user to obtain the target video.
  • the above second touch operation can be operations such as point selection, box selection, check selection, conditional screening, etc., which are not limited here.
  • the target video can be displayed in a graphical user interface to provide feedback to the user.
  • first touch operation and the second touch operation can both be operations in which a user touches the display screen of the terminal device with a finger and touches the terminal device.
  • the touch operation can include single-point touch and multi-point touch, wherein the touch operation of each touch point can include click, long press, heavy press, swipe, etc.
  • the first touch operation and the second touch operation can also be touch operations implemented by input devices such as a mouse and a keyboard, which are not limited here.
  • Figure 3 is a schematic diagram of the fine-tuning process according to Example 1 of the present disclosure, wherein z is the representation of the sampled video in the feature space.
  • c is the text corresponding to the sampled video z. It can be understood that z and c together constitute a corresponding video-text pair.
  • the text data c is "Cornus flowers are flying in the air" and the video data z is the video corresponding to the text data c as an example.
  • z_d( ⁇ D) represents the noisy video, wherein d() is a function for calculating how many steps the video should be noisy, D is the number of noisy steps, and ⁇ is the noise level.
  • z_0 is the representation of the generated video in the feature space
  • x_0 ⁇ g is the representation of the generated video in the RGB space after sampling
  • R is the calculated reward score.
  • some video-text pairs are first selected from the pre-training data set for fine-tuning.
  • a diffusion process with noise is implemented on the video data z in the video-text pairs, wherein the noise level is set to ⁇ and the number of noise steps is D, that is, the video data z is subjected to noise processing through the diffusion model to obtain a noisy video z_d ( ⁇ D).
  • the noisy video z_d ( ⁇ D) and the corresponding text data c are resampled using DDIM sampling to generate a sampled video z_0.
  • the present disclosure adopts an image-based human preference model as a reward model, that is, an image reward model is used for reward calculation, and when calculating the reward score, in order to improve the efficiency of the learning process, a segmented sampling and decoding method is used to segment the sampled video z_0, and the color space representation x_0 ⁇ g of the segmented video frame and the segmented sampling result are obtained. Then, the reward of the segmented sampling result is calculated through the image reward model and the corresponding text data c, and the reward weight of each video frame is adjusted through the time decay reward (TAR) to achieve effective and efficient fine-tuning, so as to obtain the reward score R, that is, the target reward result.
  • TAR time decay reward
  • the noisy video z_d(1) can be obtained through gradient back propagation based on the reward score R to adjust the network parameters to minimize the error, thereby optimizing the model. I will not go into details here.
  • FIG4 is a schematic diagram of a video generation method according to Embodiment 1 of the present disclosure.
  • the generated video does not meet the user's expectations.
  • the method of the present disclosure through fine-tuning, aligns the video generation model adopted by the present disclosure with the human preference model of the image based on the text and the noisy video, that is, aligns the pre-trained U-shaped network model with trainable model parameters with the image reward model, and in the fine-tuning process, adopts an efficient fine-tuning method.
  • the LoRA technology improves the performance of the model by making slight adjustments to the parameters of the model.
  • the target video generation model obtained by the present disclosure is used to perform video generation processing on the input text, so as to obtain the target video that meets the user's expectations.
  • the method of the embodiment of the present disclosure can use the image reward model to learn human preferences for the video diffusion model without a video reward model (such a model has not been trained in the relevant technology), so that after the fine-tuning method designed by the present disclosure, the output results of the video diffusion model are more in line with user expectations and more popular with humans.
  • Beneficial effect (1) In order to align the video diffusion model with human preferences, the present disclosure proposes to align the video diffusion model with the human preference model of images, so that the video diffusion model obtained after alignment can fully consider human aesthetic preferences and content relevance, output video content that better meets user expectations, improve the personalization level of content, and open up new application prospects in the field of video generation;
  • the present disclosure proposes segmented video rewards and time decay rewards in video reward fine-tuning learning to enable the model to be effectively fine-tuned.
  • user information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • user information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware.
  • the technical solution of the present disclosure, or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for enabling a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present disclosure.
  • a storage medium such as ROM/RAM, a disk, or an optical disk
  • FIG5 is a flow chart of a video generation method according to Example 2 of the present disclosure. As shown in FIG5 , the method includes:
  • Step S51 obtaining a video generation call request through a first application programming interface, wherein the request data carried in the video generation call request includes: a target text, the target text is used to describe the video content to be generated, the video generation call request is used to request to call a target video generation model to perform video generation processing on the target text to obtain a target video, and the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model in a fine-tuning manner;
  • Step S52 Return a video generation call response through the second application programming interface, wherein the response data carried in the video generation call response includes: the target video.
  • Both the first application programming interface and the second application programming interface can be understood as application programming interfaces (Application Programming Interface, API).
  • API Application Programming Interface
  • the first application programming interface and the second application programming interface in the embodiment of the present disclosure may be the same application programming interface or different application programming interfaces, which is not limited here.
  • the target text can be understood as the text input into the video generation model, such as the text input into the target video generation model in the embodiment of the present disclosure, that is, as the input of the target video generation model.
  • the target text is used to describe the video content to be generated, for example, to describe the video content that the user expects to generate.
  • the target text may be " ⁇ ", or " ⁇ ⁇ s are blowing in the wind", etc. It is understandable that the target text may be described in natural language, such as Chinese, English, Japanese, etc., which is not limited here.
  • the initial video generation model can be understood as an initial model for generating videos based on text.
  • the initial video generation model can be a pre-trained U-type network (Pre-trained UNet), that is, a U-type network model pre-trained on a large-scale dataset.
  • Pre-trained UNet pre-trained U-type network
  • the present disclosure uses a pre-trained U-type network model as the initial video generation model to make the video generated by the target video generation model more accurate.
  • the preset reward model may be a model used to evaluate the quality of generated videos, thereby guiding the video generation model to produce higher quality outputs.
  • the preset reward model may evaluate and score the generated videos, thereby quantifying the output quality or conformity of the video generation model through reward scores.
  • the preset reward model in the embodiment of the present disclosure adopts an image-based human preference model, namely an image reward model.
  • the image reward model can evaluate the quality of the generated video based on human aesthetic preferences and content relevance, so that the video generated by the final target video generation model is more in line with user expectations.
  • the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model by fine-tuning the embodiment of the present disclosure, that is, a model obtained by aligning the pre-trained U-type network with the image reward model by fine-tuning the model, so that video content that meets user preferences and expectations can be accurately generated according to the target video generation model.
  • a video generation call request can be obtained by calling a first application programming interface, and the video generation call request carries a target text for describing the content of the video to be generated.
  • a target video generation model is used to perform video generation processing on the target text, so that a target video can be obtained, and then a video generation call response including the target video can be returned by calling a second application programming interface.
  • the generated target video can be made more consistent with human aesthetic preferences and the content of the target text, and the video quality of the generated target video can be improved, that is, a video that better meets the user's expectations can be obtained, thereby improving the personalization level of the generated video content, and opening up new application prospects in the field of video generation.
  • the above-mentioned video generation method provided by the embodiments of the present disclosure can be applied to, but is not limited to, application scenarios involving video generation in the fields of e-commerce services, educational services, legal services, medical services, conference services, social network services, financial product services, logistics services, and navigation services.
  • application scenarios involving video generation in the fields of e-commerce services educational services, legal services, medical services, conference services, social network services, financial product services, logistics services, and navigation services.
  • the scenario of generating product display content in e-commerce services the scenario of generating learning content videos in educational services
  • the scenario of generating case-related videos in legal services etc., which are not limited here.
  • a video generation call request can be obtained by calling a first application programming interface, and the video generation call request carries a target text for describing the video content to be generated.
  • a target video generation model is used to perform video generation processing on the target text, so that a target video can be obtained.
  • a video generation call response including the target video can be returned, thereby achieving the purpose of generating a target video that meets user expectations, thereby achieving the technical effect of making the generated target video more consistent with human aesthetic preferences and target text content, improving the video quality of the generated target video, and making the generated target video more popular with humans, thereby solving the technical problem in the related technology of training a video generation model based on network data, resulting in the video generated by the trained video generation model having poor quality and not meeting user expectations.
  • FIG6 is a flow chart of a video generation method according to Example 3 of the present disclosure. As shown in FIG6 , the method includes:
  • Step S61 obtaining a currently input video generation dialogue request, wherein the information carried in the video generation dialogue request includes: a target text, the target text is used to describe the video content to be generated;
  • Step S62 in response to the video generation dialogue request, returns a video generation dialogue reply, wherein the video generation
  • the information carried in the dialogue response includes: a target video, which is obtained by processing the target text with a target video generation model, and the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model by fine-tuning;
  • Step S63 displaying the video-generated dialogue response in the graphical user interface.
  • the video generation dialogue request can be understood as a dialogue request (request) initiated by a user to a computer or a robot, and the video generation dialogue request carries a target text for describing the video content to be generated.
  • the target text can be understood as the text input into the video generation model, such as the text input into the target video generation model in the embodiment of the present disclosure, that is, as the input of the target video generation model.
  • the target text is used to describe the video content to be generated, for example, to describe the video content that the user expects to generate.
  • the target text may be " ⁇ ", or " ⁇ ⁇ s are blowing in the wind", etc. It is understandable that the target text may be described in natural language, such as Chinese, English, Japanese, etc., which is not limited here.
  • a video generation dialogue reply in response to the acquired video generation dialogue request, may be returned, wherein the video generation dialogue reply carries a target video obtained by performing video generation processing on the target text using a target video generation model, and the target video generation model is a model obtained by aligning the initial video generation model with a preset reward model using a fine-tuning method.
  • the initial video generation model can be understood as an initial model for generating videos based on text.
  • the initial video generation model can be a pre-trained U-type network (Pre-trained UNet), that is, a U-type network model pre-trained on a large-scale dataset.
  • Pre-trained UNet pre-trained U-type network
  • the present disclosure uses a pre-trained U-type network model as the initial video generation model to make the video generated by the target video generation model more accurate.
  • the preset reward model may be a model used to evaluate the quality of generated videos, thereby guiding the video generation model to produce higher quality outputs.
  • the preset reward model may evaluate and score the generated videos, thereby quantifying the output quality or conformity of the video generation model through reward scores.
  • the preset reward model in the embodiment of the present disclosure adopts an image-based human preference model, namely an image reward model.
  • the image reward model can evaluate the quality of the generated video based on human aesthetic preferences and content relevance, so that the video generated by the final target video generation model is more in line with user expectations.
  • the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model by fine-tuning the embodiment of the present disclosure, that is, a model obtained by aligning the pre-trained U-shaped network with the image reward model by fine-tuning the model, so as to be able to generate the target video according to the target.
  • the video generation model accurately generates video content that conforms to user preferences and meets user expectations.
  • the video-generated dialogue response can be displayed in the graphical user interface.
  • the above-mentioned video generation method provided by the embodiments of the present disclosure can be applied to, but is not limited to, application scenarios involving video generation in the fields of e-commerce services, educational services, legal services, medical services, conference services, social network services, financial product services, logistics services, and navigation services.
  • application scenarios involving video generation in the fields of e-commerce services educational services, legal services, medical services, conference services, social network services, financial product services, logistics services, and navigation services.
  • the scenario of generating product display content in e-commerce services the scenario of generating learning content videos in educational services
  • the scenario of generating case-related videos in legal services etc., which are not limited here.
  • a video generation dialogue request carrying a target text for describing the content of a video to be generated is obtained, and then in response to the video generation dialogue request, a video generation process is performed on the target text based on a target video generation model obtained by aligning the initial video generation model with a preset reward model in a fine-tuning manner, that is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video, that is, a video generation dialogue response carrying the target video is obtained, and finally the video generation dialogue response is displayed in a graphical user interface, thereby achieving the purpose of generating a target video that meets the user's expectations, thereby achieving the technical effect of making the generated target video more consistent with human aesthetic preferences and the content of the target text, improving the video quality of the generated target video
  • FIG7 is a structural schematic diagram of a video generation device according to Embodiment 4 of the present disclosure. As shown in FIG7 , the device includes:
  • An acquisition module 701 is configured to acquire a target text, wherein the target text is used to describe the video content to be generated;
  • the processing module 702 is configured to use a target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model using a fine-tuning method.
  • a training module which is configured to use training samples to sample the initial video generation model to generate a sampled video, wherein the training samples include: multiple video-text pairs, and the multiple video-text pairs all include: training videos and training texts, and the training texts are used to describe the video content of the training videos; using a preset reward model to calculate rewards for the sampled videos to obtain target reward results; based on the target reward results, adjusting the model parameters of the initial video generation model to generate a target video generation model.
  • the training module is further configured to: perform noise processing on the training video to obtain a noisy video; and perform video resampling on the initial video generation model using the noisy video and the training text to generate a sampled video.
  • the above-mentioned training module is also configured to: obtain the number of noise addition steps and the noise level corresponding to the training video, wherein the number of noise addition steps is used to determine the number of steps to be noised for the training video through a preset noise addition function, and the noise level is used to determine the degree of damage to the training video; and perform noise addition processing on the training video based on the number of noise addition steps and the noise level to obtain a noisy video.
  • the above-mentioned training module is also configured to: use a preset reward model to calculate the reward for the sampled video to obtain an initial reward result; use a time-decayed reward method to adjust the initial reward weight corresponding to the initial reward result to generate a target reward result, wherein the initial reward weight is the default reward weight corresponding to the video frame sequence contained in the sampled video.
  • the training module is further configured to: perform video segment sampling on the video frame sequence to obtain segment sampling results; and perform reward calculation on the segment sampling results using a preset reward model to obtain an initial reward result.
  • the above training module is also configured to: obtain a feature space representation of a video frame sequence; perform video segmentation sampling on the feature space representation to obtain a color space representation of the segmented video frame; and determine the segmentation sampling result based on the color space representation.
  • the above-mentioned training module is also configured to: use a time-decayed reward method to differentially adjust the initial reward weight corresponding to the initial reward result to obtain a target reward weight, wherein the target reward weight is used to indicate that the current reward weight of the first video frame in the video frame sequence is higher than the current reward weight of the second video frame, the first video frame is located in the middle position of the video frame sequence, and the second video frame is located at the edge position of the video frame sequence; generate the target reward result based on the initial reward result and the target reward weight.
  • the target video generation model is a video diffusion model
  • the preset reward model is an image reward model, wherein the image reward model is used to perform preference learning on the video diffusion model.
  • a target text for describing the content of a video to be generated is obtained, and then a video generation process is performed on the target text based on a target video generation model obtained by aligning an initial video generation model with a preset reward model by fine-tuning. That is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video.
  • the acquisition module 701 and the processing module 702 correspond to step S21 and step S22 in Example 1.
  • the examples and application scenarios implemented by the two modules and the corresponding steps are the same, but are not limited to the contents disclosed in the above-mentioned Example 1.
  • the above-mentioned modules or units can be stored in a memory (for example, a hardware component or software component in memory 104) and processed by one or more processors (for example, processors 102a, 102b, ..., 102n), the above module can also be run in the computer terminal 10 provided in Example 1 as part of the device.
  • FIG8 is a structural schematic diagram of another video generation device according to embodiment 4 of the present disclosure. As shown in FIG8 , the device includes:
  • the acquisition module 801 is configured to acquire a video generation call request through a first application programming interface, wherein the request data carried in the video generation call request includes: a target text, the target text is used to describe the video content to be generated, the video generation call request is used to request to call a target video generation model to perform video generation processing on the target text to obtain a target video, and the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model in a fine-tuning manner;
  • the return module 802 is configured to return a video generation call response through a second application programming interface, wherein the response data carried in the video generation call response includes: a target video.
  • a video generation call request can be obtained by calling a first application programming interface, and the video generation call request carries a target text for describing the video content to be generated.
  • a target video generation model is used to perform video generation processing on the target text, so that a target video can be obtained.
  • a video generation call response including the target video can be returned, thereby achieving the purpose of generating a target video that meets user expectations, thereby achieving the technical effect of making the generated target video more consistent with human aesthetic preferences and target text content, improving the video quality of the generated target video, and making the generated target video more popular with humans, thereby solving the technical problem in the related technology of training a video generation model based on network data, resulting in the video generated by the trained video generation model having poor quality and not meeting user expectations.
  • the acquisition module 801 and the return module 802 correspond to step S51 and step S52 in Example 2, and the examples and application scenarios implemented by the two modules and the corresponding steps are the same, but are not limited to the contents disclosed in the above-mentioned Example 1.
  • the above-mentioned modules or units can be hardware components or software components stored in a memory (e.g., memory 104) and processed by one or more processors (e.g., processors 102a, 102b, ..., 102n), and the above-mentioned modules can also be run in the computer terminal 10 provided in Example 1 as part of the device.
  • FIG9 is a structural schematic diagram of another video generation device according to embodiment 4 of the present disclosure. As shown in FIG9 , the device includes:
  • the acquisition module 901 is configured to acquire a currently input video generation dialogue request, wherein the information carried in the video generation dialogue request includes: a target text, the target text is used to describe the video content to be generated;
  • the first response module 902 is configured to respond to the video generation dialogue request and return a video generation dialogue reply.
  • the information carried in the video generation dialogue response includes: a target video, which is obtained by performing video generation processing on the target text using a target video generation model, and the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model using a fine-tuning method;
  • the display module 903 is configured to display the video-generated dialogue response in the graphical user interface.
  • a video generation dialogue request carrying a target text for describing the content of a video to be generated is obtained, and then in response to the video generation dialogue request, a video generation process is performed on the target text based on a target video generation model obtained by aligning the initial video generation model with a preset reward model in a fine-tuning manner, that is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video, that is, a video generation dialogue response carrying the target video is obtained, and finally the video generation dialogue response is displayed in a graphical user interface, thereby achieving the purpose of generating a target video that meets the user's expectations, thereby achieving the technical effect of making the generated target video more consistent with human aesthetic preferences and the content of the target text, improving the video quality of the generated target video
  • the acquisition module 901, the first response module 902 and the display module 903 correspond to steps S61 to S63 in Example 3, and the three modules and the corresponding steps implement the same examples and application scenarios, but are not limited to the contents disclosed in the above-mentioned Example 1.
  • the above-mentioned modules or units may be hardware components or software components stored in a memory (e.g., memory 104) and processed by one or more processors (e.g., processors 102a, 102b, ..., 102n), and the above-mentioned modules may also be part of the device and may be run in the computer terminal 10 provided in Example 1.
  • the embodiment of the present disclosure may provide a computer terminal, which may be any computer terminal device in a computer terminal group.
  • the computer terminal may also be replaced by a terminal device such as a mobile terminal.
  • the computer terminal may be located in at least one network device among a plurality of network devices of the computer network.
  • the above-mentioned computer terminal can execute the program code of the following steps in the video generation method: obtaining a target text, wherein the target text is used to describe the video content to be generated; using a target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with a preset reward model using a fine-tuning method.
  • FIG10 is a structural block diagram of a computer terminal according to Embodiment 5 of the present disclosure.
  • the computer terminal A may include: one or more (only one is shown in the figure) processors 1002, a memory 1004, a storage controller, and a peripheral interface, wherein the peripheral interface is connected to a radio frequency module, an audio module, and a display.
  • the memory can be configured to store software programs and modules, such as program instructions/modules corresponding to the video generation method and device in the embodiment of the present disclosure, and the processor executes various functional applications and data processing by running the stored software programs and modules, that is, realizing the above-mentioned video generation method.
  • the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory may further include a memory remotely arranged relative to the processor, and these remote memories may be connected to the computer terminal A via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the processor can call the information and application stored in the memory through the transmission device to execute the following steps: obtain the target text, wherein the target text is used to describe the video content to be generated; use the target video generation model to perform video generation processing on the target text to obtain the target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model using a fine-tuning method.
  • the processor may also execute the program code of the following steps: using training samples to perform sampling processing on the initial video generation model to generate a sampled video, wherein the training samples include: multiple video-text pairs, and the multiple video-text pairs each include: a training video and a training text, and the training text is used to describe the video content of the training video; using a preset reward model to calculate the reward for the sampled video to obtain a target reward result; adjusting the model parameters of the initial video generation model based on the target reward result to generate a target video generation model.
  • the processor may also execute the program code of the following steps: performing noise processing on the training video to obtain a noisy video; and performing video resampling on the initial video generation model using the noisy video and the training text to generate a sampled video.
  • the processor may also execute the following program code: obtaining the number of noise addition steps and the noise level corresponding to the training video, wherein the number of noise addition steps is used to determine the number of steps to be noised for the training video through a preset noise addition function, and the noise level is used to determine the degree of damage to the training video; performing noise addition processing on the training video based on the number of noise addition steps and the noise level to obtain a noisy video.
  • the processor may also execute the program code of the following steps: using a preset reward model to calculate rewards for the sampled video to obtain an initial reward result; using a time-decayed reward method to adjust an initial reward weight corresponding to the initial reward result to generate a target reward result, wherein the initial reward weight is a default reward weight corresponding to a video frame sequence contained in the sampled video.
  • the processor may also execute the program code of the following steps: performing video segment sampling on the video frame sequence to obtain segment sampling results; performing reward calculation on the segment sampling results using a preset reward model to obtain an initial reward result.
  • the processor may further execute the program code of the following steps: obtaining a feature space of a video frame sequence;
  • the method comprises the following steps: performing video segmentation sampling on the feature space representation to obtain a color space representation of the segmented video frame; and determining the segmentation sampling result based on the color space representation.
  • the processor may also execute the program code of the following steps: using a time-decayed reward method to perform differentiated adjustments on the initial reward weight corresponding to the initial reward result to obtain a target reward weight, wherein the target reward weight is used to indicate that the current reward weight of the first video frame in the video frame sequence is higher than the current reward weight of the second video frame, the first video frame is located in the middle of the video frame sequence, and the second video frame is located at the edge of the video frame sequence; generating the target reward result based on the initial reward result and the target reward weight.
  • the target video generation model is a video diffusion model
  • the preset reward model is an image reward model, wherein the image reward model is used to perform preference learning on the video diffusion model.
  • a target text for describing the content of a video to be generated is obtained, and then a video generation process is performed on the target text based on a target video generation model obtained by aligning an initial video generation model with a preset reward model by fine-tuning. That is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video.
  • the structure shown in FIG. 10 is for illustration only, and the computer terminal A may also be a terminal device such as a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a PDA, and a mobile Internet device (Mobile Internet Devices, MID), PAD, etc.
  • FIG. 10 does not limit the structure of the above-mentioned electronic device.
  • the computer terminal A may also include more or fewer components (such as a network interface, a display device, etc.) than those shown in FIG. 10, or have a configuration different from that shown in FIG. 10.
  • a person of ordinary skill in the art may understand that all or part of the steps in the various methods of the above embodiments may be completed by instructing the hardware related to the terminal device through a program, and the program may be stored in a computer-readable storage medium, and the storage medium may include: a flash drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.
  • the embodiment of the present disclosure further provides a computer-readable storage medium.
  • the computer-readable storage medium can be used to store the program code executed by the video generation method provided in the first embodiment.
  • the computer-readable storage medium may be located in any one of the computer terminals in a computer terminal group in a computer network, or in any one of the mobile terminals in a mobile terminal group.
  • the computer readable storage medium is configured to store the following steps: Program code: Obtain a target text, wherein the target text is used to describe the content of the video to be generated; use a target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model using a fine-tuning method.
  • the computer-readable storage medium is configured to store program codes for executing the following steps: sampling the initial video generation model using training samples to generate a sampled video, wherein the training samples include: multiple video-text pairs, and the multiple video-text pairs each include: a training video and a training text, and the training text is used to describe the video content of the training video; using a preset reward model to calculate rewards for the sampled video to obtain a target reward result; adjusting the model parameters of the initial video generation model based on the target reward result to generate a target video generation model.
  • the computer-readable storage medium is configured to store program codes for executing the following steps: performing noise processing on the training video to obtain a noisy video; and performing video resampling on the initial video generation model using the noisy video and the training text to generate a sampled video.
  • the computer-readable storage medium is configured to store program code for executing the following steps: obtaining the number of noise addition steps and the noise level corresponding to the training video, wherein the number of noise addition steps is used to determine the number of steps to be noised for the training video through a preset noise addition function, and the noise level is used to determine the degree of damage to the training video; performing noise addition processing on the training video based on the number of noise addition steps and the noise level to obtain a noisy video.
  • the computer-readable storage medium is configured to store program codes for executing the following steps: using a preset reward model to calculate rewards for the sampled video to obtain an initial reward result; using a time-decayed reward method to adjust an initial reward weight corresponding to the initial reward result to generate a target reward result, wherein the initial reward weight is a default reward weight corresponding to a video frame sequence contained in the sampled video.
  • the computer-readable storage medium is configured to store program codes for executing the following steps: performing video segment sampling on a video frame sequence to obtain segment sampling results; and performing reward calculation on the segment sampling results using a preset reward model to obtain an initial reward result.
  • the computer-readable storage medium is configured to store program code for executing the following steps: obtaining a feature space representation of a video frame sequence; performing video segmentation sampling on the feature space representation to obtain a color space representation of the segmented video frame; and determining the segmentation sampling result based on the color space representation.
  • the computer-readable storage medium is configured to store program code for executing the following steps: using a time-decayed reward method to differentially adjust the initial reward weight corresponding to the initial reward result to obtain a target reward weight, wherein the target reward weight is used to indicate that the current reward weight of the first video frame in the video frame sequence is higher than the current reward weight of the second video frame, the first video frame is located in the middle position of the video frame sequence, and the second video frame is located at the edge position of the video frame sequence; generating the target reward result based on the initial reward result and the target reward weight.
  • the target video generation model is a video diffusion model
  • the preset reward model is an image reward model, wherein the image reward model is used to perform preference learning on the video diffusion model.
  • the disclosed technical content can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division.
  • multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of units or modules, which can be electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, server or network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present disclosure.
  • the aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Provided are a video generation method, an electronic device, and a computer readable storage medium, relating to the technical fields of computers and video processing. The video generation method comprises: acquiring a target text, wherein the target text is used for describing video content to be generated (S21); and using a target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by performing model alignment on an initial video generation model and a preset reward model in a fine-tuning manner (S22). The present invention solves the technical problems in the related art that a video generated by a video generation model obtained by training based on network data has poor quality and does not meet user expectations.

Description

视频生成方法、电子设备及计算机可读存储介质Video generation method, electronic device and computer readable storage medium 技术领域Technical Field

本公开涉及计算机技术、视频处理技术领域,具体而言,涉及一种视频生成方法、电子设备及计算机可读存储介质。The present disclosure relates to the fields of computer technology and video processing technology, and in particular to a video generation method, an electronic device, and a computer-readable storage medium.

背景技术Background Art

随着视频内容在互联网上的普及,高质量和个性化的视频生成需求日益增加,视频生成模型作为一种视频生成工具,能够基于给定的输入生成逼真的视频内容。With the popularity of video content on the Internet, the demand for high-quality and personalized video generation is increasing. Video generation models, as a video generation tool, can generate realistic video content based on given input.

目前,视频生成模型通常采用网络上的数据进行模型训练,由于网络上的数据大多为质量参差不齐的数据,因此导致训练得到的视频生成模型所生成的视频质量较差,与用户期望不符。At present, video generation models usually use data on the Internet for model training. Since most of the data on the Internet are of uneven quality, the videos generated by the trained video generation models are of poor quality and do not meet user expectations.

针对上述的问题,目前尚未提出有效的解决方案。To address the above-mentioned problems, no effective solution has been proposed yet.

发明内容Summary of the invention

本公开实施例提供了一种视频生成方法、电子设备及计算机可读存储介质,以至少解决相关技术中基于网络数据训练视频生成模型,导致训练得到的视频生成模型所生成的视频质量较差,不符合用户期望的技术问题。The embodiments of the present disclosure provide a video generation method, an electronic device, and a computer-readable storage medium to at least solve the technical problem in the related art that a video generation model is trained based on network data, resulting in that the video quality generated by the trained video generation model is poor and does not meet user expectations.

根据本公开实施例的一个方面,提供了一种视频生成方法,包括:获取目标文本,其中,目标文本用于描述待生成的视频内容;采用目标视频生成模型对目标文本进行视频生成处理,得到目标视频,其中,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型。According to one aspect of an embodiment of the present disclosure, there is provided a video generation method, comprising: obtaining a target text, wherein the target text is used to describe the content of a video to be generated; performing video generation processing on the target text using a target video generation model to obtain a target video, wherein the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model using a fine-tuning method.

根据本公开实施例的另一方面,还提供了一种视频生成方法,通过终端设备提供一图形用户界面,图形用户界面所显示的内容至少部分地包含一视频生成场景,包括:响应作用于图形用户界面的第一触控操作,输入目标文本,其中,目标文本用于描述待生成的视频内容;响应作用于图形用户界面的第二触控操作,采用目标视频生成模型对目标文本进行视频生成处理,得到目标视频,其中,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型;在图形用户界面内展示目标视频。According to another aspect of an embodiment of the present disclosure, a video generation method is also provided, which provides a graphical user interface through a terminal device, and the content displayed by the graphical user interface at least partially includes a video generation scene, including: in response to a first touch operation applied to the graphical user interface, inputting a target text, wherein the target text is used to describe the video content to be generated; in response to a second touch operation applied to the graphical user interface, performing video generation processing on the target text using a target video generation model to obtain a target video, wherein the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model using a fine-tuning method; and displaying the target video in the graphical user interface.

根据本公开实施例的另一方面,还提供了一种视频生成方法,包括:获取当前输入的视频生成对话请求,其中,视频生成对话请求中携带的信息包括:目标文本,目标文本用于描述待生成的视频内容;响应于视频生成对话请求,返回视频生成对话回复,其中,视频生成对话回复中携带的信息包括:目标视频,目标视频采用目标视频生成模型对目标文本进行视频生成处理后得到,目标视频生成模型为采用微调方式对 初始视频生成模型与预设奖励模型进行模型对齐后得到的模型;在图形用户界面内展示视频生成对话回复。According to another aspect of the embodiment of the present disclosure, a video generation method is also provided, including: obtaining a currently input video generation dialogue request, wherein the information carried in the video generation dialogue request includes: a target text, which is used to describe the video content to be generated; in response to the video generation dialogue request, returning a video generation dialogue reply, wherein the information carried in the video generation dialogue reply includes: a target video, which is obtained by performing video generation processing on the target text using a target video generation model, wherein the target video generation model is a fine-tuned method for The model obtained after model alignment between the initial video generation model and the preset reward model; video generation dialogue response is displayed in the graphical user interface.

根据本公开实施例的另一方面,还提供了一种电子设备,包括:存储器,存储有可执行程序;处理器,用于运行程序,其中,程序运行时执行任意一项上述的视频生成方法。According to another aspect of an embodiment of the present disclosure, an electronic device is further provided, including: a memory storing an executable program; and a processor for running the program, wherein any one of the above-mentioned video generation methods is executed when the program is running.

根据本公开实施例的另一方面,还提供了一种计算机可读存储介质,计算机可读存储介质包括存储的可执行程序,其中,在可执行程序运行时控制计算机可读存储介质所在设备执行任意一项上述的视频生成方法。According to another aspect of an embodiment of the present disclosure, a computer-readable storage medium is further provided, the computer-readable storage medium including a stored executable program, wherein when the executable program runs, the device where the computer-readable storage medium is located is controlled to execute any one of the above-mentioned video generation methods.

在本公开实施例中,通过获取用于描述待生成的视频内容的目标文本,然后基于采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的目标视频生成模型对目标文本进行视频生成处理,即通过微调将预训练的U型网络与图片奖励模型对齐以得到微调后的目标视频生成模型,并利用微调后的目标视频生成模型基于目标文本生成视频,从而得到目标视频,由此达到了生成符合用户期望的目标视频的目的,从而实现了使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,使生成的目标视频更受人类喜爱的技术效果,进而解决了相关技术中基于网络数据训练视频生成模型,导致训练得到的视频生成模型所生成的视频质量较差,不符合用户期望的技术问题。In the disclosed embodiment, a target text for describing the video content to be generated is obtained, and then a video generation process is performed on the target text based on a target video generation model obtained by aligning an initial video generation model with a preset reward model by fine-tuning. That is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video. This achieves the purpose of generating a target video that meets user expectations, thereby achieving a technical effect of making the generated target video more consistent with human aesthetic preferences and the target text content, improving the video quality of the generated target video, and making the generated target video more popular with humans. This further solves the technical problem in the related art of training a video generation model based on network data, resulting in a poor quality video generated by the trained video generation model that does not meet user expectations.

容易注意到的是,上面的通用描述和后面的详细描述仅仅是为了对本公开进行举例和解释,并不构成对本公开的限定。It is easily noted that the above general description and the following detailed description are merely for exemplifying and explaining the present disclosure, and do not constitute a limitation of the present disclosure.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

此处所说明的附图用来提供对本公开的进一步理解,构成本公开的一部分,本公开的示意性实施例及其说明用于解释本公开,并不构成对本公开的不当限定。在附图中:The drawings described herein are used to provide a further understanding of the present disclosure and constitute a part of the present disclosure. The illustrative embodiments of the present disclosure and their descriptions are used to explain the present disclosure and do not constitute an improper limitation on the present disclosure. In the drawings:

图1是根据本公开实施例1的一种视频生成方法的应用场景示意图;FIG1 is a schematic diagram of an application scenario of a video generation method according to Embodiment 1 of the present disclosure;

图2是根据本公开实施例1的一种视频生成方法的流程图;FIG2 is a flow chart of a video generation method according to Embodiment 1 of the present disclosure;

图3是根据本公开实施例1的微调流程示意图;FIG3 is a schematic diagram of a fine-tuning process according to Embodiment 1 of the present disclosure;

图4是根据本公开实施例1的一种视频生成方法的示意图;FIG4 is a schematic diagram of a video generation method according to Embodiment 1 of the present disclosure;

图5是根据本公开实施例2的一种视频生成方法的流程图;FIG5 is a flow chart of a video generation method according to Embodiment 2 of the present disclosure;

图6是根据本公开实施例3的一种视频生成方法的流程图;FIG6 is a flow chart of a video generation method according to Embodiment 3 of the present disclosure;

图7是根据本公开实施例4的一种视频生成装置的结构示意图;FIG7 is a schematic structural diagram of a video generating device according to Embodiment 4 of the present disclosure;

图8是根据本公开实施例4的另一种视频生成装置的结构示意图;FIG8 is a schematic structural diagram of another video generating device according to Embodiment 4 of the present disclosure;

图9是根据本公开实施例4的再一种视频生成装置的结构示意图;FIG9 is a schematic structural diagram of another video generating device according to Embodiment 4 of the present disclosure;

图10是根据本公开实施例5的一种计算机终端的结构框图。 FIG10 is a structural block diagram of a computer terminal according to Embodiment 5 of the present disclosure.

具体实施方式DETAILED DESCRIPTION

为了使本技术领域的人员更好地理解本公开方案,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分的实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本公开保护的范围。In order to enable those skilled in the art to better understand the scheme of the present disclosure, the technical scheme in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only part of the embodiments of the present disclosure, not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by ordinary technicians in the field without creative work should fall within the scope of protection of the present disclosure.

需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present disclosure and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present disclosure described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products, or devices.

首先,在对本公开实施例进行描述的过程中出现的部分名词或术语适用于如下解释:First, some nouns or terms that appear in the process of describing the embodiments of the present disclosure are subject to the following explanations:

视频扩散模型(video diffusion models):一种基于深度学习的生成模型,用于生成或修改视频内容。该模型通过模拟视频数据的分布来生成新的视频帧序列。视频扩散模型通常利用大量数据进行训练,以学习如何生成逼真的视频。Video diffusion models: A deep learning-based generative model used to generate or modify video content. The model generates new video frame sequences by simulating the distribution of video data. Video diffusion models are usually trained with large amounts of data to learn how to generate realistic videos.

人类偏好(human preference):指人类用户在审视或评估内容时的主观喜好或选择倾向。在人工智能生成内容的背景下,人类偏好通常指用户对于内容的质量、风格、准确性等方面的偏好。Human preference: refers to the subjective preferences or choices of human users when reviewing or evaluating content. In the context of AI-generated content, human preference usually refers to the user's preference for content quality, style, accuracy, etc.

人类偏好模型(human preference model):一种机器学习模型,旨在捕捉和模仿人类的偏好判断。该模型通过分析人类对内容的评价来学习,以便在后续生成过程中产生更符合用户偏好的结果。该模型通常需要大量的人工标注的数据来进行训练。Human preference model: A machine learning model that aims to capture and imitate human preference judgments. The model learns by analyzing human evaluations of content in order to produce results that are more in line with user preferences in subsequent generation processes. The model usually requires a large amount of manually annotated data for training.

对齐(alignment):在人工智能生成内容的背景下,对齐通常指调整生成模型的过程,使生成模型产出的内容更符合特定的标准或目标,例如用户的偏好、特定任务的要求等。Alignment: In the context of AI-generated content, alignment usually refers to the process of adjusting the generative model so that the content produced by the generative model better meets specific standards or goals, such as user preferences, requirements of specific tasks, etc.

奖励模型(reward model):在机器学习中,奖励模型用于评估某个动作或输出的效果,通常在强化学习/奖励学习中使用。在本公开实施例的视频生成的背景下,奖励模型可以用来评估生成视频的质量,以指导模型产生更高质量的输出。Reward model: In machine learning, a reward model is used to evaluate the effect of an action or output, and is usually used in reinforcement learning/reward learning. In the context of video generation in the disclosed embodiments, a reward model can be used to evaluate the quality of the generated video to guide the model to produce higher quality output.

奖励分数(reward score):奖励分数是用来量化模型输出质量或符合度的指标。在本公开实施例的视频生成模型的应用中,奖励微调特指根据奖励模型(如人类偏好模型)对生成视频的评价得分,该分数反映了生成内容与人类偏好、目标标准或预期目标的一致性程度,通常用于指导和优化模型的训练过程,以生成更符合用户偏好或 更高质量的视频内容。Reward score: The reward score is an indicator used to quantify the quality or conformity of the model output. In the application of the video generation model of the embodiment of the present disclosure, reward fine-tuning specifically refers to the evaluation score of the generated video based on the reward model (such as the human preference model). The score reflects the consistency between the generated content and human preferences, target standards or expected goals, and is usually used to guide and optimize the training process of the model to generate more in line with user preferences or Higher quality video content.

U型网络(UNet):一种深度学习神经网络结构,被广泛应用于计算机视觉领域,特别是在图像分割任务中。U型网络结构由编码器和解码器组成,其结构类似于U字形,因此得名。编码器负责将输入图像进行特征提取和降维,而解码器则负责将编码后的特征图恢复到原始图像大小,并进行像素级别的分类或分割。U-Net: A deep learning neural network structure that is widely used in the field of computer vision, especially in image segmentation tasks. The U-Net structure consists of an encoder and a decoder, and its structure resembles a U shape, hence the name. The encoder is responsible for feature extraction and dimensionality reduction of the input image, while the decoder is responsible for restoring the encoded feature map to the original image size and performing pixel-level classification or segmentation.

模型微调(fine-tuning):是指在一个预训练的模型基础上,通过少量的数据或者领域相关的数据对模型的参数进行调整,以适应特定的任务或者数据集。通常,微调是在一个已经在大规模数据集上进行了训练的模型上进行的,这个模型通常是一个在通用任务上表现良好的深度学习模型,或者在大规模文本语料库上预训练的自然语言处理模型。Model fine-tuning: refers to adjusting the parameters of a model based on a pre-trained model through a small amount of data or domain-related data to adapt it to a specific task or dataset. Usually, fine-tuning is performed on a model that has been trained on a large dataset. This model is usually a deep learning model that performs well on general tasks, or a natural language processing model pre-trained on a large text corpus.

重采样:一种常用的数据处理方法,用于调整数据样本的大小、分布或时间间隔。在统计学和机器学习中,重采样通常用于解决样本不平衡、数据缺失或数据采集频率不一致等问题。本公开实施例中,对视频进行重采样是指将原始视频的采样率进行调整,以改变视频的播放速度或者适应不同的播放设备。重采样可以是增加采样率以提高视频质量,也可以是降低采样率以减小文件大小或适应特定的播放需求。重采样通常会导致视频的画质和流畅度发生变化。Resampling: A commonly used data processing method used to adjust the size, distribution or time interval of data samples. In statistics and machine learning, resampling is often used to solve problems such as sample imbalance, missing data or inconsistent data collection frequency. In the disclosed embodiment, resampling the video refers to adjusting the sampling rate of the original video to change the playback speed of the video or adapt to different playback devices. Resampling can be to increase the sampling rate to improve video quality, or to reduce the sampling rate to reduce the file size or to adapt to specific playback requirements. Resampling usually causes changes in the image quality and smoothness of the video.

模仿学习的判别式降维(Discriminative Dimensionality Reduction for Imitation Learning,DDIM)采样:即DDIM采样(DDIM sampling),是一种用于生成过程的方法,旨在从数据中提取出重要的信息并进行学习。该方法主要应用于模仿学习(imitation learning)领域,用于构建一个模型,使其能够模仿人类的行为。DDIM采样的主要思想是通过区分重要维度和非重要维度来降低数据的维度,其通过对数据进行判别式降维,找到对模型训练任务最有用的维度,然后利用这些维度进行模型训练和生成。具体来说,DDIM采样首先通过特征选择方法或者特征提取方法,找到对模型任务最有用的特征。然后,它通过学习一个判别式的降维模型,将数据映射到这些重要特征所在的维度上。最后,利用这些重要的维度进行模型训练和生成。Discriminative Dimensionality Reduction for Imitation Learning (DDIM) sampling: DDIM sampling is a method for the generative process that aims to extract important information from the data and learn it. This method is mainly used in the field of imitation learning to build a model that can imitate human behavior. The main idea of DDIM sampling is to reduce the dimensionality of the data by distinguishing between important and unimportant dimensions. It discriminates the dimensions of the data to find the most useful dimensions for the model training task, and then uses these dimensions for model training and generation. Specifically, DDIM sampling first finds the most useful features for the model task through feature selection methods or feature extraction methods. Then, it maps the data to the dimensions where these important features are located by learning a discriminative dimensionality reduction model. Finally, these important dimensions are used for model training and generation.

时间衰减奖励(Time-Decay Reward,TAR):一种在强化学习中用于处理未来奖励的价值随时间递减的情况的技术。在强化学习中,通常会用到折扣因子来衡量未来奖励的重要性,但是有些情况下,未来奖励的价值会随着时间的推移而递减,例如在某些任务中,较早获得的奖励可能比后来的奖励更加重要。TAR通过给予较早的奖励更高的权重,以反映出时间的影响。可以通过在计算奖励时引入时间衰减函数来实现,例如指数衰减函数或者多项式衰减函数。Time-Decay Reward (TAR): A technique used in reinforcement learning to handle situations where the value of future rewards decreases over time. In reinforcement learning, a discount factor is often used to measure the importance of future rewards, but in some cases, the value of future rewards decreases over time, such as in some tasks where earlier rewards may be more important than later rewards. TAR reflects the effect of time by giving higher weights to earlier rewards. This can be achieved by introducing a time decay function when calculating the reward, such as an exponential decay function or a polynomial decay function.

稀疏采样:一种数据采样方法,用于在大型数据集中选择部分样本进行分析或处理。在稀疏采样中,只选择数据集中的一小部分样本来代表整个数据集,以减少计算成本和时间。稀疏采样可以通过随机抽样、分层抽样或其他抽样方法来实现,确保所 选择的样本能够代表整个数据集的特征。Sparse sampling: A data sampling method used to select a portion of samples in a large data set for analysis or processing. In sparse sampling, only a small portion of the data set is selected to represent the entire data set to reduce computational cost and time. Sparse sampling can be achieved by random sampling, stratified sampling, or other sampling methods to ensure that all samples are The selected samples can represent the characteristics of the entire dataset.

低功耗广域网(Low Power Wide Area Network,LoRA):一种低功耗广域网技术,可以实现在长距离范围内的低功耗通信。LoRA技术使用了一种称为扩频频谱的调制技术,这种技术能够在低功率下实现长距离的通信。LoRA技术能够通过对LoRA设备的参数进行微调,以实现更加高效和可靠的通信,也即实现高效微调。这些参数包括发送功率、数据速率、接收灵敏度等。通过合理地调整这些参数,可以在不同的应用场景下实现更好的性能。Low Power Wide Area Network (LoRA): A low power wide area network technology that enables low power communication over long distances. LoRA technology uses a modulation technique called spread spectrum, which enables long-distance communication at low power. LoRA technology can achieve more efficient and reliable communication by fine-tuning the parameters of LoRA devices, that is, achieving efficient fine-tuning. These parameters include transmit power, data rate, receive sensitivity, etc. By adjusting these parameters reasonably, better performance can be achieved in different application scenarios.

相关技术中基于网络数据训练视频生成模型存在如下缺陷。The related art of training video generation models based on network data has the following defects.

缺陷1:由于网络上的数据大多为质量参差不齐的数据,因此导致训练得到的视频生成模型所生成的视频质量较差,与用户期望不符;Defect 1: Since most of the data on the Internet are of varying quality, the quality of the videos generated by the trained video generation model is poor, which does not meet user expectations.

缺陷2:相关技术中的视频扩散模型无法充分考虑人类的审美偏好和内容相关性。Defect 2: The video diffusion model in related technologies cannot fully consider human aesthetic preferences and content relevance.

针对上述缺陷,在本公开之前尚未提出有效的解决方案。With respect to the above-mentioned defects, no effective solution has been proposed before the present disclosure.

实施例1Example 1

根据本公开实施例,提供了一种视频生成方法,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present disclosure, a video generation method is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in an order different from that shown here.

本公开实施例一所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。图1示出了一种用于实现视频生成方法的计算机终端(或移动设备)的硬件结构框图。如图1所示,计算机终端10(或移动设备)可以包括一个或多个(图中采用102a,102b,……,102n来示出)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器104、以及用于通信功能的传输装置106。除此以外,还可以包括:显示器、输入/输出接口(I/O接口)、通用串行总线(USB)端口(可以作为BUS总线的端口中的一个端口被包括)、网络接口、电源和/或相机。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。The method embodiment provided in the first embodiment of the present disclosure can be executed in a mobile terminal, a computer terminal or a similar computing device. FIG1 shows a hardware structure block diagram of a computer terminal (or mobile device) for implementing a video generation method. As shown in FIG1 , the computer terminal 10 (or mobile device) may include one or more (102a, 102b, ..., 102n are used in the figure to show) processors 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions. In addition, it may also include: a display, an input/output interface (I/O interface), a universal serial bus (USB) port (which may be included as one of the ports of the BUS bus), a network interface, a power supply and/or a camera. It can be understood by those skilled in the art that the structure shown in FIG1 is only for illustration and does not limit the structure of the above-mentioned electronic device. For example, the computer terminal 10 may also include more or fewer components than those shown in FIG1 , or have a configuration different from that shown in FIG1 .

应当注意到的是上述一个或多个处理器102和/或其他数据处理电路在本文中通常可以被称为“数据处理电路”。该数据处理电路可以全部或部分的体现为软件、硬件、固件或其他任意组合。此外,数据处理电路可为单个独立的处理模块,或全部或部分的结合到计算机终端10(或移动设备)中的其他元件中的任意一个内。如本公开实施例中所涉及到的,该数据处理电路作为一种处理器控制(例如与接口连接的可变电阻终端路径的选择)。It should be noted that the one or more processors 102 and/or other data processing circuits described above may generally be referred to herein as "data processing circuits". The data processing circuits may be embodied in whole or in part as software, hardware, firmware, or any other combination thereof. In addition, the data processing circuit may be a single independent processing module, or may be incorporated in whole or in part into any of the other components in the computer terminal 10 (or mobile device). As involved in the embodiments of the present disclosure, the data processing circuit acts as a processor control (e.g., selection of a variable resistor terminal path connected to an interface).

存储器104可用于存储应用软件的软件程序以及模块,如本公开实施例中的视频 生成方法对应的程序指令/数据存储装置,处理器102通过运行存储在存储器104内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的视频生成方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 can be used to store software programs and modules of application software, such as the video The program instructions/data storage device corresponding to the generation method, the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, the above-mentioned video generation method is realized. The memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include a memory remotely arranged relative to the processor 102, and these remote memories may be connected to the computer terminal 10 via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。The transmission device 106 is used to receive or send data via a network. The specific example of the above network may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 106 can be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet wirelessly.

显示器可以例如触摸屏式的液晶显示器(LCD),该液晶显示器可使得用户能够与计算机终端10(或移动设备)的用户界面进行交互。The display may be, for example, a touch screen liquid crystal display (LCD) that enables a user to interact with a user interface of the computer terminal 10 (or mobile device).

在上述运行环境下,本公开提供了如图2所示的视频生成方法。图2是根据本公开实施例1的一种视频生成方法的流程图。如图2所示,该方法可以包括如下步骤:In the above operating environment, the present disclosure provides a video generation method as shown in FIG2. FIG2 is a flow chart of a video generation method according to Embodiment 1 of the present disclosure. As shown in FIG2, the method may include the following steps:

步骤S21,获取目标文本,其中,目标文本用于描述待生成的视频内容;Step S21, obtaining a target text, wherein the target text is used to describe the video content to be generated;

步骤S22,采用目标视频生成模型对目标文本进行视频生成处理,得到目标视频,其中,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型。Step S22, using a target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model in a fine-tuning manner.

目标文本可以理解为输入至视频生成模型的文本,例如输入至本公开实施例中的目标视频生成模型的文本,即作为目标视频生成模型的输入。目标文本用于描述待生成的视频内容,例如用于描述用户所期望生成的视频内容。The target text can be understood as the text input into the video generation model, such as the text input into the target video generation model in the embodiment of the present disclosure, that is, as the input of the target video generation model. The target text is used to describe the video content to be generated, for example, to describe the video content that the user expects to generate.

示例性地,若用户所期望生成的视频内容为山茱萸花在风中飘荡(Dogwood blossoms are blowing in the wind)的视频,则目标文本可以为“山茱萸花在风中飘荡”,或者“Dogwood blossoms are blowing in the wind”等。可以理解的是,目标文本可以采用自然语言文字进行描述,例如汉语、英语、日语等,此处不予限制。For example, if the video content that the user desires to generate is a video of Dogwood blossoms are blowing in the wind, the target text may be "狗玉花在风漂荡", or "狗木 开花s are blowing in the wind", etc. It is understandable that the target text may be described in natural language, such as Chinese, English, Japanese, etc., which is not limited here.

初始视频生成模型可以理解为一种基于文本生成视频的初始模型,示例性地,初始视频生成模型可以为预训练的U型网络(Pre-trained UNet),即为在大规模数据集上进行了预训练的U型网络模型。考虑到在大规模数据集上进行预训练,能够使模型学习到丰富的图像特征和语义信息,从而提高模型在特定任务上的泛化能力和准确性,因此本公开采用预训练的U型网络模型作为初始视频生成模型能够使得最终得到的目标视频生成模型生成的视频更加准确。The initial video generation model can be understood as an initial model for generating videos based on text. For example, the initial video generation model can be a pre-trained U-type network (Pre-trained UNet), that is, a U-type network model pre-trained on a large-scale dataset. Considering that pre-training on a large-scale dataset can enable the model to learn rich image features and semantic information, thereby improving the generalization ability and accuracy of the model on specific tasks, the present disclosure uses a pre-trained U-type network model as the initial video generation model to make the video generated by the final target video generation model more accurate.

预设奖励模型可以为用于评估生成视频的质量,从而指导视频生成模型产生更高 质量的输出的模型,例如预设奖励模型可以通过对生成视频进行评价得分,从而通过奖励分数来量化视频生成模型的输出质量或符合度。The preset reward model can be used to evaluate the quality of the generated video, thereby guiding the video generation model to produce higher A model of output quality, such as a preset reward model, can evaluate and score the generated video, thereby quantifying the output quality or conformity of the video generation model through a reward score.

需要注意的是,考虑到最终得到的目标视频生成模型应充分考虑人类的审美偏好和内容相关性,才能生成满足用户期望的视频,因此本公开实施例中的预设奖励模型采用基于图像的人类偏好模型,即图像奖励模型,该图像奖励模型能够基于人类的审美偏好和内容相关性评估生成视频的质量,从而使得最终得到的目标视频生成模型生成的视频更加符合用户期望。It should be noted that, considering that the final target video generation model should fully consider human aesthetic preferences and content relevance in order to generate videos that meet user expectations, the preset reward model in the embodiment of the present disclosure adopts an image-based human preference model, namely an image reward model. The image reward model can evaluate the quality of the generated video based on human aesthetic preferences and content relevance, so that the video generated by the final target video generation model is more in line with user expectations.

目标视频生成模型为本公开实施例提出的采用微调(fine-tuning)方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型,也即通过采用模型微调的方式对预训练的U型网络与图像奖励模型进行模型对齐后得到的模型,从而能够根据目标视频生成模型准确生成符合用户偏好且满足用户期望的视频内容。The target video generation model is a model obtained by aligning the initial video generation model with the preset reward model by fine-tuning the embodiment of the present disclosure, that is, a model obtained by aligning the pre-trained U-type network with the image reward model by fine-tuning the model, so that video content that meets user preferences and expectations can be accurately generated according to the target video generation model.

本公开实施例中,通过获取用于描述待生成的视频内容的目标文本,然后基于采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的目标视频生成模型对目标文本进行视频生成处理,即通过微调将预训练的U型网络与图片奖励模型对齐以得到微调后的目标视频生成模型,并利用微调后的目标视频生成模型基于目标文本生成视频,从而得到目标视频,能够使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,也即得到了更加符合用户期望的视频,进而能够提高生成得到的视频内容的个性化水平,还能在视频生成领域开辟新的应用前景。In the disclosed embodiment, by obtaining a target text for describing the video content to be generated, and then performing video generation processing on the target text based on a target video generation model obtained by aligning the initial video generation model with a preset reward model in a fine-tuning manner, that is, aligning the pre-trained U-type network with the image reward model through fine-tuning to obtain a fine-tuned target video generation model, and using the fine-tuned target video generation model to generate a video based on the target text, thereby obtaining a target video. This can make the generated target video more consistent with human aesthetic preferences and the target text content, improve the video quality of the generated target video, that is, obtain a video that better meets user expectations, thereby improving the personalization level of the generated video content, and opening up new application prospects in the field of video generation.

本公开实施例提供的上述视频生成方法可以但不限于应用于电商服务、教育服务、法律服务、医疗服务、会议服务、社交网络服务、金融产品服务、物流服务和导航服务等领域中涉及视频生成的应用场景中,例如:电商服务中的生成商品展示内容的场景、教育服务中的生成学习内容视频的场景、法律服务中生成案件相关视频的场景等,此处不予限制。The above-mentioned video generation method provided by the embodiments of the present disclosure can be applied to, but is not limited to, application scenarios involving video generation in the fields of e-commerce services, educational services, legal services, medical services, conference services, social network services, financial product services, logistics services, and navigation services. For example: the scenario of generating product display content in e-commerce services, the scenario of generating learning content videos in educational services, the scenario of generating case-related videos in legal services, etc., which are not limited here.

采用本公开实施例,通过获取用于描述待生成的视频内容的目标文本,然后基于采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的目标视频生成模型对目标文本进行视频生成处理,即通过微调将预训练的U型网络与图片奖励模型对齐以得到微调后的目标视频生成模型,并利用微调后的目标视频生成模型基于目标文本生成视频,从而得到目标视频,由此达到了生成符合用户期望的目标视频的目的,从而实现了使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,使生成的目标视频更受人类喜爱的技术效果,进而解决了相关技术中基于网络数据训练视频生成模型,导致训练得到的视频生成模型所生成的视频质量较差,不符合用户期望的技术问题。According to the disclosed embodiment, a target text for describing the content of a video to be generated is obtained, and then a video generation process is performed on the target text based on a target video generation model obtained by aligning an initial video generation model with a preset reward model by fine-tuning. That is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video. This achieves the purpose of generating a target video that meets user expectations, thereby achieving a technical effect of making the generated target video more consistent with human aesthetic preferences and the content of the target text, improving the video quality of the generated target video, and making the generated target video more popular with humans. This further solves the technical problem in the related art of training a video generation model based on network data, resulting in a poor quality video generated by the trained video generation model that does not meet user expectations.

在一种可选的实施例中,目标视频生成模型为视频扩散模型,预设奖励模型为图 像奖励模型,其中,图像奖励模型用于对视频扩散模型进行偏好学习。In an optional embodiment, the target video generation model is a video diffusion model, and the preset reward model is a graph Image reward model, where the image reward model is used to perform preference learning on the video diffusion model.

本公开实施例中,目标视频生成模型可以为视频扩散模型(video diffusion models)、自回归模型等,预设奖励模型可以为用于对视频扩散模型进行偏好学习的图像奖励模型。In the disclosed embodiment, the target video generation model may be a video diffusion model, an autoregressive model, etc., and the preset reward model may be an image reward model for performing preference learning on the video diffusion model.

在一种可选的实施例中,该视频生成方法还包括如下方法步骤:In an optional embodiment, the video generation method further includes the following method steps:

步骤S23,采用训练样本对初始视频生成模型进行采样处理,生成采样视频,其中,训练样本包括:多个视频文本对,多个视频文本对均包括:训练视频与训练文本,训练文本用于描述训练视频的视频内容;Step S23, using training samples to perform sampling processing on the initial video generation model to generate sampled videos, wherein the training samples include: a plurality of video-text pairs, and the plurality of video-text pairs each include: a training video and a training text, and the training text is used to describe the video content of the training video;

步骤S24,采用预设奖励模型对采样视频进行奖励计算,得到目标奖励结果;Step S24, using a preset reward model to calculate rewards for the sampled video to obtain a target reward result;

步骤S25,基于目标奖励结果对初始视频生成模型的模型参数进行调节,生成目标视频生成模型。Step S25, adjusting the model parameters of the initial video generation model based on the target reward result to generate a target video generation model.

训练样本可以理解为待进行模型微调使用的训练样本,示例性地,可以为从预训练数据集中选取出的部分视频文本,训练样本可以根据任意规则进行选取,本公开实施例不予限制。The training samples can be understood as training samples to be used for model fine-tuning. For example, they can be part of the video text selected from the pre-training data set. The training samples can be selected according to any rules, which is not limited in the embodiments of the present disclosure.

训练样本包括多个视频文本对,每个视频文本对都包括训练视频与对应的训练文本,其中,训练文本为用于描述训练视频的视频内容的文本内容。The training samples include a plurality of video-text pairs, each of which includes a training video and a corresponding training text, wherein the training text is text content used to describe the video content of the training video.

采样视频可以理解为利用初始视频生成模型对训练样本进行采样处理从而得到的视频,也即为利用预训练的U型网络模型对训练视频和训练文本进行采样处理,从而生成得到的视频。The sampled video can be understood as a video obtained by sampling the training samples using the initial video generation model, that is, a video generated by sampling the training video and training text using the pre-trained U-shaped network model.

目标奖励结果为根据预设奖励模型对采样视频进行奖励计算,得到的采样视频的奖励分值,该目标奖励结果用于反映采样视频与人类的审美偏好和文本内容是否相符,可以记为R。The target reward result is the reward score of the sampled video obtained by calculating the reward for the sampled video according to the preset reward model. The target reward result is used to reflect whether the sampled video is consistent with human aesthetic preferences and text content, and can be denoted as R.

本公开实施例中,在微调阶段开始之前,可以先从预训练数据集中选取出的部分视频文本作为训练样本,然后采用训练样本对初始视频生成模型进行采样处理,即利用预训练的U型网络模型对训练样本中的训练视频和训练文本进行采样处理,从而生成采样视频。再采用预设奖励模型对采样视频进行奖励计算,也即采用图像奖励模型对生成的采样视频进行奖励计算,得到该采样视频对应的目标奖励结果。最后基于目标奖励结果对初始视频生成模型的模型参数进行调节,也即基于目标奖励结果对初始视频生成模型进行模型微调,从而得到目标视频生成模型。In the disclosed embodiment, before the fine-tuning phase begins, some video texts can be selected from the pre-training data set as training samples, and then the initial video generation model is sampled using the training samples, that is, the training videos and training texts in the training samples are sampled using the pre-trained U-shaped network model to generate a sampled video. Then, a preset reward model is used to calculate rewards for the sampled video, that is, an image reward model is used to calculate rewards for the generated sampled video to obtain the target reward result corresponding to the sampled video. Finally, the model parameters of the initial video generation model are adjusted based on the target reward result, that is, the initial video generation model is fine-tuned based on the target reward result to obtain the target video generation model.

在一种可选的实施例中,在步骤S23中,采用训练样本对初始视频生成模型进行采样处理,生成采样视频,包括如下方法步骤:In an optional embodiment, in step S23, the initial video generation model is sampled using the training sample to generate a sampled video, including the following method steps:

步骤S231,对训练视频进行加噪处理,得到加噪视频;Step S231, performing noise processing on the training video to obtain a noisy video;

步骤S232,采用加噪视频与训练文本对初始视频生成模型进行视频重采样,生成采样视频。 Step S232, using the noisy video and the training text to perform video resampling on the initial video generation model to generate a sampled video.

本公开实施例中,在采用训练样本对初始视频生成模型进行采样处理时,可以对训练样本中的训练视频进行加噪处理,从而得到加噪视频,然后采用得到的加噪视频与训练样本中的训练文本对初始视频生成模型进行视频重采样,从而生成采样视频。In the embodiment of the present disclosure, when the training sample is used to sample the initial video generation model, the training video in the training sample can be denoised to obtain a noisy video, and then the obtained noisy video and the training text in the training sample are used to resample the initial video generation model to generate a sampled video.

示例性地,可以对训练视频实施带有噪声的扩散过程,从而得到加噪视频。然后对采用得到的加噪视频与训练样本中的训练文本对初始视频生成模型进行DDIM采样,从而生成采样视频。Exemplarily, a diffusion process with noise may be performed on the training video to obtain a noisy video, and then the initial video generation model is sampled by DDIM using the obtained noisy video and the training text in the training sample to generate a sampled video.

可以看出,相比于传统的生成过程方法,本公开通过DDIM采样对视频进行重采样能够更有效地利用数据信息,提高模型的生成能力和泛化能力,能够帮助模型更好地理解数据的结构和特征,从而更好地模仿人类的行为。It can be seen that compared with the traditional generation process method, the resampling of videos through DDIM sampling in the present invention can more effectively utilize data information, improve the generation and generalization capabilities of the model, and help the model better understand the structure and characteristics of the data, thereby better imitating human behavior.

在一种可选的实施例中,在步骤S231中,对训练视频进行加噪处理,得到加噪视频,包括如下方法步骤:In an optional embodiment, in step S231, performing noise processing on the training video to obtain a noisy video includes the following method steps:

步骤S2311,获取训练视频对应的加噪步数与噪声等级,其中,加噪步数用于通过预设加噪函数确定训练视频待加噪的步数,噪声等级用于确定对训练视频的破坏程度;Step S2311, obtaining the number of noise adding steps and the noise level corresponding to the training video, wherein the number of noise adding steps is used to determine the number of steps to be noise added to the training video through a preset noise adding function, and the noise level is used to determine the degree of damage to the training video;

步骤S2312,基于加噪步数与噪声等级对训练视频进行加噪处理,得到加噪视频。Step S2312: Noise the training video based on the number of noise addition steps and the noise level to obtain a noisy video.

预设加噪函数可以表示为d(τ,D),用于根据噪声等级τ和加噪步数D的值计算出训练视频应该加噪到多少步,输出结果通常在1到1000之间。The preset noise adding function can be expressed as d(τ, D), which is used to calculate how many steps the training video should be noised to according to the value of the noise level τ and the number of noise adding steps D. The output result is usually between 1 and 1000.

加噪步数D用于通过预设加噪函数d()确定训练视频待加噪的步数,通常设置为20。The number of denoising steps D is used to determine the number of steps to be denoised for the training video by using a preset denoising function d(), and is usually set to 20.

噪声等级τ用于确定对训练视频的破坏程度,取值范围在0到1之间。The noise level τ is used to determine the degree of damage to the training video, and its value range is between 0 and 1.

示例性地,若本公开实施例选用的视频生成模型为扩散模型,则加噪步数D表示扩散模型生成过程中使用的步数。预设加噪函数d()能够根据噪声等级τ和加噪步数D来确定视频加噪的程度,以便在扩散模型中生成具有一定程度噪声的视频。Exemplarily, if the video generation model selected in the embodiment of the present disclosure is a diffusion model, the number of noise addition steps D represents the number of steps used in the diffusion model generation process. The preset noise addition function d() can determine the degree of video noise addition according to the noise level τ and the number of noise addition steps D, so as to generate a video with a certain degree of noise in the diffusion model.

本公开实施例中,对训练视频进行加噪处理时,可以通过获取训练视频对应的加噪步数d与噪声等级τ,然后基于加噪步数d与噪声等级τ对训练视频进行加噪处理,从而生成具有一定程度噪声的加噪视频。In the disclosed embodiment, when performing noise processing on a training video, the number of noise adding steps d and the noise level τ corresponding to the training video can be obtained, and then the training video can be noise processed based on the number of noise adding steps d and the noise level τ, thereby generating a noisy video with a certain degree of noise.

可以看出,相比于相关技术中从文本开始进行加噪处理,本公开实施例采用DDIM采样,从视频开始进行加噪处理的方式仅需完整生成流程计算量的τ比例,因此计算量较小,能够有效节约计算资源。It can be seen that compared with the related art of starting the noise addition process from the text, the embodiment of the present disclosure adopts DDIM sampling. The method of starting the noise addition process from the video only requires a τ proportion of the calculation amount of the complete generation process, so the calculation amount is small, which can effectively save computing resources.

在一种可选的实施例中,在步骤S24中,采用预设奖励模型对采样视频进行奖励计算,得到目标奖励结果,包括如下方法步骤:In an optional embodiment, in step S24, a preset reward model is used to calculate a reward for the sampled video to obtain a target reward result, including the following method steps:

步骤S241,采用预设奖励模型对采样视频进行奖励计算,得到初始奖励结果;Step S241, using a preset reward model to calculate rewards for the sampled video to obtain an initial reward result;

步骤S242,采用时间衰减奖励方式调节初始奖励结果对应的初始奖励权重,生成目标奖励结果,其中,初始奖励权重为采样视频包含的视频帧序列对应的默认奖励权 重。Step S242, using a time decay reward method to adjust the initial reward weight corresponding to the initial reward result to generate a target reward result, wherein the initial reward weight is the default reward weight corresponding to the video frame sequence contained in the sampled video. Heavy.

本公开实施例中,生成采样视频后,在采用预设奖励模型对采样视频进行奖励计算时,可以采用预设奖励模型对采样视频进行奖励计算,即采用基于图像的人类偏好模型,也即采用图像奖励模型进行奖励计算,从而得到初始奖励结果,即该预设奖励模型的原始输出结果。In the disclosed embodiment, after the sampled video is generated, when the preset reward model is used to calculate the reward for the sampled video, the preset reward model can be used to calculate the reward for the sampled video, that is, an image-based human preference model is used, that is, an image reward model is used to calculate the reward, thereby obtaining an initial reward result, that is, the original output result of the preset reward model.

示例性地,可以将生成的采样视频以及生成采样视频时输入的文本输入至图像奖励模型,从而基于图像奖励模型输出得到分数值,即得到初始奖励结果。图像奖励模型输出的分数值可以在0到1之间,分数越大表示视频质量越好,此处不予限制。For example, the generated sample video and the text input when generating the sample video can be input into the image reward model, so as to obtain a score value based on the output of the image reward model, that is, to obtain an initial reward result. The score value output by the image reward model can be between 0 and 1, and a larger score indicates a better video quality, which is not limited here.

此外,为了提高模型的训练效果,在得到初始奖励结果后,还可以采用时间衰减奖励(Time-Decay Reward,TAR)方式调节初始奖励结果对应的初始奖励权重,即调节采样视频包含的视频帧序列对应的默认奖励权重,从而生成目标奖励结果。In addition, in order to improve the training effect of the model, after obtaining the initial reward result, the time-decay reward (TAR) method can be used to adjust the initial reward weight corresponding to the initial reward result, that is, adjust the default reward weight corresponding to the video frame sequence contained in the sampled video to generate the target reward result.

在一种可选的实施例中,在步骤S241中,采用预设奖励模型对采样视频进行奖励计算,得到初始奖励结果,包括如下方法步骤:In an optional embodiment, in step S241, a preset reward model is used to calculate a reward for a sampled video to obtain an initial reward result, including the following method steps:

步骤S2411,对视频帧序列进行视频分段采样,得到分段采样结果;Step S2411, performing video segment sampling on the video frame sequence to obtain segment sampling results;

步骤S2412,采用预设奖励模型对分段采样结果进行奖励计算,得到初始奖励结果。Step S2412, using a preset reward model to calculate rewards for the segmented sampling results to obtain an initial reward result.

本公开实施例中,在采用预设奖励模型对采样视频进行奖励计算时,为了提高学习过程的效率,可以采用分段视频奖励,对视频帧序列进行视频分段采样(segmental sampling),也即对视频进行稀疏采样,将视频拆分成几个片段,对多个连续视频帧进行分组,从而得到分段采样结果,然后采用预设奖励模型对分段采样结果进行奖励计算,得到初始奖励结果。In the disclosed embodiments, when a preset reward model is used to calculate rewards for sampled videos, in order to improve the efficiency of the learning process, segmented video rewards may be used to perform segmental sampling on video frame sequences, that is, sparse sampling is performed on the video, the video is split into several segments, and multiple continuous video frames are grouped to obtain segmented sampling results, and then the preset reward model is used to calculate rewards for the segmented sampling results to obtain initial reward results.

示例性地,以采样视频为16帧的视频为例,可以将16帧的视频以每4帧为一组,从而分成4组,采用预设奖励模型对分成的4组进行奖励计算,得到初始奖励结果,以提高模型的训练效果。For example, taking a sampled video with 16 frames as an example, the 16-frame video can be divided into 4 groups with 4 frames in each group, and the preset reward model is used to calculate the rewards for the 4 groups to obtain the initial reward results, so as to improve the training effect of the model.

在一种可选的实施例中,在步骤S2411中,对视频帧序列进行视频分段采样,得到分段采样结果,包括如下方法步骤:In an optional embodiment, in step S2411, performing video segment sampling on the video frame sequence to obtain segment sampling results includes the following method steps:

步骤S24111,获取视频帧序列的特征空间表示;Step S24111, obtaining a feature space representation of a video frame sequence;

步骤S24112,对特征空间表示进行视频分段采样,得到分段后视频帧的颜色空间表示;Step S24112, performing video segment sampling on the feature space representation to obtain a color space representation of the segmented video frame;

步骤S24113,基于颜色空间表示确定分段采样结果。Step S24113, determine the segmented sampling result based on the color space representation.

本公开实施例中,在对视频帧序列进行视频分段采样时,可以通过获取视频帧序列的特征空间表示z_0,然后对特征空间表示z_0进行视频分段采样,从而得到分段后视频帧的颜色(RGB)空间表示x_0^g,最终基于颜色空间表示x_0^g确定分段采样结果。 In the disclosed embodiment, when performing video segmentation sampling on a video frame sequence, the feature space representation z_0 of the video frame sequence can be obtained, and then the feature space representation z_0 can be subjected to video segmentation sampling to obtain the color (RGB) space representation x_0^g of the segmented video frame, and finally the segmentation sampling result can be determined based on the color space representation x_0^g.

在一种可选的实施例中,在步骤S242中,采用时间衰减奖励方式调节初始奖励结果对应的初始奖励权重,生成目标奖励结果,包括如下方法步骤:In an optional embodiment, in step S242, the initial reward weight corresponding to the initial reward result is adjusted by using a time decay reward method to generate a target reward result, including the following method steps:

步骤S2421,采用时间衰减奖励方式对初始奖励结果对应的初始奖励权重进行差异化调节,得到目标奖励权重,其中,目标奖励权重用于表示视频帧序列中第一视频帧的当前奖励权重高于第二视频帧的当前奖励权重,第一视频帧位于视频帧序列的中间位置,第二视频帧位于视频帧序列的边缘位置;Step S2421, using a time decay reward method to differentially adjust the initial reward weight corresponding to the initial reward result to obtain a target reward weight, wherein the target reward weight is used to indicate that the current reward weight of the first video frame in the video frame sequence is higher than the current reward weight of the second video frame, the first video frame is located in the middle of the video frame sequence, and the second video frame is located at the edge of the video frame sequence;

步骤S2422,基于初始奖励结果与目标奖励权重生成目标奖励结果。Step S2422, generating a target reward result based on the initial reward result and the target reward weight.

本公开实施例中,得到初始奖励结果后,在采用时间衰减奖励方式调节初始奖励结果对应的初始奖励权重时,可以采用时间衰减奖励方式对初始奖励结果对应的初始奖励权重进行差异化调节,通过将视频的中间帧的权重调整至较高,而边缘帧的权重调整至较低,实现有效且高效的微调,从而得到目标奖励权重。In the disclosed embodiment, after obtaining the initial reward result, when the time decay reward method is used to adjust the initial reward weight corresponding to the initial reward result, the time decay reward method can be used to perform differentiated adjustment on the initial reward weight corresponding to the initial reward result. By adjusting the weight of the middle frame of the video to a higher value and the weight of the edge frame to a lower value, effective and efficient fine-tuning can be achieved to obtain the target reward weight.

即本公开可以采用时间衰减奖励方式,将视频帧序列中位于视频帧序列的中间位置第一视频帧的当前奖励权重,调整至高于位于视频帧序列的边缘位置的第二视频帧的当前奖励权重。示例性地,以采样视频为16帧的视频为例,通过时间衰减奖励调整各视频帧的奖励权重,将16帧中最中间视频帧的系数调整为1,最旁边视频帧的系数调整为从而生成目标奖励结果。That is, the present disclosure can use a time decay reward method to adjust the current reward weight of the first video frame located in the middle of the video frame sequence to be higher than the current reward weight of the second video frame located at the edge of the video frame sequence. For example, taking a video with 16 frames as an example, the reward weight of each video frame is adjusted by time decay reward, and the coefficient of the middle video frame among the 16 frames is adjusted to 1, and the coefficient of the side video frame is adjusted to Thus generating the target reward result.

在一种可选的实施例中,通过终端设备提供一图形用户界面,图形用户界面所显示的内容至少部分地包含一视频生成场景,该视频生成方法还包括:In an optional embodiment, a graphical user interface is provided by a terminal device, and the content displayed by the graphical user interface at least partially includes a video generation scene. The video generation method further includes:

步骤S26,响应作用于图形用户界面的第一触控操作,输入目标文本,其中,目标文本用于描述待生成的视频内容;Step S26, in response to the first touch operation on the graphical user interface, inputting a target text, wherein the target text is used to describe the video content to be generated;

步骤S27,响应作用于图形用户界面的第二触控操作,采用目标视频生成模型对目标文本进行视频生成处理,得到目标视频,其中,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型;Step S27, in response to the second touch operation on the graphical user interface, using the target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model in a fine-tuning manner;

步骤S28,在图形用户界面内展示目标视频。Step S28, displaying the target video in the graphical user interface.

本公开实施例中的图形用户界面中至少显示有视频生成场景,用户能够通过执行控制操作在该图像编码场景中输入目标文本,采用目标视频生成模型对目标文本进行视频生成处理,得到目标视频等步骤。可以理解的是,上述视频生成场景可以但不限于电商、教育、医疗、会议、社交网络、金融产品、物流和导航等领域中涉及视频生成的应用场景。In the embodiment of the present disclosure, at least a video generation scene is displayed in the graphical user interface, and the user can input the target text in the image encoding scene by executing control operations, and use the target video generation model to perform video generation processing on the target text to obtain the target video and other steps. It can be understood that the above video generation scene can be, but is not limited to, application scenes involving video generation in the fields of e-commerce, education, medical treatment, conferences, social networks, financial products, logistics, and navigation.

上述图形用户界面还包括第一控件(或第一触控区域),当检测到作用于第一控件(或第一触控区域)的第一触控操作时,可以获取用户输入的目标文本。上述目标文本可以是用户通过第一触控操作从图形用户界面中的文本框进行输入的。上述第一触控操作可以是点选、框选、勾选、条件筛选等操作,此处不予限制。The graphical user interface also includes a first control (or a first touch area). When a first touch operation acting on the first control (or the first touch area) is detected, a target text input by the user can be obtained. The target text can be input by the user from a text box in the graphical user interface through the first touch operation. The first touch operation can be a point selection, a box selection, a check, a conditional screening, etc., which are not limited here.

上述图形用户界面还包括第二控件(或第二触控区域),当检测到作用于第二控件 (或第二触控区域)的第二触控操作时,能够采用目标视频生成模型对用户输入的目标文本进行视频生成处理,得到目标视频。上述第二触控操作可以是点选、框选、勾选、条件筛选等操作,此处不予限制。The above-mentioned graphical user interface also includes a second control (or a second touch area), when the second control is detected When the second touch operation is performed on the target text input by the user (or the second touch area), the target video generation model can be used to generate a video for the target text input by the user to obtain the target video. The above second touch operation can be operations such as point selection, box selection, check selection, conditional screening, etc., which are not limited here.

得到目标视频后,可以在图形用户界面内展示该目标视频以反馈给用户。After the target video is obtained, the target video can be displayed in a graphical user interface to provide feedback to the user.

需要说明的是,上述第一触控操作和第二触控操作均可以是用户用手指接触上述终端设备的显示屏并触控该终端设备的操作。该触控操作可以包括单点触控、多点触控,其中,每个触控点的触控操作可以包括点击、长按、重按、划动等。上述第一触控操作和第二触控操作还可以是通过鼠标、键盘等输入设备实现的触控操作,此处不予限制。It should be noted that the first touch operation and the second touch operation can both be operations in which a user touches the display screen of the terminal device with a finger and touches the terminal device. The touch operation can include single-point touch and multi-point touch, wherein the touch operation of each touch point can include click, long press, heavy press, swipe, etc. The first touch operation and the second touch operation can also be touch operations implemented by input devices such as a mouse and a keyboard, which are not limited here.

图3是根据本公开实施例1的微调流程示意图,其中,z为采样的视频在特征空间中的表示。c为采样的视频z对应的文本,可以理解的是,z与c共同构成对应的视频文本对。图3中以文本数据c为“山茱萸花在空中飞舞”,视频数据z为与文本数据c相应的视频为例。z_d(τ·D)表示加噪视频,其中,d()为用于计算出视频应该加噪到多少步的函数,D为加噪步数,τ为噪声等级。z_0为生成的视频在特征空间的表示,x_0^g为采样后生成视频在RGB空间的表示,R为计算得到的奖励分数。Figure 3 is a schematic diagram of the fine-tuning process according to Example 1 of the present disclosure, wherein z is the representation of the sampled video in the feature space. c is the text corresponding to the sampled video z. It can be understood that z and c together constitute a corresponding video-text pair. In Figure 3, the text data c is "Cornus flowers are flying in the air" and the video data z is the video corresponding to the text data c as an example. z_d(τ·D) represents the noisy video, wherein d() is a function for calculating how many steps the video should be noisy, D is the number of noisy steps, and τ is the noise level. z_0 is the representation of the generated video in the feature space, x_0^g is the representation of the generated video in the RGB space after sampling, and R is the calculated reward score.

如图3所示,在微调阶段开始之前,首先从预训练数据集中选择出部分视频文本对进行微调使用。选择出部分视频文本对后,对视频文本对中的视频数据z实施带有噪声的扩散过程,其中,设置噪声等级设定为τ,加噪步数为D,也即通过扩散模型对视频数据z进行加噪处理,得到加噪视频z_d(τ·D)。然后,基于这种加噪处理,对加噪视频z_d(τ·D)和对应的文本数据c进行采用DDIM采样进行重采样,从而生成采样视频z_0。生成采样视频z_0后,本公开采用基于图像的人类偏好模型作为奖励模型,即采用图像奖励模型进行奖励计算,并且在计算奖励分数时,为了提高学习过程的效率,采用分段采样和解码的方式,对采样视频z_0进行分段采样,得到分段后视频帧的颜色空间表示x_0^g以及分段采样结果。然后通过图像奖励模型和对应的文本数据c对分段采样结果进行奖励计算,并通过时间衰减奖励(TAR)调整各视频帧的奖励权重以实现有效且高效的微调,从而得到奖励分数R,即目标奖励结果。As shown in FIG3 , before the fine-tuning phase begins, some video-text pairs are first selected from the pre-training data set for fine-tuning. After selecting some video-text pairs, a diffusion process with noise is implemented on the video data z in the video-text pairs, wherein the noise level is set to τ and the number of noise steps is D, that is, the video data z is subjected to noise processing through the diffusion model to obtain a noisy video z_d (τ·D). Then, based on this noise processing, the noisy video z_d (τ·D) and the corresponding text data c are resampled using DDIM sampling to generate a sampled video z_0. After the sampled video z_0 is generated, the present disclosure adopts an image-based human preference model as a reward model, that is, an image reward model is used for reward calculation, and when calculating the reward score, in order to improve the efficiency of the learning process, a segmented sampling and decoding method is used to segment the sampled video z_0, and the color space representation x_0^g of the segmented video frame and the segmented sampling result are obtained. Then, the reward of the segmented sampling result is calculated through the image reward model and the corresponding text data c, and the reward weight of each video frame is adjusted through the time decay reward (TAR) to achieve effective and efficient fine-tuning, so as to obtain the reward score R, that is, the target reward result.

可以理解的是,在得到奖励分数R后,基于奖励分数R可以通过梯度的反向传播得到加噪视频z_d(1),以调整网络参数来最小化误差,从而优化模型,此次不过多赘述。It is understandable that after obtaining the reward score R, the noisy video z_d(1) can be obtained through gradient back propagation based on the reward score R to adjust the network parameters to minimize the error, thereby optimizing the model. I will not go into details here.

图4是根据本公开实施例1的一种视频生成方法的示意图,如图4所示,相比于相关技术的方法,即通过将文本输入至模型参数固定的预训练的U型网络中得到生成视频,导致生成视频不符合用户期望。本公开的方法,通过微调的方式,基于文本和加噪视频将本公开采用的视频生成模型与图像的人类偏好模型对齐,即将模型参数可训练的预训练的U型网络模型与图像奖励模型对齐,并在微调过程中,采用高效微调 技术LoRA,通过对模型的参数进行微小的调整来改进模型的性能,此外还能够基于图像奖励模型进行梯度处理优化模型,从而得到目标视频生成模型。最终采用本公开得到的目标视频生成模型对输入的文本进行视频生成处理,从而能够得到符合用户期望的目标视频。FIG4 is a schematic diagram of a video generation method according to Embodiment 1 of the present disclosure. As shown in FIG4, compared with the method of the related art, that is, generating a video by inputting text into a pre-trained U-shaped network with fixed model parameters, the generated video does not meet the user's expectations. The method of the present disclosure, through fine-tuning, aligns the video generation model adopted by the present disclosure with the human preference model of the image based on the text and the noisy video, that is, aligns the pre-trained U-shaped network model with trainable model parameters with the image reward model, and in the fine-tuning process, adopts an efficient fine-tuning method. The LoRA technology improves the performance of the model by making slight adjustments to the parameters of the model. In addition, it can also perform gradient processing optimization on the model based on the image reward model to obtain the target video generation model. Finally, the target video generation model obtained by the present disclosure is used to perform video generation processing on the input text, so as to obtain the target video that meets the user's expectations.

可以看出,本公开实施例的方法能够可以在没有视频奖励模型(相关技术中还没有训练出这种模型)的基础上,使用图像奖励模型来对视频扩散模型进行人类偏好的学习,使得视频扩散模型在经过本公开设计的微调方法以后,输出的结果更加符合用户期望,更加受到人类的喜爱。It can be seen that the method of the embodiment of the present disclosure can use the image reward model to learn human preferences for the video diffusion model without a video reward model (such a model has not been trained in the relevant technology), so that after the fine-tuning method designed by the present disclosure, the output results of the video diffusion model are more in line with user expectations and more popular with humans.

容易理解的是,本公开提供的视频生成方法的有益效果包括以下几点。It is easy to understand that the beneficial effects of the video generation method provided by the present disclosure include the following points.

有益效果(1),为了将视频扩散模型与人类偏好对齐,本公开提出将视频扩散模型与图像的人类偏好模型对齐,从而使对齐后得到的视频扩散模型能够充分考虑人类的审美偏好和内容相关性,输出更符合用户期望的视频内容,能提高内容的个性化水平,还能在视频生成领域开辟新的应用前景;Beneficial effect (1): In order to align the video diffusion model with human preferences, the present disclosure proposes to align the video diffusion model with the human preference model of images, so that the video diffusion model obtained after alignment can fully consider human aesthetic preferences and content relevance, output video content that better meets user expectations, improve the personalization level of content, and open up new application prospects in the field of video generation;

有益效果(2),本公开提出了在视频奖励微调学习中进行分段视频奖励以及时间衰减奖励,来使得模型可以进行有效得微调。Beneficial effect (2): The present disclosure proposes segmented video rewards and time decay rewards in video reward fine-tuning learning to enable the model to be effectively fine-tuned.

需要说明的是,本公开所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,并且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准,并提供有相应的操作入口,供用户选择授权或者拒绝。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with the relevant laws, regulations and standards of relevant countries and regions, and provide corresponding operation entrances for users to choose to authorize or refuse.

另外,还需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本公开所必须的。In addition, it should be noted that, for the aforementioned method embodiments, for the sake of simplicity, they are all described as a series of action combinations, but those skilled in the art should be aware that the present disclosure is not limited by the order of the actions described, because according to the present disclosure, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present disclosure.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本公开各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on such an understanding, the technical solution of the present disclosure, or the part that contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for enabling a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present disclosure.

实施例2Example 2

在如实施例1中的运行环境下,本公开提供了如图5所示的一种视频生成方法,图5是根据本公开实施例2的一种视频生成方法的流程图,如图5所示,该方法包括: In the operating environment as in Example 1, the present disclosure provides a video generation method as shown in FIG5 . FIG5 is a flow chart of a video generation method according to Example 2 of the present disclosure. As shown in FIG5 , the method includes:

步骤S51,通过第一应用程序编程接口获取视频生成调用请求,其中,视频生成调用请求中携带的请求数据包括:目标文本,目标文本用于描述待生成的视频内容,视频生成调用请求用于请求调用目标视频生成模型对目标文本进行视频生成处理以得到目标视频,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型;Step S51, obtaining a video generation call request through a first application programming interface, wherein the request data carried in the video generation call request includes: a target text, the target text is used to describe the video content to be generated, the video generation call request is used to request to call a target video generation model to perform video generation processing on the target text to obtain a target video, and the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model in a fine-tuning manner;

步骤S52,通过第二应用程序编程接口返回视频生成调用响应,其中,视频生成调用响应中携带的响应数据包括:目标视频。Step S52: Return a video generation call response through the second application programming interface, wherein the response data carried in the video generation call response includes: the target video.

第一应用程序编程接口和第二应用程序编程接口均可以理解为应用程序编程接口(Application Programming Interface,API),通过调用第一应用程序编程接口能够获取携带有目标文本的视频生成调用请求,通过调用第二应用程序编程接口能够返回视频生成调用响应。Both the first application programming interface and the second application programming interface can be understood as application programming interfaces (Application Programming Interface, API). By calling the first application programming interface, a video generation call request carrying the target text can be obtained, and by calling the second application programming interface, a video generation call response can be returned.

本公开实施例中的第一应用程序编程接口和第二应用程序编程接口可以为同一应用程序编程接口,也可以为不同应用程序编程接口,此处不予限制。The first application programming interface and the second application programming interface in the embodiment of the present disclosure may be the same application programming interface or different application programming interfaces, which is not limited here.

目标文本可以理解为输入至视频生成模型的文本,例如输入至本公开实施例中的目标视频生成模型的文本,即作为目标视频生成模型的输入。目标文本用于描述待生成的视频内容,例如用于描述用户所期望生成的视频内容。The target text can be understood as the text input into the video generation model, such as the text input into the target video generation model in the embodiment of the present disclosure, that is, as the input of the target video generation model. The target text is used to describe the video content to be generated, for example, to describe the video content that the user expects to generate.

示例性地,若用户所期望生成的视频内容为山茱萸花在风中飘荡(Dogwood blossoms are blowing in the wind)的视频,则目标文本可以为“山茱萸花在风中飘荡”,或者“Dogwood blossoms are blowing in the wind”等。可以理解的是,目标文本可以采用自然语言文字进行描述,例如汉语、英语、日语等,此处不予限制。For example, if the video content that the user desires to generate is a video of Dogwood blossoms are blowing in the wind, the target text may be "狗玉花在风漂荡", or "狗木 开花s are blowing in the wind", etc. It is understandable that the target text may be described in natural language, such as Chinese, English, Japanese, etc., which is not limited here.

初始视频生成模型可以理解为一种基于文本生成视频的初始模型,示例性地,初始视频生成模型可以为预训练的U型网络(Pre-trained UNet),即为在大规模数据集上进行了预训练的U型网络模型。考虑到在大规模数据集上进行预训练,能够使模型学习到丰富的图像特征和语义信息,从而提高模型在特定任务上的泛化能力和准确性,因此本公开采用预训练的U型网络模型作为初始视频生成模型能够使得最终得到的目标视频生成模型生成的视频更加准确。The initial video generation model can be understood as an initial model for generating videos based on text. For example, the initial video generation model can be a pre-trained U-type network (Pre-trained UNet), that is, a U-type network model pre-trained on a large-scale dataset. Considering that pre-training on a large-scale dataset can enable the model to learn rich image features and semantic information, thereby improving the generalization ability and accuracy of the model on specific tasks, the present disclosure uses a pre-trained U-type network model as the initial video generation model to make the video generated by the target video generation model more accurate.

预设奖励模型可以为用于评估生成视频的质量,从而指导视频生成模型产生更高质量的输出的模型,例如预设奖励模型可以通过对生成视频进行评价得分,从而通过奖励分数来量化视频生成模型的输出质量或符合度。The preset reward model may be a model used to evaluate the quality of generated videos, thereby guiding the video generation model to produce higher quality outputs. For example, the preset reward model may evaluate and score the generated videos, thereby quantifying the output quality or conformity of the video generation model through reward scores.

需要注意的是,考虑到最终得到的目标视频生成模型应充分考虑人类的审美偏好和内容相关性,才能生成满足用户期望的视频,因此本公开实施例中的预设奖励模型采用基于图像的人类偏好模型,即图像奖励模型,该图像奖励模型能够基于人类的审美偏好和内容相关性评估生成视频的质量,从而使得最终得到的目标视频生成模型生成的视频更加符合用户期望。 It should be noted that, considering that the final target video generation model should fully consider human aesthetic preferences and content relevance in order to generate videos that meet user expectations, the preset reward model in the embodiment of the present disclosure adopts an image-based human preference model, namely an image reward model. The image reward model can evaluate the quality of the generated video based on human aesthetic preferences and content relevance, so that the video generated by the final target video generation model is more in line with user expectations.

目标视频生成模型为本公开实施例提出的采用微调(fine-tuning)方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型,也即通过采用模型微调的方式对预训练的U型网络与图像奖励模型进行模型对齐后得到的模型,从而能够根据目标视频生成模型准确生成符合用户偏好且满足用户期望的视频内容。The target video generation model is a model obtained by aligning the initial video generation model with the preset reward model by fine-tuning the embodiment of the present disclosure, that is, a model obtained by aligning the pre-trained U-type network with the image reward model by fine-tuning the model, so that video content that meets user preferences and expectations can be accurately generated according to the target video generation model.

本公开实施例中,通过调用第一应用程序编程接口能够获取视频生成调用请求,该视频生成调用请求中携带了用于描述待生成的视频内容的目标文本,根据获取到的视频生成调用请求采用目标视频生成模型对目标文本进行视频生成处理,从而能够得到目标视频,然后通过调用第二应用程序编程接口能够返回包括该目标视频的视频生成调用响应。根据该方法能够使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,也即得到了更加符合用户期望的视频,进而能够提高生成得到的视频内容的个性化水平,还能在视频生成领域开辟新的应用前景。In the disclosed embodiment, a video generation call request can be obtained by calling a first application programming interface, and the video generation call request carries a target text for describing the content of the video to be generated. According to the obtained video generation call request, a target video generation model is used to perform video generation processing on the target text, so that a target video can be obtained, and then a video generation call response including the target video can be returned by calling a second application programming interface. According to this method, the generated target video can be made more consistent with human aesthetic preferences and the content of the target text, and the video quality of the generated target video can be improved, that is, a video that better meets the user's expectations can be obtained, thereby improving the personalization level of the generated video content, and opening up new application prospects in the field of video generation.

本公开实施例提供的上述视频生成方法可以但不限于应用于电商服务、教育服务、法律服务、医疗服务、会议服务、社交网络服务、金融产品服务、物流服务和导航服务等领域中涉及视频生成的应用场景中,例如:电商服务中的生成商品展示内容的场景、教育服务中的生成学习内容视频的场景、法律服务中生成案件相关视频的场景等,此处不予限制。The above-mentioned video generation method provided by the embodiments of the present disclosure can be applied to, but is not limited to, application scenarios involving video generation in the fields of e-commerce services, educational services, legal services, medical services, conference services, social network services, financial product services, logistics services, and navigation services. For example: the scenario of generating product display content in e-commerce services, the scenario of generating learning content videos in educational services, the scenario of generating case-related videos in legal services, etc., which are not limited here.

采用本公开实施例,通过调用第一应用程序编程接口能够获取视频生成调用请求,该视频生成调用请求中携带了用于描述待生成的视频内容的目标文本,根据获取到的视频生成调用请求采用目标视频生成模型对目标文本进行视频生成处理,从而能够得到目标视频,然后通过调用第二应用程序编程接口能够返回包括该目标视频的视频生成调用响应,由此达到了生成符合用户期望的目标视频的目的,从而实现了使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,使生成的目标视频更受人类喜爱的技术效果,进而解决了相关技术中基于网络数据训练视频生成模型,导致训练得到的视频生成模型所生成的视频质量较差,不符合用户期望的技术问题。By adopting the embodiment of the present disclosure, a video generation call request can be obtained by calling a first application programming interface, and the video generation call request carries a target text for describing the video content to be generated. According to the obtained video generation call request, a target video generation model is used to perform video generation processing on the target text, so that a target video can be obtained. Then, by calling a second application programming interface, a video generation call response including the target video can be returned, thereby achieving the purpose of generating a target video that meets user expectations, thereby achieving the technical effect of making the generated target video more consistent with human aesthetic preferences and target text content, improving the video quality of the generated target video, and making the generated target video more popular with humans, thereby solving the technical problem in the related technology of training a video generation model based on network data, resulting in the video generated by the trained video generation model having poor quality and not meeting user expectations.

需要说明的是,本实施例的优选实施方式可以参见实施例1中的相关描述,此处不再赘述。It should be noted that the preferred implementation of this embodiment can refer to the relevant description in Example 1, which will not be repeated here.

实施例3Example 3

在如实施例1中的运行环境下,本公开提供了如图6所示的一种视频生成方法,图6是根据本公开实施例3的一种视频生成方法的流程图,如图6所示,该方法包括:In the operating environment as in Example 1, the present disclosure provides a video generation method as shown in FIG6 . FIG6 is a flow chart of a video generation method according to Example 3 of the present disclosure. As shown in FIG6 , the method includes:

步骤S61,获取当前输入的视频生成对话请求,其中,视频生成对话请求中携带的信息包括:目标文本,目标文本用于描述待生成的视频内容;Step S61, obtaining a currently input video generation dialogue request, wherein the information carried in the video generation dialogue request includes: a target text, the target text is used to describe the video content to be generated;

步骤S62,响应于视频生成对话请求,返回视频生成对话回复,其中,视频生成 对话回复中携带的信息包括:目标视频,目标视频采用目标视频生成模型对目标文本进行视频生成处理后得到,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型;Step S62, in response to the video generation dialogue request, returns a video generation dialogue reply, wherein the video generation The information carried in the dialogue response includes: a target video, which is obtained by processing the target text with a target video generation model, and the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model by fine-tuning;

步骤S63,在图形用户界面内展示视频生成对话回复。Step S63, displaying the video-generated dialogue response in the graphical user interface.

视频生成对话请求可以理解为用户向计算机或机器人发起的对话请求(request),该视频生成对话请求中携带有用于描述待生成的视频内容的目标文本。The video generation dialogue request can be understood as a dialogue request (request) initiated by a user to a computer or a robot, and the video generation dialogue request carries a target text for describing the video content to be generated.

目标文本可以理解为输入至视频生成模型的文本,例如输入至本公开实施例中的目标视频生成模型的文本,即作为目标视频生成模型的输入。目标文本用于描述待生成的视频内容,例如用于描述用户所期望生成的视频内容。The target text can be understood as the text input into the video generation model, such as the text input into the target video generation model in the embodiment of the present disclosure, that is, as the input of the target video generation model. The target text is used to describe the video content to be generated, for example, to describe the video content that the user expects to generate.

示例性地,若用户所期望生成的视频内容为山茱萸花在风中飘荡(Dogwood blossoms are blowing in the wind)的视频,则目标文本可以为“山茱萸花在风中飘荡”,或者“Dogwood blossoms are blowing in the wind”等。可以理解的是,目标文本可以采用自然语言文字进行描述,例如汉语、英语、日语等,此处不予限制。For example, if the video content that the user desires to generate is a video of Dogwood blossoms are blowing in the wind, the target text may be "狗玉花在风漂荡", or "狗木 开花s are blowing in the wind", etc. It is understandable that the target text may be described in natural language, such as Chinese, English, Japanese, etc., which is not limited here.

本公开实施例中,响应于获取到的视频生成对话请求,可以返回视频生成对话回复(response),其中,视频生成对话回复中携带有采用目标视频生成模型对目标文本进行视频生成处理后得到的目标视频,且目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型。In the disclosed embodiment, in response to the acquired video generation dialogue request, a video generation dialogue reply (response) may be returned, wherein the video generation dialogue reply carries a target video obtained by performing video generation processing on the target text using a target video generation model, and the target video generation model is a model obtained by aligning the initial video generation model with a preset reward model using a fine-tuning method.

初始视频生成模型可以理解为一种基于文本生成视频的初始模型,示例性地,初始视频生成模型可以为预训练的U型网络(Pre-trained UNet),即为在大规模数据集上进行了预训练的U型网络模型。考虑到在大规模数据集上进行预训练,能够使模型学习到丰富的图像特征和语义信息,从而提高模型在特定任务上的泛化能力和准确性,因此本公开采用预训练的U型网络模型作为初始视频生成模型能够使得最终得到的目标视频生成模型生成的视频更加准确。The initial video generation model can be understood as an initial model for generating videos based on text. For example, the initial video generation model can be a pre-trained U-type network (Pre-trained UNet), that is, a U-type network model pre-trained on a large-scale dataset. Considering that pre-training on a large-scale dataset can enable the model to learn rich image features and semantic information, thereby improving the generalization ability and accuracy of the model on specific tasks, the present disclosure uses a pre-trained U-type network model as the initial video generation model to make the video generated by the target video generation model more accurate.

预设奖励模型可以为用于评估生成视频的质量,从而指导视频生成模型产生更高质量的输出的模型,例如预设奖励模型可以通过对生成视频进行评价得分,从而通过奖励分数来量化视频生成模型的输出质量或符合度。The preset reward model may be a model used to evaluate the quality of generated videos, thereby guiding the video generation model to produce higher quality outputs. For example, the preset reward model may evaluate and score the generated videos, thereby quantifying the output quality or conformity of the video generation model through reward scores.

需要注意的是,考虑到最终得到的目标视频生成模型应充分考虑人类的审美偏好和内容相关性,才能生成满足用户期望的视频,因此本公开实施例中的预设奖励模型采用基于图像的人类偏好模型,即图像奖励模型,该图像奖励模型能够基于人类的审美偏好和内容相关性评估生成视频的质量,从而使得最终得到的目标视频生成模型生成的视频更加符合用户期望。It should be noted that, considering that the final target video generation model should fully consider human aesthetic preferences and content relevance in order to generate videos that meet user expectations, the preset reward model in the embodiment of the present disclosure adopts an image-based human preference model, namely an image reward model. The image reward model can evaluate the quality of the generated video based on human aesthetic preferences and content relevance, so that the video generated by the final target video generation model is more in line with user expectations.

目标视频生成模型为本公开实施例提出的采用微调(fine-tuning)方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型,也即通过采用模型微调的方式对预训练的U型网络与图像奖励模型进行模型对齐后得到的模型,从而能够根据目标 视频生成模型准确生成符合用户偏好且满足用户期望的视频内容。The target video generation model is a model obtained by aligning the initial video generation model with the preset reward model by fine-tuning the embodiment of the present disclosure, that is, a model obtained by aligning the pre-trained U-shaped network with the image reward model by fine-tuning the model, so as to be able to generate the target video according to the target. The video generation model accurately generates video content that conforms to user preferences and meets user expectations.

返回视频生成对话回复后,可以在图形用户界面内展示视频生成对话回复。After returning to the video-generated dialogue response, the video-generated dialogue response can be displayed in the graphical user interface.

本公开实施例提供的上述视频生成方法可以但不限于应用于电商服务、教育服务、法律服务、医疗服务、会议服务、社交网络服务、金融产品服务、物流服务和导航服务等领域中涉及视频生成的应用场景中,例如:电商服务中的生成商品展示内容的场景、教育服务中的生成学习内容视频的场景、法律服务中生成案件相关视频的场景等,此处不予限制。The above-mentioned video generation method provided by the embodiments of the present disclosure can be applied to, but is not limited to, application scenarios involving video generation in the fields of e-commerce services, educational services, legal services, medical services, conference services, social network services, financial product services, logistics services, and navigation services. For example: the scenario of generating product display content in e-commerce services, the scenario of generating learning content videos in educational services, the scenario of generating case-related videos in legal services, etc., which are not limited here.

采用本公开实施例,通过获取携带有用于描述待生成的视频内容的目标文本的视频生成对话请求,然后响应于该视频生成对话请求,基于采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的目标视频生成模型对目标文本进行视频生成处理,即通过微调将预训练的U型网络与图片奖励模型对齐以得到微调后的目标视频生成模型,并利用微调后的目标视频生成模型基于目标文本生成视频,从而得到目标视频,即得到携带该目标视频的视频生成对话回复,最后在图形用户界面内展示该视频生成对话回复,由此达到了生成符合用户期望的目标视频的目的,从而实现了使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,使生成的目标视频更受人类喜爱的技术效果,进而解决了相关技术中基于网络数据训练视频生成模型,导致训练得到的视频生成模型所生成的视频质量较差,不符合用户期望的技术问题。According to the disclosed embodiment, a video generation dialogue request carrying a target text for describing the content of a video to be generated is obtained, and then in response to the video generation dialogue request, a video generation process is performed on the target text based on a target video generation model obtained by aligning the initial video generation model with a preset reward model in a fine-tuning manner, that is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video, that is, a video generation dialogue response carrying the target video is obtained, and finally the video generation dialogue response is displayed in a graphical user interface, thereby achieving the purpose of generating a target video that meets the user's expectations, thereby achieving the technical effect of making the generated target video more consistent with human aesthetic preferences and the content of the target text, improving the video quality of the generated target video, and making the generated target video more popular with humans, thereby solving the technical problem in the related art of training a video generation model based on network data, resulting in a poor quality of the video generated by the trained video generation model that does not meet the user's expectations.

需要说明的是,本实施例的优选实施方式可以参见实施例1中的相关描述,此处不再赘述。It should be noted that the preferred implementation of this embodiment can refer to the relevant description in Example 1, which will not be repeated here.

实施例4Example 4

根据本公开实施例,还提供了一种用于实施上述视频生成方法的装置实施例。图7是根据本公开实施例4的一种视频生成装置的结构示意图,如图7所示,该装置包括:According to an embodiment of the present disclosure, a device embodiment for implementing the above-mentioned video generation method is also provided. FIG7 is a structural schematic diagram of a video generation device according to Embodiment 4 of the present disclosure. As shown in FIG7 , the device includes:

获取模块701,被设置为获取目标文本,其中,目标文本用于描述待生成的视频内容;An acquisition module 701 is configured to acquire a target text, wherein the target text is used to describe the video content to be generated;

处理模块702,被设置为采用目标视频生成模型对目标文本进行视频生成处理,得到目标视频,其中,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型。The processing module 702 is configured to use a target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model using a fine-tuning method.

可选地,还包括:训练模块,被设置为采用训练样本对初始视频生成模型进行采样处理,生成采样视频,其中,训练样本包括:多个视频文本对,多个视频文本对均包括:训练视频与训练文本,训练文本用于描述训练视频的视频内容;采用预设奖励模型对采样视频进行奖励计算,得到目标奖励结果;基于目标奖励结果对初始视频生成模型的模型参数进行调节,生成目标视频生成模型。 Optionally, it also includes: a training module, which is configured to use training samples to sample the initial video generation model to generate a sampled video, wherein the training samples include: multiple video-text pairs, and the multiple video-text pairs all include: training videos and training texts, and the training texts are used to describe the video content of the training videos; using a preset reward model to calculate rewards for the sampled videos to obtain target reward results; based on the target reward results, adjusting the model parameters of the initial video generation model to generate a target video generation model.

可选地,上述训练模块还被设置为:对训练视频进行加噪处理,得到加噪视频;采用加噪视频与训练文本对初始视频生成模型进行视频重采样,生成采样视频。Optionally, the training module is further configured to: perform noise processing on the training video to obtain a noisy video; and perform video resampling on the initial video generation model using the noisy video and the training text to generate a sampled video.

可选地,上述训练模块还被设置为:获取训练视频对应的加噪步数与噪声等级,其中,加噪步数用于通过预设加噪函数确定训练视频待加噪的步数,噪声等级用于确定对训练视频的破坏程度;基于加噪步数与噪声等级对训练视频进行加噪处理,得到加噪视频。Optionally, the above-mentioned training module is also configured to: obtain the number of noise addition steps and the noise level corresponding to the training video, wherein the number of noise addition steps is used to determine the number of steps to be noised for the training video through a preset noise addition function, and the noise level is used to determine the degree of damage to the training video; and perform noise addition processing on the training video based on the number of noise addition steps and the noise level to obtain a noisy video.

可选地,上述训练模块还被设置为:采用预设奖励模型对采样视频进行奖励计算,得到初始奖励结果;采用时间衰减奖励方式调节初始奖励结果对应的初始奖励权重,生成目标奖励结果,其中,初始奖励权重为采样视频包含的视频帧序列对应的默认奖励权重。Optionally, the above-mentioned training module is also configured to: use a preset reward model to calculate the reward for the sampled video to obtain an initial reward result; use a time-decayed reward method to adjust the initial reward weight corresponding to the initial reward result to generate a target reward result, wherein the initial reward weight is the default reward weight corresponding to the video frame sequence contained in the sampled video.

可选地,上述训练模块还被设置为:对视频帧序列进行视频分段采样,得到分段采样结果;采用预设奖励模型对分段采样结果进行奖励计算,得到初始奖励结果。Optionally, the training module is further configured to: perform video segment sampling on the video frame sequence to obtain segment sampling results; and perform reward calculation on the segment sampling results using a preset reward model to obtain an initial reward result.

可选地,上述训练模块还被设置为:获取视频帧序列的特征空间表示;对特征空间表示进行视频分段采样,得到分段后视频帧的颜色空间表示;基于颜色空间表示确定分段采样结果。Optionally, the above training module is also configured to: obtain a feature space representation of a video frame sequence; perform video segmentation sampling on the feature space representation to obtain a color space representation of the segmented video frame; and determine the segmentation sampling result based on the color space representation.

可选地,上述训练模块还被设置为:采用时间衰减奖励方式对初始奖励结果对应的初始奖励权重进行差异化调节,得到目标奖励权重,其中,目标奖励权重用于表示视频帧序列中第一视频帧的当前奖励权重高于第二视频帧的当前奖励权重,第一视频帧位于视频帧序列的中间位置,第二视频帧位于视频帧序列的边缘位置;基于初始奖励结果与目标奖励权重生成目标奖励结果。Optionally, the above-mentioned training module is also configured to: use a time-decayed reward method to differentially adjust the initial reward weight corresponding to the initial reward result to obtain a target reward weight, wherein the target reward weight is used to indicate that the current reward weight of the first video frame in the video frame sequence is higher than the current reward weight of the second video frame, the first video frame is located in the middle position of the video frame sequence, and the second video frame is located at the edge position of the video frame sequence; generate the target reward result based on the initial reward result and the target reward weight.

可选地,目标视频生成模型为视频扩散模型,预设奖励模型为图像奖励模型,其中,图像奖励模型用于对视频扩散模型进行偏好学习。Optionally, the target video generation model is a video diffusion model, and the preset reward model is an image reward model, wherein the image reward model is used to perform preference learning on the video diffusion model.

采用本公开实施例,通过获取用于描述待生成的视频内容的目标文本,然后基于采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的目标视频生成模型对目标文本进行视频生成处理,即通过微调将预训练的U型网络与图片奖励模型对齐以得到微调后的目标视频生成模型,并利用微调后的目标视频生成模型基于目标文本生成视频,从而得到目标视频,由此达到了生成符合用户期望的目标视频的目的,从而实现了使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,使生成的目标视频更受人类喜爱的技术效果,进而解决了相关技术中基于网络数据训练视频生成模型,导致训练得到的视频生成模型所生成的视频质量较差,不符合用户期望的技术问题。According to the disclosed embodiment, a target text for describing the content of a video to be generated is obtained, and then a video generation process is performed on the target text based on a target video generation model obtained by aligning an initial video generation model with a preset reward model by fine-tuning. That is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video. This achieves the purpose of generating a target video that meets user expectations, thereby achieving a technical effect of making the generated target video more consistent with human aesthetic preferences and the content of the target text, improving the video quality of the generated target video, and making the generated target video more popular with humans. This further solves the technical problem in the related art of training a video generation model based on network data, resulting in a poor quality video generated by the trained video generation model that does not meet user expectations.

此处需要说明的是,上述获取模块701和处理模块702对应于实施例1中的步骤S21和步骤S22,两个模块与对应的步骤所实现的实例和应用场景相同,但不限于上述实施例1所公开的内容。需要说明的是,上述模块或单元可以是存储在存储器(例 如,存储器104)中并由一个或多个处理器(例如,处理器102a,102b,……,102n)处理的硬件组件或软件组件,上述模块也可以作为装置的一部分可以运行在实施例1提供的计算机终端10中。It should be noted that the acquisition module 701 and the processing module 702 correspond to step S21 and step S22 in Example 1. The examples and application scenarios implemented by the two modules and the corresponding steps are the same, but are not limited to the contents disclosed in the above-mentioned Example 1. It should be noted that the above-mentioned modules or units can be stored in a memory (for example, For example, a hardware component or software component in memory 104) and processed by one or more processors (for example, processors 102a, 102b, ..., 102n), the above module can also be run in the computer terminal 10 provided in Example 1 as part of the device.

根据本公开实施例,还提供了另一种用于实施上述视频生成方法的装置实施例。图8是根据本公开实施例4的另一种视频生成装置的结构示意图,如图8所示,该装置包括:According to an embodiment of the present disclosure, another device embodiment for implementing the above-mentioned video generation method is also provided. FIG8 is a structural schematic diagram of another video generation device according to embodiment 4 of the present disclosure. As shown in FIG8 , the device includes:

获取模块801,被设置为通过第一应用程序编程接口获取视频生成调用请求,其中,视频生成调用请求中携带的请求数据包括:目标文本,目标文本用于描述待生成的视频内容,视频生成调用请求用于请求调用目标视频生成模型对目标文本进行视频生成处理以得到目标视频,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型;The acquisition module 801 is configured to acquire a video generation call request through a first application programming interface, wherein the request data carried in the video generation call request includes: a target text, the target text is used to describe the video content to be generated, the video generation call request is used to request to call a target video generation model to perform video generation processing on the target text to obtain a target video, and the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model in a fine-tuning manner;

返回模块802,被设置为通过第二应用程序编程接口返回视频生成调用响应,其中,视频生成调用响应中携带的响应数据包括:目标视频。The return module 802 is configured to return a video generation call response through a second application programming interface, wherein the response data carried in the video generation call response includes: a target video.

采用本公开实施例,通过调用第一应用程序编程接口能够获取视频生成调用请求,该视频生成调用请求中携带了用于描述待生成的视频内容的目标文本,根据获取到的视频生成调用请求采用目标视频生成模型对目标文本进行视频生成处理,从而能够得到目标视频,然后通过调用第二应用程序编程接口能够返回包括该目标视频的视频生成调用响应,由此达到了生成符合用户期望的目标视频的目的,从而实现了使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,使生成的目标视频更受人类喜爱的技术效果,进而解决了相关技术中基于网络数据训练视频生成模型,导致训练得到的视频生成模型所生成的视频质量较差,不符合用户期望的技术问题。By adopting the embodiment of the present disclosure, a video generation call request can be obtained by calling a first application programming interface, and the video generation call request carries a target text for describing the video content to be generated. According to the obtained video generation call request, a target video generation model is used to perform video generation processing on the target text, so that a target video can be obtained. Then, by calling a second application programming interface, a video generation call response including the target video can be returned, thereby achieving the purpose of generating a target video that meets user expectations, thereby achieving the technical effect of making the generated target video more consistent with human aesthetic preferences and target text content, improving the video quality of the generated target video, and making the generated target video more popular with humans, thereby solving the technical problem in the related technology of training a video generation model based on network data, resulting in the video generated by the trained video generation model having poor quality and not meeting user expectations.

此处需要说明的是,上述获取模块801和返回模块802对应于实施例2中的步骤S51和步骤S52,两个模块与对应的步骤所实现的实例和应用场景相同,但不限于上述实施例1所公开的内容。需要说明的是,上述模块或单元可以是存储在存储器(例如,存储器104)中并由一个或多个处理器(例如,处理器102a,102b,……,102n)处理的硬件组件或软件组件,上述模块也可以作为装置的一部分可以运行在实施例1提供的计算机终端10中。It should be noted that the acquisition module 801 and the return module 802 correspond to step S51 and step S52 in Example 2, and the examples and application scenarios implemented by the two modules and the corresponding steps are the same, but are not limited to the contents disclosed in the above-mentioned Example 1. It should be noted that the above-mentioned modules or units can be hardware components or software components stored in a memory (e.g., memory 104) and processed by one or more processors (e.g., processors 102a, 102b, ..., 102n), and the above-mentioned modules can also be run in the computer terminal 10 provided in Example 1 as part of the device.

根据本公开实施例,还提供了再一种用于实施上述视频生成方法的装置实施例。图9是根据本公开实施例4的再一种视频生成装置的结构示意图,如图9所示,该装置包括:According to an embodiment of the present disclosure, another device embodiment for implementing the above-mentioned video generation method is also provided. FIG9 is a structural schematic diagram of another video generation device according to embodiment 4 of the present disclosure. As shown in FIG9 , the device includes:

获取模块901,被设置为获取当前输入的视频生成对话请求,其中,视频生成对话请求中携带的信息包括:目标文本,目标文本用于描述待生成的视频内容;The acquisition module 901 is configured to acquire a currently input video generation dialogue request, wherein the information carried in the video generation dialogue request includes: a target text, the target text is used to describe the video content to be generated;

第一响应模块902,被设置为响应于视频生成对话请求,返回视频生成对话回复, 其中,视频生成对话回复中携带的信息包括:目标视频,目标视频采用目标视频生成模型对目标文本进行视频生成处理后得到,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型;The first response module 902 is configured to respond to the video generation dialogue request and return a video generation dialogue reply. The information carried in the video generation dialogue response includes: a target video, which is obtained by performing video generation processing on the target text using a target video generation model, and the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model using a fine-tuning method;

展示模块903,被设置为在图形用户界面内展示视频生成对话回复。The display module 903 is configured to display the video-generated dialogue response in the graphical user interface.

采用本公开实施例,通过获取携带有用于描述待生成的视频内容的目标文本的视频生成对话请求,然后响应于该视频生成对话请求,基于采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的目标视频生成模型对目标文本进行视频生成处理,即通过微调将预训练的U型网络与图片奖励模型对齐以得到微调后的目标视频生成模型,并利用微调后的目标视频生成模型基于目标文本生成视频,从而得到目标视频,即得到携带该目标视频的视频生成对话回复,最后在图形用户界面内展示该视频生成对话回复,由此达到了生成符合用户期望的目标视频的目的,从而实现了使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,使生成的目标视频更受人类喜爱的技术效果,进而解决了相关技术中基于网络数据训练视频生成模型,导致训练得到的视频生成模型所生成的视频质量较差,不符合用户期望的技术问题。According to the disclosed embodiment, a video generation dialogue request carrying a target text for describing the content of a video to be generated is obtained, and then in response to the video generation dialogue request, a video generation process is performed on the target text based on a target video generation model obtained by aligning the initial video generation model with a preset reward model in a fine-tuning manner, that is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video, that is, a video generation dialogue response carrying the target video is obtained, and finally the video generation dialogue response is displayed in a graphical user interface, thereby achieving the purpose of generating a target video that meets the user's expectations, thereby achieving the technical effect of making the generated target video more consistent with human aesthetic preferences and the content of the target text, improving the video quality of the generated target video, and making the generated target video more popular with humans, thereby solving the technical problem in the related art of training a video generation model based on network data, resulting in a poor quality of the video generated by the trained video generation model that does not meet the user's expectations.

此处需要说明的是,上述获取模块901、第一响应模块902和展示模块903对应于实施例3中的步骤S61至步骤S63,三个模块与对应的步骤所实现的实例和应用场景相同,但不限于上述实施例1所公开的内容。需要说明的是,上述模块或单元可以是存储在存储器(例如,存储器104)中并由一个或多个处理器(例如,处理器102a,102b,……,102n)处理的硬件组件或软件组件,上述模块也可以作为装置的一部分可以运行在实施例1提供的计算机终端10中。It should be noted that the acquisition module 901, the first response module 902 and the display module 903 correspond to steps S61 to S63 in Example 3, and the three modules and the corresponding steps implement the same examples and application scenarios, but are not limited to the contents disclosed in the above-mentioned Example 1. It should be noted that the above-mentioned modules or units may be hardware components or software components stored in a memory (e.g., memory 104) and processed by one or more processors (e.g., processors 102a, 102b, ..., 102n), and the above-mentioned modules may also be part of the device and may be run in the computer terminal 10 provided in Example 1.

需要说明的是,本公开上述实施例中涉及到的优选实施方案与实施例1提供的方案以及应用场景、实施过程相同,但不仅限于实施例1所提供的方案。It should be noted that the preferred implementation scheme involved in the above embodiments of the present disclosure is the same as the scheme provided in Example 1, as well as the application scenario and implementation process, but is not limited to the scheme provided in Example 1.

实施例5Example 5

本公开的实施例可以提供一种计算机终端,该计算机终端可以是计算机终端群中的任意一个计算机终端设备。可选地,在本实施例中,上述计算机终端也可以替换为移动终端等终端设备。The embodiment of the present disclosure may provide a computer terminal, which may be any computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced by a terminal device such as a mobile terminal.

可选地,在本实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。Optionally, in this embodiment, the computer terminal may be located in at least one network device among a plurality of network devices of the computer network.

在本实施例中,上述计算机终端可以执行视频生成方法中以下步骤的程序代码:获取目标文本,其中,目标文本用于描述待生成的视频内容;采用目标视频生成模型对目标文本进行视频生成处理,得到目标视频,其中,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型。In this embodiment, the above-mentioned computer terminal can execute the program code of the following steps in the video generation method: obtaining a target text, wherein the target text is used to describe the video content to be generated; using a target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with a preset reward model using a fine-tuning method.

可选地,图10是根据本公开实施例5的一种计算机终端的结构框图。如图10所 示,该计算机终端A可以包括:一个或多个(图中仅示出一个)处理器1002、存储器1004、存储控制器、以及外设接口,其中,外设接口与射频模块、音频模块和显示器连接。Optionally, FIG10 is a structural block diagram of a computer terminal according to Embodiment 5 of the present disclosure. As shown, the computer terminal A may include: one or more (only one is shown in the figure) processors 1002, a memory 1004, a storage controller, and a peripheral interface, wherein the peripheral interface is connected to a radio frequency module, an audio module, and a display.

其中,存储器可被设置为存储软件程序以及模块,如本公开实施例中的视频生成方法和装置对应的程序指令/模块,处理器通过运行存储在内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的视频生成方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端A。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。Among them, the memory can be configured to store software programs and modules, such as program instructions/modules corresponding to the video generation method and device in the embodiment of the present disclosure, and the processor executes various functional applications and data processing by running the stored software programs and modules, that is, realizing the above-mentioned video generation method. The memory may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include a memory remotely arranged relative to the processor, and these remote memories may be connected to the computer terminal A via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行下述步骤:获取目标文本,其中,目标文本用于描述待生成的视频内容;采用目标视频生成模型对目标文本进行视频生成处理,得到目标视频,其中,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型。The processor can call the information and application stored in the memory through the transmission device to execute the following steps: obtain the target text, wherein the target text is used to describe the video content to be generated; use the target video generation model to perform video generation processing on the target text to obtain the target video, wherein the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model using a fine-tuning method.

可选地,上述处理器还可以执行如下步骤的程序代码:采用训练样本对初始视频生成模型进行采样处理,生成采样视频,其中,训练样本包括:多个视频文本对,多个视频文本对均包括:训练视频与训练文本,训练文本用于描述训练视频的视频内容;采用预设奖励模型对采样视频进行奖励计算,得到目标奖励结果;基于目标奖励结果对初始视频生成模型的模型参数进行调节,生成目标视频生成模型。Optionally, the processor may also execute the program code of the following steps: using training samples to perform sampling processing on the initial video generation model to generate a sampled video, wherein the training samples include: multiple video-text pairs, and the multiple video-text pairs each include: a training video and a training text, and the training text is used to describe the video content of the training video; using a preset reward model to calculate the reward for the sampled video to obtain a target reward result; adjusting the model parameters of the initial video generation model based on the target reward result to generate a target video generation model.

可选地,上述处理器还可以执行如下步骤的程序代码:对训练视频进行加噪处理,得到加噪视频;采用加噪视频与训练文本对初始视频生成模型进行视频重采样,生成采样视频。Optionally, the processor may also execute the program code of the following steps: performing noise processing on the training video to obtain a noisy video; and performing video resampling on the initial video generation model using the noisy video and the training text to generate a sampled video.

可选地,上述处理器还可以执行如下步骤的程序代码:获取训练视频对应的加噪步数与噪声等级,其中,加噪步数用于通过预设加噪函数确定训练视频待加噪的步数,噪声等级用于确定对训练视频的破坏程度;基于加噪步数与噪声等级对训练视频进行加噪处理,得到加噪视频。Optionally, the processor may also execute the following program code: obtaining the number of noise addition steps and the noise level corresponding to the training video, wherein the number of noise addition steps is used to determine the number of steps to be noised for the training video through a preset noise addition function, and the noise level is used to determine the degree of damage to the training video; performing noise addition processing on the training video based on the number of noise addition steps and the noise level to obtain a noisy video.

可选地,上述处理器还可以执行如下步骤的程序代码:采用预设奖励模型对采样视频进行奖励计算,得到初始奖励结果;采用时间衰减奖励方式调节初始奖励结果对应的初始奖励权重,生成目标奖励结果,其中,初始奖励权重为采样视频包含的视频帧序列对应的默认奖励权重。Optionally, the processor may also execute the program code of the following steps: using a preset reward model to calculate rewards for the sampled video to obtain an initial reward result; using a time-decayed reward method to adjust an initial reward weight corresponding to the initial reward result to generate a target reward result, wherein the initial reward weight is a default reward weight corresponding to a video frame sequence contained in the sampled video.

可选地,上述处理器还可以执行如下步骤的程序代码:对视频帧序列进行视频分段采样,得到分段采样结果;采用预设奖励模型对分段采样结果进行奖励计算,得到初始奖励结果。Optionally, the processor may also execute the program code of the following steps: performing video segment sampling on the video frame sequence to obtain segment sampling results; performing reward calculation on the segment sampling results using a preset reward model to obtain an initial reward result.

可选地,上述处理器还可以执行如下步骤的程序代码:获取视频帧序列的特征空 间表示;对特征空间表示进行视频分段采样,得到分段后视频帧的颜色空间表示;基于颜色空间表示确定分段采样结果。Optionally, the processor may further execute the program code of the following steps: obtaining a feature space of a video frame sequence; The method comprises the following steps: performing video segmentation sampling on the feature space representation to obtain a color space representation of the segmented video frame; and determining the segmentation sampling result based on the color space representation.

可选地,上述处理器还可以执行如下步骤的程序代码:采用时间衰减奖励方式对初始奖励结果对应的初始奖励权重进行差异化调节,得到目标奖励权重,其中,目标奖励权重用于表示视频帧序列中第一视频帧的当前奖励权重高于第二视频帧的当前奖励权重,第一视频帧位于视频帧序列的中间位置,第二视频帧位于视频帧序列的边缘位置;基于初始奖励结果与目标奖励权重生成目标奖励结果。Optionally, the processor may also execute the program code of the following steps: using a time-decayed reward method to perform differentiated adjustments on the initial reward weight corresponding to the initial reward result to obtain a target reward weight, wherein the target reward weight is used to indicate that the current reward weight of the first video frame in the video frame sequence is higher than the current reward weight of the second video frame, the first video frame is located in the middle of the video frame sequence, and the second video frame is located at the edge of the video frame sequence; generating the target reward result based on the initial reward result and the target reward weight.

可选地,目标视频生成模型为视频扩散模型,预设奖励模型为图像奖励模型,其中,图像奖励模型用于对视频扩散模型进行偏好学习。Optionally, the target video generation model is a video diffusion model, and the preset reward model is an image reward model, wherein the image reward model is used to perform preference learning on the video diffusion model.

采用本公开实施例,通过获取用于描述待生成的视频内容的目标文本,然后基于采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的目标视频生成模型对目标文本进行视频生成处理,即通过微调将预训练的U型网络与图片奖励模型对齐以得到微调后的目标视频生成模型,并利用微调后的目标视频生成模型基于目标文本生成视频,从而得到目标视频,由此达到了生成符合用户期望的目标视频的目的,从而实现了使生成的目标视频与人类的审美偏好和目标文本内容更加契合,提高了生成得到的目标视频的视频质量,使生成的目标视频更受人类喜爱的技术效果,进而解决了相关技术中基于网络数据训练视频生成模型,导致训练得到的视频生成模型所生成的视频质量较差,不符合用户期望的技术问题。According to the disclosed embodiment, a target text for describing the content of a video to be generated is obtained, and then a video generation process is performed on the target text based on a target video generation model obtained by aligning an initial video generation model with a preset reward model by fine-tuning. That is, a pre-trained U-shaped network is aligned with a picture reward model by fine-tuning to obtain a fine-tuned target video generation model, and a video is generated based on the target text using the fine-tuned target video generation model, thereby obtaining a target video. This achieves the purpose of generating a target video that meets user expectations, thereby achieving a technical effect of making the generated target video more consistent with human aesthetic preferences and the content of the target text, improving the video quality of the generated target video, and making the generated target video more popular with humans. This further solves the technical problem in the related art of training a video generation model based on network data, resulting in a poor quality video generated by the trained video generation model that does not meet user expectations.

本领域普通技术人员可以理解,图10所示的结构仅为示意,计算机终端A也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(MobileInternetDevices,MID)、PAD等终端设备。图10其并不对上述电子装置的结构造成限定。例如,计算机终端A还可包括比图10中所示更多或者更少的组件(如网络接口、显示装置等),或者具有与图10所示不同的配置。It can be understood by those skilled in the art that the structure shown in FIG. 10 is for illustration only, and the computer terminal A may also be a terminal device such as a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a PDA, and a mobile Internet device (Mobile Internet Devices, MID), PAD, etc. FIG. 10 does not limit the structure of the above-mentioned electronic device. For example, the computer terminal A may also include more or fewer components (such as a network interface, a display device, etc.) than those shown in FIG. 10, or have a configuration different from that shown in FIG. 10.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the above embodiments may be completed by instructing the hardware related to the terminal device through a program, and the program may be stored in a computer-readable storage medium, and the storage medium may include: a flash drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.

实施例6Example 6

本公开的实施例还提供了一种计算机可读存储介质。可选地,在本实施例中,上述计算机可读存储介质可以用于保存上述实施例一所提供的视频生成方法所执行的程序代码。The embodiment of the present disclosure further provides a computer-readable storage medium. Optionally, in this embodiment, the computer-readable storage medium can be used to store the program code executed by the video generation method provided in the first embodiment.

可选地,在本实施例中,上述计算机可读存储介质可以位于计算机网络中计算机终端群中的任意一个计算机终端中,或者位于移动终端群中的任意一个移动终端中。Optionally, in this embodiment, the computer-readable storage medium may be located in any one of the computer terminals in a computer terminal group in a computer network, or in any one of the mobile terminals in a mobile terminal group.

可选地,在本实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的 程序代码:获取目标文本,其中,目标文本用于描述待生成的视频内容;采用目标视频生成模型对目标文本进行视频生成处理,得到目标视频,其中,目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型。Optionally, in this embodiment, the computer readable storage medium is configured to store the following steps: Program code: Obtain a target text, wherein the target text is used to describe the content of the video to be generated; use a target video generation model to perform video generation processing on the target text to obtain a target video, wherein the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model using a fine-tuning method.

可选地,在本实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:采用训练样本对初始视频生成模型进行采样处理,生成采样视频,其中,训练样本包括:多个视频文本对,多个视频文本对均包括:训练视频与训练文本,训练文本用于描述训练视频的视频内容;采用预设奖励模型对采样视频进行奖励计算,得到目标奖励结果;基于目标奖励结果对初始视频生成模型的模型参数进行调节,生成目标视频生成模型。Optionally, in this embodiment, the computer-readable storage medium is configured to store program codes for executing the following steps: sampling the initial video generation model using training samples to generate a sampled video, wherein the training samples include: multiple video-text pairs, and the multiple video-text pairs each include: a training video and a training text, and the training text is used to describe the video content of the training video; using a preset reward model to calculate rewards for the sampled video to obtain a target reward result; adjusting the model parameters of the initial video generation model based on the target reward result to generate a target video generation model.

可选地,在本实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:对训练视频进行加噪处理,得到加噪视频;采用加噪视频与训练文本对初始视频生成模型进行视频重采样,生成采样视频。Optionally, in this embodiment, the computer-readable storage medium is configured to store program codes for executing the following steps: performing noise processing on the training video to obtain a noisy video; and performing video resampling on the initial video generation model using the noisy video and the training text to generate a sampled video.

可选地,在本实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:获取训练视频对应的加噪步数与噪声等级,其中,加噪步数用于通过预设加噪函数确定训练视频待加噪的步数,噪声等级用于确定对训练视频的破坏程度;基于加噪步数与噪声等级对训练视频进行加噪处理,得到加噪视频。Optionally, in this embodiment, the computer-readable storage medium is configured to store program code for executing the following steps: obtaining the number of noise addition steps and the noise level corresponding to the training video, wherein the number of noise addition steps is used to determine the number of steps to be noised for the training video through a preset noise addition function, and the noise level is used to determine the degree of damage to the training video; performing noise addition processing on the training video based on the number of noise addition steps and the noise level to obtain a noisy video.

可选地,在本实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:采用预设奖励模型对采样视频进行奖励计算,得到初始奖励结果;采用时间衰减奖励方式调节初始奖励结果对应的初始奖励权重,生成目标奖励结果,其中,初始奖励权重为采样视频包含的视频帧序列对应的默认奖励权重。Optionally, in this embodiment, the computer-readable storage medium is configured to store program codes for executing the following steps: using a preset reward model to calculate rewards for the sampled video to obtain an initial reward result; using a time-decayed reward method to adjust an initial reward weight corresponding to the initial reward result to generate a target reward result, wherein the initial reward weight is a default reward weight corresponding to a video frame sequence contained in the sampled video.

可选地,在本实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:对视频帧序列进行视频分段采样,得到分段采样结果;采用预设奖励模型对分段采样结果进行奖励计算,得到初始奖励结果。Optionally, in this embodiment, the computer-readable storage medium is configured to store program codes for executing the following steps: performing video segment sampling on a video frame sequence to obtain segment sampling results; and performing reward calculation on the segment sampling results using a preset reward model to obtain an initial reward result.

可选地,在本实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:获取视频帧序列的特征空间表示;对特征空间表示进行视频分段采样,得到分段后视频帧的颜色空间表示;基于颜色空间表示确定分段采样结果。Optionally, in this embodiment, the computer-readable storage medium is configured to store program code for executing the following steps: obtaining a feature space representation of a video frame sequence; performing video segmentation sampling on the feature space representation to obtain a color space representation of the segmented video frame; and determining the segmentation sampling result based on the color space representation.

可选地,在本实施例中,计算机可读存储介质被设置为存储用于执行以下步骤的程序代码:采用时间衰减奖励方式对初始奖励结果对应的初始奖励权重进行差异化调节,得到目标奖励权重,其中,目标奖励权重用于表示视频帧序列中第一视频帧的当前奖励权重高于第二视频帧的当前奖励权重,第一视频帧位于视频帧序列的中间位置,第二视频帧位于视频帧序列的边缘位置;基于初始奖励结果与目标奖励权重生成目标奖励结果。Optionally, in this embodiment, the computer-readable storage medium is configured to store program code for executing the following steps: using a time-decayed reward method to differentially adjust the initial reward weight corresponding to the initial reward result to obtain a target reward weight, wherein the target reward weight is used to indicate that the current reward weight of the first video frame in the video frame sequence is higher than the current reward weight of the second video frame, the first video frame is located in the middle position of the video frame sequence, and the second video frame is located at the edge position of the video frame sequence; generating the target reward result based on the initial reward result and the target reward weight.

可选地,目标视频生成模型为视频扩散模型,预设奖励模型为图像奖励模型,其中,图像奖励模型用于对视频扩散模型进行偏好学习。 Optionally, the target video generation model is a video diffusion model, and the preset reward model is an image reward model, wherein the image reward model is used to perform preference learning on the video diffusion model.

上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the above-mentioned embodiments of the present disclosure are only for description and do not represent the advantages or disadvantages of the embodiments.

在本公开的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments of the present disclosure, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.

在本公开所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in the present disclosure, it should be understood that the disclosed technical content can be implemented in other ways. Among them, the device embodiments described above are only schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of units or modules, which can be electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present disclosure, or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, server or network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present disclosure. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

以上所述仅是本公开的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本公开原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本公开的保护范围。 The above is only a preferred embodiment of the present disclosure. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principle of the present disclosure. These improvements and modifications should also be regarded as the scope of protection of the present disclosure.

Claims (13)

一种视频生成方法,包括:A video generation method, comprising: 获取目标文本,其中,所述目标文本用于描述待生成的视频内容;Obtaining a target text, wherein the target text is used to describe the video content to be generated; 采用目标视频生成模型对所述目标文本进行视频生成处理,得到目标视频,其中,所述目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型。The target text is processed by video generation using a target video generation model to obtain a target video, wherein the target video generation model is a model obtained by aligning an initial video generation model with a preset reward model using a fine-tuning method. 根据权利要求1所述的视频生成方法,其中,所述视频生成方法还包括:The video generation method according to claim 1, wherein the video generation method further comprises: 采用训练样本对所述初始视频生成模型进行采样处理,生成采样视频,其中,所述训练样本包括:多个视频文本对,所述多个视频文本对均包括:训练视频与训练文本,所述训练文本用于描述所述训练视频的视频内容;The initial video generation model is sampled using training samples to generate sampled videos, wherein the training samples include: a plurality of video-text pairs, each of the plurality of video-text pairs includes: a training video and a training text, and the training text is used to describe the video content of the training video; 采用所述预设奖励模型对所述采样视频进行奖励计算,得到目标奖励结果;Using the preset reward model to calculate the reward for the sampled video to obtain a target reward result; 基于所述目标奖励结果对所述初始视频生成模型的模型参数进行调节,生成所述目标视频生成模型。The model parameters of the initial video generation model are adjusted based on the target reward result to generate the target video generation model. 根据权利要求2所述的视频生成方法,其中,采用所述训练样本对所述初始视频生成模型进行采样处理,生成所述采样视频包括:The video generation method according to claim 2, wherein the initial video generation model is sampled using the training sample to generate the sampled video, comprising: 对所述训练视频进行加噪处理,得到加噪视频;Performing noise processing on the training video to obtain a noisy video; 采用所述加噪视频与所述训练文本对所述初始视频生成模型进行视频重采样,生成所述采样视频。The noisy video and the training text are used to perform video resampling on the initial video generation model to generate the sampled video. 根据权利要求3所述的视频生成方法,其中,对所述训练视频进行加噪处理,得到所述加噪视频包括:The video generation method according to claim 3, wherein the step of performing noise processing on the training video to obtain the noisy video comprises: 获取所述训练视频对应的加噪步数与噪声等级,其中,所述加噪步数用于通过预设加噪函数确定所述训练视频待加噪的步数,所述噪声等级用于确定对所述训练视频的破坏程度;Obtaining the number of noise adding steps and the noise level corresponding to the training video, wherein the number of noise adding steps is used to determine the number of steps to be noise added to the training video through a preset noise adding function, and the noise level is used to determine the degree of damage to the training video; 基于所述加噪步数与所述噪声等级对所述训练视频进行加噪处理,得到所述加噪视频。The training video is subjected to noise addition processing based on the noise addition step number and the noise level to obtain the noisy video. 根据权利要求2所述的视频生成方法,其中,采用所述预设奖励模型对所述采样视频进行奖励计算,得到所述目标奖励结果包括: The video generation method according to claim 2, wherein the preset reward model is used to perform reward calculation on the sampled video to obtain the target reward result, comprising: 采用所述预设奖励模型对所述采样视频进行奖励计算,得到初始奖励结果;Using the preset reward model to calculate the reward for the sampled video to obtain an initial reward result; 采用时间衰减奖励方式调节所述初始奖励结果对应的初始奖励权重,生成所述目标奖励结果,其中,所述初始奖励权重为所述采样视频包含的视频帧序列对应的默认奖励权重。An initial reward weight corresponding to the initial reward result is adjusted by adopting a time decay reward method to generate the target reward result, wherein the initial reward weight is a default reward weight corresponding to the video frame sequence included in the sampled video. 根据权利要求5所述的视频生成方法,其中,采用所述预设奖励模型对所述采样视频进行奖励计算,得到所述初始奖励结果包括:The video generation method according to claim 5, wherein the reward calculation for the sampled video is performed using the preset reward model to obtain the initial reward result comprises: 对所述视频帧序列进行视频分段采样,得到分段采样结果;Performing video segment sampling on the video frame sequence to obtain segment sampling results; 采用所述预设奖励模型对所述分段采样结果进行奖励计算,得到所述初始奖励结果。The preset reward model is used to calculate the reward for the segmented sampling result to obtain the initial reward result. 根据权利要求6所述的视频生成方法,其中,对所述视频帧序列进行视频分段采样,得到所述分段采样结果包括:The video generation method according to claim 6, wherein performing video segment sampling on the video frame sequence to obtain the segment sampling result comprises: 获取所述视频帧序列的特征空间表示;Obtaining a feature space representation of the video frame sequence; 对所述特征空间表示进行视频分段采样,得到分段后视频帧的颜色空间表示;Performing video segment sampling on the feature space representation to obtain a color space representation of the segmented video frame; 基于所述颜色空间表示确定所述分段采样结果。The segment sampling result is determined based on the color space representation. 根据权利要求7所述的视频生成方法,其中,采用所述时间衰减奖励方式调节所述初始奖励结果对应的初始奖励权重,生成所述目标奖励结果包括:The video generation method according to claim 7, wherein the time decay reward method is used to adjust the initial reward weight corresponding to the initial reward result, and generating the target reward result comprises: 采用所述时间衰减奖励方式对所述初始奖励结果对应的初始奖励权重进行差异化调节,得到目标奖励权重,其中,所述目标奖励权重用于表示所述视频帧序列中第一视频帧的当前奖励权重高于第二视频帧的当前奖励权重,所述第一视频帧位于所述视频帧序列的中间位置,所述第二视频帧位于所述视频帧序列的边缘位置;The time decay reward method is used to differentially adjust the initial reward weight corresponding to the initial reward result to obtain a target reward weight, wherein the target reward weight is used to indicate that the current reward weight of the first video frame in the video frame sequence is higher than the current reward weight of the second video frame, the first video frame is located in the middle of the video frame sequence, and the second video frame is located at the edge of the video frame sequence; 基于所述初始奖励结果与所述目标奖励权重生成所述目标奖励结果。The target reward result is generated based on the initial reward result and the target reward weight. 根据权利要求1所述的视频生成方法,其中,所述目标视频生成模型为视频扩散模型,所述预设奖励模型为图像奖励模型,其中,所述图像奖励模型用于对所述视频扩散模型进行偏好学习。According to the video generation method according to claim 1, wherein the target video generation model is a video diffusion model, and the preset reward model is an image reward model, wherein the image reward model is used to perform preference learning on the video diffusion model. 一种视频生成方法,包括:A video generation method, comprising: 通过第一应用程序编程接口获取视频生成调用请求,其中,所述视频生成调用请求中携带的请求数据包括:目标文本,所述目标文本用于描述待生成的视频 内容,所述视频生成调用请求用于请求调用目标视频生成模型对所述目标文本进行视频生成处理以得到目标视频,所述目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型;Obtaining a video generation call request through a first application programming interface, wherein the request data carried in the video generation call request includes: a target text, the target text is used to describe the video to be generated The video generation call request is used to request to call a target video generation model to perform video generation processing on the target text to obtain a target video, and the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model in a fine-tuning manner; 通过第二应用程序编程接口返回视频生成调用响应,其中,所述视频生成调用响应中携带的响应数据包括:所述目标视频。A video generation call response is returned through the second application programming interface, wherein the response data carried in the video generation call response includes: the target video. 一种视频生成方法,包括:A video generation method, comprising: 获取当前输入的视频生成对话请求,其中,所述视频生成对话请求中携带的信息包括:目标文本,所述目标文本用于描述待生成的视频内容;Acquire a currently input video generation dialogue request, wherein the information carried in the video generation dialogue request includes: a target text, wherein the target text is used to describe the video content to be generated; 响应于所述视频生成对话请求,返回视频生成对话回复,其中,所述视频生成对话回复中携带的信息包括:目标视频,所述目标视频采用目标视频生成模型对所述目标文本进行视频生成处理后得到,所述目标视频生成模型为采用微调方式对初始视频生成模型与预设奖励模型进行模型对齐后得到的模型;In response to the video generation dialogue request, a video generation dialogue reply is returned, wherein the information carried in the video generation dialogue reply includes: a target video, the target video is obtained by performing video generation processing on the target text using a target video generation model, and the target video generation model is a model obtained by aligning the initial video generation model with the preset reward model in a fine-tuning manner; 在图形用户界面内展示所述视频生成对话回复。Presenting the video within a graphical user interface generates a conversation response. 一种电子设备,包括:An electronic device, comprising: 存储器,存储有可执行程序;A memory storing an executable program; 处理器,用于运行所述程序,其中,所述程序运行时执行权利要求1至11中任意一项所述的方法。A processor, configured to run the program, wherein the program executes the method according to any one of claims 1 to 11 when running. 一种计算机可读存储介质,所述计算机可读存储介质包括存储的可执行程序,其中,在所述可执行程序运行时控制所述计算机可读存储介质所在设备执行权利要求1至11中任意一项所述的方法。 A computer-readable storage medium, the computer-readable storage medium comprising a stored executable program, wherein when the executable program is run, the device where the computer-readable storage medium is located is controlled to execute the method described in any one of claims 1 to 11.
PCT/CN2024/107881 2023-12-05 2024-07-26 Video generation method, electronic device, and computer readable storage medium Pending WO2025118634A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202311659000.8 2023-12-05
CN202311659000.8A CN117668297A (en) 2023-12-05 2023-12-05 Video generation method, electronic device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2025118634A1 true WO2025118634A1 (en) 2025-06-12

Family

ID=90065659

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/107881 Pending WO2025118634A1 (en) 2023-12-05 2024-07-26 Video generation method, electronic device, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN117668297A (en)
WO (1) WO2025118634A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120494017A (en) * 2025-07-17 2025-08-15 北京达佳互联信息技术有限公司 Training method, training device, training equipment and training storage medium for text feature generation model

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117668297A (en) * 2023-12-05 2024-03-08 浙江阿里巴巴机器人有限公司 Video generation method, electronic device and computer-readable storage medium
CN120935423A (en) * 2024-05-09 2025-11-11 阿里巴巴(中国)有限公司 Video generation methods, electronic devices and computer-readable storage media
CN118354164B (en) * 2024-06-17 2024-10-29 阿里巴巴(中国)有限公司 Video generation method, electronic device and computer readable storage medium
CN118695051B (en) * 2024-08-26 2024-11-22 腾讯科技(深圳)有限公司 Data generation method, device, product, equipment and medium
CN119094814B (en) * 2024-08-29 2025-09-09 腾讯科技(深圳)有限公司 Data processing method and device
CN119416748A (en) * 2024-09-13 2025-02-11 北京百度网讯科技有限公司 Method, device, electronic device and storage medium for generating review information based on large model
CN119364133B (en) * 2024-12-19 2025-03-18 苏州元脑智能科技有限公司 Video data generation method, electronic device, storage medium and program product
CN119851174B (en) * 2024-12-24 2025-10-28 清华大学 Multi-dimensional fine granularity rewarding method and device for image and video generation
CN120373391A (en) * 2025-04-15 2025-07-25 上海幻电信息科技有限公司 Model optimization method and device
CN120953453A (en) * 2025-10-14 2025-11-14 阿里巴巴(中国)有限公司 Model training methods, video generation methods, electronic devices and storage media

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230118966A1 (en) * 2022-12-16 2023-04-20 Lemon Inc. Generation of story videos corresponding to user input using generative models
CN116894880A (en) * 2023-07-11 2023-10-17 北京百度网讯科技有限公司 A training method, model, device and electronic equipment for Vincentian graph model
DE202023101550U1 (en) * 2023-03-28 2023-10-25 Google LLC Generating videos using generative neural network sequences
CN116958969A (en) * 2023-07-12 2023-10-27 腾讯科技(深圳)有限公司 Picture generation method and device, storage medium and electronic equipment
CN117078782A (en) * 2023-08-12 2023-11-17 虚空漫步(苏州)科技有限公司 Novel text video generation method and system based on generation type artificial intelligence
CN117095083A (en) * 2023-10-17 2023-11-21 华南理工大学 A text-image generation method, system, device and storage medium
CN117668297A (en) * 2023-12-05 2024-03-08 浙江阿里巴巴机器人有限公司 Video generation method, electronic device and computer-readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230118966A1 (en) * 2022-12-16 2023-04-20 Lemon Inc. Generation of story videos corresponding to user input using generative models
DE202023101550U1 (en) * 2023-03-28 2023-10-25 Google LLC Generating videos using generative neural network sequences
CN116894880A (en) * 2023-07-11 2023-10-17 北京百度网讯科技有限公司 A training method, model, device and electronic equipment for Vincentian graph model
CN116958969A (en) * 2023-07-12 2023-10-27 腾讯科技(深圳)有限公司 Picture generation method and device, storage medium and electronic equipment
CN117078782A (en) * 2023-08-12 2023-11-17 虚空漫步(苏州)科技有限公司 Novel text video generation method and system based on generation type artificial intelligence
CN117095083A (en) * 2023-10-17 2023-11-21 华南理工大学 A text-image generation method, system, device and storage medium
CN117668297A (en) * 2023-12-05 2024-03-08 浙江阿里巴巴机器人有限公司 Video generation method, electronic device and computer-readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120494017A (en) * 2025-07-17 2025-08-15 北京达佳互联信息技术有限公司 Training method, training device, training equipment and training storage medium for text feature generation model

Also Published As

Publication number Publication date
CN117668297A (en) 2024-03-08

Similar Documents

Publication Publication Date Title
WO2025118634A1 (en) Video generation method, electronic device, and computer readable storage medium
CN111783647B (en) Training method of face fusion model, face fusion method, device and equipment
CN114519143B (en) Training method of course recommendation model, course recommendation method and device
US10600171B2 (en) Image-blending via alignment or photometric adjustments computed by a neural network
US20180260668A1 (en) Harmonizing composite images using deep learning
CN111199540A (en) Image quality evaluation method, device, electronic device and storage medium
US20240046538A1 (en) Method for generating face shape adjustment image, model training method, apparatus and device
CN118714417A (en) Video generation method, system, electronic device and storage medium
CN118052907A (en) Text map generation method and related device
CN113626129A (en) Method, device and electronic device for determining page color
CN117332068A (en) Human-computer interaction methods, devices, electronic equipment and storage media
CN112102304B (en) Image processing method, device, computer equipment and computer readable storage medium
CN116955868A (en) Training and application method, device and equipment of webpage template evaluation model
CN114449355B (en) Live interaction method, device, equipment and storage medium
CN115866332A (en) A processing method, device and processing equipment for a video frame interpolation model
CN114511444A (en) Image set grid collage method and device
US20250157106A1 (en) Style tailoring latent diffusion models for human expression
CN119152063A (en) Image generation method and device, electronic equipment and nonvolatile storage medium
CN119478155A (en) Digital human generation method, device, computer equipment and storage medium
WO2025086840A1 (en) Image generation method and apparatus, electronic device, and storage medium
CN119168093A (en) Question and answer information processing method, model training method, device, electronic device and medium
CN119251841A (en) Image semantic extraction method, device, computer equipment and storage medium
CN116775179A (en) Virtual object configuration method, electronic device and computer readable storage medium
CN117788352A (en) Image processing method, system and electronic equipment
US20250054405A1 (en) System for Providing Step-by-Step Explanations of Pedagogical Exercises Using Machine-Learned Models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24899284

Country of ref document: EP

Kind code of ref document: A1