CN117336567A

CN117336567A - Video generation method, device, equipment and storage medium

Info

Publication number: CN117336567A
Application number: CN202311030837.6A
Authority: CN
Inventors: 高建清; 左童春; 姚仕豪; 何山; 郜静文; 杨硕; 殷保才; 殷兵; 刘烨秋; 付新勇; 王雨露; 张若楠; 管广鹏; 吕磊; 陈付国; 金左雨; 董飞; 胡国平; 刘聪; 魏思
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2024-01-02

Abstract

The application discloses a video generation method, a device, equipment and a storage medium, wherein the video generation method comprises the following steps: acquiring an original manuscript and an original configuration diagram of the original manuscript; acquiring reference data which is obtained by analysis in response to the original manuscript and the original matching chart and is used for dubbing; the reference data comprises a text file, a first text representing emotion information contained in the text file and a second text representing at least the text file with pronunciation tone; performing voice synthesis based on the reference data to obtain video dubbing; a target video is generated based at least on the original profile and the video dubbing. By the aid of the scheme, video generation efficiency can be improved, and video generation cost is reduced.

Description

Video generation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer data processing technologies, and in particular, to a video generating method, apparatus, device, and storage medium.

Background

Along with the popularization of computer technology, videos become an important media transmission mode in daily life, the existing video production mode needs to record corresponding videos by special persons after manuscripts required by the videos are edited, and the special persons need to record the videos by using more manpower resources, so that the cost for recording the videos is higher, the manual recording time is longer, and if errors occur, the video needs to be re-recorded or produced. Therefore, the video generation method of the prior method has low efficiency and high cost.

Disclosure of Invention

The technical problem that this application mainly solves is to provide a video generation method, device, equipment and storage medium, can improve video generation efficiency, reduces video generation cost.

In order to solve the above technical problem, a first aspect of the present application provides a video generating method, including: acquiring an original manuscript and an original configuration diagram of the original manuscript; acquiring reference data which is obtained by analysis in response to the original manuscript and the original matching chart and is used for dubbing; the reference data comprises a text file, a first text representing emotion information contained in the text file and a second text representing at least the text file with pronunciation tone; performing voice synthesis based on the reference data to obtain video dubbing; a target video is generated based at least on the original profile and the video dubbing.

In order to solve the technical problem, a second aspect of the present application provides a video generating device, which includes an acquisition module, an analysis module, a synthesis module and a generation module; the acquisition module is used for acquiring the original manuscript and the original configuration diagram of the original manuscript; the analysis module is used for acquiring reference data which is obtained by analysis in response to the original manuscript and the original configuration diagram and is used for dubbing; the reference data comprises a text file, a first text representing emotion information contained in the text file and a second text representing at least the text file with pronunciation tone; the synthesis module is used for carrying out voice synthesis based on the reference data to obtain video dubbing; the generation module is used for generating a target video at least based on the original configuration diagram and the video dubbing.

In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the video generating method of the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer readable storage medium storing program instructions executable by a processor for implementing the video generating method of the first aspect.

After the original manuscript and the original matching chart of the original manuscript are obtained, the reference data which is obtained by responding to the analysis of the original manuscript and the original matching chart and is used for dubbing is further obtained; the reference data comprises a text file, a first text representing emotion information contained in the text file and a second text representing at least the text file with pronunciation tone; performing voice synthesis based on the reference data to obtain video dubbing; a target video is generated based at least on the original profile and the video dubbing. By the method, the target video can be automatically analyzed and generated by only providing the original manuscript of the target video to be generated by the user and the original configuration diagram of the original manuscript. Compared with manual video production, the video production efficiency can be improved, and the video production cost can be reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the technical aspects of the application.

Fig. 1 is a schematic flow chart of a first embodiment of a video generating method provided in the present application;

fig. 2 is a schematic flow chart of a second embodiment of a video generating method provided in the present application;

fig. 3 is a schematic flow chart of a third embodiment of a video generating method provided in the present application;

FIG. 4 is a schematic diagram of one embodiment of a video generation network provided herein;

fig. 5 is a schematic flow chart of a fourth embodiment of a video generating method provided in the present application;

FIG. 6 is a schematic diagram of one embodiment of a preview page provided herein;

FIG. 7 is a schematic diagram of a video generating apparatus according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a frame of an embodiment of an electronic device provided herein;

FIG. 9 is a schematic diagram of a framework of one embodiment of a computer-readable storage medium provided herein.

Detailed Description

The following describes the embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. "several" means at least one. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Referring to fig. 1, fig. 1 is a flowchart of a first embodiment of a video generating method provided in the present application. The method may be performed by a terminal device, which may be, for example, a computer, a mobile phone, a tablet, etc., which is not particularly limited in this embodiment. It should be noted that, if there are substantially the same results, the method of the present invention is not limited to the flow sequence shown in fig. 1. As shown in fig. 1, the method comprises the steps of:

S11: and obtaining the original manuscript and the original configuration diagram of the original manuscript.

In this embodiment, the manuscript type of the original manuscript may be a work report, a business manuscript, a new media manuscript, a personal lecture manuscript, a hosting manuscript, etc. The work report may include a personal work report or a unit work report, and is mainly used for summarizing related matters of the user's personal or organization work. The business documents may include business letters, business contracts, business plans, market research reports, sales proposals, business briefs, business lectures, business articles, business blogs, and the like. For example, the business letter may include a business invitation, a business promotion, a business thank you letter, a business letter of apology, and the like. The business contracts are used to clarify rights and obligations between the business partners. Business schedules are used to describe business models, market prospects, funding requirements, etc. of items to introduce new business ideas or items to investors or financial institutions. The market research report is used for explaining market conditions and competition environments and providing basis for enterprise decision making. Sales proposals are used to introduce products or services to potential customers and to illustrate their advantages and features to facilitate sales. The business bulletins are used for reporting work progress, work results, work problems, work plans, and the like to a superordinate leader or subordinate staff. The business lecture is used for lecture occasions such as business meetings, business publications, business activities and the like. The commercial articles are mainly published on commercial media or platforms to build brand images or professional images for enterprises. Business blogs primarily include information or insights about business management, typically published by individuals or businesses on social media or websites. The new media manuscript may be a news manuscript, a public number operation document, etc. It should be noted that the above-mentioned types of original documents are merely exemplary, and in other embodiments, the original documents may also include other document types, and the present embodiment is not limited thereto specifically.

The original configuration image of the original manuscript can be a picture capable of vividly expressing the content of the original manuscript, for example, the original manuscript comprises a place, and then the original configuration image can be a picture comprising the place; for another example, the original document may include a product, and the original configuration may be a picture of the product. The original configuration diagram can be a picture in an original manuscript, a picture input by a user, or a diagram generated by a model. It should be noted that, the above-mentioned original configuration is only an exemplary illustration, and in other embodiments, the original configuration may be in other forms, and this embodiment is not limited to this, but it should be clear that the original configuration corresponds to the content of the original document.

In an implementation scenario, the original manuscript and the original configuration diagram can be uploaded by a user, namely, the user uploads the original manuscript and the original configuration diagram on a human-computer interaction interface of the video generating device; the original manuscript and the original profile can also be acquired from other devices in response to a user's acquisition instruction. It will be appreciated that the original manuscript may also have no corresponding original map, and in this case, the blank image may be used as the original map.

S12: reference data, which is analyzed in response to the original manuscript and the original artwork and used for dubbing, is acquired.

In this embodiment, the reference data may include a text document, a first text representing emotion information contained in the text document, and a second text representing at least a pronunciation tone of the text document. The text of the document can be an original manuscript, or can be a key text obtained by summarizing the original manuscript; the first text may include emotion information contained in the text of the document, and the emotion information may include happiness, angry, surprise, sadness, and the like; the second text may include a sound tone to be employed by the text of the text, and the sound tone may include boys, girls, middle-aged men, middle-aged women, elderly men, elderly women, and the like.

The step of obtaining the reference data may include: extracting text characteristics of a target paragraph in the original manuscript, and extracting image characteristics of an original matching chart attached to the target paragraph; processing the text features and the image features based on the pre-trained multi-modal encoder to obtain multi-modal features fused with the text information and the image information; decoding based on the multi-mode characteristics to obtain reference data of the target paragraph; the text feature comprises character features of all characters in the target paragraph, and the image feature comprises sub-block features of all sub-blocks in the original configuration. Specifically, the original manuscript may be segmented to obtain at least one target paragraph, each target paragraph is input to the feature extraction network, the feature extraction network is utilized to obtain character features of each character in each target paragraph, and the character features may be represented by word embedding vectors. The original matching diagram corresponding to the target paragraph is divided into a plurality of sub-blocks, and the characteristics of each sub-block can be extracted by utilizing a characteristic extraction network to obtain the image characteristics of the original matching diagram attached to the target paragraph. It will be appreciated that the feature extraction network may be any network capable of feature extraction, for example, a transducer may be used for the feature extraction network. The reference data of each target paragraph can be obtained by carrying out the above processing on each target paragraph and the original configuration drawing attached to the target paragraph, and the reference data of each target paragraph can be combined to obtain the reference data of the original manuscript.

In the embodiment, the original manuscript is segmented, and compared with the original distribution diagram of the whole original manuscript and the original manuscript, the original distribution diagram of the target paragraph and the original distribution diagram attached to the target paragraph are respectively input into the multi-modal encoder, so that the multi-modal encoder can obtain more accurate reference data for each paragraph.

In an implementation scenario, the reference data may be obtained by analyzing the target paragraph and an original match graph attached to the target paragraph based on a graph-text understanding network, where the graph-text understanding network may include a semantic decoder and a pre-trained multi-modal encoder, the multi-modal encoder is configured to process text features and image features to obtain multi-modal features fused with text information and image information, and the semantic decoder is configured to decode the multi-modal features to obtain the reference data of the target paragraph. The image-text understanding network needs to be pre-trained before use, parameters of the multi-mode encoder are fixed in the pre-training process, and parameters of the semantic decoder are changed.

The training process of the graph-text understanding network comprises the following steps: acquiring a sample manuscript, a sample original distribution diagram thereof and a corresponding sample release video; determining sample reference data of a sample target paragraph in a sample manuscript based on a sample release video, and processing the sample target paragraph and a sample original matching chart attached to the sample target paragraph based on a graph-text understanding network to obtain prediction reference data of the sample target paragraph; based on the differences between the sample reference data and the prediction reference data, network parameters of the semantic decoder are adjusted.

In a specific embodiment, the sample manuscript is a news manuscript, and news manuscripts and pictures of different categories such as financial, entertainment, politics, society and the like can be obtained from public news websites such as newness, networkand the like to serve as the sample manuscript and the original distribution diagram of the sample, and broadcast videos corresponding to the news manuscripts are obtained from a short video platform to serve as sample release videos. Dividing the sample manuscript into paragraphs to obtain at least one sample target paragraph, determining a sample release video corresponding to the sample target paragraph based on the sample release video, for example, dividing the sample manuscript into paragraphs to obtain one sample target paragraph, wherein the sample release video corresponding to the sample target paragraph is the sample release video corresponding to the sample manuscript; for another example, the sample manuscript is divided into two sample target paragraphs, the total time length of the sample release video corresponding to the sample manuscript is 40s, the sample release video corresponding to the first sample target paragraph can be the first 10s video, and the sample release video corresponding to the first sample target paragraph can be the 10s-40s video. Further, based on the sample release video corresponding to the sample target paragraph, a broadcast text, a first reference text representing the anchor emotion and a second reference text representing the anchor pronunciation tone in the sample release video are determined to serve as sample reference data of the sample target paragraph.

After at least one sample target section is obtained by dividing the sample manuscript, inputting the sample target section and a sample original matching diagram attached to the sample target section into a graph-text understanding network, encoding the sample target section and the sample original matching diagram attached to the sample target section by a multi-mode encoder in the graph-text understanding network to obtain sample multi-mode characteristics fused with sample text information and sample image information, and decoding the sample multi-mode characteristics by a semantic decoder in the graph-text understanding network to obtain prediction reference data of the sample target section. The prediction reference data comprises a prediction broadcasting text, a first prediction text representing emotion contained in the prediction broadcasting text and a second prediction text representing pronunciation tone to be adopted by the prediction broadcasting text. The network parameters of the semantic decoder are adjusted based on the differences between the sample reference data and the prediction reference data, in particular, the total loss may be determined based on the differences between the sample reference data and the prediction reference data, and the network parameters of the semantic decoder are adjusted based on the total loss. For example, a first difference between the broadcast text and the predicted broadcast text, a second difference between the first reference text and the first predicted text, and a third difference between the second reference text and the second predicted text may be calculated, respectively, to obtain a first loss, a second loss, and a third loss based on the first difference, the second difference, and the third difference, respectively, and the first loss, the second loss, and the third loss may be weighted or the first loss, the second loss, and the third loss may be directly summed to obtain a total loss; and adjusting network parameters of the semantic decoder by using the loss. It will be appreciated that the difference between the sample reference data and the prediction reference data may also be calculated directly, resulting in a total loss. The method for calculating the loss is not particularly limited, and may be, for example, a cosine similarity method or a mean square error method.

S13: and performing voice synthesis based on the reference data to obtain video dubbing.

In one implementation scenario, after obtaining the reference data, the reference data may be input into a speech synthesis model, and video dubbing may be synthesized using the speech synthesis model. The speech synthesis model may employ an existing natural speech2 model, and the naturalspech 2 model uses "continuous vectors" instead of "discrete labels" to represent speech, thereby generating more complete speech segments, and the generated speech does not produce the "stick-reading" phenomenon of "lack of emotion", i.e., speaking in words. It will be appreciated that in other implementation scenarios, other speech synthesis models, such as statistical parametric speech synthesis models (Statistical Parametric Speech Synthesis, SPSS), may also be employed. It should be clear that the manner of obtaining the video dubbing is not limited to using a model, but may be obtained in other manners.

S14: a target video is generated based at least on the original profile and the video dubbing.

In an implementation scenario, the target video may be generated based only on the original profile and the video dubbing; that is, the frame of the target video display only contains the original map.

In another implementation scene, the target video can be generated based on the original configuration map, the video dubbing and the broadcasting image; that is, the picture displayed by the target video contains the original distribution diagram and the broadcasting image. The broadcasting image can be obtained by analyzing a first text which contains emotion information and a second text which at least represents pronunciation tone to be adopted by the text, or can be a broadcasting image selected by a user.

In other implementation scenarios, the target video may also be generated based on the original profile, the newly added profile, and the video dubbing; that is, the frame of the target video display contains the original map and the newly added map. The new added map is obtained by analyzing the text of the document and the original map.

It will be appreciated that in other implementation scenarios, the target video may also be generated based on the original profile, the newly added profile, the broadcast avatar, and the video dubbing.

The original map may be displayed in a picture-in-picture form in the target video, for example, in an upper left, upper right, lower left, or lower right region of the target video display. In another implementation scenario, the original artwork may also be displayed full screen in the target video display. Similarly, the new configuration diagram may be displayed in the target video display screen in the same manner as the original configuration diagram, which is not described herein. The specific display modes of the original configuration diagram and the newly added configuration diagram in the target video are not particularly limited in the application.

After the original manuscript and the original matching chart of the original manuscript are obtained, the reference data which is obtained by responding to the analysis of the original manuscript and the original matching chart and is used for dubbing is further obtained; the reference data comprises a text file, a first text representing emotion information contained in the text file and a second text representing at least the text file with pronunciation tone; performing voice synthesis based on the reference data to obtain video dubbing; a target video is generated based at least on the original profile and the video dubbing. By the method, the target video can be automatically analyzed and generated by only providing the original manuscript of the target video to be generated by the user and the original configuration diagram of the original manuscript. Compared with manual recording, the method has the advantages that the video generation efficiency can be improved, and the video generation cost can be reduced.

Referring to fig. 2, fig. 2 is a flowchart of a second embodiment of a video generating method provided in the present application, where the method includes:

s21: and obtaining the original manuscript and the original configuration diagram of the original manuscript.

S22: reference data, which is analyzed in response to the original manuscript and the original artwork and used for dubbing, is acquired.

The reference data comprises a text file, a first text representing emotion information contained in the text file and a second text representing at least the text file, wherein pronunciation tone is adopted. For the specific implementation of steps S21 and S22, please refer to steps S11 and S12 of the first embodiment of the video generating method provided in the present application, and details are not repeated here.

In an implementation scenario, the second text may represent information such as dressing, clothing, etc. of the anchor for playing the text of the message case, in addition to representing the pronunciation tone.

S23: and acquiring the broadcasting image obtained by analysis in response to the first text and the second text.

In an implementation scenario, a first descriptive text characterizing the anchor details may be predicted based on the first text and the second text; and generating a broadcasting image based on the first descriptive text. Specifically, the first description text may be predicted based on the first text and the second text by using the hint decoder, that is, the first text and the second text are input to the hint decoder, and the hint decoder outputs the first description text. The first descriptive text may characterize the sex, age group, image, etc. of the anchor. For example, the emotion information expressed by the first text is happy, and the girl expressed by the second text pronounces the tone color, the first descriptive text can contain information such as the girl's image, the girl's tone color is happy, the girl's dressing and clothing, and the like.

After the first description text is obtained, a pretrained text graph network can be utilized to generate a broadcasting image based on the first description text, and the broadcasting image is the image of the anchor. In one implementation, the pre-trained contextually-mapped network may be a diffusion model (diffusion model). It will be appreciated that in other implementation scenarios, the pre-trained textbook network may also be other models capable of generating images based on text, without specific limitation herein.

In an implementation scenario, an original manuscript is segmented, after a first text and a second text corresponding to each target paragraph are obtained, a prompt decoder is directly utilized to decode the first text and the second text corresponding to the target paragraph, a first description text corresponding to the target paragraph is obtained, and then a broadcasting image corresponding to the target paragraph is generated. At this time, the broadcasting images corresponding to the target paragraphs are different. The present embodiment can be applied to a case where an original document is a conversation document.

In another implementation scenario, if the original document is not a conversation document, for example, the original document is a news document, only one anchor exists in the original document, and at this time, there may be a difference in the broadcasting image obtained by using the first text and the second text of each target paragraph. Therefore, after the first texts and the second texts corresponding to all the target paragraphs are obtained, the first texts and the second texts corresponding to all the target paragraphs are input to the prompt decoder, so that the first description text corresponding to the original manuscript is obtained, and the broadcasting image is generated.

S24: and performing voice synthesis based on the reference data to obtain video dubbing.

In one implementation, the reference data may be input into a speech synthesis model, and video dubbing may be synthesized using the speech synthesis model. The speech synthesis model may employ an existing natural speech2 model, and the naturalspech 2 model uses "continuous vectors" instead of "discrete labels" to represent speech, thereby generating more complete speech segments, and the generated speech does not produce the "stick-reading" phenomenon of "lack of emotion", i.e., speaking in words.

In this embodiment, step S23 may be performed first, and then step S24 may be performed. It should be understood that, in other embodiments, step S24 may be performed first, and then step S23 may be performed. Or both steps may be performed simultaneously.

S25: and generating a target video based on the original configuration diagram, the video dubbing and the broadcasting image.

In an implementation scenario, a broadcast action of a broadcast image in a target video is generated by a video dubbing driver. Specifically, the broadcasting image in the target video is the anchor of the broadcasting message text, and the lip and head driving of the anchor can be realized by using a sadtalker model. The sadcalker model is an open source model for automatically synthesizing the talking animation of the character by using the pictures and the audio file, and the model gives a picture and an audio file to the model, and the model can carry out corresponding actions of the face, such as mouth opening, blinking, head moving and the like on the transferred picture according to the audio file. Sadcalker generates 3D motion coefficients (head pose, expression) of a 3DMM from audio and implicitly modulates a novel 3D-aware face rendering for generating a talking head motion video. Specifically, the sadtalk model includes an ExpNet model (expression capture model) that learns accurate facial expressions from audio by extracting motion coefficients and 3D rendered facial motions, and a PoseVAE (pose generation model). PoseVAE is used to synthesize different styles of head movements.

The original configuration diagram can be displayed in a picture-in-picture mode or in a full screen mode. The broadcast character may also be displayed in a picture-in-picture format, or full screen. It can be understood that the broadcasting image can also be displayed on the original configuration diagram, that is, the original configuration diagram is used as a background diagram in a mode of picture superposition, so that the broadcasting image is located on the original configuration diagram. The display positions of the original distribution diagram and the broadcasting image in the target video can be set according to the needs, and the display positions are not particularly limited herein.

According to the method, the broadcasting image is generated by utilizing the first text and the second text, and then the target video is generated by utilizing the original distribution diagram, the video dubbing and the broadcasting image, so that the original distribution diagram and the text of the text can be vividly displayed by the target video, and the reading quantity of a user is further improved.

Referring to fig. 3, fig. 3 is a flowchart of a third embodiment of a video generating method provided in the present application, where the method includes:

s31: acquiring an original manuscript and an original configuration diagram of the original manuscript;

s32: reference data, which is analyzed in response to the original manuscript and the original artwork and used for dubbing, is acquired.

The reference data may include a text, a first text representing emotion information contained in the text, and a second text representing at least a pronunciation tone to be adopted by the text. For the specific implementation of steps S31 and S32, please refer to steps S11 and S12 of the first embodiment of the video generating method provided in the present application, and the description thereof is omitted here.

S33: a new join profile is obtained that is analyzed in response to the document text and the original profile.

In one implementation scenario, the text corresponding to the original manuscript includes text obtained by analyzing the target paragraph and the original match-up diagram attached to the target paragraph in the original manuscript. Obtaining the correlation degree between each character in the text of the target paragraph and the original configuration diagram attached to the target paragraph; screening each character in the text of the target paragraph based on the relevance of each character to obtain a reference text; predicting a second description text for representing the map details based on the target entity in the reference text; and generating a new configuration diagram of the text of the target paragraph when the text is reported in the target video based on the second description text.

Specifically, performing attention processing on each character in the text of the target paragraph and the original matching diagram attached to the target paragraph to obtain the correlation degree between each character in the text of the target paragraph and the original matching diagram attached to the target paragraph; and selecting the characters with lower relevance according to the relevance of each character to obtain the reference document. In an implementation scenario, the characters with the preset number and the top rank can be selected according to the ranking from small to large of the relevance, so that the reference document is obtained. In another implementation scenario, a correlation threshold may also be set, and the reference document is obtained using characters with a correlation less than the correlation threshold. After the reference document is obtained, an entity recognition algorithm (for example, NER, named Entity Recognition) can be adopted to obtain a target entity in the reference document, the target entity is input to a prompt decoder to obtain a second description text representing the graphic details, the second description text is input to a pre-trained text-in-text graph network, and the pre-trained text-in-text graph network outputs a new graphic when the text of the target paragraph is played in the target video.

In the above manner, the text of each target paragraph and the original configuration diagram are processed to obtain the corresponding newly added configuration diagram of each target paragraph. In another implementation scenario, the original manuscript and the original configuration diagram of the original manuscript can also be directly processed to obtain a new configuration diagram corresponding to the original manuscript. Carrying out attention processing on each character in the original manuscript and the original matching diagram to obtain the correlation degree between each character in the text of the original manuscript and the original matching diagram of the original manuscript; and obtaining a reference document according to the relevance of each character. Identifying a target entity in the reference document, inputting the target entity into a prompt decoder to obtain a second description text representing the graphic details, inputting the second description text into a pre-trained graph-in-text network, and outputting a new graphic in the original document by the pre-trained graph-in-text network when the text is played in the target video.

S34: and performing voice synthesis based on the reference data to obtain video dubbing.

In the specific implementation manner of step S34, please refer to step S13 of the first embodiment of the video generating method provided in the present application, and the description thereof is omitted here.

S35: and generating a target video based on the original configuration diagram, the new configuration diagram and the video dubbing.

In an implementation scenario, the original configuration diagram and the new configuration diagram may be displayed at different times of the target video, where the original configuration diagram and the new configuration diagram are different pictures, and may be displayed in any manner in the target video, that is, the manner of displaying the original configuration diagram and the new configuration diagram is not specifically limited.

According to the embodiment, the content of the original manuscript can be expressed more vividly by adding the map to the original manuscript. Further, through the correlation degree between each character in the text of the target paragraph and the original matching diagram attached to the target paragraph, each character in the text of the target paragraph is screened to obtain a reference text, and a second description text representing the matching diagram detail is predicted based on a target entity in the reference text; based on the second description text, a new configuration diagram of the text of the target paragraph is generated when the text of the target paragraph is broadcast in the target video, so that the new configuration diagram can be associated with the Wen Anwen text of the target paragraph, the content of the text is correctly reflected, and the excessive similarity between the new configuration diagram and the original configuration diagram is avoided.

Referring to fig. 4, fig. 4 is a schematic diagram of an embodiment of a video generation network provided in the present application.

The application also provides a video generation network comprising a graph-text understanding network and an image generation network. In the above embodiment, the target video may be obtained based on an original distribution chart and video dubbing, or may be obtained based on the original distribution chart, video dubbing, and image data obtained by prediction of reference data, where the image data includes at least one of an anchor image and a newly added distribution chart, and the reference data is obtained by analyzing the original distribution chart attached to the target paragraph and the target paragraph based on an image-text understanding network, the image-text understanding network includes a semantic decoder and a pre-trained multi-mode encoder, the pre-trained multi-mode encoder is used for processing text features of the target paragraph and image features of the original distribution chart attached to the target paragraph in the original manuscript to obtain multi-mode features fused with text information and image information, and the semantic decoder is used for decoding the multi-mode features fused with the text information and the image information to obtain the reference data of the target paragraph. The image data is obtained by predicting the reference data based on an image generation network, the image generation network comprises a prompt decoder and a pre-trained text generation graph network, the prompt decoder is used for generating descriptive text representing expected image details based on the reference data, and the pre-trained text generation graph network is used for generating image data related to the expected image based on the descriptive text.

The image-text understanding network and the image generating network need to be trained before use, and parameters of the multi-mode encoder and parameters of the text-generated graph network are unchanged in the training process, so that the training steps of the image-text understanding network are referred to the previous description, and are not repeated here. The training step of the image generation network comprises: acquiring a sample manuscript, a sample original distribution diagram thereof and a corresponding sample release video; based on the sample release video, analyzing to obtain sample reference data and sample description text; wherein the sample description text includes at least one of: the method comprises the steps that a detail description text of a host image in a sample release video is provided, and compared with a sample original map added sample map, the sample release video is provided with a detail description text of a sample map; processing the sample reference data based on the prompt decoder to obtain a prediction description text; adjusting network parameters of the hint decoder based on the difference between the sample description text and the predictive description text; the pre-trained text-generated graph network is fixed in parameters in the training process.

Specifically, the sample manuscript is a news manuscript, and news manuscripts and pictures of different categories such as financial, entertainment, politics, society and the like can be obtained from public news websites such as newness, internet easiness and the like to serve as the sample manuscript and the original distribution diagram of the sample, and broadcast videos corresponding to the news manuscripts are obtained from a short video platform to serve as sample release videos. Dividing the sample manuscript into paragraphs to obtain at least one sample target paragraph, determining a sample release video corresponding to the sample target paragraph based on the sample release video, for example, dividing the sample manuscript into paragraphs to obtain one sample target paragraph, wherein the sample release video corresponding to the sample target paragraph is the sample release video corresponding to the sample manuscript; for another example, the sample manuscript is divided into two sample target paragraphs, the total time length of the sample release video corresponding to the sample manuscript is 40s, the sample release video corresponding to the first sample target paragraph can be the first 10s video, and the sample release video corresponding to the first sample target paragraph can be the 10s-40s video. Further, based on the sample release video corresponding to the sample target paragraph, a broadcast text, a first reference text representing the emotion of the anchor and a second reference text representing the sound tone of the anchor in the sample release video are determined to serve as sample reference data of the sample target paragraph, and at least one of a detail description text of the anchor image in the sample release video and a detail description text of a sample map which is newly added in comparison with the sample original map of the sample release video is determined, so that the sample description text is obtained.

Inputting the sample parameter data to a prompt decoder, and processing the sample reference data by the prompt decoder to obtain a prediction description text; based on the difference between the sample description text and the prediction description text, a penalty is obtained, based on which the network parameters of the hint decoder are adjusted. The loss determination based on the difference between the sample description text and the prediction description text may be calculated by adopting a cosine similarity mode or a mean square error calculation mode, and the method is not particularly limited.

Referring to fig. 5 and fig. 6 in combination, fig. 5 is a flowchart illustrating a fourth embodiment of a video generating method provided in the present application, and fig. 6 is a schematic diagram illustrating an embodiment of a preview page provided in the present application, where the method includes:

s51: and obtaining the original manuscript and the original configuration diagram of the original manuscript.

S52: reference data, which is analyzed in response to the original manuscript and the original artwork and used for dubbing, is acquired.

S53: and performing voice synthesis based on the reference data to obtain video dubbing.

S54: a target video is generated based at least on the original profile and the video dubbing.

In the specific embodiment of steps S51 to S54, please refer to steps S11 to S14 of the first embodiment of the video generating method provided in the present application, and the details are not repeated here.

S55: and displaying the preview page.

In an implementation scenario, the preview page is divided into a plurality of regions, and the first region of the preview page is used for previewing the target video, as shown in fig. 6, the first region may be disposed at an upper right corner of the preview page, and it will be appreciated that, in other implementation scenarios, the first region may also be disposed at other positions of the preview page, such as disposed at an upper left corner of the preview page, which is not specifically limited herein. The second area of the preview page is provided with at least a first control and a second control, the first control displays the image information of the image of the anchor, the image information of the image of the anchor can be the name of the anchor, such as 'xiao Yan' shown in fig. 6, and in other embodiments, the image information of the anchor can also be displayed as other information capable of representing the image of the anchor, such as displaying that the image of the anchor is modesty, lovely, etc.; the second control displays attribute information of the avatar to which the video dubbing belongs, and the attribute information of the avatar to which the video dubbing belongs may be a name of the avatar to which the video dubbing belongs, such as "a minqi" shown in fig. 6, it may be understood that, in other embodiments, the attribute information of the avatar to which the video dubbing belongs may also be tone information or other information, and the application is not specifically limited.

In an implementation scenario, if the user can click on the first control in the preview page, the device may display a first avatar library associated with the first control in response to a user instruction for the first control; and responding to a selection instruction of a user for any avatar in the first avatar library, and replacing the anchor avatar in the target video with the selected avatar. In this embodiment, the user may customize the anchor avatar in the target video.

In an implementation scenario, the user may click on the second control in the preview page, and the device may display a second library of images associated with the second control in response to a user instruction for the second control; and responding to a selection instruction of any virtual image in the second image library, and adjusting the dubbing attribute of the video dubbing to the dubbing attribute of the selected virtual image. In this embodiment, the dubbing attribute of the entire text may be adjusted.

In another implementation scenario, as shown in fig. 6, the preview page may further include a third area, where the text of the document and a dubbing mark (not shown in the figure) of each sentence in the text of the document are displayed, and the dubbing mark characterizes at least one dubbing attribute of timbre and emotion. It will be appreciated that the dubbing marks of the respective sentences may be disposed after the respective sentences or may be disposed at other positions, which is not particularly limited herein. The user can select any sentence in the text of the text as a target sentence and click on the dubbing mark, the device can respond to the selection instruction of the user on the dubbing mark to display a plurality of dubbing options, the user further selects the dubbing options required by the target sentence from the plurality of dubbing options, and the device adjusts the dubbing attribute of the audio frame corresponding to the sentence where the selected dubbing mark is located based on the selected dubbing options. For example, the dubbing attribute of the audio frame corresponding to the target sentence is a male tone, the dubbing mark of the target sentence is adjusted, and after the dubbing option is selected as a female tone, the dubbing attribute of the audio frame corresponding to the target sentence is a male tone and is adjusted to a female tone. In this embodiment, the dubbing attribute of a part of the sentence in the document text can be adjusted.

It will be appreciated that in other embodiments, a portion of the sentences in the document text may be selected and the first control clicked on to display a first avatar library associated with the first control; and responding to a selection instruction of a user for any avatar in the first avatar library, and replacing the anchor avatar corresponding to the partial statement as the selected avatar. According to the embodiment, the anchor image corresponding to part of the sentences in the text of the text can be adjusted.

By the method, the user can preview the target video on the preview page, and define the anchor image and the dubbing attribute of the video dubbing in the playing process of the target video so as to meet the requirements of different users and improve the reading quantity of the video.

Referring to fig. 7, fig. 7 is a schematic frame diagram of an embodiment of a video generating apparatus provided in the present application.

The video generating device 70 comprises an acquisition module 71, an analysis module 72, a synthesis module 73 and a generation module 74, wherein the acquisition module 71 is used for acquiring an original manuscript and an original configuration drawing of the original manuscript; the analysis module 72 is configured to obtain reference data that is obtained by analysis in response to the original manuscript and the original configuration drawing and is used for dubbing; the reference data comprises a text file, a first text representing emotion information contained in the text file and a second text representing at least the text file with pronunciation tone; the synthesis module 73 is configured to perform speech synthesis based on the reference data to obtain video dubbing; the generation module 74 is configured to generate a target video based at least on the original profile and the video dubbing.

By the method, the target video can be automatically analyzed and generated only by providing the original manuscript of the target video to be generated by the user and the original configuration diagram of the original manuscript. Compared with manual recording, the method has the advantages that the video generation efficiency can be improved, and the video generation cost can be reduced.

In an implementation scenario, the analysis module 72 further extracts text features of the target paragraph in the original manuscript and extracts image features of the original map attached to the target paragraph; the text features comprise character features of all characters in the target paragraph, and the image features comprise sub-block features of all sub-blocks in the original configuration map; processing the text features and the image features based on the pre-trained multi-modal encoder to obtain multi-modal features fused with the text information and the image information; decoding is carried out based on the multi-mode characteristics, and reference data of the target paragraph is obtained.

In an implementation scenario, the video generating apparatus 70 may further include a first training module (not shown in the figure), where the first training module is configured to perform a training step of the graph-text understanding network, where the graph-text understanding network includes a semantic decoder and a pre-trained multi-mode encoder, where the semantic decoder is configured to decode multi-mode features, and where the graph-text understanding network is configured to analyze the target paragraph and an original configuration diagram attached to the target paragraph to obtain reference data, and specifically, the first training module obtains a sample document, and a sample original configuration diagram thereof, and a corresponding sample distribution video; determining sample reference data of a sample target paragraph in a sample manuscript based on a sample release video, and processing the sample target paragraph and a sample original matching chart attached to the sample target paragraph based on a graph-text understanding network to obtain prediction reference data of the sample target paragraph; adjusting network parameters of the semantic decoder based on differences between the sample reference data and the prediction reference data; wherein the pre-trained multi-modal encoder is fixed in parameters during the training process.

In an implementation scenario, the obtaining module 71 may be further configured to obtain a broadcast image that is analyzed in response to the first text and the second text; the generation module 74 may generate a target video based on the original profile, the video dubbing, and the broadcast avatar; the broadcasting action of the broadcasting image in the target video is generated by a video dubbing driver.

In an implementation scenario, the obtaining module 71 may predict, based on the first text and the second text, a first descriptive text characterizing the anchor details; and generating a broadcasting image based on the first descriptive text.

Therefore, the first text and the second text are utilized to generate the broadcasting image, and then the original distribution diagram, the video dubbing and the broadcasting image are utilized to generate the target video, so that the target video can vividly display the original distribution diagram and the text, and the reading quantity of a user is further improved.

In an implementation scenario, the obtaining module 71 may also be configured to obtain a new configuration map that is analyzed in response to the document text and the original configuration map; the generation module 74 may generate the target video based on the original profile, the newly added profile, and the video dubbing.

In an implementation scenario, the obtaining module 71 may obtain a correlation between each character in the text of the target paragraph and the original configuration diagram attached to the target paragraph; screening each character in the text of the target paragraph based on the relevance of each character to obtain a reference text; predicting a second description text for representing the map details based on the target entity in the reference text; and generating a new configuration diagram of the text of the target paragraph when the text is reported in the target video based on the second description text.

Therefore, the content of the original manuscript can be expressed more vividly by adding the map to the original manuscript. Further, through the correlation degree between each character in the text of the target paragraph and the original matching diagram attached to the target paragraph, each character in the text of the target paragraph is screened to obtain a reference text, and a second description text representing the matching diagram detail is predicted based on a target entity in the reference text; based on the second description text, a new configuration diagram of the text of the target paragraph is generated when the text of the target paragraph is broadcast in the target video, so that the new configuration diagram can be associated with the Wen Anwen text of the target paragraph, the content of the text is correctly reflected, and the excessive similarity between the new configuration diagram and the original configuration diagram is avoided.

In an implementation scenario, the target video is generated based on an original distribution image, a video dubbing image and image data obtained by prediction of reference data, the image data comprises at least one of an anchor image and a newly added distribution image, the image data is obtained by prediction of the reference data based on an image generation network, the image generation network comprises a prompt decoder and a pre-trained text image generation network, the prompt decoder is used for generating description text representing expected image details based on the reference data, and the pre-trained text image generation network is used for generating image data related to the expected image based on the description text.

In an implementation scenario, the video generating apparatus 70 may further include a second training module (not shown in the figure), where the second training module is configured to perform a training step of the video image generating network, and specifically, the second training module is configured to obtain the sample manuscript and the sample original configuration thereof and the corresponding sample release video; based on the sample release video, analyzing to obtain sample reference data and sample description text; wherein the sample description text includes at least one of: the method comprises the steps that a detail description text of a host image in a sample release video is provided, and compared with a sample original map added sample map, the sample release video is provided with a detail description text of a sample map; processing the sample reference data based on the prompt decoder to obtain a prediction description text; adjusting network parameters of the hint decoder based on the difference between the sample description text and the predictive description text; the pre-trained text-generated graph network is fixed in parameters in the training process.

In an implementation scenario, the video generating apparatus 70 may further include a display module (not shown in the figure), where the display module is configured to display the preview page; the first area of the preview page is used for previewing the target video, the second area of the preview page is at least provided with a first control and a second control, the first control displays the image information of the anchor image, and the second control displays the attribute information of the virtual image to which the video dubbing belongs.

In an implementation scenario, the third area of the preview page displays a text and a dubbing mark of each sentence in the text, where the dubbing mark represents at least one dubbing attribute of tone and emotion.

In an implementation scenario, the video generating apparatus 70 may further include a dubbing attribute adjustment module (not shown in the figure), where the dubbing attribute adjustment module is configured to display a number of dubbing options in response to a selection instruction for the dubbing mark, and adjust, based on the selected dubbing option, a dubbing attribute of an audio frame corresponding to a sentence in which the selected dubbing mark is located.

In an implementation scenario, the video generating apparatus 70 may further include a display module (not shown in the figure) for displaying a first avatar library associated with the first control; and replacing the anchor avatar in the target video with the selected avatar in response to a selection instruction of any one of the avatars in the first avatar library.

In one implementation, the video generating apparatus 70 may further include a display module (not shown) for displaying a second gallery associated with the second control; and responding to a selection instruction of any virtual image in the second image library, adjusting the dubbing attribute of the video dubbing to the dubbing attribute of the selected virtual image.

Therefore, the user can preview the target video on the preview page, and customize the image of the anchor and the dubbing attribute of the video dubbing in the playing process of the target video so as to meet the requirements of different users and improve the reading quantity of the video.

It should be noted that, the apparatus of this embodiment may perform the steps in the above method, and details of the related content refer to the above method section, which is not described herein again.

Referring to fig. 8, fig. 8 is a schematic frame diagram of an embodiment of an electronic device provided in the present application. In this embodiment, the electronic device 80 includes a memory 81 and a processor 82.

The processor 82 may also be referred to as a CPU (Central Processing Unit ). The processor 82 may be an integrated circuit chip having signal processing capabilities. The processor 82 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The general purpose processor may be a microprocessor or the processor 82 may be any conventional processor 82 or the like.

A memory 81 in the electronic device 80 is used to store program instructions required for the execution of the processor 82.

The processor 82 is configured to execute program instructions to implement the video generation method of the present application.

Referring to fig. 9, fig. 9 is a schematic diagram of a frame of an embodiment of a computer readable storage medium provided in the present application. The computer-readable storage medium 90 of the embodiment of the present application stores program instructions 91, and the program instructions 91 are executed to implement the video generation method provided in the present application. Wherein the program instructions 91 may form a program file stored in the above-mentioned computer readable storage medium 90 in the form of a software product, so that a computer device (which may be a personal computer, a server, or a network device, etc.) performs all or part of the steps of the methods of the embodiments of the present application. And the aforementioned computer-readable storage medium 90 includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes, or a terminal device such as a computer, a server, a mobile phone, a tablet, or the like.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and system may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the patent application, and all equivalent structures or equivalent processes using the descriptions and the contents of the present application or other related technical fields are included in the scope of the patent application.

Claims

1. A video generation method, comprising:

acquiring an original manuscript and an original configuration diagram of the original manuscript;

acquiring reference data which is obtained by analysis in response to the original manuscript and the original configuration diagram and is used for dubbing; the reference data comprises a text file, a first text representing emotion information contained in the text file and a second text representing at least a pronunciation tone to be adopted by the text file;

performing voice synthesis based on the reference data to obtain video dubbing;

and generating a target video at least based on the original configuration diagram and the video dubbing.

2. The method of claim 1, wherein the step of obtaining the reference data comprises:

extracting text characteristics of a target paragraph in the original manuscript, and extracting image characteristics of the original configuration drawing attached to the target paragraph; wherein the text feature comprises character features of each character in the target paragraph and the image feature comprises sub-block features of each sub-block in the original map;

Processing the text features and the image features based on a pre-trained multi-modal encoder to obtain multi-modal features fused with text information and image information;

and decoding based on the multi-mode characteristics to obtain the reference data of the target paragraph.

3. The method according to claim 2, wherein the reference data is obtained by analyzing the target paragraph and the original fitting graph attached to the target paragraph based on a graph understanding network, the graph understanding network comprises a semantic decoder and the pre-trained multi-modal encoder, the semantic decoder is used for decoding the multi-modal features, and the training step of the graph understanding network comprises:

acquiring a sample manuscript, a sample original distribution diagram thereof and a corresponding sample release video;

determining sample reference data of a sample target paragraph in the sample manuscript based on the sample release video, and processing the sample target paragraph and a sample original configuration diagram attached to the sample target paragraph based on the image-text understanding network to obtain prediction reference data of the sample target paragraph;

adjusting network parameters of the semantic decoder based on differences between the sample reference data and the prediction reference data; wherein the pre-trained multi-modal encoder is fixed in parameters during the training process.

4. A method according to any one of claims 1 to 3, wherein after said obtaining reference data for dubbing that is analyzed in response to said original document and said original artwork, and before said generating a target video based at least on said original artwork and said video dubbing, the method further comprises:

acquiring a broadcast image analyzed in response to the first text and the second text;

the generating a target video based at least on the original profile and the video dubbing includes:

generating the target video based on the original configuration diagram, the video dubbing and the broadcasting image; and the broadcasting action of the broadcasting image in the target video is generated by the video dubbing driver.

5. The method of claim 4, wherein the acquiring of the broadcasting image comprises:

predicting to obtain a first description text representing anchor details based on the first text and the second text;

and generating the broadcasting image based on the first description text.

6. A method according to any one of claims 1 to 3, wherein after said obtaining reference data for dubbing that is analyzed in response to said original document and said original artwork, and before said generating a target video based at least on said original artwork and said video dubbing, the method further comprises:

Acquiring a new configuration diagram which is obtained by analysis in response to the text of the text and the original configuration diagram;

and generating the target video based on the original configuration diagram, the new configuration diagram and the video dubbing.

7. The method of claim 6, wherein the text corresponding to the original document comprises: analyzing the obtained text of the document by the target paragraph in the original manuscript and the original matching diagram attached to the target paragraph, wherein the step of obtaining the new matching diagram comprises the following steps:

obtaining the correlation degree between each character in the text of the target paragraph and the original configuration diagram attached to the target paragraph;

screening each character in the text of the target paragraph based on the relevance of each character to obtain a reference text;

predicting a second description text for representing the map details based on the target entity in the reference text;

and generating a new configuration diagram of the text of the target paragraph when the text of the target paragraph is broadcast in the target video based on the second description text.

8. A method according to any one of claims 1 to 3, wherein the target video is generated based on the original distribution map, the video dubbing and image data predicted from the reference data, the image data comprising at least one of an anchor figure, a newly added distribution map, and the image data is predicted from the reference data based on an image generation network comprising a hint decoder for generating descriptive text characterizing desired image details based on the reference data and a pre-trained text-to-text map network for generating image data related to the desired image based on the descriptive text.

9. The method of claim 8, wherein the training step of the image generation network comprises:

based on the sample release video, analyzing to obtain sample reference data and sample description text; wherein the sample description text includes at least one of: the detail description text of the anchor image in the sample release video and the detail description text of the sample map added by the sample release video compared with the sample original map;

processing the sample reference data based on the prompt decoder to obtain a prediction description text;

adjusting network parameters of the hint decoder based on a difference between the sample description text and the prediction description text; the pre-trained text-generated graph network is fixed in parameters in the training process.

10. A method according to any one of claims 1 to 3, wherein the target video contains an anchor character, and wherein after said generating the target video based at least on the original profile and the video dubbing, the method further comprises:

displaying a preview page;

The first area of the preview page is used for previewing the target video, the second area of the preview page is at least provided with a first control and a second control, the first control displays the image information of the anchor image, and the second control displays the attribute information of the virtual image to which the video dubbing belongs.

11. The method of claim 10, wherein a third area of the preview page displays the text and dubbing marks for each sentence in the text, and wherein the dubbing marks represent at least one dubbing attribute of timbre and emotion; the method further comprises the steps of:

and responding to the selection instruction of the dubbing mark, displaying a plurality of dubbing options, and adjusting the dubbing attribute of the audio frame corresponding to the sentence where the selected dubbing mark is positioned based on the selected dubbing options.

12. The method according to claim 10, wherein the method further comprises:

displaying a first image library associated with the first control;

and replacing the anchor avatar in the target video with the selected avatar in response to a selection instruction of any avatar in the first avatar library.

13. The method according to claim 10, wherein the method further comprises:

displaying a second library of images associated with the second control;

and responding to a selection instruction of any avatar in the second avatar library, adjusting the dubbing attribute of the video dubbing to the dubbing attribute of the selected avatar.

14. A video generating apparatus, comprising:

the acquisition module is used for acquiring an original manuscript and an original configuration diagram of the original manuscript;

the analysis module is used for acquiring reference data which is obtained by analysis in response to the original manuscript and the original matching chart and is used for dubbing; the reference data comprises a text file, a first text representing emotion information contained in the text file and a second text representing at least a pronunciation tone to be adopted by the text file;

the synthesis module is used for carrying out voice synthesis based on the reference data to obtain the video dubbing;

and the generating module is used for generating a target video at least based on the original configuration diagram and the video dubbing.

15. An electronic device comprising a memory and a processor coupled to each other, the memory having stored therein program instructions for executing the program instructions to implement the video generation method of any of claims 1 to 13.

16. A computer readable storage medium, characterized in that program instructions executable by a processor for implementing the video generation method of any one of claims 1 to 13 are stored.