CN119233041A

CN119233041A - Method, device, equipment and medium for generating popularization video of episode

Info

Publication number: CN119233041A
Application number: CN202411348290.9A
Authority: CN
Inventors: 陈妙
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2024-09-25
Filing date: 2024-09-25
Publication date: 2024-12-31

Abstract

The present disclosure provides a method, device, equipment and medium for generating a promotional video for a TV series, and relates to the field of artificial intelligence, and in particular to the field of computer vision and deep learning technology. The specific implementation scheme is: segment the video in the target TV series to obtain the video footage; the target TV series includes at least one video; from the footage of each video, determine the initial footage, the ending footage, and the middle footage; the timing of the middle footage is between the initial footage and the ending footage; the initial footage is the footage with a highlight plot in the video, and the ending footage is the footage with a suspense plot in the video; based on the initial footage, the ending footage, and the middle footage, generate a promotional video for the target TV series. The present disclosure can improve the production efficiency of the promotional video for the TV series and save labor costs.

Description

Method, device, equipment and medium for generating popularization video of episode

Technical Field

The disclosure relates to the technical field of computer vision and deep learning in the field of artificial intelligence, in particular to a generation method, a device, equipment and a medium of a popularization video of an episode.

Background

Viewing movie episodes has become one way for people to enjoy leisure in their daily lives, for example, viewing movie dramas, or viewing short dramas.

However, the content of the episode has strong consistency, and the user cannot quickly know the content of a specific episode through the fragmentation time, so that a popularization video of the episode with attractive effect needs to be manufactured, and the user can quickly know the content of the episode in the fragmentation time and simultaneously attract the user to watch the complete content of the episode.

Disclosure of Invention

The disclosure provides a generation method, device, equipment and medium of a promotion video of an episode.

According to a first aspect of the present disclosure, there is provided a method for generating a promotional video of an episode, including:

the method comprises the steps of segmenting a video in a target play set to obtain a shot segment of the video, wherein the target play set comprises at least one video;

Determining an initial shot segment, an ending shot segment and a middle shot segment from all the shot segments of the video, wherein the time sequence of the middle shot segment is positioned between the initial shot segment and the ending shot segment;

And generating the popularization video of the target episode according to the initial shot segment, the ending shot segment and the middle shot segment.

According to a second aspect of the present disclosure, there is provided a generation apparatus of a promotional video of an episode, including:

The system comprises a segmentation unit, a segmentation unit and a processing unit, wherein the segmentation unit is used for carrying out segmentation processing on videos in a target theatrical system to obtain shot fragments of the videos, and the target theatrical system comprises at least one video;

The system comprises a determining unit, a judging unit and a judging unit, wherein the determining unit is used for determining an initial lens segment, an ending lens segment and a middle lens segment from the lens segments of each video, and the time sequence of the middle lens segment is positioned between the initial lens segment and the ending lens segment;

and the generation unit is used for generating the popularization video of the target episode according to the initial shot segment, the ending shot segment and the middle shot segment.

According to a third aspect of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, the computer program being readable from the readable storage medium by at least one processor of an electronic device, the at least one processor executing the computer program causing the electronic device to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the first aspect.

The technology improves the production efficiency of popularization videos of episodes and saves labor cost.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a scene diagram in which embodiments of the present disclosure may be implemented;

FIG. 2 is a schematic illustration of an application interface for an application provided by the present disclosure;

Fig. 3 is a flowchart of a method for generating a promotional video of an episode, provided in a first embodiment of the present disclosure;

fig. 4 is a flowchart of a method for generating a promotional video of an episode according to a second embodiment of the present disclosure;

fig. 5 is a flowchart of a method for generating a promotional video of an episode according to a third embodiment of the present disclosure;

fig. 6 is a schematic diagram of a correspondence between a lens segment and a vocal segment provided in the present disclosure;

fig. 7 is a block diagram of a device for generating a promotional video of an episode according to an embodiment of the present disclosure;

Fig. 8 is a block diagram of an electronic device for implementing a method of generating promotional video for an episode in an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Viewing movie episodes has become part of people's daily recreational life, but the content of movie episodes is often coherent and the number of episodes is high, so that the user cannot know all the content of the episode in a short time. Therefore, in the related art, a popularization video of the movie episode is produced by a manual editing mode, so that a user can know the content of the episode in a short time and simultaneously is attracted to watch the complete content of the episode.

However, this method requires manual editing of episodes, and is inefficient and costly to manufacture.

In view of the above, the method for generating the popularization video of the episode, provided by the present disclosure, automatically clips the video of the target episode to generate the popularization video of the target episode, improves the manufacturing efficiency, and saves the labor cost.

The present disclosure provides a method, an apparatus, a device, and a medium for generating a promotional video of an episode, which are applied to the technical field of computer vision and deep learning in the field of artificial intelligence, so as to achieve the effects of improving the production efficiency of the promotional video of the episode and saving the labor cost.

It should be noted that, in the technical solution of the present disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, etc. of the personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations.

FIG. 1 is a scene diagram in which embodiments of the present disclosure may be implemented. As shown in fig. 1, the scenario at least includes an electronic device 100, which may be a notebook computer, a smart phone, a desktop computer, a server, a tablet computer, or the like, having processing capabilities. An application 101, a display 102, a memory 103, and a processor 104 are provided in the electronic device 100. The user may start and run the application 101 through the electronic device 100, after the application 101 is started, receive at least one video of the target episode selected by the user, and the processor 104 processes the at least one video of the target episode based on the application 101, so as to obtain a promotion video of the target episode, and display the promotion video to the user through the display screen 102.

Optionally, the electronic device 100 cooperatively provides a function of generating a promotional video of the target episode through the above components. The communication between the above-mentioned components is typically achieved by means of a bus system or a specific interface protocol within the electronic device. The application 101 is an interface for a user to interact with the electronic device 100 for displaying content of promotional videos, receiving user input (e.g., search, click, etc.), and invoking functionality of other components. Application 101 is typically installed on the operating system of electronic device 100 and interacts with underlying hardware and software through system APIs (application program interfaces). The display screen 102 is used to display content and other visual information of the promotional video. The display 102 and the processor 104 are typically connected through video output interfaces (e.g., HDMI, displayPort, LVDS, etc.) that are capable of transmitting high quality video signals to ensure a clear display of promotional video content. The memory 103 is used to store video files, application data, and various files of an operating system. The memory 103 may be an internal memory (e.g., RAM, ROM, flash memory, etc.) or an external memory (e.g., SD card, hard disk, etc.). The memory 103 and the processor 104 communicate through a memory bus or a specific storage interface (such as SATA, PCIe, etc.), so as to implement fast reading and writing of data. Processor 104 is the core component of electronic device 100 and is responsible for executing program instructions, processing data, and controlling other components. The processor 104 communicates with the various components through an internal bus system, including data exchange with the memory 103, video signal transmission with the display screen 102, instruction execution with the application 101, and the like.

Optionally, referring to fig. 2, fig. 2 is a schematic diagram of an application interface of an application program provided in the present disclosure. The display screen 102 of the electronic device 100 may display an application interface of the application 101, where the application 101 may provide an input control for a user to determine a video of a target episode, and the disclosure does not limit how to determine the video of the target episode, for example, may be that the user uploads the video of the target episode through the input control, or may extract the stored video corresponding to the album interval from the memory 103 based on the album interval of the target episode input by the user through the input control. Further, at least one video of the target episode is obtained through an application interface of the application.

It will be appreciated that the scene diagram shown in fig. 1 is merely an exemplary illustration. In practical applications, the scene schematic diagram may further include other devices, for example, a network device, an audio device, etc., which may be specifically adjusted according to practical requirements, and the disclosure is not limited thereto. The embodiment of the disclosure also does not limit the actual forms of various devices included in the application scene, and does not limit the interaction modes between the devices, and in the specific application of the scheme, the embodiment can be set according to the actual requirements.

The following detailed description is given of the technical solutions of the present disclosure and how the technical solutions of the present disclosure solve the above technical problems with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

Fig. 3 is a flowchart of a method for generating a promotional video of an episode according to a first embodiment of the present disclosure, where an execution subject of the present embodiment may be the foregoing electronic device, and the method may include the following steps:

s301, segmenting the video in the target play set to obtain a shot segment of the video.

Illustratively, the episode characterizes a dramatic program played on a television channel or network video platform, and the disclosure is not limited to the content and form of the episode, e.g., the form of the episode may be a television episode or a short episode, etc. The target episode is an episode in which a promotional video is to be generated, the target episode including at least one video, for example, the target episode may be a short episode a, which has a total of 20 episode videos.

It should be noted that, the number of videos included in the target episode is not limited in this disclosure, and the target episode may include all videos of the target episode, for example, 20 videos including the short episode a, or may be a portion of videos including the target episode in the target episode, for example, 1-5 videos including the short episode a, or 10-18 videos, which are specifically related to the user setting.

The shot segments represent segments of different shots in the video, with shot boundaries between each shot segment. It should be noted that the duration of each shot segment is not limited in this disclosure, and is specifically related to the content of the video.

After the electronic device receives the videos of the target episode, the electronic device can perform segmentation processing on each video to obtain a shot segment of the video. For example, the video may be input into a preset shot boundary detection model to perform slicing processing on the video, so as to obtain a plurality of shot segments of the video. The type of shot boundary detection model is not limited in this disclosure, so long as the shot boundary detection function can be realized, and for example, a transition-Based Network Representation Learning for Social Relation Extraction (TransNet) model can be used.

S302, determining an initial shot segment, an ending shot segment and an intermediate shot segment from shot segments of each video.

Illustratively, the initial shot segment characterizes a first shot segment of the promotional video of the target episode, the ending shot segment characterizes a last shot segment of the promotional video of the target episode, and the middle shot segment characterizes a shot segment between the initial shot segment and the ending shot segment, i.e., the timing of the middle shot segment is located between both the initial shot segment and the ending shot segment. The initial lens segment is a lens segment with highlight scenario in the video, and the ending lens segment is a lens segment with suspense scenario in the video. It should be noted that highlight dramas represent highlight dramas with points of view, which are more attractive to the audience than other dramas. The suspense scenario characterizes a scenario with doubts and uncertainties, attracts the audience to be full of curiosity and expectations for the development of subsequent scenarios, and hooks the viewing interests of the audience.

In some possible implementations, the electronic device may sequentially determine, from a shot segment of a first video of all videos of the target episode, whether each shot segment has a highlight scenario, and if so, determine the first shot segment having the highlight scenario as an initial shot segment. If all the video clips of the target episode do not have highlight episodes, a prompt message is output to the user to prompt the user that the highlight episodes do not exist in the video of the current target episode.

Similarly, after determining the initial lens segment, the electronic device may sequentially determine whether each lens segment located after the initial lens segment has a suspense scenario from the first lens segment having a highlight scenario in all videos, if so, may randomly determine that one lens segment having a suspense scenario is the ending lens segment, or may determine that one lens segment having a suspense scenario is the ending lens segment according to a duration from the initial lens segment to the lens segment having a suspense scenario, or may determine that a lens segment having a suspense scenario adjacent to the initial lens segment is the ending lens segment, which is not limited herein, and may be specifically set according to actual requirements. After the initial shot segment and the end shot segment are determined, the shot segment with the time sequence between the initial shot segment and the end shot segment is determined as the middle shot segment.

In one possible implementation manner, regarding how to determine whether the shot segment has a highlight scenario, the shot segment may be input into a preset video classification model, and whether the shot segment is a shot segment with a highlight scenario is output. The type of the video classification model is not limited by the disclosure, and may be, for example, a video masking self-encoder (VideoMAE) model. In another possible implementation manner, the electronic device may obtain text recognition information of the lens segment, and further may fill the speech information of the lens segment into the speech template based on a preset speech template, obtain a speech word, input the speech word into a preset large language model, and output whether the lens segment has a highlight scenario. The text recognition information characterizes the line information of the lens segment, and the lens segment can be recognized by an optical character recognition technology (Optical Character Recognition, OCR) to obtain the text recognition information of the lens segment.

How to determine whether the lens segment has the suspense scenario is similar to the determination manner of the highlight scenario, and will not be described herein.

In some possible implementations, the electronic device may perform preprocessing on the shot segments of each video to remove invalid shot segments in the shot segments of each video, so as to obtain valid shot segments of each video, where the invalid shot segments represent segments irrelevant to the scenario of the target scenario, and may include, for example, shot segments such as a head, a tail, and an advertisement. Furthermore, the means of the implementation can be adopted for the effective shot segments of each video, so as to determine the initial shot segment, the ending shot segment and the middle shot segment. For example, the method may identify a category of each lens segment based on a preset lens classification model, so as to determine an invalid lens segment, and the type of the lens classification model is not limited in the disclosure, and may be, for example, a convolutional neural network model. By the method, invalid shot segments in the video can be deleted, so that the generated popularization video only keeps shot segments related to the scenario in the target episode, and the content effectiveness in the popularization video is improved.

S303, generating a popularization video of the target episode according to the initial shot segment, the ending shot segment and the middle shot segment.

In one possible implementation manner, after obtaining an initial lens segment, an end lens segment, and a middle lens segment, the electronic device may splice the three lens segments according to a time sequence, so as to obtain a popularization video of the target episode.

In another possible implementation manner, the electronic device may splice the three lens segments according to the time sequence to obtain an initial popularization video, and further may beautify the initial popularization video to obtain a popularization video of the target episode. The beautifying operation may include, for example, modifying a filter of the initial popularization video, adding background music, adding a name of the target episode, collecting information, generating text information of the popularization video, and the like, which is not limited in this disclosure.

The popularization video obtained through the step not only has the initial lens segment of the highlight scenario, but also comprises the ending lens segment of the suspense scenario, and the middle lens segment between the initial lens segment and the ending lens segment have logic continuity, so that the audience can be attracted to watch the complete content of the target scenario based on the highlight scenario and the suspense scenario while knowing the content of the target scenario based on the logic of the consecutive scenario.

According to the generation method of the promotion video of the episode, after the electronic device performs segmentation processing on each video of the target episode, the lens segments of each video are obtained, and then, based on the lens segments of all videos, initial lens segments with highlight episodes, final lens segments with suspense episodes and intermediate lens segments between the initial lens segments and the final lens segments are determined, and finally, the promotion video of the target episode is automatically generated according to the initial lens segments, the final lens segments and the intermediate lens segments. Through the mode, the electronic equipment can automatically generate the popularization video based on the video of the target episode without manually editing, so that the production efficiency of the popularization video is improved, and the labor cost is saved.

Fig. 4 is a flowchart of a method for generating a promotional video of an episode according to a second embodiment of the present disclosure, where an execution subject of the present embodiment may be the foregoing electronic device, and on the basis of the foregoing embodiment, the method may further describe how to determine an initial shot segment, an ending shot segment, and an intermediate shot segment from shot segments of each video, and may further include the following steps:

S401, segmenting the video in the target play set to obtain a shot segment of the video.

It should be noted that this step is similar to the aforementioned step S301, and will not be described here again.

S402, identifying the lens category of the lens segment.

Illustratively, the shot categories characterize categories of dramas included in the shot clips, and the disclosure does not limit the content of the shot categories. For example, the shot categories are any one of shot categories representing that the shot section comprises an episode cover, shot categories representing that the shot section comprises a tail, shot categories representing that the shot section comprises a coincidence scenario, and shot categories representing that the shot section comprises an advertisement.

Through the setting of the lens category, each lens segment can be classified by utilizing the lens category, and invalid lens segments can be removed accurately.

In some possible implementations, the electronic device may preset a shot classification model, where the shot classification model may be pre-trained based on shot segments with tags, where the tags include the plurality of shot categories. Thus, the electronic device can input each lens segment into the lens classification model and output the lens class of the lens segment.

In some possible implementation manners, the electronic device may set different recognition modes for each lens category, so that the lens category of each lens segment may be accurately determined, the recognition accuracy of the lens category is improved, and the phenomenon that the lens segment is deleted by mistake due to the wrong recognition of the lens category is prevented.

Illustratively, the following several approaches may be included:

The method 1 comprises the steps of extracting a first frame picture of a first lens segment from a lens segment of a video, and determining that the lens category of the first lens segment is a lens category comprising an episode cover if the first frame picture is determined to be the episode cover.

The electronic device may input the first frame picture and the cue word together into a preset visual-semantic Model (VLM), and output whether the first frame picture is a jacket of the episode, where the cue word may be, for example, "please determine whether the picture is a jacket picture of the episode", and the disclosure herein does not limit the content of the cue word. The electronic device may also be a cover picture in which the target episode is stored in advance, and determine a feature vector of the cover picture of the target episode and a feature vector of the first frame picture respectively, so as to determine a similarity of the two feature vectors, if the similarity is determined to be greater than or equal to a preset threshold, determine that the first frame picture is an episode cover, and if the similarity is less than the preset threshold, determine that the first frame picture is not an episode cover. The feature vector may be determined by, for example, inputting a picture into a preset calculation model, where the calculation model may be, for example, a Pre-Training (Contrastive Language-Image Pre-Training) model based on a contrast text-Image pair, or may be another model, and the disclosure is not limited herein.

By the method, the electronic equipment can accurately identify whether the lens fragments are of the lens category comprising the episode covers, and only needs to conduct identification processing on the first lens fragment of each video, so that the identification number of the lens fragments is reduced, and the identification efficiency is improved.

The method 2 comprises the steps of obtaining voice recognition information of a shot segment of a video, sequentially determining whether the voice recognition information exists in the shot segment of the video from the last shot segment of the video according to the voice recognition information until a shot segment with the voice recognition information is determined, and determining the shot class of the shot segment of the video with the time sequence positioned behind the shot segment with the voice recognition information as the shot class comprising the tail.

Wherein the speech recognition information characterizes text information of speech included in the shot segment. It should be appreciated that voice information may be present in the shot segment, which may be recognized by an automatic speech recognition technique (Automatic Speech Recognition, ASR) and converted to text information. Note that the voice information in the shot section does not include music information. The electronic device may determine, for each shot segment of the video, starting from the last shot segment of the video, whether voice recognition information exists in the shot segment, if not, whether a preceding shot segment of the shot segment exists, until a shot segment with voice recognition information is determined, and determining shot categories of other shot segments of the video after the shot segment with voice recognition information as shot categories including a tail. For example, the video includes shot segments 1-10, wherein shot segment 8-10 does not have speech recognition information, and shot segment 7 does have speech recognition information, and the shot category of shot segment 8-10 is determined to be a shot category including a footage.

By the method, the electronic equipment can accurately identify whether the lens segments are of the lens category comprising the tail, and does not need to conduct identification processing on all the lens segments of each video, so that the identification number of the lens segments is reduced, and the identification efficiency is improved.

The method 3 comprises the steps of obtaining second character recognition information of a lens segment of a video and third character recognition information of a lens segment of a last video of the video according to each video of a target episode, and determining that the lens category of the lens segment is a lens category comprising a coincidence scenario if the similarity between the second character recognition information of the lens segment and the third character recognition information of the lens segment of the last video of the video is greater than or equal to a preset threshold value according to each lens segment of the video.

The text recognition information characterizes the speech information in the lens segment, and the method for obtaining the text recognition information can refer to the foregoing embodiment, which is not described herein again. For example, taking the video as a third set of video, obtaining second text identification information of a lens segment of the third set of video and third text identification information of a lens segment of the second set of video, determining, for each lens segment of the third set of video, similarity between the second text identification information of the lens segment and the third text identification information of each lens segment of the second set, and if the similarity is determined to be greater than or equal to a preset threshold, determining that the lens category of the lens segment of the third set is a lens category including a coincidence scenario.

The electronic device may preset m lens segments at the tail of the second set, n lens segments at the head of the third set, and judge the similarity between the second text recognition information of the n lens segments at the head of the third set and the third text recognition information of the lens segments at the tail of the second set, so as to determine whether the lens category of the lens segments of the third set is a lens category including a coincidence scenario, reduce the number of lens segments to be recognized, and improve the recognition efficiency. Wherein m and n are integers greater than or equal to 1.

By the method, the electronic equipment can accurately identify whether the lens segment is of the lens category comprising the coincidence scenario.

And 4, extracting any frame of picture from the shot segment of the video, and determining the shot class of the shot segment as the shot class comprising the advertisement if the any frame of picture is determined to be the advertisement picture.

For example, the electronic device may input the arbitrary frame picture and the prompt word together into a preset visual-semantic Model (VLM), and output whether the arbitrary frame picture is an advertisement picture, where the prompt word may be, for example, "please determine whether the picture is an advertisement picture", and the disclosure herein does not limit the content of the prompt word.

In this way, the electronic device can accurately recognize whether the shot clip is of a shot category including advertisements.

S403, determining an initial shot segment, an ending shot segment and a middle shot segment from the shot segments of each video according to the shot categories of the shot segments.

The method includes that the electronic device deletes the shot segments with the shot categories of the shot categories in each video according to the shot categories of the shot segments to obtain effective shot segments in each video, wherein the effective shot segments can represent shot segments which are related to the scenario and have no repeated scenario, and further, an initial shot segment, an ending shot segment and an intermediate shot segment are determined from the effective shot segments.

S404, splicing the initial lens segment, the ending lens segment and the middle lens segment to obtain the popularization video of the target episode.

For example, the electronic device may obtain an initial shot segment, an end shot segment, and an intermediate shot segment, and then splice the three shot segments according to a time sequence to obtain a promotion video of the target episode.

By the method, the electronic equipment can automatically generate the popularization video of the scenario content with the coherent logic based on the time sequence, and the generation efficiency of the popularization video is improved.

According to the generation method of the promotion video of the episode, the electronic device can obtain the shot segments of each video after the video of the target episode is segmented, so that the shot category of the shot segments is identified, invalid shot segments in the video are deleted according to the shot category of the shot segments, the effective shot segments of the video are obtained, and then the initial shot segments, the ending shot segments and the middle shot segments are determined from the effective shot segments of all videos, and finally the promotion video of the target episode is automatically generated according to the initial shot segments, the ending shot segments and the middle shot segments. By the method, the electronic equipment can delete the invalid lens fragments in the video based on the categories of the lens fragments, ensure that only the lens fragments which are related to the scenario and have no repeated scenario are reserved in the generated popularization video, improve the manufacturing efficiency of the popularization video, save the labor cost and improve the effectiveness of scenario contents in the popularization video.

Fig. 5 is a flowchart of a method for generating a promotional video of an episode according to a third embodiment of the present disclosure, where an execution subject of the present embodiment may be the foregoing electronic device, and on the basis of the foregoing embodiment, the method may further describe how to determine an initial shot segment, an end shot segment, and an intermediate shot segment from shot segments of each video according to a shot category of the shot segment, and may include the following steps:

s501, segmenting the video in the target play set to obtain a shot segment of the video.

S502, identifying the lens category of the lens segment.

It should be noted that this step is similar to the aforementioned step S401, and will not be described here again.

S503, identifying the voice in the audio corresponding to the video, and obtaining the voice segment corresponding to the video.

Illustratively, each video has audio corresponding to the video that characterizes all of the sound information in the video, including speech information and background music. The human voice segment characterizes a segment of the speech information in the video, the human voice segment having a start-stop time, e.g., 0-5s for human voice segment 1 and 5-7s for human voice segment 2.

In one possible implementation manner, the electronic device may input the audio corresponding to the video into a preset speech recognition model, and output each voice clip in the audio file. The present disclosure is not limited to the type of speech recognition model, and may be, for example, a voice activation detection (Voice Activity Detection, VAD) model.

S504, determining the shot segment to be processed of the video from all the shot segments of the video according to the shot category of the shot segment in the video and the voice segment corresponding to the video.

Illustratively, the shot segments to be processed characterize a shot segment that is aligned in time with a human voice segment of the video. It should be understood that when the electronic device performs shot segmentation processing on the video, the continuity of sound is not considered, so that the same sentence of voice is segmented by the shot segments, if the shot segments are deleted, the situation that the voice is discontinuous exists, therefore, the shot segments corresponding to the shot categories need to be adjusted according to the voice segments, and the continuity of the voice is ensured.

For example, for each video shot segment, a shot segment corresponding to the shot category may be determined, and further, a voice segment corresponding to the shot category may be determined, and starting from the voice segment, a voice segment aligned with the shot segment in time after the voice segment is determined, and the shot segment aligned with the voice segment and the shot segment after the shot segment aligned with the voice segment are determined as the shot segment to be processed.

The method comprises the steps of determining a first human voice segment from human voice segments corresponding to videos according to the lens types of the human voice segments in the videos, determining the lens segments to be deleted in the videos according to the first human voice segment, determining lens segments which are not aligned with other human voice segments located behind the first human voice segment in the videos according to the first human voice segment, deleting the lens segments corresponding to the lens types from all the lens segments of the videos, and obtaining the lens segments to be processed of the videos.

For example, referring to fig. 6, fig. 6 is a schematic diagram of a correspondence relationship between a lens segment and a voice segment provided in the present disclosure. Taking the example that the video comprises a lens segment 1 (0-5 s), a lens segment 2 (5-15 s), a lens segment 3 (15-30 s), a lens segment 4 (30-60 s) and a lens segment 5 (60-70 s), and the voice segment corresponding to the video comprises a voice segment 1 (1-6 s), a voice segment 2 (7-17 s) and a voice segment 3 (30-69 s). The lens category of the lens segment 1 is a lens category including an episode cover.

Referring to fig. 6, the electronic device may determine, according to a lens category of a lens segment, a first voice segment including the lens segment in time, taking the lens segment 1 as an example, where the first voice segment is the voice segment 1, and since the lens segment 1 needs to be deleted in the subsequent processing, it can be seen from the voice segment 1 that if only the lens segment 1 is deleted, a situation that voice information is incomplete may occur, and therefore, the lens segment 2 corresponding to the voice segment 1 needs to be deleted. However, as can be seen from fig. 6, the shot segment 2 is still located in the time range of the voice segment 2, and if the shot segment 2 is deleted, there may still be a situation that voice information is incomplete, so after the first voice segment (the voice segment 1), it is necessary to determine shot segments (the shot segment 2 and the shot segment 3) that are not aligned in time with other voice segments located after the first voice segment, as shot segments to be deleted, and the shot segment 4 is aligned in time with the voice segment 3, so that deletion is not necessary.

It should be noted that, the above is only to take a shot category including an episode cover as an example, and how to determine the shot segments to be deleted is illustrated, and a processing manner of the shot segments of other shot categories existing in the video is similar to the above, which is not described herein again.

Through the method, the electronic equipment can determine the lens segment to be deleted according to the lens segment of each video, and then delete the lens segment (lens segment 1) corresponding to the lens category and the lens segment to be deleted (lens segment 2 and lens segment 3) to obtain the lens segment to be processed, wherein the lens segment to be processed is aligned with the voice segment in time, the situation that voice information is incomplete in the lens segment to be processed is avoided, and the quality of popularization video is improved.

S505, according to the shot segments to be processed of each video, determining an initial shot segment, an ending shot segment and an intermediate shot segment.

In the step, the shot segments to be processed are obtained by processing the shot segments of each video, and then the initial shot segment, the end shot segment and the middle shot segment are determined from the shot segments to be processed of each video, so that the number of shot segments to be identified is reduced, the generated popularization video is ensured to only comprise the effective shot segments, and the content effectiveness of the popularization video is improved.

For example, the initial shot segment, the end shot segment, and the middle shot segment may be determined from the shot segments to be processed of each video, respectively. It should be noted that this step is similar to the manner of the foregoing embodiment, and reference is made to the foregoing embodiment for specific description.

(1) And the initial lens segment is used for determining whether the lens segment to be processed with the highlight scenario exists or not based on a preset video classification model from the lens segment to be processed of the first video in each video, and if so, determining the first lens segment to be processed with the highlight scenario as the initial lens segment.

By the method, the electronic equipment can determine the initial lens fragments from the lens fragments to be processed, and the ineffective lens fragments (the head, the advertisement, the repeated scenario and the like) of the video do not need to be identified, so that the number of lens fragments needing to be identified is reduced, and the identification efficiency is improved.

(2) The method comprises the steps of obtaining first character recognition information of a to-be-processed lens segment of each video, wherein the first character recognition information represents speech information in the to-be-processed lens segment, determining whether the to-be-processed lens segment is a lens segment with a suspense scenario or not according to the character recognition information of the to-be-processed lens segment in each video after an initial lens segment, and determining the end lens segment according to the duration from the initial lens segment to the to-be-processed lens segment with the suspense scenario if the to-be-processed lens segment is the to-be-processed lens segment with the suspense scenario.

For example, if the determined duration from the initial shot segment to the shot segment adjacent to the initial shot segment with the suspense scenario is less than a preset duration, for example, 3 minutes, then determining a shot segment with the suspense scenario that satisfies the preset duration based on the shot segment to be processed may be continued until the ending shot segment is determined.

By the method, the electronic equipment can determine the finished shot segments from the shot segments to be processed, and the ineffective shot segments (tail, advertisement and the like) of the video do not need to be identified, so that the number of shot segments needing to be identified is reduced, and the identification efficiency is improved.

(3) And determining the lens segments with the time sequence between the initial lens segment and the end lens segment as the middle lens segment.

S506, splicing the initial lens segment, the ending lens segment and the middle lens segment to obtain the popularization video of the target episode.

It should be noted that this step is similar to the aforementioned step S404, and will not be described here again.

According to the generation method of the promotion video of the episode, after the lens category is identified for the lens segment of the video, the electronic device can determine the lens segment to be deleted according to the voice segment and the lens category corresponding to the video, delete the lens segment corresponding to the lens category and the lens segment to be deleted from the lens segment corresponding to the video to obtain the lens segment to be processed, further determine the initial lens segment, the end lens segment and the middle lens segment from the lens segment to be processed, and splice the lens segment to obtain the promotion video of the target episode. By the method, the electronic equipment can guarantee the integrity of voice information in the lens segment to be processed based on the voice segment, improve the quality of popularization videos and further improve user experience.

Fig. 7 is a block diagram of a device for generating a promotional video of an episode, which is provided in an embodiment of the present disclosure, and the device 700 includes the following units:

The segmentation unit 701 is configured to perform segmentation processing on a video in a target episode to obtain a shot segment of the video, where the target episode includes at least one video;

A determining unit 702, configured to determine an initial shot segment, an end shot segment, and an intermediate shot segment from shot segments of each video, where a timing sequence of the intermediate shot segment is located between the initial shot segment and the end shot segment;

A generating unit 703, configured to generate a promotion video of the target episode according to the initial shot segment, the ending shot segment, and the middle shot segment.

In some possible embodiments, the determining unit 701 includes:

the identification module is used for identifying the lens category of the lens segment;

and the determining module is used for determining the initial lens segment, the ending lens segment and the middle lens segment from the lens segments of the videos according to the lens types of the lens segments.

In some possible embodiments, the shot categories are any one of shot categories for characterizing shot segments comprising episode covers, shot categories for characterizing shot segments comprising footage, shot categories for characterizing shot segments comprising coincident episodes, and shot categories for characterizing shot segments comprising advertisements.

In some possible embodiments, the video has audio corresponding to the video, and the determining module comprises:

the identification sub-module is used for identifying the voice in the audio corresponding to the video and obtaining a voice fragment corresponding to the video;

The first determining submodule is used for determining the shot segments to be processed of the video from all the shot segments of the video according to the shot types of the shot segments in the video and the voice segments corresponding to the video, wherein the shot segments to be processed represent the shot segments aligned with the voice segments of the video in time;

And the second determining submodule is used for determining the initial lens segment, the ending lens segment and the middle lens segment according to the lens segments to be processed of the videos.

In some possible embodiments, the first determining submodule is specifically configured to:

Determining a first human voice segment from human voice segments corresponding to the video according to the lens types of the lens segments in the video, wherein the first human voice segment represents the human voice segment which comprises the lens segments corresponding to the lens types in time;

Determining a shot segment to be deleted in the video according to the first human voice segment, wherein the shot segment to be deleted represents a shot segment which is not aligned with other human voice segments positioned behind the first human voice segment in time;

And deleting the shot segments corresponding to the shot categories from the shot segments of the video, and obtaining the shot segments to be processed of the video by the shot segments to be deleted.

In some possible embodiments, the second determining submodule is specifically configured to:

starting from a lens segment to be processed of the first video in each video, and sequentially determining whether each lens segment to be processed has a highlight scenario or not based on a preset video classification model;

if yes, determining the first lens segment to be processed with the highlight scenario as the initial lens segment.

Acquiring first character recognition information of a lens segment to be processed of each video, wherein the first character recognition information represents the speech information in the lens segment to be processed;

determining whether the lens segment to be processed is a lens segment with a suspense scenario or not according to character identification information of the lens segment to be processed from the lens segments to be processed of each video located behind the initial lens segment;

if yes, determining the ending lens segment according to the duration from the initial lens segment to the lens segment to be processed with the suspense scenario.

In some possible embodiments, the identification module includes:

The first extraction submodule is used for extracting a first frame picture of a first lens segment from the lens segment of the video;

and the third determining sub-module is used for determining that the lens category of the first lens segment is the lens category comprising the episode cover when the first frame picture is determined to be the episode cover.

In some possible embodiments, the identification module includes:

The system comprises a first acquisition sub-module, a second acquisition sub-module and a third acquisition sub-module, wherein the first acquisition sub-module is used for acquiring voice recognition information of a shot segment of the video, and the voice recognition information characterizes text information of voices included in the shot segment;

And a fourth determining sub-module, configured to sequentially determine, according to the voice recognition information, from the last shot segment of the video, whether the voice recognition information exists in the shot segment of the video until a shot segment with voice recognition information is determined, and determine a shot class of the shot segment of the video whose time sequence is located after the shot segment with voice recognition information as a shot class including a tail.

In some possible embodiments, the identification module includes:

The second acquisition sub-module is used for acquiring second character identification information of a lens segment of each video of the target episode and third character identification information of the lens segment of the video of the last set of videos, wherein the character identification information represents the speech information in the lens segment;

And a fifth determining sub-module, configured to determine, for each lens segment of the video, that the lens category of the lens segment is a lens category including a coincidence scenario if it is determined that the similarity between the second text identifying information of the lens segment and the third text identifying information of the lens segment of the last video of the video is greater than or equal to a preset threshold.

In some possible embodiments, the identification module includes:

the second extraction submodule is used for extracting any frame of picture from the lens segment of the video;

And a sixth determining submodule, configured to determine, when determining that the arbitrary frame of picture is an advertisement picture, that the shot category of the shot segment is a shot category including an advertisement.

In some possible embodiments, the generating unit includes:

and the splicing module is used for carrying out splicing treatment on the initial lens segment, the ending lens segment and the middle lens segment to obtain the popularization video of the target episode.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in the device 800 are connected to the I/O interface 805, including an input unit 806, such as a keyboard, a mouse, etc., an output unit 807, such as various types of displays, speakers, etc., a storage unit 808, such as a magnetic disk, optical disk, etc., and a communication unit 809, such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a generation method of a promotional video of an episode. For example, in some embodiments, the method of generating promotional video for an episode may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the method of generating a promotional video of an episode described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method of generating promotional videos of episodes in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A generation method of popularization video of an episode includes:

2. The method of claim 1, wherein determining an initial shot segment, an ending shot segment, and an intermediate shot segment from among shot segments of each of the videos comprises:

Identifying a lens category of the lens segment;

And determining the initial shot segment, the ending shot segment and the middle shot segment from the shot segments of the videos according to the shot categories of the shot segments.

3. The method of claim 2, wherein the shot categories are any of a shot category characterizing a shot clip including an episode cover, a shot category characterizing a shot clip including a tail, a shot category characterizing a shot clip including a coincidence scenario, and a shot category characterizing a shot clip including an advertisement.

4. The method of claim 2 or 3, wherein the video has audio corresponding to the video, determining the initial shot segment, the ending shot segment, and the middle shot segment from each of the shot segments according to the shot categories of the shot segments, comprising:

identifying the voice in the audio corresponding to the video to obtain a voice segment corresponding to the video;

Determining a lens segment to be processed of the video from all the lens segments of the video according to the lens category of the lens segment in the video and the voice segment corresponding to the video, wherein the lens segment to be processed represents the lens segment aligned with the voice segment of the video in time;

and determining the initial shot segment, the ending shot segment and the middle shot segment according to the shot segments to be processed of each video.

5. The method of claim 4, wherein determining a shot segment of the video to be processed from among the shot segments of the video according to the shot class of the shot segment in the video and the voice segment corresponding to the video, comprises:

6. The method of claim 4 or 5, wherein determining the initial shot segment from the shot segments of each video to be processed comprises:

7. The method of claim 4 or 5, wherein determining the ending shot segment from the to-be-processed shot segments of each video comprises:

8. The method of any of claims 2-7, wherein identifying a lens category of the lens segment comprises:

extracting a first frame picture of a first shot segment from the shot segment of the video;

And if the first frame picture is determined to be the episode cover, determining that the lens category of the first lens segment is the lens category comprising the episode cover.

9. The method of any of claims 2-7, wherein identifying a lens category of the lens segment comprises:

Acquiring voice recognition information of a shot segment of the video, wherein the voice recognition information characterizes text information of voices included in the shot segment;

And according to the voice recognition information, starting from the last shot segment of the video, sequentially determining whether the voice recognition information exists in the shot segment of the video until one shot segment with the voice recognition information exists is determined, and determining the shot class of the shot segment of the video with the time sequence positioned behind the shot segment with the voice recognition information as the shot class comprising the tail.

10. The method of any of claims 2-7, wherein identifying a lens category of the lens segment comprises:

Acquiring second character recognition information of a lens segment of each video of a target episode and third character recognition information of a lens segment of a previous video of the video, wherein the character recognition information represents the speech information in the lens segment;

and determining that the lens category of the lens segment is the lens category comprising the coincidence scenario if the similarity between the second character recognition information of the lens segment and the third character recognition information of the lens segment of the last video of the video is greater than or equal to a preset threshold value according to each lens segment of the video.

11. The method of any of claims 2-7, wherein identifying a lens category of the lens segment comprises:

extracting any frame of picture from the shot segment of the video;

and if the arbitrary frame of picture is determined to be the advertisement picture, determining that the shot category of the shot segment is the shot category comprising the advertisement.

12. The method of any of claims 1-11, wherein generating a promotional video of the target episode from the initial shot segment, the ending shot segment, and the intermediate shot segment comprises:

and performing splicing processing on the initial lens segment, the ending lens segment and the middle lens segment to obtain the popularization video of the target episode.

13. A generation apparatus of a promotional video of an episode, comprising:

14. The apparatus of claim 13, wherein the determining unit comprises:

15. The apparatus of claim 14, wherein the shot categories are any of a shot category characterizing a shot clip including an episode cover, a shot category characterizing a shot clip including a tail, a shot category characterizing a coincidence scenario, and a shot category characterizing an advertisement.

16. The apparatus of claim 14 or 15, wherein the video has audio corresponding to video, the determining module comprising:

17. The apparatus of claim 16, wherein the first determination submodule is configured to:

18. The apparatus according to claim 16 or 17, wherein the second determination submodule is specifically configured to:

19. The apparatus according to claim 16 or 17, wherein the second determination submodule is specifically configured to:

20. The apparatus of any of claims 14-19, wherein the identification module comprises:

21. The apparatus of any of claims 14-19, wherein the identification module comprises:

22. The apparatus of any of claims 14-19, wherein the identification module comprises:

23. The apparatus of any of claims 14-19, wherein the identification module comprises:

24. The apparatus according to any of claims 13-23, wherein the generating unit comprises:

25. An electronic device, comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of claims 1-12.