Disclosure of Invention
The disclosure provides a video text clearing method, a device, an electronic device and a storage medium, so as to at least solve the problem of poor text clearing effect in the related art. The technical scheme of the present disclosure is as follows:
According to a first aspect of an embodiment of the present disclosure, there is provided a video text removal method, including:
Responding to a text clearing request for a target video, and acquiring a text clearing area corresponding to the text clearing request;
Determining a target text region image which is suitable for the text clearing region from a plurality of initial text region images of the target video based on the text clearing region;
extracting a character masking area from the target character area image, wherein the character masking area is used for representing character area information corresponding to characters contained in the target character area image;
and performing word clearing processing on the target video based on the word mask area.
In an exemplary embodiment, the target video comprises a plurality of video frames, and before the target text region image which is suitable for the text clearing region is determined from a plurality of initial text region images of the target video, the method further comprises the steps of extracting target video frames from the plurality of video frames according to preset video frame intervals, and extracting an image region carrying text from the target video frames as the initial text region image.
In an exemplary embodiment, the determining a target text region image corresponding to the text removal region from a plurality of initial text region images of the target video includes acquiring a plurality of initial text region images corresponding to the target video frame and image regions corresponding to the initial text region images, acquiring a region overlap ratio between each image region and the text removal region, and taking the initial text region image as the target text region image when the region overlap ratio is greater than a preset overlap ratio threshold.
In an exemplary embodiment, the text mask area extracting method includes the steps of obtaining a first current video frame from the plurality of video frames, obtaining a target text area image corresponding to the first current video frame when the first current video frame belongs to the target video frame, inputting the target text area image corresponding to the first current video frame into a pre-trained text mask detection network, obtaining a text mask area corresponding to the first current video frame through the text mask detection network, and training the text mask detection network according to a sample text mask area of a sample text area image and a sample background area of the sample text area image.
In an exemplary embodiment, after the first current video frame is acquired from the plurality of video frames, acquiring an adjacent target video frame adjacent to the first current video frame if the first current video frame does not belong to the target video frame, acquiring a target text region image corresponding to the adjacent target video frame and a text mask region corresponding to the adjacent target video frame, acquiring a first image matched with the text mask region corresponding to the adjacent target video frame in the adjacent target video frame, and acquiring a second image matched with the text mask region corresponding to the adjacent target video frame in the first current video frame, wherein a difference value between the first image and the second image is acquired, and taking the text mask region corresponding to the adjacent target video frame as the text mask region corresponding to the first current video frame if the difference value is smaller than a preset difference value threshold.
In an exemplary embodiment, the target video includes a plurality of video frames; the text clearing processing is carried out on the target video based on the text mask area, and the text clearing processing comprises the steps of obtaining a second current video frame from the plurality of video frames and the text mask area corresponding to the second current video frame, obtaining a last video frame of the second current video frame from the plurality of video frames when the second current video frame is not the first frame of the plurality of video frames, and carrying out the text clearing processing on the second current video frame according to the last video frame, the second current video frame and the text mask area corresponding to the second current video frame to obtain the target clearing video frame corresponding to the second current video frame.
In an exemplary embodiment, the performing text clearing processing on the second current video frame according to the previous video frame, the second current video frame and the text mask area corresponding to the second current video frame to obtain a target cleared video frame corresponding to the second current video frame includes obtaining an initial cleared video frame corresponding to the second current video frame according to the previous video frame, the second current video frame and the text mask area corresponding to the second current video frame, obtaining a target cleared video frame corresponding to the previous video frame and the text mask area corresponding to the previous video frame, inputting an initial cleared video frame corresponding to the previous video frame, a text mask area corresponding to the previous video frame, an initial cleared video frame corresponding to the second current video frame and a text mask area corresponding to the second current video frame, inputting a pre-trained anti-flicker suppression network, and obtaining the target cleared video frame corresponding to the second current video frame through the anti-flicker suppression network.
According to a second aspect of an embodiment of the present disclosure, there is provided a video text removal apparatus, including:
a clear area acquisition unit configured to perform acquisition of a text clear area corresponding to a text clear request in response to the text clear request to a target video;
A target image acquisition unit configured to perform determination of a target text region image adapted to the text removal region from a plurality of initial text region images of the target video based on the text removal region;
A character mask extraction unit configured to perform extraction of a character mask region from the target character region image, the character mask region being used for characterizing character region information corresponding to characters contained in the target character region image;
And the text clearing processing unit is configured to execute text clearing processing on the target video based on the text mask area.
In an exemplary embodiment, the target video includes a plurality of video frames, the target image acquisition unit is further configured to extract a target video frame from the plurality of video frames at a preset video frame interval, and extract an image area carrying text from the target video frame as an initial text area image.
In an exemplary embodiment, the target image obtaining unit is further configured to obtain a plurality of initial text region images corresponding to the target video frame and image regions corresponding to the initial text region images, obtain a region overlap ratio between each image region and the text removal region, and take the initial text region image as the target text region image if the region overlap ratio is greater than a preset overlap ratio threshold.
In an exemplary embodiment, the text mask extracting unit is further configured to obtain a first current video frame from the plurality of video frames, obtain a target text region image corresponding to the first current video frame when the first current video frame belongs to the target video frame, input the target text region image corresponding to the first current video frame into a pre-trained text mask detection network, obtain a text mask region corresponding to the first current video frame through the text mask detection network, and obtain the text mask detection network by training according to a sample text mask region of a sample text region image and a sample background region of the sample text region image.
In an exemplary embodiment, the text mask extracting unit is further configured to perform, when the first current video frame does not belong to the target video frame, acquiring an adjacent target video frame adjacent to the first current video frame, acquiring a target text region image corresponding to the adjacent target video frame and a text mask region corresponding to the adjacent target video frame, acquiring a first image in the adjacent target video frame, which is matched with the text mask region corresponding to the adjacent target video frame, and acquiring a second image in the first current video frame, which is matched with the text mask region corresponding to the adjacent target video frame, and acquiring a difference value between the first image and the second image, and when the difference value is smaller than a preset difference value threshold, taking the text mask region corresponding to the adjacent target video frame as the text mask region corresponding to the first current video frame.
In an exemplary embodiment, the target video includes a plurality of video frames, the text removal processing unit is further configured to perform text removal processing on the second current video frame according to the previous video frame, the second current video frame, and the text mask area corresponding to the second current video frame, and obtain the target removed video frame corresponding to the second current video frame when the second current video frame is not the first frame of the plurality of video frames.
In an exemplary embodiment, the text removal processing unit is further configured to perform obtaining an initial removal video frame corresponding to the second current video frame according to the previous video frame, the second current video frame, and a text mask area corresponding to the second current video frame, obtain a target removal video frame corresponding to the previous video frame, and a text mask area corresponding to the previous video frame, input a pre-trained anti-flicker suppression network, and obtain a target removal video frame corresponding to the second current video frame through the anti-flicker suppression network.
According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising a processor, a memory for storing instructions executable by the processor, wherein the processor is configured to execute the instructions to implement the video text removal method according to any one of the embodiments of the first aspect.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the video text removal method according to any one of the embodiments of the first aspect.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising instructions therein, which when executed by a processor of an electronic device, enable the electronic device to perform the video text removal method according to any one of the embodiments of the first aspect.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
The method comprises the steps of responding to a text clearing request of a target video, obtaining a text clearing area corresponding to the text clearing request, determining a target text area image corresponding to the text clearing area from a plurality of initial text area images of the target video based on the text clearing area, extracting a text mask area from the target text area image, wherein the text mask area is used for representing text area information corresponding to text contained in the target text area image, and performing text clearing processing on the target video based on the text mask area. When the video text is cleared, the text mask area used for representing text area information corresponding to the text contained in the image can be further extracted from the target text area image after the target text area image is obtained, and text clearing processing of the target video is realized based on the text mask area.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
It should be further noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party.
Fig. 1 is a flowchart illustrating a video text removal method according to an exemplary embodiment, and as shown in fig. 1, the video text removal method may be used in a terminal, including the following steps.
In step S101, in response to a text removal request for a target video, a text removal area corresponding to the text removal request is acquired.
The target video refers to a video which needs to be subjected to word clearing processing, the word clearing request refers to a request triggered by a user through a terminal and used for clearing part of words presented in the target video, and the word clearing area is a video image area selected by the user and used for realizing word clearing processing. In this embodiment, when a user needs to perform a clearing process on a part of text in a certain video, a text clearing request for the video may be triggered by the terminal of the user, and a text region to be cleared is selected, and the terminal may respond to the request, and take the video to be subjected to the text clearing process as a target video, and take a region selected by the user as a text clearing region.
For example, when a user needs to clear a caption in a video, a text clearing request for the video can be initiated through the terminal of the user, and an area for presenting the caption in the video is selected as a text area needing to be cleared.
In step S102, a target text region image corresponding to the text removal region is determined from a plurality of initial text region images of the target video based on the text removal region.
In this embodiment, there may be a plurality of region images carrying text for the target video, for example, one target video may simultaneously display video subtitles or video caption text, so that the target video may simultaneously carry a plurality of initial text region images, and the target text region image refers to an initial text region image corresponding to a text clearing region selected by a user among the plurality of initial text region images. Specifically, after determining the target video, the terminal can identify the area image carrying the text from the target video by using a text identification technology as an initial text area image, and can further screen out an initial text area image which is suitable for the text removal area selected by the user from the identified initial text area image as a target text area image.
For example, the plurality of initial text region images for the target video, which are identified by the text recognition technology, may include a caption region image presented as a video caption and a caption region image presented as a video caption, if the user needs to perform a cleaning process on the video caption, the caption region of the video may be selected as a text cleaning region, and then the terminal may use the presented caption region image as the target text region image.
In step S103, extracting a character mask area from the target character area image, wherein the character mask area is used for representing character area information corresponding to characters contained in the target character area image;
in step S104, a text clearing process is performed on the target video based on the text mask region.
The text mask region refers to a text region corresponding to text contained in the target text region image, and in this embodiment, the extracted target text region image includes a part of background region image besides text image itself, and the text mask region refers to an image region corresponding to text image itself, and the image corresponding to the region only includes text itself. In this embodiment, after the terminal determines the target text region image, the text mask region included in the target text region image may be further extracted from the target text region image, and finally, after the text mask region is obtained, the terminal may further perform a cleaning process on the text in the target video by using the text mask region, for example, may fill the text mask region by using an image near the text mask region, so as to implement a cleaning process on the text represented by the text mask region.
The video text clearing method includes the steps of responding to a text clearing request of a target video, obtaining a text clearing area corresponding to the text clearing request, determining a target text area image corresponding to the text clearing area from a plurality of initial text area images of the target video based on the text clearing area, extracting a text mask area from the target text area image, wherein the text mask area is used for representing text area information corresponding to text contained in the target text area image, and performing text clearing processing on the target video based on the text mask area. When the video text is cleared, the text mask area used for representing text area information corresponding to the text contained in the image can be further extracted from the target text area image after the target text area image is obtained, and text clearing processing of the target video is realized based on the text mask area.
In an exemplary embodiment, the target video includes a plurality of video frames, and before step S102, the method may further include extracting the target video frames from the plurality of video frames according to a preset video frame interval, and extracting an image area carrying text from the target video frames as an initial text area image.
In this embodiment, the target video may be composed of a plurality of video frames, where the target video frames refer to video frames extracted according to a preset video frame interval, where the video frame interval may be preset by a user, for example, the user may set that one target video frame is obtained every 3 frames, and after obtaining video frames of the plurality of target videos, the terminal may sample from the video frames of the target video according to every 3 frames, so as to obtain corresponding target video frames. The terminal can then use the obtained target video frame to input it into a pre-trained word recognition network. And outputting and obtaining an image area containing characters in the target video frame through the character recognition network to serve as an initial character area image. Compared with the method that all video frames in the target video are required to be input into the character recognition network to achieve acquisition of the initial character region image, the method can extract the initial character region image of the target video frame by only inputting the target video frame into the initial character region image in a mode of extracting the target video frame, and can improve the operation speed of a model, so that the character clearing efficiency is improved.
For example, the target video may include video frame 0, video frame 1, video frame 2, video frame 3, video frame 4, video frame 5, video frame 6 and video frame 7, if the preset video frame interval is the way of extracting the target video frame every 3 frames, the terminal may input the video frame 0, video frame 3 and video frame 6 as the target video frames into the word recognition network to obtain the corresponding initial word area image, and not all the video frames need to be input into the word recognition network to obtain the corresponding initial word area image, so that the operation speed of the model can be improved, and the word clearing efficiency can be improved.
In this embodiment, a target video frame may be extracted from a plurality of video frames according to a preset video frame interval, and only an initial text region image corresponding to the target video frame may be extracted from the target video frame.
Further, as shown in fig. 2, step S102 may further include:
in step S201, a plurality of initial text region images corresponding to the target video frame and image regions corresponding to the respective initial text region images are acquired.
The image area corresponding to the initial text area image refers to an area corresponding to each initial text area image in the target video frame, and in this embodiment, after obtaining the initial text area image of each target video frame, the terminal may further determine the initial text area image, where the initial text area image is located in the corresponding target video frame, as an image area corresponding to each initial text area image.
In step S202, the region overlap ratio between each image region and the text removal region is obtained;
in step S203, when the region overlap ratio is greater than the overlap ratio threshold value set in advance, the initial character region image is set as the target character region image.
The region overlap ratio refers to the overlap ratio between the image region corresponding to the initial text region image and the text removal region, and the region overlap ratio can be obtained by the ratio of the overlap area between the image region corresponding to each initial text region image and the text removal region to the image region area corresponding to each initial text region image. In this embodiment, after obtaining the image area corresponding to each initial text area image, the terminal may compare the image area corresponding to each initial text area image with the text removal area to obtain the area overlap ratio between the image area corresponding to each initial text area image and the text removal area, and then the terminal may compare the obtained area overlap ratio with a preset overlap ratio threshold, and only if the area overlap ratio is greater than the preset overlap ratio threshold, the initial text area image corresponding to the image area may be used as the target text area image.
For example, the initial text region image included in the target video frame may include a text region image a, a text region image B, and a text region image C, and each initial text region image corresponds to the image region a, the image region B, and the image region C, respectively, and then the terminal may calculate the region overlap ratio between the image region a and the text removal region selected by the user, the region overlap ratio between the image region B and the text removal region selected by the user, and the region overlap ratio between the image region C and the text removal region selected by the user, respectively, and compare whether each region overlap ratio is greater than a preset overlap ratio threshold, and may be that the region overlap ratio between the image region B and the text removal region selected by the user is greater than the overlap ratio threshold only if the region overlap ratio between the image region C and the text removal region selected by the user is greater than the overlap ratio threshold, then the terminal may use the text region image C as the target text region image.
In this embodiment, after obtaining a plurality of initial text region images of the target video frame, the terminal may further screen out the target text region image based on the region overlapping ratio between the image region corresponding to each initial text region image and the text clearing region, so as to improve the accuracy of screening the target text region image and further improve the accuracy of text clearing processing.
In an exemplary embodiment, as shown in fig. 3, step S103 may further include:
in step S301, a first current video frame is acquired from a plurality of video frames.
The first current video frame may be any one of a plurality of video frames, and in this embodiment, the terminal may select any one of the plurality of video frames of the target video as the first current video frame.
In step S302, in the case that the first current video frame belongs to the target video frame, a target text region image corresponding to the first current video frame is acquired.
If the first current video frame selected by the terminal in step S301 is a target video frame extracted from the plurality of video frames according to the video frame interval, the terminal may further obtain the target text region image corresponding to the first current video frame when the first current video frame belongs to the target video frame, because the terminal may extract the corresponding initial text region image from the target video frame and further find the target text region image corresponding to each target video frame.
In step S303, a target text region image corresponding to the first current video frame is input into a pre-trained text mask detection network, and a text mask region corresponding to the first current video frame is obtained through the text mask detection network, wherein the text mask detection network is obtained through training a sample text mask region according to a sample text region image and a sample background region of the sample text region image.
The text mask detection network is a neural network for detecting text masks carried in images, and the text mask detection network can be obtained by training sample images carrying text, namely sample text area images. For example, the user may collect a sample text region image for training the text mask detection network in advance, and label the sample text region image, so as to determine a text mask region included in the sample text region image, that is, a sample text mask region, and a background region included in the sample text region image, that is, a sample background region, respectively. The terminal can train a neural network model of the text mask region and the background region for classification through the sample text region image and the sample text mask region and the sample background region corresponding to the sample text region image, so that a text mask detection network for detecting the text mask region can be obtained.
After the training of the text mask detection network is completed, the target text region image corresponding to the first current video frame can be further input into the trained text mask detection network, so that the text mask region corresponding to the first current video frame can be output by the text mask detection network.
In this embodiment, if the first current video frame belongs to the target video frame, the target text region image corresponding to the first current video frame may be input into the text mask detection network trained in advance, the text mask detection network outputs the corresponding text mask region, because the character mask detection network is obtained by training a sample character mask area based on a sample character area image and a sample background area, the character mask area can output an accurate character mask area, so that the accuracy of the character mask area corresponding to the obtained first current video frame can be further improved.
In addition, as shown in fig. 4, after step S301, it may further include:
In step S401, in the case where the first current video frame does not belong to the target video frame, an adjoining target video frame adjacent to the first current video frame is acquired.
The adjacent target video frame refers to a target video frame adjacent to the first current video frame, and if the first current video frame does not belong to the target video frame, the terminal may further acquire the target video frame adjacent to the first current video frame as the adjacent target video frame. For example, the target video may include video frame 0, video frame 1, video frame 2, video frame 3, video frame 4, video frame 5, video frame 6, and video frame 7, where video frame 0, video frame 3, and video frame 6 are the target video frames, if the first current video frame is video frame 1 or video frame 2, since it does not belong to the target video frame, the terminal may take a target video frame adjacent to the first current video frame of the target video frames as its corresponding adjacent target video frame, that is, the video frame 0 and video frame 3 are adjacent target video frames of video frame 1 or video frame 2, and if the first current video frame is video frame 4, video frame 5, the terminal may take video frame 3 and video frame 6 as adjacent target video frames.
In step S402, a target text region image corresponding to an adjacent target video frame and a text mask region corresponding to the adjacent target video frame are acquired.
After obtaining the adjacent target video frames corresponding to each first current video frame, the terminal may obtain an initial text region image corresponding to each adjacent target video frame, and may further screen out a corresponding target text region image of each adjacent target video frame, and then the terminal may further obtain a text mask region corresponding to each target text region image, for example, may input each target text region image into a pre-trained text mask detection network, and output, through the text mask detection network, a text mask region corresponding to each target text region image, thereby obtaining a text mask region corresponding to each adjacent target video frame.
In step S403, a first image in the adjacent target video frame, which is matched with the text mask area corresponding to the adjacent target video frame, and a second image in the first current video frame, which is matched with the text mask area corresponding to the adjacent target video frame, are obtained, and a difference value between the first image and the second image is obtained;
in step S404, if the difference value is smaller than the preset difference value threshold, the text mask area corresponding to the adjacent target video frame is used as the text mask area corresponding to the first current video frame.
In this embodiment, the first image refers to an image displayed in a text mask area in an adjacent target video frame, that is, the text image itself that needs to be removed in the target video frame, and the second image is an image displayed in a text mask area in an adjacent target video frame in the first current video frame, and if the difference between the first image and the second image is smaller, that is, if the difference value between the first image and the second image is smaller than the preset difference value threshold, it may indicate that the similarity between the first image and the second image is larger, that is, the corresponding text image in the adjacent target video frame is also displayed in the first current video frame, so that the text mask area corresponding to the adjacent target video frame may be used as the text mask area corresponding to the first current video frame.
Taking the video frame 1 as an example of the first current video frame, the adjacent target video frames corresponding to the video frame 1 are the video frame 0 and the video frame 3, wherein the first image corresponding to the video frame 0 is the image a, the first image corresponding to the video frame 3 is the image B, and the second image corresponding to the video frame 2 is the image C, then the terminal can respectively calculate the difference value between the image a and the image C, and calculate the difference value between the image B and the image C, if the difference value between the image a and the image C is smaller than the preset difference value threshold, the image displayed in the text mask area corresponding to the image a of the image C is the same as the image displayed in the image a of the image C, namely the image C is also displayed with the text corresponding to the image a, so the terminal can take the text mask area corresponding to the video frame 0 as the text mask area corresponding to the video frame 1, and if the difference value between the image B and the image C is smaller than the preset difference value threshold, then the terminal can take the text mask area corresponding to the video frame 3 as the text mask area corresponding to the video frame 1.
In this embodiment, if the first current video frame does not belong to the target video frame, the difference value of the image displayed in the text mask area adjacent to the target video frame and the image displayed in the text mask area of the first current video frame may be compared, and if the difference value is smaller than the set difference value threshold, the text mask area adjacent to the target video frame may be used as the text mask area of the first current video frame, so that the detection of the image of the target text area of the first current video frame is not required, and the corresponding text mask area may be obtained, thereby further improving the efficiency of video text clearing.
In an exemplary embodiment, the target video contains a plurality of video frames, and as shown in fig. 5, step S104 may further include:
In step S501, a second current video frame and a text mask area corresponding to the second current video frame are obtained from the plurality of video frames.
In this embodiment, after obtaining the text mask area corresponding to each video frame in the target video, the terminal may further select any one video frame from the obtained text mask areas as the second current video frame, and use the text mask area corresponding to the video frame as the text mask area corresponding to the second current video frame.
In step S502, in the case where the second current video frame is not the first frame of the plurality of video frames, a previous video frame of the second current video frame is acquired from the plurality of video frames;
In step S503, according to the previous video frame, the second current video frame, and the text mask area corresponding to the second current video frame, text clearing processing is performed on the second current video frame, so as to obtain a target clear video frame corresponding to the second current video frame.
The previous video frame refers to the previous video frame of the second current video frame, and the target clear video frame refers to the video frame displayed after the text clear processing is performed on the video frame of the target video. If the second current video frame is not the first frame of the plurality of video frames, besides using the second current video frame and the text mask area corresponding to the second current video frame to realize text clearing of the second current video frame, in this embodiment, the text clearing mode of further introducing the last video frame of the second current video frame to realize text clearing may be realized through a neural network model for realizing image clearing, the last video frame, the second current video frame and the text mask area corresponding to the second current video frame are input as models, and are input into the neural network model for image clearing, text clearing processing for the second current video frame is realized by the neural network model, and the time inconsistency of the result caused by single-frame text clearing can be alleviated by introducing the last video frame, thereby further improving the text clearing effect.
In this embodiment, in the process of performing text removal processing on the second current video frame, so as to obtain the target removed video frame, a previous video frame of the second current video frame is further introduced as input, so that inconsistency in time of a result caused by text removal of a single frame can be alleviated, and the effect of text removal is further improved.
Further, as shown in fig. 6, step S503 may further include:
in step S601, an initial clear video frame corresponding to the second current video frame is obtained according to the previous video frame, the second current video frame, and the text mask area corresponding to the second current video frame.
In this embodiment, obtaining the target clear video frame corresponding to the second current video frame may be implemented by using a neural network model for implementing text clearing, and the neural network model may be composed of two parts, including a text clearing network for clearing text, and a flicker suppression network for further suppressing a flicker problem caused by a lack of temporal continuity, where the initial clear video frame refers to an output result obtained by outputting by the text clearing network for the second current video frame. Specifically, the terminal may input the previous video frame of the second current video frame, and the text mask area corresponding to the second current video frame into a pre-trained text removal network, and output the initial removal video frame corresponding to the second current video frame by the text removal network.
In step S602, a target clear video frame corresponding to a previous video frame and a text mask area corresponding to the previous video frame are obtained;
in step S603, the target clear video frame corresponding to the previous video frame, the text mask area corresponding to the previous video frame, the initial clear video frame corresponding to the second current video frame, and the text mask area corresponding to the second current video frame are input into the pre-trained anti-flicker suppression network, and the target clear video frame corresponding to the second current video frame is obtained through the anti-flicker suppression network.
The target clear video frame is finally output through the flicker suppression network, and the final text clear processing result is aimed at the video frame. In this embodiment, after obtaining an initial clear video frame corresponding to the second current video frame, the terminal may further obtain a target clear video frame corresponding to a previous video frame of the second current video frame and a text mask area corresponding to the previous video frame, and may input the target clear video frame of the previous video frame, the text mask area of the previous video frame, the initial clear video frame of the second current video frame, and the text mask area of the second current video frame as a model, into a pre-trained anti-flicker suppression network, where the anti-flicker suppression network is formed by an encoder (Encoder) -a gate-controlled loop unit (GRU) -Decoder (Decoder), where the gate-controlled loop unit may carry an implicit value, and may obtain the target clear video frame corresponding to the second current video frame through the anti-flicker suppression network.
Taking the video frame 2 as the second current video frame as an example, the video frame 1 may be taken as the previous video frame of the second current video frame, the terminal may input the video frame 1, the video frame 2 and the text mask area corresponding to the video frame 2 to the text clearing network for clearing text, the initial clearing video frame corresponding to the video frame 2 may be obtained through the text clearing network, and the target clearing video frame corresponding to the video frame 1, the text mask area corresponding to the video frame 1, the initial clearing video frame corresponding to the video frame 2 and the text mask area corresponding to the video frame 2 may be further input to the pre-trained anti-flicker suppression network, so that the target clearing video frame corresponding to the video frame 2 may be obtained by using the anti-flicker suppression network.
In this embodiment, when the text removal processing is performed on the second current video frame, an anti-flicker suppression network is further introduced, so that the flicker problem caused by the lack of temporal continuity can be further suppressed, and the text removal effect is further improved.
In an exemplary embodiment, there is also provided a subtitle cleaning method for cleaning video subtitles, as shown in fig. 7, the method may specifically include the steps of:
(1) Subtitle region detection
First, if the subtitle needs to be cleared, a specific range of the subtitle region must be determined. The method is more speed critical, and for a video, the time consumed to detect the subtitles is linearly related to the number of frames that invoke the subtitle detection algorithm. Therefore, for subtitle region detection, a frame skip detection mode is designed. I.e. for every n buffered length video frames only the subtitle region in the n-th video frame is detected.
When detecting subtitles, a general algorithm in the field of word recognition is used to detect rectangular subtitle areas, namely:
Bi=D(Ii),
Wherein, D () represents a subtitle detection network, I i is an I-th frame video frame, and B i is a subtitle rectangular area set obtained by detecting the I-th frame.
(2) Subtitle mask detection network
For the detected rectangular region of the subtitle, a specific mask of the subtitle needs to be accurately obtained to be used as an input of a subsequent clearing module. The subtitles and the backgrounds can be classified as two different types of scenes by using a classification network, namely:
wherein, And a subtitle mask corresponding to the kth subtitle rectangular area of the ith frame. E () identifies the network for the designed subtitle mask. G () is a frame from I i and corresponding rectangular regionIs obtained for input to the subtitle mask recognition network.
(3) Subtitle tracking
Meanwhile, since subtitle detection is not frame-by-frame detection. For this purpose, a letter tracking model is also designed, according to the scene of unified captions, the detected caption mask area value is unchanged, but the background area may change, given a threshold σ, for the adjacent frames I i±1 of I i, if any:
I.e. identify the adjacent frame and Part of the subtitle is continuous, with Diff () being the inter-frame difference.
Therefore, the start and end positions of the unified caption in the n video frames with the buffer lengths can be obtained by transmitting the n_th frame to the head and tail frames in the n video frames with the buffer lengths.
(4) Subtitle clearing
In order to make the clearing capacity of the network fast and stable, the designed clearing algorithm is based on a two-stage network structure, and in order to realize light weight, the structures such as a convolution layer, a attention mechanism and the like in the network are greatly reduced, but similar effects are maintained. Moreover, unlike conventional image-cleaning algorithms, in order to mitigate the temporal inconsistencies of the results due to single-frame cleaning, the input of the previous frame is also added to ensure consistency of the output results:
Wherein the method comprises the steps of And F () is a designed clearing network for the output result after i frame erasure.
After that, in order to further suppress the flicker problem caused by the lack of temporal continuity, a lightweight high-speed anti-flicker suppression network is added before the final result is obtained, consisting of encoder (Encoder) -gate-controlled loop unit (GRU) -Decoder (Decoder), i.e.
For final stable output results, H is the implicit value required in the gating loop unit and C () is the designed anti-flicker suppression network.
Compared with the existing caption clearing algorithm, the embodiment can accurately and rapidly detect the caption range, accurately identify the captions in different languages and different fonts, and clear the captions with high quality and stability.
It should be understood that, although the steps in the flowcharts of fig. 1 to 7 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps of fig. 1-7 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.
It should be understood that the same/similar parts of the embodiments of the method described above in this specification may be referred to each other, and each embodiment focuses on differences from other embodiments, and references to descriptions of other method embodiments are only needed.
Fig. 8 is a block diagram illustrating a video word clearing device, according to an exemplary embodiment. Referring to fig. 8, the apparatus includes a clear area acquisition unit 801, a target image acquisition unit 802, a text mask extraction unit 803, and a text clear processing unit 804.
A clear area acquisition unit 801 configured to perform acquisition of a text clear area corresponding to a text clear request in response to the text clear request to the target video;
a target image acquisition unit 802 configured to perform determination of a target text region image adapted to the text removal region from a plurality of initial text region images of the target video based on the text removal region;
A text mask extraction unit 803 configured to perform extraction of a text mask region from the target text region image, the text mask region being used to characterize text region information corresponding to text contained in the target text region image;
The text removal processing unit 804 is configured to perform text removal processing on the target video based on the text mask region.
In an exemplary embodiment, the target video comprises a plurality of video frames, the target image acquisition unit 802 is further configured to extract the target video frames from the plurality of video frames at preset video frame intervals, and extract an image area carrying text from the target video frames as an initial text area image.
In an exemplary embodiment, the target image obtaining unit 802 is further configured to perform obtaining a plurality of initial text region images corresponding to the target video frame and image regions corresponding to the respective initial text region images, obtain a region overlap ratio between the respective image regions and the text removal region, and take the initial text region image as the target text region image if the region overlap ratio is greater than a preset overlap ratio threshold.
In an exemplary embodiment, the text mask extracting unit 803 is further configured to perform obtaining a first current video frame from the plurality of video frames, obtaining a target text region image corresponding to the first current video frame if the first current video frame belongs to the target video frame, inputting the target text region image corresponding to the first current video frame into a text mask detection network trained in advance, obtaining a text mask region corresponding to the first current video frame through the text mask detection network, and training the text mask detection network to obtain a sample text mask region according to the sample text region image and a sample background region of the sample text region image.
In an exemplary embodiment, the text mask extracting unit 803 is further configured to perform, in a case where the first current video frame does not belong to the target video frame, acquiring a target text region image corresponding to the target video frame adjacent to the first current video frame, and a text mask region corresponding to the target video frame adjacent to the target video frame, acquiring a first image matching the text mask region corresponding to the target video frame in the target video frame adjacent to the target video frame, and a second image matching the text mask region corresponding to the target video frame adjacent to the target video frame in the first current video frame, acquiring a difference value between the first image and the second image, and in a case where the difference value is smaller than a preset difference value threshold, taking the text mask region corresponding to the target video frame adjacent to the target video frame as the text mask region corresponding to the first current video frame.
In an exemplary embodiment, the target video includes a plurality of video frames, a text removal processing unit 804 further configured to perform text removal processing on the second current video frame according to the previous video frame, the second current video frame, and the text mask area corresponding to the second current video frame, and obtain the target removed video frame corresponding to the second current video frame when the second current video frame is not the first frame of the plurality of video frames.
In an exemplary embodiment, the text removal processing unit 804 is further configured to perform obtaining an initial removal video frame corresponding to the second current video frame according to the previous video frame, the second current video frame, and the text mask area corresponding to the second current video frame, obtain a target removal video frame corresponding to the previous video frame and the text mask area corresponding to the previous video frame, and input a pre-trained anti-flicker suppression network to obtain a target removal video frame corresponding to the second current video frame through the anti-flicker suppression network.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 9 is a block diagram illustrating an electronic device 900 for video word cleaning, according to an example embodiment. For example, electronic device 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, and the like.
Referring to FIG. 9, an electronic device 900 can include one or more of a processing component 902, a memory 904, a power component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, and a communication component 916.
The processing component 902 generally controls overall operation of the electronic device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 902 may include one or more processors 920 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 902 can include one or more modules that facilitate interaction between the processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
The memory 904 is configured to store various types of data to support operations at the electronic device 900. Examples of such data include instructions for any application or method operating on the electronic device 900, contact data, phonebook data, messages, pictures, video, and so forth. The memory 904 may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, optical disk, or graphene memory.
The power supply component 906 provides power to the various components of the electronic device 900. Power supply components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 900.
The multimedia component 908 comprises a screen between the electronic device 900 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front-facing camera and/or a rear-facing camera. When the electronic device 900 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 904 or transmitted via the communication component 916. In some embodiments, the audio component 910 further includes a speaker for outputting audio signals.
The I/O interface 912 provides an interface between the processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, a home button, a volume button, an activate button, and a lock button.
The sensor assembly 914 includes one or more sensors for providing status assessment of various aspects of the electronic device 900. For example, the sensor assembly 914 may detect an on/off state of the electronic device 900, a relative positioning of the components, such as a display and keypad of the electronic device 900, the sensor assembly 914 may also detect a change in position of the electronic device 900 or a component of the electronic device 900, the presence or absence of a user's contact with the electronic device 900, an orientation or acceleration/deceleration of the device 900, and a change in temperature of the electronic device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 916 is configured to facilitate communication between the electronic device 900 and other devices, either wired or wireless. The electronic device 900 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 916 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a computer-readable storage medium is also provided, such as a memory 904 including instructions executable by the processor 920 of the electronic device 900 to perform the above-described method. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In an exemplary embodiment, a computer program product is also provided, comprising instructions executable by the processor 920 of the electronic device 900 to perform the above-described method.
It should be noted that the descriptions of the foregoing apparatus, the electronic device, the computer readable storage medium, the computer program product, and the like according to the method embodiments may further include other implementations, and the specific implementation may refer to the descriptions of the related method embodiments and are not described herein in detail.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.