CN114066716A

CN114066716A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN114066716A
Application number: CN202010774284.5A
Authority: CN
Inventors: 廖震宇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2022-02-18
Anticipated expiration: 2040-08-04
Also published as: CN114066716B

Abstract

The present disclosure relates to an image processing method, device, electronic device and storage medium. The method includes: acquiring an image to be processed; segmenting an object in the to-be-processed image to obtain a segmented object; The segmented object is subjected to image style conversion processing according to the selected image style to obtain the processed target object, wherein the image conversion model is to first increase the number of channels of the depth-separable convolutional layer in the super network, The model obtained by iterative training is performed by deleting the added channels that do not meet the preset conditions. In the present disclosure, the number of channels of the depthwise separable convolutional layer in the super network is first increased and then decreased, which not only increases or decreases the expressive ability of the network under the condition of visual acceptance, but also reduces the calculation amount of the internal processing of the model. Improves the efficiency of image style transfer.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

With the rise of deep learning, deep learning architectures and algorithms have enabled computer vision, pattern recognition and natural language processing to be greatly developed. Among them, Convolutional Neural Networks (CNN) have the most significant application effect in image processing.

In the related art, a deep learning network is generally adopted to extract features for assisting image style migration. The content and style characteristics of the input picture are extracted through a CNN network, a gram matrix is constructed, so that a content loss function and a style loss function are defined, and then a stylized target picture is obtained through optimization solution. However, in the method, although the stylized target picture with good reality degree can be obtained, because the optimization solving process takes a long time and has a large calculation amount, rapid style migration cannot be realized, the style migration efficiency is reduced, and the user experience is poor.

Disclosure of Invention

The present disclosure provides an image processing method, an image processing apparatus, an electronic device, and a storage medium, so as to at least solve the technical problem in the related art that image style migration efficiency is low due to large image style migration calculation amount and long time consumption. The technical scheme of the disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided an image processing method, including:

acquiring an image to be processed;

segmenting the object in the image to be processed to obtain a segmented object;

and carrying out image style conversion processing on the segmented object according to the selected image style through an image conversion model to obtain a processed target object, wherein the image conversion model is obtained by carrying out iterative training in a mode that the number of channels of the depth separable convolution layer in the super network is increased firstly, and then the channels which do not meet the preset condition after the increase are deleted.

Optionally, before the image-style conversion processing is performed on the segmented object according to the selected image style by the image conversion model, the method further includes:

generating a hyper-network based on a deep separable convolutional layer of a convolutional neural network, and determining parameters of the hyper-network;

and training a convolutional neural network model according to the determined parameters of the hyper-network to obtain the image conversion model.

Optionally, the generating a super network by the deep separable convolutional layer based on the convolutional neural network, and determining parameters of the super network include:

expanding the number C of input channels of the depth separable convolutional layer of the convolutional neural network into an integral multiple of 2 to obtain 2xNxC channels, wherein N is the number of convolutional cores;

splitting the expanded depth separable convolutional layer into convolution units i, and assigning a parameter g _ i to each convolution unit i;

performing convolution processing on each expanded input channel and a corresponding convolution kernel respectively to obtain corresponding convolution results;

multiplying each convolution result by the parameter g _ i of the corresponding convolution unit i to obtain the convolution parameter of the corresponding channel;

the convolution parameters of all channels are combined together as the parameters of the super network.

Optionally, the convolutional neural network model includes: a first generation network having a deep separable convolutional layer, a first discrimination network, and a second generation network;

the training of the convolutional neural network model according to the determined parameters of the hyper-network to obtain the image conversion model comprises the following steps:

selecting a first type of common image training sample and a second type of style image training sample;

inputting the common image training samples of the first type into the first generation network for generation processing, and outputting images of a second type;

respectively inputting the images of the second type into a first discrimination network to obtain the probability that the images of the second type belong to the style image training samples of the second type, wherein the probability is called cross entropy loss;

respectively inputting the images of the second type into a second generation network to obtain image loss, namely consistency loss, when the images of the second type pass through the second generation network;

calculating the loss sum of the consistency loss and the cross entropy loss, or performing weighted summation on the loss of the consistency loss and the cross entropy loss to obtain the loss sum;

calculating a feature map on a gradient domain according to the loss sum;

updating parameters of the hyper-network according to the feature map;

and performing iterative training on the channel corresponding to the updated parameters of the hyper-network until an image conversion model is generated.

Optionally, the iteratively training the updated channel corresponding to the parameter of the hyper-network until an image conversion model is generated includes:

when the parameter of the hyper-network is smaller than a set parameter threshold value, deleting a channel corresponding to the parameter smaller than the set parameter threshold value;

when the parameter of the hyper-network is not smaller than a set parameter threshold value, reserving a channel corresponding to the parameter not smaller than the set parameter threshold value;

and repeating the next iterative training until an image conversion model is generated.

According to a second aspect of the embodiments of the present disclosure, there is provided an image processing apparatus including:

an acquisition module configured to perform acquiring an image to be processed;

the segmentation module is configured to segment the object in the image to be processed to obtain a segmented object;

and the style conversion processing module is configured to execute image style conversion processing on the segmented object according to a selected image style through an image conversion model to obtain a processed target object, wherein the image conversion model is obtained by performing iterative training in a manner that the number of channels of the depth separable convolutional layer in the hyper-network is increased first, and then the channels which do not meet preset conditions after the increase are deleted.

Optionally, the apparatus further comprises:

a determining module configured to perform generating a hyper network based on a depth separable convolutional layer of a convolutional neural network before the style conversion processing module performs image style conversion processing on the segmented object according to a selected image style through an image conversion model, and determining parameters of the hyper network;

a training module configured to perform training of a convolutional neural network model according to the determined parameters of the hyper-network, resulting in the image transformation model.

Optionally, the determining module includes:

a channel expansion module configured to perform expansion of an input channel number C of the depth separable convolutional layer of the super network to an integer multiple of 2, resulting in 2xNxC channels, where N is the number of convolutional cores;

a splitting module configured to perform splitting the expanded depth separable convolutional layer into convolutional units i, and assign a parameter g _ i to each convolutional unit i;

the convolution module is configured to perform convolution processing on each expanded input channel and a corresponding convolution kernel respectively to obtain a corresponding convolution result;

the product module is configured to multiply each convolution result by the parameter g _ i of the corresponding convolution unit i to obtain a convolution parameter of the corresponding channel;

a combining module configured to perform combining the convolution parameters of all channels together as a parameter of the super-network.

Optionally, the convolutional neural network model includes: a first generation network and a first discriminant network and a second generation network having deep separable convolutional layers;

the training module comprises:

the selecting module is configured to select a first type of common image training sample and a second type of style image training sample;

a first image processing module configured to perform input of the common image training samples of the first type into the first generation network for generation processing, and output images of a second type;

the first loss determining module is configured to input the images of the second type into a first discrimination network respectively for judgment, so as to obtain a probability that the images of the second type belong to a style image training sample of the second type, wherein the probability is called cross entropy loss;

a second loss determining module configured to perform input of the images of the second type to a second generation network respectively, to obtain an image loss when the images of the second type pass through the second generation network, which is called a consistency loss;

the calculation module is configured to calculate the sum of the loss of the consistency loss and the loss of the cross entropy loss, or perform weighted summation on the loss of the consistency loss and the loss of the cross entropy loss to obtain the sum of the losses;

a feature map calculation module configured to perform a feature map calculation on a gradient domain according to the sum of the losses;

an update module configured to perform updating parameters of the hyper-network according to the profile;

and the iteration module is configured to perform iterative training on the channels corresponding to the updated parameters of the hyper-network until an image conversion model is generated.

Optionally, the iteration module includes:

the channel deleting module is configured to delete the channel corresponding to the parameter smaller than the set parameter threshold when the parameter of the hyper-network is smaller than the set parameter threshold;

a channel reservation module configured to execute, in response to a parameter of the super network not being less than a set parameter threshold, reserving a channel corresponding to the parameter not being less than the set parameter threshold;

and the iterative training making module is repeatedly configured to execute the iterative training of the next time until the image conversion model is generated.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image processing method as described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the image processing method as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product according to instructions in which, when executed by a processor of an electronic device, cause the electronic device to perform the image processing method as described above.

The technical scheme provided by the embodiment of the disclosure at least has the following beneficial effects:

in the disclosure, an image to be processed is segmented to obtain a segmented object; and carrying out image style conversion processing on the segmented object according to the selected image style through an image conversion model to obtain a processed target object, wherein the image conversion model is obtained by carrying out iterative training in a mode that the number of channels of the depth separable convolution layer in the super network is increased firstly, and then the channels which do not meet the preset condition after the increase are deleted. According to the method, the number of channels of the depth separable convolution layer in the super-network is increased or decreased, so that the expression capacity of the network is increased or decreased under the condition that vision is acceptable, the calculated amount of model internal processing is reduced, the image style conversion efficiency is improved, and the satisfaction degree of a user on image conversion is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating an image processing method according to an exemplary embodiment.

Fig. 2 is a diagram illustrating a depth separable convolutional layer split, according to an exemplary embodiment.

FIG. 3 is a schematic diagram of a training process for converting a first type of image into a second type of image provided by the present disclosure.

FIG. 4 is a schematic diagram of a training process for converting a second type of image into a first type of image provided by the present disclosure.

Fig. 5 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment.

Fig. 6 is another block diagram illustrating an image processing apparatus according to an exemplary embodiment.

FIG. 7 is a block diagram illustrating a determination module in accordance with an exemplary embodiment.

FIG. 8 is a block diagram illustrating a training module in accordance with an exemplary embodiment.

FIG. 9 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Fig. 10 is a block diagram illustrating an apparatus for image processing according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

To facilitate an understanding of the disclosure, the following noun terms are to be construed prior to introduction of the disclosure.

The present disclosure divides and modifies the structure of the main body of the existing model, changes the channels of the existing model into multiple channels, and can select the model. Specifically, the conventional convolutional neural network CNN generation model is based on the conventional CONV2D operation, and is computationally expensive. The present disclosure proposes replacing a traditional Convolutional Layer (Convolutional Layer) in an existing Convolutional neural network production model with a Depth Separable Convolutional Layer (DSCL), wherein the replacement mode may be a complete replacement, or some specific Convolutional layers in the Convolutional neural network may be replaced according to actual requirements.

Fig. 1 is a flowchart illustrating an image processing method according to an exemplary embodiment, as illustrated in fig. 1, the image processing method including the steps of:

in step 101, an image to be processed is acquired.

In step 102, segmenting an object in the image to be processed to obtain a segmented object;

in step 103, performing image style conversion processing on the segmented object according to a selected image style through an image conversion model to obtain a processed target object, wherein the image conversion model is obtained by performing iterative training in a manner that the number of channels of the depth separable convolution layer in the super network is increased first, and then the channels which do not meet preset conditions after the increase are deleted.

The image processing method disclosed by the present disclosure may be applied to a terminal, a server, and the like, and is not limited herein, and the terminal implementation device may be an electronic device such as a smart phone, a notebook computer, a tablet computer, and the like.

The following describes in detail specific implementation steps of an image processing method provided in an embodiment of the present disclosure with reference to fig. 1.

First, step 101 is executed to acquire an image to be processed.

In this step, the acquired image to be processed is an image that needs to be subjected to style conversion, and the image may be an image selected from an album, or an image that has just been shot, and the like. The present embodiment is not limited.

Next, step 102 is executed to segment the object in the image to be processed to obtain a segmented object.

In this step, face recognition detection is performed on an image to be processed, and if a face image or an avatar is detected, the detected face image or avatar is segmented, which may be generally segmented by a currently set size, for example, the face or avatar is segmented into 512 × 512 image blocks. The process of recognizing the image by the face recognition algorithm is well known to those skilled in the art, and will not be described herein.

In this disclosure, the image may be divided in a plurality of dividing manners, for example, an average dividing manner or a fixed dividing manner, and of course, the image may also be divided by using image dividing software, which is not limited in this embodiment. The average division mode can carry out average splitting according to the average number of rows and columns, namely, the image is divided into a plurality of equal parts according to the number of rows and columns; the fixed division is the division of the image into pixel sizes of the designated image blocks. After the division is completed, the file format of the required image block can be obtained.

And finally, executing step 103, and performing image style conversion processing on the segmented object according to the selected image style through an image conversion model to obtain a processed target object, wherein the image conversion model is obtained by performing iterative training in a manner that the number of channels of the depth separable convolutional layer in the super network is increased first, and then the channels which do not meet the preset condition after the increase are deleted.

Inputting the segmented object into an image conversion model, performing image style conversion processing by the image conversion model according to a selected image style, and outputting a target object with the same size after processing, wherein the target object is a target object with the selected image style and belongs to an image of a different type from the object; the image conversion model is obtained by performing iterative training on the depth separable convolutional layer of the super network, that is, the model is obtained by performing iterative training in a manner that the number of channels of the depth separable convolutional layer of the super network is increased first, and then the channels which do not meet the preset condition after the increase are deleted. The condition that the preset condition is not satisfied may be that the parameter threshold of the super-network is not satisfied, or that the channel parameter threshold of the channel of the depth separable convolution layer is not satisfied.

For example, if the input current object is a 512-by-512 human face image and the selected image style is a 512-by-512 cartoon face, the human face image input image conversion model is subjected to image style conversion processing, and then the 512-by-512 cartoon face style is output.

It should be noted that the image transformation model in this embodiment is a model trained in advance, and the training process is described in detail below.

Further, the method may further include replacing the object corresponding to the target object obtained after the processing with the image to be processed to obtain a style image.

In this step, the target object obtained after the style conversion processing obtained in step 103 is replaced with an object in the image to be processed, so as to obtain a complete style image. That is to say, the obtained complete style image has an internal relation with the original image to be processed, but belongs to different types of style images, for example, a natural human face is processed by an image conversion model to generate a doll face, or a natural human face is processed by the image conversion model to generate a cartoon face, or a cartoon is processed by the image conversion model to change a sketch style into an oil painting style, and the like.

According to the method, an obtained image to be processed is segmented to obtain a segmented object, the obtained object is subjected to image style conversion processing according to a selected image style through an image conversion model to obtain a processed target object, wherein the image conversion model is a model obtained through iterative training in a mode that the number of channels of a depth separable convolutional layer in a hyper-network is increased first, and then the channels which do not meet preset conditions after the channels are deleted, and further, the target object can be replaced by the corresponding object in the image to be processed to obtain a complete style image. That is, the present disclosure increases or decreases the number of channels of the depth separable convolutional layer in the super-network in advance, and not only increases or decreases the expression capability of the network, but also decreases the amount of computation of the internal processing of the model in the case of acceptable vision, and improves the efficiency of image style conversion and the satisfaction of the user on image conversion.

Optionally, in another embodiment, on the basis of the foregoing embodiment, before performing image-style conversion processing on the segmented object according to a selected image style by using an image conversion model, the method may further include:

1) generating a hyper-network based on a deep separable convolutional layer of a convolutional neural network, and determining parameters of the hyper-network;

the method specifically comprises the following steps: expanding the number C of input channels of a depth separable convolution layer of the super-convolution neural network into 2 integral multiples to obtain 2xNxC channels, wherein N is the number of convolution kernels; splitting the expanded depth-separable convolutional layer into convolutional units (each convolutional unit is composed of a filter) i, and assigning a parameter g _ i to each convolutional unit i; performing convolution processing on each expanded input channel and a corresponding convolution kernel respectively to obtain corresponding convolution results; multiplying each convolution result by the parameter g _ i of the corresponding convolution unit i to obtain the convolution parameter of the corresponding channel, wherein each convolution parameter is a tensor; the convolution parameters of all channels are combined together as the parameters of the super network. That is, the above process determines the parameters of the hyper-network as initial parameters.

It should be noted that, in the present disclosure, the channel of the depth separable convolutional layer has two different meanings, the first is for the sample image (image is used as a training sample), the channel refers to a color channel, and the color channel will be used to represent the channel of the sample image; the second is the dimensionality of the output space, e.g., the number of output channels in a convolution operation, or the number of convolution kernels in each convolution layer.

That is, this step is to define a super network (supernet) structure based on the deep separable convolutional layer of the convolutional neural network. Specifically, first, the number of channels (channels) of a depth Separable Convolutional Layer (Depthwise Separable capacitive Layer) of the CNN network is increased to an integral multiple of 2. I.e. the number of input and output (input and output) channels is multiplied by an integer multiple of 2. Assuming that the original channel number of this layer is C, we amplify it to 2xNxC, where N is the number of convolution kernels. Wherein each NxC is respectively represented by (3x3 kernel, 5x5 kernel) or (3x3 kernel 7x7 kernel). This makes it possible to create a super network with sufficient expressive power.

Next, each depth separable convolutional layer (Depthwise separable convolutional layer) is split into a plurality of small units (i.e., convolutional units), denoted by i, and each convolutional unit i is assigned with a parameter g _ i, the splitting diagram is shown in fig. 2, it is assumed that the convolutional unit i is three, i.e., the parameters assigned to each convolutional unit are g _1, g _2, and g _3, but in practical application, the invention is not limited thereto. While the parameters of the small units of the conventional separable convolution layer partial conv are all 1, i.e. gi is 1, in the present disclosure, each of the separated convolution units i is assigned with a corresponding parameter g _ i, where i may be equal to a plurality of values of 1,2,3, … …, etc. In deconvolution (depthwise convolution), each channel is independently convolved with a 3 × 3 kernel. That is, C _ out _ i is C _ in _ i (3x3 kernel), and then a positive parameter g _ i is multiplied by C _ out _ i, that is, each convolution result is multiplied by the assignment parameter g _ i of the corresponding convolution unit, so as to obtain the convolution parameter of the corresponding channel, where each convolution parameter is a tensor. Wherein a tensor consists of a set of original values forming an array (of arbitrary dimensions), the order of the tensor is its dimension and its shape is an integer tuple specifying the length of each dimension of the array; finally, the convolution parameters of all the channels are combined together to be used as the output of the parameters of the super network. With the above definition of the present disclosure, the parameter of the hyper-network can be learned not only by itself, but also by g _ i as a parameter corresponding to kernel.

2) And training a convolutional neural network model according to the determined hyper-network parameters to obtain the image conversion model.

Wherein, in this step, the convolutional neural network model may include: the method comprises the following steps of training a convolutional neural network model according to determined parameters of the hyper-network to obtain an image conversion model, wherein the image conversion model comprises a first generation network, a first discriminant network and a second generation network which are provided with depth separable convolutional layers:

selecting a first type of common image training sample and a second type of style image training sample; inputting the common image training sample of the first type into the first generation network for generation processing, and outputting an image of a second type; then, the images of the second type are respectively input into a first discrimination network, and the probability that the images of the second type belong to the style image training sample of the second type is obtained, wherein the probability is called cross entropy loss; respectively inputting the images of the second type into a second generation network to obtain image loss, namely consistency loss, when the images of the second type pass through the second generation network; calculating the loss sum of the consistency loss and the cross entropy loss, or carrying out weighted summation on the loss of the consistency loss and the cross entropy loss to obtain the loss sum; finally, calculating a characteristic map on a gradient domain according to the loss sum; updating parameters of the hyper-network according to the feature map; and performing iterative training on the channel corresponding to the updated parameters of the hyper-network until an image conversion model is generated.

That is, the common image training sample of the first type is selected and input to the first generation network for generation processing, an image of a second type is output, the image of the second type is input to a first discrimination network and a second generation network respectively, similarity judgment of an image field is performed by the first discrimination network, and a probability that the generated image of the second type belongs to the style image training sample of the second type is output, wherein the probability is called cross entropy loss. And performing style conversion processing by the second generation network to output a style image with a second type of image, wherein the image loss when the obtained second type of image passes through the second generation network is called consistency loss. Then, calculating the loss sum of the consistency loss and the cross entropy loss, or performing weighted summation on the loss of the consistency loss and the cross entropy loss to obtain the loss sum, and calculating a feature map on a gradient domain according to the loss sum; updating parameters of the hyper-network according to the feature map; and performing iterative training on the channel corresponding to the updated parameters of the hyper-network until an image conversion model is generated.

Similarly, inputting the second type style image training sample into the second generation network for generation processing, outputting a first type style image, respectively inputting the first type style image into a second judgment network and a first generation network, judging the image type similarity by the second judgment network, and outputting the probability that the generated first type style image belongs to the first type style image; performing a style conversion process by the first production network, outputting an image having a first type of image; calculating the cross entropy loss when the second type of image is generated and judged, and the sum of the loss of consistency loss when the first type of style image is generated and judged, or performing weighted summation on the loss of consistency loss and cross entropy loss to obtain the sum of losses; calculating a feature map on a gradient domain according to the loss sum; and updating the determined parameters of the hyper-network according to the feature map, and performing iterative training on the channels corresponding to the updated parameters of the hyper-network until an image conversion model is generated.

Wherein the iteratively training the channel corresponding to the updated parameter of the hyper-network until generating an image conversion model comprises: when the parameter of the hyper-network is smaller than a set parameter threshold value, deleting a channel corresponding to the parameter smaller than the set parameter threshold value; or, when responding to that the parameter of the hyper-network is not less than the set parameter threshold, reserving the channel corresponding to the parameter not less than the set parameter threshold; and repeating the next iterative training until an image conversion model is generated.

In the present disclosure, the parameters of the super network may include convolution parameters of all channels of the depth separable convolution layer, and of course, may also include a position (weight) and a parameter g _ i previously defining a convolution unit, and the present embodiment is not limited.

The parameter threshold, i.e. λ (Lambda), is set, and it may select different values with each training, and its initial value is preset, and may also be a set value in subsequent iterative training, and of course, it may also select intermediate values of parameters corresponding to all channels. For example, by determining whether the convolution parameter of the channel is smaller than Lambda and deleting the channel corresponding to the convolution parameter smaller than Lambda, the unimportant branch is subtracted, the next data forwarding is performed, the parameter of the super network is updated, and the next iteration (iteration) training is performed.

That is, whether the updated parameter of the hyper-network is smaller than a set parameter threshold is judged, and if so, a channel corresponding to the parameter smaller than the set parameter threshold is deleted; if not, reserving the channel corresponding to the parameter not less than the set parameter threshold, and performing the next iterative training until generating the image conversion model.

In the present disclosure, the convolutional neural network model includes at least four networks, which are two generating networks and two discriminant networks, respectively, where a first type includes a generating network G _ a2b (Generator _ a2b _ super), the present disclosure is referred to as a first generating network, and the discriminant networks are: d _ a2b (Discriminator _ a2b _ super), referred to as the first discriminant network in this disclosure; the second type comprises a generating network: g _ b2a (Generator _ b2a _ super), referred to herein as the second generation network, discriminates that the network is D _ b2a (Discriminator _ b2a super), referred to herein as the second discriminant.

Wherein G _ a2b is a network that converts a first type of image to a second type of image; d _ a2b is a discrimination network for judging the probability that the image of the second type obtained by converting G _ a2b belongs to the second type, and G _ b2a is a network for converting the image of the second type into the image of the first type; d _ b2a is a discrimination network for determining the probability that the image of the first type converted by G _ b2a belongs to the first type. That is, during training, the input is two types of single frame images, each paired with another type of image, which is inherently related to the original image. It should be noted that, the first generating network is also referred to as a first generator, the first discriminating network is referred to as a first discriminator, the second generating network is referred to as a second generator, and the second discriminating network is referred to as a second discriminator, and the specific process of training is shown in fig. 3 and fig. 4, fig. 3 is a schematic diagram of training for converting an image of a first type into an image of a second type according to the present disclosure: FIG. 4 is a schematic diagram of a training process for converting a second type of image into a first type of image provided by the present disclosure.

As shown in fig. 3: firstly, selecting each image of different types, namely selecting an image of a type A (domain A), which is called an image A (Input image _ A);

then, Input image _ a is Input to a first generation network (G _ A2B, Generator _ A2B _ super), the first generation network performs conversion processing on the image a, outputs an image A2B (image _ A2B), then inputs the image A2B to a first Discriminator (D _ A2B, Discriminator _ A2B _ super) and a second Generator (G _ B2a, Generator _ B2a _ super), respectively, and determines by the first Discriminator whether the image A2B belongs to a picture of type B (domain B), and outputs the determined probability, wherein the first Discriminator is a two-class classification network, and the output is the probability that the image belongs to domain B, and when the decision is made, a corresponding cross loss (cross loss) is generated, which is a function of the cross loss.

Meanwhile, G _ B2a converts the input image _ A2B into an image A2B2A (image _ A2B2A) according to the selected style, with the intention that image _ a and image _ A2B2A are expected to be very similar, but both of different styles. At this time, loss, i.e. loss of consistency (consistency loss), is generated, and generally, the loss is L1 norm or L2 norm, where L1 norm is loss in a matrix of imageA-imageA2B2A, i.e. the sum of absolute values of each element, for example, L1 norm of (1, -1) is: 1+1 ═ 2. And then, the total loss of the network is the sum of the two losses, or the loss of the consistency loss and the loss of the cross entropy loss are subjected to weighted summation to obtain the sum of the losses, then a characteristic diagram on a gradient domain is calculated according to the sum of the losses, and the determined parameters of the super network are updated according to the characteristic diagram. The method for calculating the sum of losses and the feature map on the gradient domain according to the sum of losses are well known to those skilled in the art and will not be described herein again.

Similarly, as shown in fig. 4, an image of type B (domain B) is selected, referred to as image B (Input image _ B), the second type image B (image _ B) is Input to the second generator (G _ B2a), the second generator performs conversion processing on the image B, outputs image B2A (image _ B2A), then inputs image B2A to the second discriminator (D _ B2a) and the first generator (G _ a2B), the second discriminator determines whether image B2A belongs to the probability of the picture of type a (domain a), and outputs the determined probability, wherein the second discriminator is also a classification network of the second classification, and the output is the probability that the image belongs to the domain B, and when the entropy of the image B2 is determined, a cross loss (cross loss) corresponding to the problem of the classification is generated.

At the same time, G _ B2a converts the input image _ B2A into an image B2A2B (image _ B2A2B) according to the selected style, with the goal of expecting image _ B and image _ B2A2B to be very similar, but both of a different type of style.

That is, the convolutional neural network models are referred to as two types of generative models and two types of discriminant models, respectively. The generation method of the generative model learns the joint probability distribution P (X, Y) of the sample and the label through observation data, and the trained model can generate new data which accords with the distribution of the sample and can be used for supervised learning and unsupervised learning. And the discriminant model: the tracking problem is treated as a binary problem and then the decision boundary of the target and the background is found. It can be said that no matter how the object is described, it can be classified into which type, that is, the style of which type, as long as it knows where the object is different from the background and then, for inputting an image, see the side where it is located.

The method comprises the steps of firstly segmenting an object in an image to be processed, then inputting the segmented object into an image style conversion model trained according to parameters of a hyper-network, and carrying out conversion processing according to a selected image style, so as to generate a target object with a selected style, namely, the calculated amount of the internal processing of the model is reduced under the condition that the vision is acceptable, the efficiency of image style conversion is improved, and the satisfaction degree of a user is also provided.

It is noted that, for simplicity of explanation, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present disclosure is not limited by the order of acts described, as some steps may, in accordance with the present disclosure, occur in other orders and/or concurrently. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required in order to implement the disclosure.

Fig. 5 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment. Referring to fig. 5, the apparatus includes an acquisition module 501, a segmentation module 502, and a format conversion processing module 503, wherein,

the acquiring module 501 is configured to perform acquiring an image to be processed;

the segmentation module 502 is configured to perform segmentation on an object in the image to be processed to obtain a segmented object;

the style conversion processing module 503 is configured to perform image style conversion processing on the segmented object according to a selected image style through an image conversion model to obtain a processed target object, where the image conversion model is obtained through iterative training in a manner that the number of channels of the depth separable convolution layer in the super-network is increased first, and then the channels which do not satisfy a preset condition after the increase are deleted.

Optionally, in another embodiment, on the basis of the above embodiment, the apparatus further includes: the structure of the determining module 601 and the training module 602 is schematically shown in fig. 6, wherein,

the determining module 601 is configured to perform generating a hyper network based on the depth separable convolutional layer of the convolutional neural network before the style conversion processing module 503 performs the image style conversion processing on the segmented object according to the selected image style through the image conversion model, and determining parameters of the hyper network;

the training module 602 is configured to perform training of a convolutional neural network model according to the determined parameters of the hyper-network, resulting in the image transformation model.

Optionally, in another embodiment, on the basis of the foregoing embodiment, the determining module 601 includes: a channel enlarging module 701, a splitting module 702, a convolution module 703, a product module 704 and a combining module 705, which are schematically shown in fig. 7, wherein,

the channel expansion module 701 is configured to perform expansion of the number C of input channels of the depth separable convolutional layer of the super network to an integer multiple of 2, resulting in 2xNxC channels, where N is the number of convolutional cores;

the splitting module 702 configured to perform splitting the expanded depth separable convolutional layer into convolutional units i, and assign a parameter g _ i to each convolutional unit i;

the convolution module 703 is configured to perform convolution processing on each expanded input channel with a corresponding convolution kernel, respectively, to obtain a corresponding convolution result;

the product module 704 is configured to perform multiplication of each convolution result by the parameter g _ i of the corresponding convolution unit i to obtain the convolution parameter of the corresponding channel;

the combining module 705 is configured to perform the combination of the convolution parameters of all channels together as parameters of the super network.

Optionally, in another embodiment, on the basis of the above embodiment, the antagonistic neural network model includes: a first generation network, a second generation network and a first discrimination network having deep separable convolutional layers;

the training module 602 includes: a selecting module 801, a first image processing module 802, a first loss determining module 803, a second loss determining module 804, a calculating module 805, a feature map calculating module 806, an updating module 807 and an iteration module 808, which are schematically shown in fig. 8, wherein,

the selecting module 801 is configured to perform selecting a first type of common image training sample and a second type of stylistic image training sample;

the first image processing module 802 is configured to perform input of the first type of ordinary image training samples into the first generation network for generation processing, and output a second type of images;

the first loss determination 803 is configured to input the images of the second type into a first discriminant network respectively for judgment, so as to obtain a probability that the images of the second type belong to a second-type style image training sample, where the probability is referred to as cross entropy loss;

the second loss determining module 804 is configured to perform input of the images of the second type to a second generation network respectively, and obtain an image loss when the images of the second type pass through the second generation network, which is called a consistency loss;

the calculating module 805 is configured to perform calculation on the sum of the loss of the consistency loss and the loss of the cross entropy loss, or perform weighted summation on the loss of the consistency loss and the loss of the cross entropy loss to obtain the sum of the losses;

the feature map calculation 806 configured to perform a feature map calculation on a gradient domain based on the sum of the losses;

the updating module 807 configured to perform updating parameters of the hyper-network according to the profile;

the iteration module 808 is configured to perform iterative training on the channels corresponding to the updated parameters of the hyper-network until an image transformation model is generated.

Optionally, in another embodiment, on the basis of the above embodiment, the iteration module includes: a channel deletion module and a channel reservation module, wherein,

the channel deleting module is configured to delete a channel corresponding to a parameter smaller than a set parameter threshold when the parameter of the hyper-network is smaller than the set parameter threshold;

the channel reservation module is configured to execute the step of reserving a channel corresponding to the parameter not less than the set parameter threshold in response to the parameter not less than the set parameter threshold of the hyper-network;

The present disclosure also provides an electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

The present disclosure also provides a storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the image processing method as described above.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment related to the method, and reference may be made to part of the description of the embodiment of the method for the relevant points, and the detailed description will not be made here.

In an exemplary embodiment, a storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of an electronic device to perform the above method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 9 is a block diagram illustrating an electronic device 900 in accordance with an example embodiment. For example, the electronic device 900 may be a mobile terminal or a server, and in the embodiment of the present disclosure, the electronic device is taken as a mobile terminal for example. For example, the electronic device 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 9, electronic device 900 may include one or more of the following components: a processing component 902, a memory 904, a power component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, and a communication component 916.

The processing component 902 generally controls overall operation of the electronic device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on the electronic device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 906 provides power to the various components of the electronic device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 900.

The multimedia components 908 include a screen that provides an output interface between the electronic device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status evaluations of various aspects of the electronic device 900. For example, sensor assembly 914 may detect an open/closed state of device 900, the relative positioning of components, such as a display and keypad of electronic device 900, sensor assembly 914 may also detect a change in the position of electronic device 900 or a component of electronic device 900, the presence or absence of user contact with electronic device 900, orientation or acceleration/deceleration of electronic device 900, and a change in the temperature of electronic device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate wired or wireless communication between the electronic device 900 and other devices. The electronic device 900 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the image processing methods shown above.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the electronic device 900 to perform the image processing method illustrated above, is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product, wherein the instructions of the computer program product, when executed by the processor 920 of the electronic device 900, cause the electronic device 900 to perform the image processing method illustrated above.

Fig. 10 is a block diagram illustrating an apparatus 1000 for image processing according to an example embodiment. For example, the apparatus 1000 may be provided as a server. Referring to fig. 10, the apparatus 1000 includes a processing component 1022 that further includes one or more processors and memory resources, represented by memory 1032, for storing instructions, such as application programs, that are executable by the processing component 1022. The application programs stored in memory 1032 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1022 is configured to execute instructions to perform the image processing methods described above.

The device 1000 may also include a power supply component 1026 configured to perform power management for the device 1000, a wired or wireless network interface 1050 configured to connect the device 1000 to a network, and an input/output (I/O) interface 1058. The apparatus 1000 may operate based on an operating system stored in memory 1032, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image processing method, comprising:

acquiring an image to be processed;

2. The image processing method according to claim 1, wherein before the image-style conversion processing of the segmented object in the selected image style by the image conversion model, the method further comprises:

3. The image processing method of claim 2, wherein generating the hyper-network based on the depth separable convolutional layer of the convolutional neural network, and determining parameters of the hyper-network comprises:

4. The image processing method according to claim 2 or 3,

the convolutional neural network model includes: a first generation network having a deep separable convolutional layer, a first discrimination network, and a second generation network;

calculating a feature map on a gradient domain according to the loss sum;

updating parameters of the hyper-network according to the feature map;

5. The image processing method of claim 4, wherein iteratively training the channels corresponding to the updated parameters of the hyper-network until an image transformation model is generated comprises:

6. An image processing apparatus characterized by comprising:

an acquisition module configured to perform acquiring an image to be processed;

7. The image processing apparatus according to claim 6, characterized in that the apparatus further comprises:

8. The image processing apparatus according to claim 7, wherein the determination module includes:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image processing method of any one of claims 1 to 5.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image processing method of any one of claims 1 to 5.