[go: up one dir, main page]

WO2025218588A1 - Image editing method and apparatus, device, and storage medium - Google Patents

Image editing method and apparatus, device, and storage medium

Info

Publication number
WO2025218588A1
WO2025218588A1 PCT/CN2025/088422 CN2025088422W WO2025218588A1 WO 2025218588 A1 WO2025218588 A1 WO 2025218588A1 CN 2025088422 W CN2025088422 W CN 2025088422W WO 2025218588 A1 WO2025218588 A1 WO 2025218588A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
edited
noise
sample
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2025/088422
Other languages
French (fr)
Chinese (zh)
Inventor
王熊辉
任玉羲
吴捷
王诗吟
王一同
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Publication of WO2025218588A1 publication Critical patent/WO2025218588A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/60Image enhancement or restoration using machine learning, e.g. neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing

Definitions

  • the embodiments of the present disclosure relate to an image editing method, apparatus, device, and storage medium.
  • the present disclosure provides an image editing method, apparatus, device, and storage medium, which can accurately edit a specified area, improve image-text matching, and improve image generation quality.
  • an embodiment of the present disclosure provides an image editing method, comprising:
  • the image editing model performs noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word, and outputs a target image according to the noise prediction result and the image generation result.
  • an embodiment of the present disclosure further provides an image editing device, the device comprising:
  • An acquisition module configured to acquire a region to be edited in an original image and a target prompt word corresponding to the region to be edited, wherein the target prompt word is text information used to describe an expected effect of image editing;
  • a noise adding module configured to determine a target mask image according to the area to be edited, and add preset noise to the area to be edited based on the target mask image by an image editing model to obtain a local noise image
  • An image generation module is used to perform noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word through the image editing model, and output a target image according to the noise prediction result and the image generation result.
  • an embodiment of the present disclosure further provides an electronic device, the electronic device comprising:
  • processors one or more processors
  • a storage device for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the image editing method as described in any embodiment of the present disclosure.
  • an embodiment of the present disclosure further provides a storage medium comprising computer-executable instructions, which, when executed by a computer processor, are used to execute the image editing method as described in any embodiment of the present disclosure.
  • FIG1 is a schematic flow chart of an image editing method provided by an embodiment of the present disclosure.
  • FIG2 is a schematic diagram of an editing interface provided by an embodiment of the present disclosure.
  • FIG3 is a flow chart of another image editing method provided by an embodiment of the present disclosure.
  • FIG4 is a schematic diagram of training an image editing model provided by an embodiment of the present disclosure.
  • FIG5 is a schematic structural diagram of an image editing device provided by an embodiment of the present disclosure.
  • FIG6 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the term “including” and its variations are open-ended, i.e., “including but not limited to.”
  • the term “based on” means “based, at least in part, on.”
  • the term “one embodiment” means “at least one embodiment,” the term “another embodiment” means “at least one additional embodiment,” and the term “some embodiments” means “at least some embodiments.”
  • Other terms are defined in the following description.
  • a prompt message is sent to the user to clearly inform the user that the operation requested will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the electronic device, application, server, storage medium, or other software or hardware that performs the operations of the disclosed technical solution based on the prompt message.
  • the prompt information in response to receiving a user's active request, may be sent to the user in the form of a pop-up window, in which the prompt information may be presented in text form.
  • the pop-up window may also contain a selection control for the user to select "agree” or “disagree” to provide personal information to the electronic device.
  • FIG1 is a flow chart of an image editing method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure is applicable to the situation of local editing.
  • the user inputs an area selection operation for the original image to determine the area to be edited in the original image, and inputs a target prompt word for the area to be edited.
  • the image editing model generates a target object in the area to be edited based on the original image, the area to be edited, and the target prompt word, obtains a target image with the target object, and outputs it.
  • the target object representation accurately generates the area to be edited based on the target prompt word.
  • the method can be executed by an image editing device, which can be implemented in the form of software and/or hardware.
  • it can be implemented by an electronic device, which can be a mobile terminal, a PC, or a server, etc.
  • the method includes:
  • S110 Acquire a region to be edited in the original image and a target prompt word corresponding to the region to be edited.
  • the original image represents a picture or video frame of a video to be edited.
  • the original image may include a picture or video frame stored in an electronic device.
  • the original image may include a picture or video frame downloaded from a server.
  • the original image may also include a picture or video frame taken by a user.
  • image editing operations are required for a local area, the original image may be displayed through an editing interface.
  • the editing interface may be an interactive interface that is triggered and displayed by a local editing control in the client.
  • the client may include an application, a mini-program, or a web client, etc.
  • the local editing control is used to trigger the display of the editing interface.
  • the area to be edited represents the image area where image editing operations are to be performed.
  • the area to be edited can be determined by performing an area selection operation on the original image in the editing interface.
  • Area selection operations include smearing, framing, or sliding operations.
  • a smearing operation on the original image in the editing interface is obtained, and the area to be edited is determined based on the smeared area.
  • a framing operation on the original image in the editing interface is obtained, and the area to be edited is determined based on the framed area.
  • a sliding operation on the original image in the editing interface is obtained, and the area to be edited is determined based on the sliding trajectory.
  • the target prompt word is text information used to describe the expected effect of image editing.
  • the target prompt word represents a descriptive text of the target object to be generated in the area to be edited.
  • the target prompt word may represent a car, indicating that the expected effect of image editing is to replace the object in the area to be edited with a car.
  • an original image is displayed in an editing interface; an area selection operation for the original image is obtained, and an area to be edited in the original image is determined based on the area selection operation; a text input operation for the area to be edited is obtained, and a target prompt word corresponding to the area to be edited is determined based on the text input operation.
  • FIG2 is a schematic diagram of an editing interface provided by an embodiment of the present disclosure.
  • the editing interface 200 includes an editing area 210 and a result display area 220, etc.
  • the editing area 210 includes an image loading control 230 and a prompt word input control 240.
  • the original image 250 is obtained, and the original image 250 is displayed at a position corresponding to the image loading control 230.
  • the smeared area in the original image 250 is determined as the area to be edited.
  • the target prompt word corresponding to the area to be edited is obtained.
  • the prompt word input control 240 includes a text box control.
  • the target mask is used to mask the non-to-be-edited area in the original image, prompting the image editing model to perform image editing on the to-be-edited area. For example, an original mask with the same size as the original image is generated, the target object in the to-be-edited area of the original image is identified, and the to-be-edited area in the original mask is determined based on the coordinates of the target object.
  • the to-be-edited area in the original mask is filled with white, and the non-to-be-edited area is filled with black to obtain the target mask.
  • the target object in the to-be-edited area of the original image can be dilated to expand the boundary of the target object, and then the target mask is determined based on the dilated target object.
  • the image editing model may be a diffusion model trained based on a sample atlas, a sample mask atlas, and a descriptive text set.
  • the sample mask atlas is determined based on a sample dilated image corresponding to a sample image in the sample atlas.
  • the sample dilated image represents an image after a dilation operation is performed on a target region corresponding to the sample image.
  • the descriptive text set is determined based on the image content of the target region.
  • the target region may represent a semantically significant target object segmented in the sample image using a panoptic segmentation model.
  • the diffusion model includes an encoder, a noise prediction network, and a decoder.
  • the encoder input includes the original image, which is used to compress the original image into a low-dimensional space to obtain a latent feature map.
  • the decoder is used to restore the low-dimensional image, after completing the image editing task, to the original image size, obtaining the target image.
  • the noise prediction network is used to add noise to the to-be-edited region in the latent feature map, subject to the constraints of the target mask map, while preserving the latent features in the non-to-be-edited regions, to obtain a local noise image.
  • the noise prediction network can be a Unet network, comprising a convolutional layer, a downsampling layer, a downsampling layer based on a multi-head attention mechanism, an intermediate network, an upsampling layer based on a multi-head attention mechanism, and an upsampling layer.
  • a convolutional layer comprising a convolutional layer, a downsampling layer, a downsampling layer based on a multi-head attention mechanism, an intermediate network, an upsampling layer based on a multi-head attention mechanism, and an upsampling layer.
  • the downsampling layer includes multiple residual modules.
  • the downsampling layer based on the multi-head attention mechanism may include a residual module and a Transformer module, and the diffusion model may include at least one downsampling layer based on the multi-head attention mechanism.
  • different downsampling layers based on the multi-head attention mechanism may include different numbers of Transformer modules.
  • the intermediate network includes residual modules and Transformer modules.
  • the upsampling layer based on the multi-head attention mechanism may include a residual module and a Transformer module, and the diffusion model may include at least one upsampling layer based on the multi-head attention mechanism. If there are multiple upsampling layers based on the multi-head attention mechanism, different upsampling layers based on the multi-head attention mechanism may include different numbers of Transformer modules.
  • the upsampling layer includes multiple residual modules.
  • the preset noise can be noise whose distribution properties in the noise feature map meet preset requirements.
  • the preset noise can be random noise that meets Gaussian distribution, etc.
  • the noise feature map can be a noise image with the same size as the original image.
  • the noise feature area can be determined by superimposing the target mask map and the noise feature map.
  • the noise feature area represents a specific area in the noise feature map that corresponds to the area to be edited. Since the area to be edited in the target mask map is white and the remaining areas are black, superimposing the target mask map on the noise feature map can cover the non-to-edit area in the noise feature map, thereby determining the noise feature area corresponding to the area to be edited.
  • the local noise image represents the image obtained by adding a preset noise to the region to be edited in the original image. Since only the region to be edited exhibits noise after the noise is added, the noisy image is considered the local noise image.
  • the original image is compressed into a low-dimensional latent space using an encoder to obtain a latent feature map of the original image.
  • the preset noise is then added to the region to be edited in the latent feature map to obtain the local noise image.
  • a target mask map is determined according to the area to be edited, and preset noise is added to the area to be edited based on the target mask map through an image editing model to obtain a local noise image, including: determining the target mask map according to the foreground content in the area to be edited; generating a potential feature map of the original image through the image editing model, and determining the area to be edited in the potential feature map by combining the target mask map and the potential feature map; performing a set number of noise adding operations based on the area to be edited in the potential feature map to obtain a local noise image.
  • the foreground content of the smeared area in the original image is identified to obtain the target object.
  • a target mask image is generated based on the target object in the original image.
  • the original image is compressed by the encoder in the image editing model to obtain a potential feature map of the original image. Since the area to be edited in the target mask image is white and the remaining areas are black, superimposing the target mask image on the potential feature map can cover the non-to-be-edited area in the potential feature map, thereby determining the area to be edited in the potential feature map.
  • the image editing model performs a set number of noise addition operations based on the area to be edited in the potential feature map to obtain a local noise image.
  • the latent feature map is represented as z0.
  • the image editing model determines the region to be edited within z0 under the constraints of the target mask map. Random noise is then added to the region to be edited within z0 to obtain a local noise map z1.
  • the image editing model then adds random noise to the region to be edited within the local noise map z1 to obtain a local noise map z2.
  • the noise addition steps are iteratively performed until a local noise map zT is obtained.
  • performing a set number of noise adding operations based on the area to be edited in the potential feature map to obtain a local noise image includes: obtaining a noise feature map, wherein the noise feature map represents a preset noise; determining a noise feature area in combination with the target mask map and the noise feature map; performing a set number of noise adding operations based on the area to be edited and the noise feature area to obtain a local noise image.
  • the latent feature map is represented as z0
  • the noise feature map is represented as S. z0, S, and the target mask map are input into the Unet network of the image editing model.
  • the Unet network determines the region to be edited in z0 under the constraints of the target mask map.
  • the Unet network also determines the noise feature region under the constraints of the target mask map.
  • the noise features in the noise feature region are superimposed on the pixel features in the region to be edited in z0 to obtain the local noise map z1.
  • random noise is added to the region to be edited in the local noise map z1 through the Unet network to obtain the local noise map z2.
  • the noise addition step is iterated until the local noise map zT is obtained.
  • the image editing model determines predicted noise based on image features in the to-be-edited area of the local noise image; the image editing model generates text features based on the target prompt word, and determines the correlation between the text features and the local noise image; a local edited image is generated based on the correlation and the local noise image, and a denoising operation is performed on the local edited image based on the predicted noise, and a target image is output.
  • the image editing model includes a text mapping module for mapping the target prompt word into a text vector as a text feature.
  • the local noise image Z0 is input into the decoder of the image editing model, and the local noise image Z0 after denoising and image generation processing is decompressed by the decoder to restore it to the size of the original image to obtain the target image.
  • the method further includes: obtaining mask setting information, the mask setting information including a blur level; determining predicted noise based on image features in the to-be-edited region of the local noise image using the image editing model; generating text features based on the target prompt word using the image editing model, and determining the correlation between the text features and the local noise image; determining the degree to which the generated object fills the to-be-edited region based on the blur level; generating a local edited image based on the correlation, the filling level, and the local noise image, wherein the to-be-edited region in the local edited image includes the generated object; and performing a denoising operation on the local edited image based on the predicted noise to output a target image.
  • the fill level represents the distribution of the generated object within the area to be edited. At different blur levels, the generated object fills the area to varying degrees. With a high fill level, the generated object is positioned close to the edge of the area to be edited. With a low fill level, the generated object is further away from the edge of the area to be edited.
  • the editing interface also includes a mask setting control for inputting mask setting information. If a trigger operation for the mask setting control is detected, the mask setting information is obtained and parsed to obtain a blur level. After generating text features based on the target prompt word using the image editing model and determining the correlation between the text features and the local noise image, the degree to which the generated object fills the to-be-edited area is determined based on the blur level. A local edited image is generated based on the correlation, the filling level, and the local noise image, wherein the to-be-edited area in the local edited image includes the generated object. A denoising operation is then performed on the local edited image based on the predicted noise, and a target image is output.
  • the technical solution of the embodiment of the present disclosure accurately describes the expected editing effect of the area to be edited by the target prompt word for the area to be edited, determines the target mask map based on the area to be edited, adds preset noise to the area to be edited based on the target mask map through the image editing model, and obtains a local noise image. Then, the area to be edited of the local noise image is denoised and processed by the image editing model based on the target prompt word to generate a target object that meets the expected editing effect in the area to be edited, thereby achieving precise editing of the designated area and improving the image-text matching and image generation quality.
  • the embodiment of the present disclosure solves the problem that the image editing method in the related art cannot accurately modify and edit local areas, and improves the consistency of image and text and the image generation quality.
  • Figure 3 is a flow chart of another image editing method provided by an embodiment of the present disclosure.
  • the embodiment of the present disclosure is applicable to the situation of training an image editing model. Based on the above embodiments, the embodiment of the present disclosure additionally defines the training process of the image editing model.
  • the method includes:
  • the sample atlas is a collection of sample images.
  • the sample images may include countable instance objects, such as people, cars, and animals.
  • the sample images may also include regions without fixed shapes, such as the sky, grass, snow, and trees.
  • the target region represents an instance target and/or a region without a fixed shape in the sample image that carries semantic information.
  • the dilation operation is a morphological operation in image processing that is used to enlarge the target region and roughen its boundaries.
  • a panoptic segmentation model is used to segment a sample image, obtaining instance targets and/or shapeless regions within the sample image.
  • a dilation operation is performed on the instance targets and/or shapeless regions within the sample image to obtain a dilated sample image.
  • a sample mask atlas set is determined based on the sample dilated images corresponding to each sample image in the sample atlas.
  • a reference mask image of the sample image is generated based on a target region in the sample dilated image; Gaussian blur processing is performed on the reference mask image to obtain at least two sample mask images corresponding to the sample image; and a sample mask atlas set is determined based on the at least two sample mask images corresponding to the sample images in the sample atlas, wherein the at least two sample mask images are associated with the description text.
  • a reference mask of the sample image is generated based on the target region in the sample dilated image.
  • the to-be-edited region of the reference mask of the sample image is filled with white, while the non-to-be-edited region is filled with black.
  • the reference mask includes instance masks and semantic masks.
  • Gaussian kernels of different sizes are used to perform Gaussian blur processing on the reference mask image to obtain at least two sample mask images with different degrees of roughness corresponding to the sample image.
  • a sample mask atlas is formed based on at least two sample mask images corresponding to each sample image in the sample atlas to meet image editing needs of different precisions.
  • the image editing model is trained by at least two sample mask images with different degrees of roughness, which can support fine instance target boundaries and rough rectangles as mask images to input the image editing model, and output a target image with a high degree of image-text matching.
  • the image editing model is trained by at least two sample mask images with different degrees of roughness, which can support fine instance target boundaries and rough rectangles as mask images to input the image editing model, and the generated object in the output target image fills the entire area to be edited.
  • a pre-set image description generation model is used to identify and understand the image content of the target area and output a description text corresponding to the target area.
  • a description text set is formed based on the description text of the target area of each sample image in the sample atlas.
  • a correlation is established between the description text of the sample image and the sample mask image to obtain an image-text data pair.
  • S340 Train a preset editing model based on the sample atlas, the sample mask atlas, and the description text set to obtain an image editing model.
  • the preset editing model may include an encoder, a noise prediction network, and a decoder.
  • the noise prediction network may be a Unet network, comprising a convolutional layer, a downsampling layer, a downsampling layer based on a multi-head attention mechanism, an intermediate network, an upsampling layer based on a multi-head attention mechanism, and an upsampling layer.
  • a multi-head attention mechanism text and images are associated.
  • the downsampling layer includes multiple residual modules.
  • the downsampling layer based on the multi-head attention mechanism may include a residual module and a Transformer module, and the diffusion model may include at least one downsampling layer based on the multi-head attention mechanism.
  • different downsampling layers based on the multi-head attention mechanism may include different numbers of Transformer modules.
  • the noise prediction network includes residual modules and Transformer modules.
  • the upsampling layer based on the multi-head attention mechanism may include a residual module and a Transformer module, and the diffusion model may include at least one upsampling layer based on the multi-head attention mechanism. If there are multiple upsampling layers based on the multi-head attention mechanism, different upsampling layers based on the multi-head attention mechanism may include different numbers of Transformer modules.
  • the upsampling layer consists of multiple residual modules.
  • Figure 4 is a schematic diagram of training an image editing model provided by an embodiment of the present disclosure.
  • the encoder 410 input includes a sample image 420, which is used to compress the sample image 420 into a low-dimensional space to obtain a latent feature map 430.
  • the decoder 440 is used to restore the low-dimensional image 450 that has completed the image editing task to the size of the sample image, obtaining an edited result image 460.
  • the noise prediction network 470 is used to iteratively perform T noise addition operations on the to-be-edited region 490 in the latent feature map 430 under the constraints of the sample mask map 480, while maintaining the latent features of the non-to-be-edited regions unchanged, to obtain a local noise image 4100.
  • Low-dimensional image 450 is input to decoder 440, which decompresses low-dimensional image 450 to obtain an edited image 460 of the same size as sample image 420.
  • a loss value is calculated between the generated object in the to-be-edited region of edited image 460 and the real instance object or the region with no fixed shape.
  • the model parameters are updated based on the loss value during backpropagation, ultimately yielding a trained image editing model.
  • the method further includes: determining the blur level of the sample mask image based on the size of the Gaussian kernel used in the Gaussian blur processing, and marking the corresponding sample mask image according to the blur level. Since Gaussian blur processing is performed on the reference mask image using Gaussian kernels of different sizes, the resulting sample mask images have different degrees of roughness.
  • the sample mask images can be marked with a blur level based on the size of the Gaussian kernel used in the Gaussian blur processing. For example, a fine sample mask image corresponds to a blur level of 0, and as the roughness of the sample mask image increases, the blur level corresponding to the sample mask image increases.
  • the method of training a preset editing model based on the sample atlas, the sample mask atlas, and the descriptive text set to obtain an image editing model includes: training the preset editing model based on the sample atlas, the sample mask atlas, the blur levels corresponding to the sample mask images, and the descriptive text set to obtain the image editing model.
  • the blur levels are mapped to blur level vectors, and the blur level vectors are injected into the noise prediction network so that the model learns the correspondence between different blur levels and sample mask images of different roughness levels, thereby controlling the degree to which the generated object fills the area to be edited.
  • the technical solution of the disclosed embodiment adds noise and denoise to the area to be edited, increasing local correlation and improving the matching degree between the edited image and the description text. This improves the generation stability of the model and avoids the generation errors that may occur during the denoising and denoising of the sample image as a whole, as the description text is a description of the entire image, resulting in the edited image not meeting expectations.
  • the latent features of the sample image are used as input in the denoising and denoising operations corresponding to all time steps of model training. This reduces the reconstruction difficulty of the model in image editing tasks, reduces the color difference of the generated objects, and improves model consistency.
  • Figure 5 is a structural diagram of an image editing device provided by an embodiment of the present disclosure.
  • the device can be implemented in the form of software and/or hardware.
  • it can be implemented by an electronic device, which can be a mobile terminal, a PC or a server.
  • the apparatus includes: an acquisition module 510 , a noise adding module 520 and an image generating module 530 .
  • An acquisition module 510 is configured to acquire a region to be edited in an original image and a target prompt word corresponding to the region to be edited, wherein the target prompt word is text information used to describe an expected effect of image editing;
  • a noise adding module 520 for determining a target mask image according to the region to be edited, and adding preset noise to the region to be edited based on the target mask image by an image editing model to obtain a local noise image
  • the image generation module 530 is configured to perform noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word through the image editing model, and output a target image according to the noise prediction results and the image generation results.
  • the image editing model is trained by:
  • determining a target area in the sample image For a sample image in the sample atlas, determining a target area in the sample image, performing a dilation operation on the target area to obtain a sample dilated image, and determining a sample mask atlas according to the sample dilated image corresponding to the sample atlas;
  • a preset editing model is trained based on the sample atlas, sample mask atlas and description text set to obtain an image editing model.
  • determining a sample mask atlas according to the sample dilation image corresponding to the sample atlas includes:
  • a sample mask atlas is determined according to at least two sample mask images corresponding to the sample images in the sample atlas.
  • the method further includes:
  • the method of training a preset editing model based on the sample atlas, the sample mask atlas, and the description text set to obtain an image editing model includes:
  • a preset editing model is trained according to the sample atlas, the sample mask atlas, the blur levels corresponding to the sample mask images, and the description text set to obtain an image editing model.
  • the acquisition module 510 is specifically configured to:
  • a text input operation for the area to be edited is acquired, and a target prompt word corresponding to the area to be edited is determined according to the text input operation.
  • the noise adding module 520 is specifically configured to:
  • a noise adding operation is performed a set number of times based on the area to be edited in the potential feature map to obtain a local noise image.
  • performing a set number of noise addition operations based on the area to be edited in the potential feature map to obtain a local noise image includes:
  • noise characteristic graph wherein the noise characteristic graph represents a preset noise
  • a noise adding operation is performed a set number of times based on the area to be edited and the noise feature area to obtain a local noise image.
  • the image generation module 530 is specifically configured to:
  • a local edited image is generated according to the correlation and the local noise image, a denoising operation is performed on the local edited image based on the predicted noise, and a target image is output.
  • it also includes:
  • a level setting module configured to obtain mask setting information, wherein the mask setting information includes a blur level
  • generating a local edited image according to the correlation and the local noise image includes:
  • a local editing image is generated according to the correlation, the filling degree and the local noise image, wherein the area to be edited in the local editing image includes the generated object.
  • the image editing device provided by the embodiments of the present disclosure can execute the image editing method provided by any embodiment of the present disclosure, and has the corresponding functional modules and beneficial effects of the execution method.
  • FIG6 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present disclosure.
  • a schematic diagram of the structure of an electronic device 600 suitable for implementing an embodiment of the present disclosure is shown below.
  • the terminal device in the embodiment of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG6 is merely an example and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
  • electronic device 600 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 601, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded from a storage device 608 into a random access memory (RAM) 603.
  • ROM read-only memory
  • RAM random access memory
  • Various programs and data required for the operation of electronic device 600 are also stored in RAM 603.
  • Processing device 601, ROM 602, and RAM 603 are connected to each other via a bus 604.
  • An input/output (I/O) interface 605 is also connected to bus 604.
  • the following devices may be connected to the I/O interface 605: an input device 606 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; an output device 607 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609.
  • the communication device 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data.
  • FIG. 6 shows the electronic device 600 with various devices, it should be understood that not all of the devices shown are required to be implemented or present. More or fewer devices may alternatively be implemented or present.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program includes program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from the network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602.
  • the processing device 601 the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.
  • the electronic device provided by the embodiment of the present disclosure and the image editing method provided by the above embodiment belong to the same inventive concept.
  • An embodiment of the present disclosure provides a computer storage medium having a computer program stored thereon.
  • the program is executed by a processor, the image editing method provided by the above embodiment is implemented.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • a computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or component, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, device, or component.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, which carries computer-readable program code. Such a propagated data signal may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • the program code contained on the computer-readable medium may be transmitted using any suitable medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination thereof.
  • the client and server can communicate using any currently known or later developed network protocol, such as HTTP (HyperText Transfer Protocol), and can be interconnected with any form or medium of digital data communication (e.g., a communication network).
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include a local area network ("LAN”), a wide area network ("WAN”), an internet (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any currently known or later developed network.
  • the computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device When the one or more programs are executed by the electronic device, the electronic device:
  • the image editing model performs noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word, and outputs a target image according to the noise prediction result and the image generation result.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, or a combination thereof, including, but not limited to, object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages.
  • the program code may be executed entirely on the user's computer, partially on the user's computer, as a stand-alone software package, partially on the user's computer and partially on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., through the Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • each box in the flowchart or block diagram can represent a module, program segment, or a part of code, and the module, program segment, or a part of code contains one or more executable instructions for realizing the specified logical function.
  • the functions marked in the box can also occur in a different order than that marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved.
  • each box in the block diagram and/or flowchart, and the combination of the boxes in the block diagram and/or flowchart can be implemented with a dedicated hardware-based system that performs the specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in this disclosure may be implemented in software or hardware, wherein the name of a unit does not necessarily limit the unit itself.
  • exemplary types of hardware logic components include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chip (SOCs), complex programmable logic devices (CPLDs), and the like.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOCs systems on chip
  • CPLDs complex programmable logic devices
  • a machine-readable medium can be a tangible medium that can contain or store a program for use by or in conjunction with an instruction execution system, device or equipment.
  • a machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or equipment, or any suitable combination of the foregoing.
  • a more specific example of a machine-readable storage medium can include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM portable compact disk read-only memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the foregoing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)

Abstract

Embodiments of the present disclosure provide an image editing method and apparatus, a device, and a storage medium. The image editing method comprises: acquiring an area to be edited in an original image and a target prompt word corresponding to said area, wherein the target prompt word is text information used for describing an expected effect of image editing; determining a target mask image on the basis of said area, and adding preset noise to said area on the basis of the target mask image by means of an image editing model to obtain a local noise image; and on the basis of the target prompt word and by means of the image editing model, performing noise prediction processing and image generation processing on an area to be edited in the local noise image, and outputting a target image on the basis of a noise prediction result and an image generation result. The embodiments of the present disclosure solve the problem that image editing methods in the related art cannot accurately modify and edit local areas, thereby improving image-text consistency and image generation quality.

Description

图像编辑方法、装置、设备及存储介质Image editing method, device, equipment and storage medium

本申请要求于2024年4月15日递交的中国专利申请第202410455156.2号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。This application claims priority to Chinese Patent Application No. 202410455156.2 filed on April 15, 2024, and the contents of the above-mentioned Chinese patent application disclosure are hereby incorporated by reference in their entirety as a part of this application.

技术领域Technical Field

本公开实施例涉及一种图像编辑方法、装置、设备及存储介质。The embodiments of the present disclosure relate to an image editing method, apparatus, device, and storage medium.

背景技术Background Art

随着计算机技术的发展,根据文本描述对原始图像中的局部区域进行编辑的需求日益增加。With the development of computer technology, the demand for editing local areas in original images based on text descriptions is increasing.

目前,在根据文本描述的图像编辑任务中,采用语义图像编辑会无差别的处理整个图像,无法精准地修改和编辑指定区域,导致生成的图像与文本描述存在差异,影响图像生成质量。Currently, in image editing tasks based on text descriptions, semantic image editing will indiscriminately process the entire image and cannot accurately modify and edit the specified area, resulting in differences between the generated image and the text description, affecting the image generation quality.

发明内容Summary of the Invention

本公开提供一种图像编辑方法、装置、设备及存储介质,可以精准地编辑指定区域,提升图文匹配度和图像生成质量。The present disclosure provides an image editing method, apparatus, device, and storage medium, which can accurately edit a specified area, improve image-text matching, and improve image generation quality.

第一方面,本公开实施例提供了一种图像编辑方法,包括:In a first aspect, an embodiment of the present disclosure provides an image editing method, comprising:

获取原始图像中的待编辑区域和所述待编辑区域对应的目标提示词,其中,所述目标提示词为用于描述图像编辑的预期效果的文本信息;Acquire a region to be edited in the original image and a target prompt word corresponding to the region to be edited, wherein the target prompt word is text information used to describe the expected effect of image editing;

根据所述待编辑区域确定目标掩膜图,通过图像编辑模型基于所述目标掩膜图向所述待编辑区域添加预设噪声,得到局部噪声图像;Determining a target mask image according to the area to be edited, and adding preset noise to the area to be edited based on the target mask image using an image editing model to obtain a local noise image;

通过所述图像编辑模型基于所述目标提示词,对所述局部噪声图像的待编辑区域进行噪声预测处理和图像生成处理,根据噪声预测结果和图像生成结果输出目标图像。The image editing model performs noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word, and outputs a target image according to the noise prediction result and the image generation result.

第二方面,本公开实施例还提供了一种图像编辑装置,该装置包括:In a second aspect, an embodiment of the present disclosure further provides an image editing device, the device comprising:

获取模块,用于获取原始图像中的待编辑区域和所述待编辑区域对应的目标提示词,其中,所述目标提示词为用于描述图像编辑的预期效果的文本信息;An acquisition module, configured to acquire a region to be edited in an original image and a target prompt word corresponding to the region to be edited, wherein the target prompt word is text information used to describe an expected effect of image editing;

噪声添加模块,用于根据所述待编辑区域确定目标掩膜图,通过图像编辑模型基于所述目标掩膜图向所述待编辑区域添加预设噪声,得到局部噪声图像;a noise adding module, configured to determine a target mask image according to the area to be edited, and add preset noise to the area to be edited based on the target mask image by an image editing model to obtain a local noise image;

图像生成模块,用于通过所述图像编辑模型基于所述目标提示词,对所述局部噪声图像的待编辑区域进行噪声预测处理和图像生成处理,根据噪声预测结果和图像生成结果输出目标图像。An image generation module is used to perform noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word through the image editing model, and output a target image according to the noise prediction result and the image generation result.

第三方面,本公开实施例还提供了一种电子设备,所述电子设备包括:In a third aspect, an embodiment of the present disclosure further provides an electronic device, the electronic device comprising:

一个或多个处理器;one or more processors;

存储装置,用于存储一个或多个程序,a storage device for storing one or more programs,

当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本公开任意实施例所述的图像编辑方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the image editing method as described in any embodiment of the present disclosure.

第四方面,本公开实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如本公开任意实施例所述的图像编辑方法。In a fourth aspect, an embodiment of the present disclosure further provides a storage medium comprising computer-executable instructions, which, when executed by a computer processor, are used to execute the image editing method as described in any embodiment of the present disclosure.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。The above and other features, advantages, and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numerals represent the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale.

图1为本公开实施例所提供的一种图像编辑方法的流程示意图;FIG1 is a schematic flow chart of an image editing method provided by an embodiment of the present disclosure;

图2为本公开实施例所提供的一种编辑界面的示意图;FIG2 is a schematic diagram of an editing interface provided by an embodiment of the present disclosure;

图3为本公开实施例所提供的另一种图像编辑方法的流程示意图;FIG3 is a flow chart of another image editing method provided by an embodiment of the present disclosure;

图4为本公开实施例所提供的一种图像编辑模型的训练示意图;FIG4 is a schematic diagram of training an image editing model provided by an embodiment of the present disclosure;

图5为本公开实施例所提供的一种图像编辑装置的结构示意图;以及FIG5 is a schematic structural diagram of an image editing device provided by an embodiment of the present disclosure; and

图6为本公开实施例所提供的一种电子设备的结构示意图。FIG6 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

具体实施方式DETAILED DESCRIPTION

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。The following describes embodiments of the present disclosure in more detail with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as limited to the embodiments described herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of protection of the present disclosure.

应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. In addition, the method embodiments may include additional steps and/or omit the steps shown. The scope of the present disclosure is not limited in this respect.

本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "including" and its variations are open-ended, i.e., "including but not limited to." The term "based on" means "based, at least in part, on." The term "one embodiment" means "at least one embodiment," the term "another embodiment" means "at least one additional embodiment," and the term "some embodiments" means "at least some embodiments." Other terms are defined in the following description.

需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that the concepts of "first" and "second" mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order or interdependence of the functions performed by these devices, modules or units.

需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless otherwise clearly indicated in the context, they should be understood as "one or more".

本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of the messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes and are not used to limit the scope of these messages or information.

可以理解的是,在使用本公开各实施例公开的技术方案之前,均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。It is understandable that before using the technical solutions disclosed in the various embodiments of this disclosure, the type, scope of use, usage scenarios, etc. of the personal information involved in this disclosure should be informed to the user and the user's authorization should be obtained in an appropriate manner in accordance with relevant laws and regulations.

例如,在响应于接收到用户的主动请求时,向用户发送提示信息,以明确地提示用户,其请求执行的操作将需要获取和使用到用户的个人信息。从而,使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。For example, in response to a user's active request, a prompt message is sent to the user to clearly inform the user that the operation requested will require the acquisition and use of the user's personal information. This allows the user to independently choose whether to provide personal information to the electronic device, application, server, storage medium, or other software or hardware that performs the operations of the disclosed technical solution based on the prompt message.

作为一种可选的但非限定性的实现方式,响应于接收到用户的主动请求,向用户发送提示信息的方式例如可以是弹窗的方式,弹窗中可以以文字的方式呈现提示信息。此外,弹窗中还可以承载供用户选择“同意”或者“不同意”向电子设备提供个人信息的选择控件。As an optional but non-limiting implementation, in response to receiving a user's active request, the prompt information may be sent to the user in the form of a pop-up window, in which the prompt information may be presented in text form. Furthermore, the pop-up window may also contain a selection control for the user to select "agree" or "disagree" to provide personal information to the electronic device.

可以理解的是,上述通知和获取用户授权过程仅是示意性的,不对本公开的实现方式构成限定,其它满足相关法律法规的方式也可应用于本公开的实现方式中。It is understandable that the above notification and user authorization process are merely illustrative and do not limit the implementation of the present disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of the present disclosure.

可以理解的是,本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。It is understandable that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) must comply with the requirements of relevant laws, regulations and relevant provisions.

需要说明的是,在本公开实施例中,可能提及某些软件、组件、模型等业界已有方案,应当将它们认为是示范性的,其目的仅仅是为了说明本公开技术方案实施中的可行性,但并不意味着申请人已经或者必然用到了该方案。It should be noted that in the embodiments of the present disclosure, certain software, components, models and other existing solutions in the industry may be mentioned. They should be regarded as exemplary and their purpose is only to illustrate the feasibility of implementing the technical solution of the present disclosure, but it does not mean that the applicant has or will necessarily use the solution.

图1为本公开实施例所提供的一种图像编辑方法的流程示意图,本公开实施例适用于局部编辑的情形,例如,用户输入针对原始图像的区域选择操作,以确定原始图像中的待编辑区域,并输入针对待编辑区域的目标提示词,图像编辑模型基于原始图像、待编辑区域和目标提示词在待编辑区域生成目标对象,得到具有目标对象的目标图像并输出。目标对象表征基于目标提示词准确生成的待编辑区域对象。该方法可以由图像编辑装置来执行,该装置可以通过软件和/或硬件的形式实现,可选的,通过电子设备来实现,该电子设备可以是移动终端、PC端或服务器等。FIG1 is a flow chart of an image editing method provided by an embodiment of the present disclosure. The embodiment of the present disclosure is applicable to the situation of local editing. For example, the user inputs an area selection operation for the original image to determine the area to be edited in the original image, and inputs a target prompt word for the area to be edited. The image editing model generates a target object in the area to be edited based on the original image, the area to be edited, and the target prompt word, obtains a target image with the target object, and outputs it. The target object representation accurately generates the area to be edited based on the target prompt word. The method can be executed by an image editing device, which can be implemented in the form of software and/or hardware. Optionally, it can be implemented by an electronic device, which can be a mobile terminal, a PC, or a server, etc.

如图1所示,所述方法包括:As shown in FIG1 , the method includes:

S110、获取原始图像中的待编辑区域和所述待编辑区域对应的目标提示词。S110: Acquire a region to be edited in the original image and a target prompt word corresponding to the region to be edited.

本公开实施例中,原始图像表征待进行图像编辑操作的图片或视频的视频帧等。原始图像可以包括电子设备中存储的图片或视频帧。或者,原始图像可以包括从服务端下载的图片或视频帧。或者,原始图像还可以包括用户拍摄得到的图片或视频帧。若需要进行局部区域的图像编辑操作,可以通过编辑界面展示原始图像。编辑界面可以为客户端中由局部编辑控件触发展示的交互界面。其中,客户端可以包括应用程序、小程序或网页客户端等。局部编辑控件用于触发展示编辑界面。In the embodiments of the present disclosure, the original image represents a picture or video frame of a video to be edited. The original image may include a picture or video frame stored in an electronic device. Alternatively, the original image may include a picture or video frame downloaded from a server. Alternatively, the original image may also include a picture or video frame taken by a user. If image editing operations are required for a local area, the original image may be displayed through an editing interface. The editing interface may be an interactive interface that is triggered and displayed by a local editing control in the client. The client may include an application, a mini-program, or a web client, etc. The local editing control is used to trigger the display of the editing interface.

待编辑区域表征待进行图像编辑操作的图像区域。可以通过编辑界面中针对原始图像的区域选择操作,确定待编辑区域。其中,区域选择操作包括涂抹操作、框选操作或滑动操作等。例如,获取编辑界面中针对原始图像的涂抹操作,根据被涂抹区域确定待编辑区域。或者,获取编辑界面中针对原始图像的框选操作,根据被框住区域确定待编辑区域。或者,获取编辑界面中针对原始图像的滑动操作,根据滑动轨迹确定待编辑区域。The area to be edited represents the image area where image editing operations are to be performed. The area to be edited can be determined by performing an area selection operation on the original image in the editing interface. Area selection operations include smearing, framing, or sliding operations. For example, a smearing operation on the original image in the editing interface is obtained, and the area to be edited is determined based on the smeared area. Alternatively, a framing operation on the original image in the editing interface is obtained, and the area to be edited is determined based on the framed area. Alternatively, a sliding operation on the original image in the editing interface is obtained, and the area to be edited is determined based on the sliding trajectory.

其中,所述目标提示词为用于描述图像编辑的预期效果的文本信息。具体地,目标提示词表征对待编辑区域内预期生成的目标对象的描述文本。例如,目标提示词可以表征汽车,表示图像编辑的预期效果为将待编辑区域内的对象替换为汽车。The target prompt word is text information used to describe the expected effect of image editing. Specifically, the target prompt word represents a descriptive text of the target object to be generated in the area to be edited. For example, the target prompt word may represent a car, indicating that the expected effect of image editing is to replace the object in the area to be edited with a car.

示例性地,在编辑界面中展示原始图像;获取针对所述原始图像的区域选择操作,根据所述区域选择操作确定所述原始图像中的待编辑区域;获取针对所述待编辑区域的文本输入操作,根据所述文本输入操作确定所述待编辑区域对应的目标提示词。Exemplarily, an original image is displayed in an editing interface; an area selection operation for the original image is obtained, and an area to be edited in the original image is determined based on the area selection operation; a text input operation for the area to be edited is obtained, and a target prompt word corresponding to the area to be edited is determined based on the text input operation.

图2为本公开实施例所提供的一种编辑界面的示意图。如图2所示,编辑界面200包括编辑区域210和结果展示区域220等。编辑区域210包括图像加载控件230和提示词输入控件240。响应于图像加载控件230的触发事件,获取原始图像250,在图像加载控件230对应的位置展示原始图像250。响应于针对原始图像250的涂抹操作,确定原始图像250中的涂抹区域,作为待编辑区域。响应于针对提示词输入控件240的文本输入操作,获取待编辑区域对应的目标提示词。其中,提示词输入控件240包括文本框控件。FIG2 is a schematic diagram of an editing interface provided by an embodiment of the present disclosure. As shown in FIG2 , the editing interface 200 includes an editing area 210 and a result display area 220, etc. The editing area 210 includes an image loading control 230 and a prompt word input control 240. In response to a triggering event of the image loading control 230, the original image 250 is obtained, and the original image 250 is displayed at a position corresponding to the image loading control 230. In response to a smearing operation on the original image 250, the smeared area in the original image 250 is determined as the area to be edited. In response to a text input operation on the prompt word input control 240, the target prompt word corresponding to the area to be edited is obtained. The prompt word input control 240 includes a text box control.

S120、根据所述待编辑区域确定目标掩膜图,通过图像编辑模型基于所述目标掩膜图向所述待编辑区域添加预设噪声,得到局部噪声图像。S120 , determining a target mask image according to the area to be edited, and adding preset noise to the area to be edited based on the target mask image by an image editing model to obtain a local noise image.

其中,目标掩膜图用于遮住原始图像中的非待编辑区域,以提示图像编辑模型对待编辑区域进行图像编辑。例如,生成与原始图像尺寸一致的原始掩膜图,识别原始图像的待编辑区域中的目标对象,根据目标对象的坐标确定原始掩膜图中的待编辑区域,将原始掩膜图中待编辑区域填充为白色,非待编辑区域填充为黑色,得到目标掩膜图。可选地,还可以对原始图像的待编辑区域中的目标对象进行膨胀处理,以对目标对象的边界进行扩张,然后基于膨胀处理后的目标对象确定目标掩膜图。The target mask is used to mask the non-to-be-edited area in the original image, prompting the image editing model to perform image editing on the to-be-edited area. For example, an original mask with the same size as the original image is generated, the target object in the to-be-edited area of the original image is identified, and the to-be-edited area in the original mask is determined based on the coordinates of the target object. The to-be-edited area in the original mask is filled with white, and the non-to-be-edited area is filled with black to obtain the target mask. Optionally, the target object in the to-be-edited area of the original image can be dilated to expand the boundary of the target object, and then the target mask is determined based on the dilated target object.

图像编辑模型可以为基于样本图集、样本掩膜图集和描述文本集训练的扩散模型。其中,样本掩膜图集基于所述样本图集中样本图像对应的样本膨胀图像确定。样本膨胀图像表征针对所述样本图像对应的目标区域执行膨胀操作后的图像。所述描述文本集基于所述目标区域的图像内容确定。其中,目标区域可以表征样本图像中通过全景分割模型分割的携带语义的目标对象。The image editing model may be a diffusion model trained based on a sample atlas, a sample mask atlas, and a descriptive text set. The sample mask atlas is determined based on a sample dilated image corresponding to a sample image in the sample atlas. The sample dilated image represents an image after a dilation operation is performed on a target region corresponding to the sample image. The descriptive text set is determined based on the image content of the target region. The target region may represent a semantically significant target object segmented in the sample image using a panoptic segmentation model.

其中,扩散模型包括编码器、噪声预测网络和解码器。编码器的输入包括原始图像,用于将原始图像压缩至低维度空间,得到潜在特征图。解码器用于将完成图像编辑任务的低维度图像还原至原始图像的尺寸,得到目标图像。噪声预测网络用于在目标掩膜图的约束下,向潜在特征图中的待编辑区域添加噪声,对非待编辑区域保持潜在特征不变,得到局部噪声图像。以及,还在目标提示词的约束下,对局部噪声图像zT中的待编辑区域进行噪声预测处理和图像生成处理,得到时间步t=T对应的预测噪声以及图文相关性。然后,基于时间步t=T对应的图文相关性和局部噪声图像zT生成时间步t=T对应的局部编辑图像。从时间步t=T对应的局部编辑图像中减去时间步t=T对应的预测噪声,得到时间步t=T-1对应的局部噪声图像zT-1。将时间步t=T-1对应的局部噪声图像zT-1输入噪声预测网络以生成时间步t=i-2的局部噪声图像zT-2,以此类推,直至生成时间步t=0对应的局部噪声图像z0,即得到目标图像的低维度的特征图。The diffusion model includes an encoder, a noise prediction network, and a decoder. The encoder input includes the original image, which is used to compress the original image into a low-dimensional space to obtain a latent feature map. The decoder is used to restore the low-dimensional image, after completing the image editing task, to the original image size, obtaining the target image. The noise prediction network is used to add noise to the to-be-edited region in the latent feature map, subject to the constraints of the target mask map, while preserving the latent features in the non-to-be-edited regions, to obtain a local noise image. Furthermore, under the constraints of the target cue word, the to-be-edited region in the local noise image zT is subjected to noise prediction and image generation, obtaining the predicted noise and image-text correlation corresponding to time step t=T. Then, based on the image-text correlation and the local noise image zT, a local edited image corresponding to time step t=T is generated. The predicted noise corresponding to time step t=T is subtracted from the local edited image to obtain the local noise image zT-1 corresponding to time step t=T-1. The local noise image zT-1 corresponding to time step t=T-1 is input into the noise prediction network to generate the local noise image zT-2 at time step t=i-2, and so on, until the local noise image z0 corresponding to time step t=0 is generated, that is, the low-dimensional feature map of the target image is obtained.

可选地,噪声预测网络可以为Unet网络,包括卷积层、下采样层、基于多头注意力机制的下采样层、中间网络、基于多头注意力机制的上采样层以及上采样层等。通过引入多头注意力机制,使得文本和图像相关联。其中,下采样层包括多个残差模块。基于多头注意力机制的下采样层可以包括残差模块和Transformer模块,且扩散模型可以包括至少一个基于多头注意力机制的下采样层。若具有多个基于多头注意力机制的下采样层时,不同的基于多头注意力机制的下采样层包括的Transformer模块的数量不同。中间网络包括残差模块和Transformer模块等。基于多头注意力机制的上采样层可以包括残差模块和Transformer模块,且扩散模型可以包括至少一个基于多头注意力机制的上采样层。若具有多个基于多头注意力机制的上采样层时,不同的基于多头注意力机制的上采样层包括的Transformer模块的数量不同。上采样层包括多个残差模块。Optionally, the noise prediction network can be a Unet network, comprising a convolutional layer, a downsampling layer, a downsampling layer based on a multi-head attention mechanism, an intermediate network, an upsampling layer based on a multi-head attention mechanism, and an upsampling layer. By introducing the multi-head attention mechanism, text and images are associated. The downsampling layer includes multiple residual modules. The downsampling layer based on the multi-head attention mechanism may include a residual module and a Transformer module, and the diffusion model may include at least one downsampling layer based on the multi-head attention mechanism. If there are multiple downsampling layers based on the multi-head attention mechanism, different downsampling layers based on the multi-head attention mechanism may include different numbers of Transformer modules. The intermediate network includes residual modules and Transformer modules. The upsampling layer based on the multi-head attention mechanism may include a residual module and a Transformer module, and the diffusion model may include at least one upsampling layer based on the multi-head attention mechanism. If there are multiple upsampling layers based on the multi-head attention mechanism, different upsampling layers based on the multi-head attention mechanism may include different numbers of Transformer modules. The upsampling layer includes multiple residual modules.

预设噪声可以为噪声特征图中分布属性满足预设要求的噪声。例如,预设噪声可以为满足高斯分布的随机噪声等。其中,噪声特征图可以为与原始图像的尺寸一致的噪声图像。可以通过叠加目标掩膜图和噪声特征图确定噪声特征区域。噪声特征区域表征噪声特征图中与待编辑区域对应的特定区域。由于目标掩膜图中待编辑区域为白色,其余区域为黑色,将目标掩膜图叠加至噪声特征图,可以遮盖住噪声特征图中的非待编辑区域,从而,确定待编辑区域对应的噪声特征区域。The preset noise can be noise whose distribution properties in the noise feature map meet preset requirements. For example, the preset noise can be random noise that meets Gaussian distribution, etc. Among them, the noise feature map can be a noise image with the same size as the original image. The noise feature area can be determined by superimposing the target mask map and the noise feature map. The noise feature area represents a specific area in the noise feature map that corresponds to the area to be edited. Since the area to be edited in the target mask map is white and the remaining areas are black, superimposing the target mask map on the noise feature map can cover the non-to-edit area in the noise feature map, thereby determining the noise feature area corresponding to the area to be edited.

局部噪声图像表示向原始图像的待编辑区域添加预设噪声后的图像,由于添加噪声后的图像只有待编辑区域呈现噪声状态,因此,将加噪声后的图像作为局部噪声图像。可选地,为了减少运算量,通过编码器将原始图像压缩至低维度的潜在空间,得到原始图像的潜在特征图。向潜在特征图的待编辑区域添加预设噪声,得到局部噪声图像。The local noise image represents the image obtained by adding a preset noise to the region to be edited in the original image. Since only the region to be edited exhibits noise after the noise is added, the noisy image is considered the local noise image. Optionally, to reduce computational complexity, the original image is compressed into a low-dimensional latent space using an encoder to obtain a latent feature map of the original image. The preset noise is then added to the region to be edited in the latent feature map to obtain the local noise image.

示例性地,根据所述待编辑区域确定目标掩膜图,通过图像编辑模型基于所述目标掩膜图向所述待编辑区域添加预设噪声,得到局部噪声图像,包括:根据所述待编辑区域中的前景内容确定目标掩膜图;通过所述图像编辑模型生成所述原始图像的潜在特征图,结合所述目标掩膜图和潜在特征图确定所述潜在特征图中的待编辑区域;基于所述潜在特征图中的待编辑区域执行设定次数的噪声添加操作,得到局部噪声图像。Exemplarily, a target mask map is determined according to the area to be edited, and preset noise is added to the area to be edited based on the target mask map through an image editing model to obtain a local noise image, including: determining the target mask map according to the foreground content in the area to be edited; generating a potential feature map of the original image through the image editing model, and determining the area to be edited in the potential feature map by combining the target mask map and the potential feature map; performing a set number of noise adding operations based on the area to be edited in the potential feature map to obtain a local noise image.

例如,若对原始图像中的局部区域进行涂抹,对原始图像中涂抹区域进行前景内容识别,得到目标对象。基于原始图像中的目标对象生成目标掩膜图。通过图像编辑模型中的编码器对原始图像进行压缩处理,得到原始图像的潜在特征图。由于目标掩膜图中待编辑区域为白色,其余区域为黑色,将目标掩膜图叠加至潜在特征图,可以遮盖住潜在特征图中的非待编辑区域,从而,确定出潜在特征图中的待编辑区域。通过图像编辑模型基于潜在特征图中的待编辑区域执行设定次数的噪声添加操作,得到局部噪声图像。For example, if a local area in the original image is smeared, the foreground content of the smeared area in the original image is identified to obtain the target object. A target mask image is generated based on the target object in the original image. The original image is compressed by the encoder in the image editing model to obtain a potential feature map of the original image. Since the area to be edited in the target mask image is white and the remaining areas are black, superimposing the target mask image on the potential feature map can cover the non-to-be-edited area in the potential feature map, thereby determining the area to be edited in the potential feature map. The image editing model performs a set number of noise addition operations based on the area to be edited in the potential feature map to obtain a local noise image.

一些实施例中,潜在特征图表示为z0,通过图像编辑模型在目标掩膜图的约束下确定z0的待编辑区域,向z0的待编辑区域中添加随机噪声,得到局部噪声图z1。然后,通过图像编辑模型向局部噪声图z1的待编辑区域中添加随机噪声,得到局部噪声图z2。迭代执行噪声添加步骤,直至得到局部噪声图zT。In some embodiments, the latent feature map is represented as z0. The image editing model determines the region to be edited within z0 under the constraints of the target mask map. Random noise is then added to the region to be edited within z0 to obtain a local noise map z1. The image editing model then adds random noise to the region to be edited within the local noise map z1 to obtain a local noise map z2. The noise addition steps are iteratively performed until a local noise map zT is obtained.

可选地,所述基于所述潜在特征图中的待编辑区域执行设定次数的噪声添加操作,得到局部噪声图像,包括:获取噪声特征图,其中,所述噪声特征图表征预设噪声;结合所述目标掩膜图和噪声特征图确定噪声特征区域;基于所述待编辑区域和噪声特征区域执行设定次数的噪声添加操作,得到局部噪声图像。Optionally, performing a set number of noise adding operations based on the area to be edited in the potential feature map to obtain a local noise image includes: obtaining a noise feature map, wherein the noise feature map represents a preset noise; determining a noise feature area in combination with the target mask map and the noise feature map; performing a set number of noise adding operations based on the area to be edited and the noise feature area to obtain a local noise image.

例如,潜在特征图表示为z0,噪声特征图表示为S,将z0、S和目标掩膜图输入图像编辑模型的Unet网络,通过Unet网络在目标掩膜图的约束下确定z0的待编辑区域,通过Unet网络在目标掩膜图的约束下确定噪声特征区域,将噪声特征区域内的噪声特征叠加至z0的待编辑区域中的像素特征,得到局部噪声图z1。然后,采用相同的方式,通过Unet网络向局部噪声图z1的待编辑区域中添加随机噪声,得到局部噪声图z2。迭代执行噪声添加步骤,直至得到局部噪声图zT。For example, the latent feature map is represented as z0, and the noise feature map is represented as S. z0, S, and the target mask map are input into the Unet network of the image editing model. The Unet network determines the region to be edited in z0 under the constraints of the target mask map. The Unet network also determines the noise feature region under the constraints of the target mask map. The noise features in the noise feature region are superimposed on the pixel features in the region to be edited in z0 to obtain the local noise map z1. Then, using the same method, random noise is added to the region to be edited in the local noise map z1 through the Unet network to obtain the local noise map z2. The noise addition step is iterated until the local noise map zT is obtained.

S130、通过所述图像编辑模型基于所述目标提示词,对所述局部噪声图像的待编辑区域进行噪声预测处理和图像生成处理,根据噪声预测结果和图像生成结果输出目标图像。S130 , performing noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word through the image editing model, and outputting a target image according to the noise prediction result and the image generation result.

示例性地,通过所述图像编辑模型根据所述局部噪声图像的待编辑区域中的图像特征确定预测噪声;通过所述图像编辑模型基于所述目标提示词生成文本特征,确定所述文本特征与局部噪声图像的相关性;根据所述相关性和局部噪声图像生成局部编辑图像,基于所述预测噪声对局部编辑图像进行去噪操作,输出目标图像。Exemplarily, the image editing model determines predicted noise based on image features in the to-be-edited area of the local noise image; the image editing model generates text features based on the target prompt word, and determines the correlation between the text features and the local noise image; a local edited image is generated based on the correlation and the local noise image, and a denoising operation is performed on the local edited image based on the predicted noise, and a target image is output.

例如,图像编辑模型包括文本映射模块,用于将目标提示词映射为文本向量,作为文本特征。将文本特征、局部噪声图像ZT和时间步t=T输入图像编辑模型的Unet网络,通过Unet网络输出时间步t=T对应的预测噪声,并计算文本特征与局部噪声图像ZT的相关性,根据所述相关性调整局部噪声图像ZT的像素分布情况,得到时间步t=T对应的局部编辑图像。从时间步t=T对应的局部编辑图像中减去时间步t=T对应的预测噪声,得到时间步t=T-1的局部噪声图像。将文本特征、局部噪声图像ZT-1和时间步t=T-1输入图像编辑模型的Unet网络,采用相似的方法,可以得到时间步t=T-2的局部噪声图像ZT-2,以此类推,执行T轮去噪和图像生成处理,得到t=0的局部噪声图像Z0。将局部噪声图像Z0输入图像编辑模型的解码器,通过解码器对去噪和图像生成处理后的局部噪声图像Z0进行解压缩,以还原至原始图像的尺寸,得到目标图像。For example, the image editing model includes a text mapping module for mapping the target prompt word into a text vector as a text feature. The text features, the local noise image ZT, and the time step t=T are input into the Unet network of the image editing model. The Unet network outputs the predicted noise corresponding to time step t=T, calculates the correlation between the text features and the local noise image ZT, and adjusts the pixel distribution of the local noise image ZT based on the correlation to obtain the local edited image corresponding to time step t=T. The predicted noise corresponding to time step t=T is subtracted from the local edited image corresponding to time step t=T to obtain the local noise image at time step t=T-1. The text features, the local noise image ZT-1, and the time step t=T-1 are input into the Unet network of the image editing model. Using a similar method, the local noise image ZT-2 at time step t=T-2 can be obtained. Similarly, after performing T rounds of denoising and image generation, the local noise image Z0 at time step t=0 is obtained. The local noise image Z0 is input into the decoder of the image editing model, and the local noise image Z0 after denoising and image generation processing is decompressed by the decoder to restore it to the size of the original image to obtain the target image.

可选地,还包括:获取掩膜设置信息,所述掩膜设置信息包括模糊等级。通过所述图像编辑模型根据所述局部噪声图像的待编辑区域中的图像特征确定预测噪声;通过所述图像编辑模型基于所述目标提示词生成文本特征,确定所述文本特征与局部噪声图像的相关性;基于所述模糊等级确定生成对象对待编辑区域的填充程度;根据所述相关性、填充程度和局部噪声图像生成局部编辑图像,其中,所述局部编辑图像中的待编辑区域包括所述生成对象。基于所述预测噪声对局部编辑图像进行去噪操作,输出目标图像。Optionally, the method further includes: obtaining mask setting information, the mask setting information including a blur level; determining predicted noise based on image features in the to-be-edited region of the local noise image using the image editing model; generating text features based on the target prompt word using the image editing model, and determining the correlation between the text features and the local noise image; determining the degree to which the generated object fills the to-be-edited region based on the blur level; generating a local edited image based on the correlation, the filling level, and the local noise image, wherein the to-be-edited region in the local edited image includes the generated object; and performing a denoising operation on the local edited image based on the predicted noise to output a target image.

其中,填充程度表征生成对象在待编辑区域中的分布信息。不同模糊等级下,生成对象对待编辑区域的填充程度不同。对于填充程度较高的情况,生成对像紧贴于待编辑区域的边缘。对于填充程度较低的情况,生成对象与待编辑区域的边缘的距离增加。The fill level represents the distribution of the generated object within the area to be edited. At different blur levels, the generated object fills the area to varying degrees. With a high fill level, the generated object is positioned close to the edge of the area to be edited. With a low fill level, the generated object is further away from the edge of the area to be edited.

例如,编辑界面还包括掩膜设置控件,用于输入掩膜设置信息。若检测到针对掩膜设置控件的触发操作,获取掩膜设置信息,解析掩膜设置信息得到模糊等级。在通过所述图像编辑模型基于所述目标提示词生成文本特征,确定所述文本特征与局部噪声图像的相关性之后,基于模糊等级确定生成对象对待编辑区域的填充程度;根据所述相关性、填充程度和局部噪声图像生成局部编辑图像,其中,所述局部编辑图像中的待编辑区域包括所述生成对象。基于预测噪声对局部编辑图像进行去噪操作,输出目标图像。For example, the editing interface also includes a mask setting control for inputting mask setting information. If a trigger operation for the mask setting control is detected, the mask setting information is obtained and parsed to obtain a blur level. After generating text features based on the target prompt word using the image editing model and determining the correlation between the text features and the local noise image, the degree to which the generated object fills the to-be-edited area is determined based on the blur level. A local edited image is generated based on the correlation, the filling level, and the local noise image, wherein the to-be-edited area in the local edited image includes the generated object. A denoising operation is then performed on the local edited image based on the predicted noise, and a target image is output.

本公开实施例的技术方案,通过针对待编辑区域的目标提示词准确描述出待编辑区域的预期编辑效果,基于待编辑区域确定目标掩膜图,通过图像编辑模型基于目标掩膜图向待编辑区域添加预设噪声,得到局部噪声图像,然后,通过图像编辑模型基于目标提示词对局部噪声图像的待编辑区域进行去噪处理和文生图处理,以在待编辑区域生成符合预期编辑效果的目标对象,实现精准地编辑指定区域,提升图文匹配度和图像生成质量。由于仅对待编辑区域进行加噪处理和去噪处理,降低了图像处理难度和目标对象与原始图像的色差。本公开实施例解决了相关技术中的图像编辑方法无法精准地修改和编辑局部区域的问题,提升了图文一致性和图像生成质量。The technical solution of the embodiment of the present disclosure accurately describes the expected editing effect of the area to be edited by the target prompt word for the area to be edited, determines the target mask map based on the area to be edited, adds preset noise to the area to be edited based on the target mask map through the image editing model, and obtains a local noise image. Then, the area to be edited of the local noise image is denoised and processed by the image editing model based on the target prompt word to generate a target object that meets the expected editing effect in the area to be edited, thereby achieving precise editing of the designated area and improving the image-text matching and image generation quality. Since only the area to be edited is subjected to noise addition and denoising, the difficulty of image processing and the color difference between the target object and the original image are reduced. The embodiment of the present disclosure solves the problem that the image editing method in the related art cannot accurately modify and edit local areas, and improves the consistency of image and text and the image generation quality.

图3为本公开实施例所提供的另一种图像编辑方法的流程示意图,本公开实施例适用于训练图像编辑模型的情形,本公开实施例在上述各实施例的基础上,附加限定了图像编辑模型的训练过程。Figure 3 is a flow chart of another image editing method provided by an embodiment of the present disclosure. The embodiment of the present disclosure is applicable to the situation of training an image editing model. Based on the above embodiments, the embodiment of the present disclosure additionally defines the training process of the image editing model.

如图3所示,该方法包括:As shown in FIG3 , the method includes:

S310、获取样本图集。S310: Obtain a sample atlas.

其中,样本图集为样本图像的集合。样本图像可以包括可计数的实例目标,例如,人、车和动物等。样本图像还可以包括无固定形状的区域,例如,天空、草地、雪和树木等。The sample atlas is a collection of sample images. The sample images may include countable instance objects, such as people, cars, and animals. The sample images may also include regions without fixed shapes, such as the sky, grass, snow, and trees.

S320、对于所述样本图集中的样本图像,确定所述样本图像中的目标区域,对所述目标区域进行膨胀操作,得到样本膨胀图像,根据所述样本图集对应的所述样本膨胀图像确定样本掩膜图集。S320 . For a sample image in the sample atlas, determine a target area in the sample image, perform a dilation operation on the target area to obtain a sample dilated image, and determine a sample mask atlas based on the sample dilated image corresponding to the sample atlas.

其中,目标区域表征样本图像中携带语义信息的实例目标和/或无固定形状的区域。膨胀操作为一种图像处理中的形态学运算,用于使目标区域变大,边界变粗糙。The target region represents an instance target and/or a region without a fixed shape in the sample image that carries semantic information. The dilation operation is a morphological operation in image processing that is used to enlarge the target region and roughen its boundaries.

采用全景分割模型对样本图像进行分割处理,得出样本图像中的实例目标和/或无固定形状的区域。对样本图像中的实例目标和/或无固定形状的区域进行膨胀操作,得到样本膨胀图像。本公开实施例通过对样本图像进行随机膨胀,使得模型预测过程中实际输入的目标掩膜图能超过目标区域边界,彻底解决了模型的完全填充问题,使目标图像中待编辑区域的生成对象更加真实及自然,提升了图像生成质量。A panoptic segmentation model is used to segment a sample image, obtaining instance targets and/or shapeless regions within the sample image. A dilation operation is performed on the instance targets and/or shapeless regions within the sample image to obtain a dilated sample image. By randomly dilating the sample image, the disclosed embodiment allows the target mask input during the model prediction process to exceed the target region boundary, completely resolving the model's incomplete fill problem. This makes the generated objects in the target image's to-be-edited region more realistic and natural, thereby improving image generation quality.

根据样本图集中各样本图像对应的样本膨胀图像确定样本掩膜图集。示例性地,对于所述样本图集对应的样本膨胀图像,根据所述样本膨胀图像中的目标区域生成样本图像的参考掩膜图;对所述参考掩膜图进行高斯模糊处理,得到所述样本图像对应的至少两个样本掩膜图;根据所述样本图集中样本图像对应的至少两个样本掩膜图,确定样本掩膜图集,其中,所述至少两个样本掩膜图与描述文本相关联。A sample mask atlas set is determined based on the sample dilated images corresponding to each sample image in the sample atlas. Exemplarily, for the sample dilated images corresponding to the sample atlas, a reference mask image of the sample image is generated based on a target region in the sample dilated image; Gaussian blur processing is performed on the reference mask image to obtain at least two sample mask images corresponding to the sample image; and a sample mask atlas set is determined based on the at least two sample mask images corresponding to the sample images in the sample atlas, wherein the at least two sample mask images are associated with the description text.

本公开实施例中,基于样本膨胀图像中的目标区域生成样本图像的参考掩膜图,其中,样本图像的参考掩膜图的待编辑区域填充为白色,非待编辑区域填充为黑色。其中,参考掩膜图包括实例掩膜图和语义掩膜图等。In the disclosed embodiment, a reference mask of the sample image is generated based on the target region in the sample dilated image. The to-be-edited region of the reference mask of the sample image is filled with white, while the non-to-be-edited region is filled with black. The reference mask includes instance masks and semantic masks.

采用不同尺寸的高斯核对参考掩膜图进行高斯模糊处理,得到样本图像对应的粗糙程度不同的至少两个样本掩膜图。根据样本图集中各样本图像对应的至少两个样本掩膜图构成样本掩膜图集,以满足不同精度的图像编辑需求。对于目标区域表征实例目标的情况,通过粗糙程度不同的至少两个样本掩膜图训练图像编辑模型,可以支持精细的实例目标边界和粗糙的矩形作为掩膜图输入图像编辑模型,输出图文匹配度较高的目标图像。对于目标区域表征无固定形状的区域,通过粗糙程度不同的至少两个样本掩膜图训练图像编辑模型,可以支持精细的实例目标边界和粗糙的矩形作为掩膜图输入图像编辑模型,输出的目标图像中生成对象填满整个待编辑区域。Gaussian kernels of different sizes are used to perform Gaussian blur processing on the reference mask image to obtain at least two sample mask images with different degrees of roughness corresponding to the sample image. A sample mask atlas is formed based on at least two sample mask images corresponding to each sample image in the sample atlas to meet image editing needs of different precisions. For the case where the target area represents an instance target, the image editing model is trained by at least two sample mask images with different degrees of roughness, which can support fine instance target boundaries and rough rectangles as mask images to input the image editing model, and output a target image with a high degree of image-text matching. For the case where the target area represents an area without a fixed shape, the image editing model is trained by at least two sample mask images with different degrees of roughness, which can support fine instance target boundaries and rough rectangles as mask images to input the image editing model, and the generated object in the output target image fills the entire area to be edited.

S330、对所述目标区域进行图像内容识别,得到所述目标区域对应的描述文本,根据所述样本图集对应的所述目标区域的描述文本确定描述文本集。S330 , performing image content recognition on the target area to obtain description text corresponding to the target area, and determining a description text set based on the description text of the target area corresponding to the sample atlas.

例如,通过预设图像描述生成模型对目标区域进行图像内容识别与理解,输出目标区域对应的描述文本。根据样本图集中各样本图像的目标区域的描述文本构成描述文本集。建立样本图像的描述文本与样本掩膜图的关联关系,得到图文数据对。For example, a pre-set image description generation model is used to identify and understand the image content of the target area and output a description text corresponding to the target area. A description text set is formed based on the description text of the target area of each sample image in the sample atlas. A correlation is established between the description text of the sample image and the sample mask image to obtain an image-text data pair.

S340、根据所述样本图集、样本掩膜图集和描述文本集训练预设编辑模型,得到图像编辑模型。S340: Train a preset editing model based on the sample atlas, the sample mask atlas, and the description text set to obtain an image editing model.

其中,预设编辑模型可以包括编码器、噪声预测网络和解码器。可选地,噪声预测网络可以为Unet网络,包括卷积层、下采样层、基于多头注意力机制的下采样层、中间网络、基于多头注意力机制的上采样层以及上采样层等。通过引入多头注意力机制,使得文本和图像相关联。其中,下采样层包括多个残差模块。基于多头注意力机制的下采样层可以包括残差模块和Transformer模块,且扩散模型可以包括至少一个基于多头注意力机制的下采样层。若具有多个基于多头注意力机制的下采样层时,不同的基于多头注意力机制的下采样层包括的Transformer模块的数量不同。噪声预测网络包括残差模块和Transformer模块等。基于多头注意力机制的上采样层可以包括残差模块和Transformer模块,且扩散模型可以包括至少一个基于多头注意力机制的上采样层。若具有多个基于多头注意力机制的上采样层时,不同的基于多头注意力机制的上采样层包括的Transformer模块的数量不同。上采样层包括多个残差模块。The preset editing model may include an encoder, a noise prediction network, and a decoder. Optionally, the noise prediction network may be a Unet network, comprising a convolutional layer, a downsampling layer, a downsampling layer based on a multi-head attention mechanism, an intermediate network, an upsampling layer based on a multi-head attention mechanism, and an upsampling layer. By introducing a multi-head attention mechanism, text and images are associated. The downsampling layer includes multiple residual modules. The downsampling layer based on the multi-head attention mechanism may include a residual module and a Transformer module, and the diffusion model may include at least one downsampling layer based on the multi-head attention mechanism. If there are multiple downsampling layers based on the multi-head attention mechanism, different downsampling layers based on the multi-head attention mechanism may include different numbers of Transformer modules. The noise prediction network includes residual modules and Transformer modules. The upsampling layer based on the multi-head attention mechanism may include a residual module and a Transformer module, and the diffusion model may include at least one upsampling layer based on the multi-head attention mechanism. If there are multiple upsampling layers based on the multi-head attention mechanism, different upsampling layers based on the multi-head attention mechanism may include different numbers of Transformer modules. The upsampling layer consists of multiple residual modules.

图4为本公开实施例所提供的一种图像编辑模型的训练示意图。如图4所示,编码器410的输入包括样本图像420,用于将样本图像420压缩至低维度空间,得到潜在特征图430。解码器440用于将完成图像编辑任务的低维度图像450还原至样本图像的尺寸,得到编辑结果图像460。噪声预测网络470用于在样本掩膜图480的约束下,对潜在特征图430中的待编辑区域490迭代执行T次添加噪声的操作,对非待编辑区域保持潜在特征不变,得到局部噪声图像4100。以及,还在时间步和描述文本的约束下,对局部噪声图像zT中的待编辑区域进行噪声预测处理和图像生成处理,得到时间步t=T对应的预测噪声以及图文相关性。然后,基于时间步t=T对应的图文相关性和局部噪声图像zT生成时间步t=T对应的局部编辑图像。从时间步t=T对应的局部编辑图像中减去时间步t=T对应的预测噪声,得到时间步t=T-1对应的局部噪声图像zT-1。将时间步t=T-1对应的局部噪声图像zT-1输入噪声预测网络以生成时间步t=i-2的局部噪声图像zT-2,以此类推,直至生成时间步t=0对应的局部噪声图像z0,即得到编辑结果图像460的低维度图像450。Figure 4 is a schematic diagram of training an image editing model provided by an embodiment of the present disclosure. As shown in Figure 4, the encoder 410 input includes a sample image 420, which is used to compress the sample image 420 into a low-dimensional space to obtain a latent feature map 430. The decoder 440 is used to restore the low-dimensional image 450 that has completed the image editing task to the size of the sample image, obtaining an edited result image 460. The noise prediction network 470 is used to iteratively perform T noise addition operations on the to-be-edited region 490 in the latent feature map 430 under the constraints of the sample mask map 480, while maintaining the latent features of the non-to-be-edited regions unchanged, to obtain a local noise image 4100. Furthermore, under the constraints of the time step and the descriptive text, the to-be-edited region in the local noise image zT is subjected to noise prediction and image generation processing to obtain the predicted noise and image-text correlation corresponding to time step t = T. Then, based on the image-text correlation corresponding to time step t = T and the local noise image zT, a local edited image corresponding to time step t = T is generated. The predicted noise at time step t = T is subtracted from the local edited image at time step t = T to obtain the local noise image zT-1 at time step t = T-1. The local noise image zT-1 at time step t = T-1 is input into the noise prediction network to generate the local noise image zT-2 at time step t = i-2. This is repeated until the local noise image z0 at time step t = 0 is generated, thus obtaining the low-dimensional image 450 of the edited result image 460.

将低维度图像450输入解码器440,解码器440对低维度图像450进行解压缩处理,以得到图像尺寸与样本图像420相同的编辑结果图像460。计算编辑结果图像460中待编辑区域中的生成对象与真实地实例目标或无固定形状的区域的损失值,根据损失值在反向传播的过程中更新模型参数,最终得到训练完成的图像编辑模型。Low-dimensional image 450 is input to decoder 440, which decompresses low-dimensional image 450 to obtain an edited image 460 of the same size as sample image 420. A loss value is calculated between the generated object in the to-be-edited region of edited image 460 and the real instance object or the region with no fixed shape. The model parameters are updated based on the loss value during backpropagation, ultimately yielding a trained image editing model.

可选地,在对所述参考掩膜图进行高斯模糊处理,得到所述样本图像对应的至少两个样本掩膜图之后,还包括:根据高斯模糊处理所采用的高斯核的尺寸,确定所述样本掩膜图的模糊等级,根据所述模糊等级标记对应的样本掩膜图。由于采用不同大小的高斯核对参考掩膜图进行高斯模糊处理,得到的样本掩膜图的粗糙程度不同。可以基于高斯模糊处理所采用的高斯核的尺寸对样本掩膜图进行模糊等级标记。例如,精细的样本掩膜图对应的模糊等级为0,随着样本掩膜图的粗糙程度增加,样本掩膜图对应的模糊等级递增。Optionally, after performing Gaussian blur processing on the reference mask image to obtain at least two sample mask images corresponding to the sample image, the method further includes: determining the blur level of the sample mask image based on the size of the Gaussian kernel used in the Gaussian blur processing, and marking the corresponding sample mask image according to the blur level. Since Gaussian blur processing is performed on the reference mask image using Gaussian kernels of different sizes, the resulting sample mask images have different degrees of roughness. The sample mask images can be marked with a blur level based on the size of the Gaussian kernel used in the Gaussian blur processing. For example, a fine sample mask image corresponds to a blur level of 0, and as the roughness of the sample mask image increases, the blur level corresponding to the sample mask image increases.

所述根据所述样本图集、样本掩膜图集和描述文本集训练预设编辑模型,得到图像编辑模型,包括:根据所述样本图集、样本掩膜图集、样本掩膜图对应的模糊等级和描述文本集训练预设编辑模型,得到图像编辑模型。具体地,将模糊等级映射为模糊等级向量,将模糊等级向量注入到噪声预测网络,以使模型学习到不同模糊等级与不同粗糙程度的样本掩膜图的对应关系,进而,控制生成对象对待编辑区域的填充程度。The method of training a preset editing model based on the sample atlas, the sample mask atlas, and the descriptive text set to obtain an image editing model includes: training the preset editing model based on the sample atlas, the sample mask atlas, the blur levels corresponding to the sample mask images, and the descriptive text set to obtain the image editing model. Specifically, the blur levels are mapped to blur level vectors, and the blur level vectors are injected into the noise prediction network so that the model learns the correspondence between different blur levels and sample mask images of different roughness levels, thereby controlling the degree to which the generated object fills the area to be edited.

本公开实施例的技术方案,对待编辑区域进行加噪和去噪,增加局部相关性,提升了编辑结果图像与描述文本的匹配度,提高了模型的生成稳定性,避免对样本图像整体进行加噪和去噪过程中,由于描述文本为图像整体描述,容易出现生成错误,导致编辑结果图像无法满足预期。对于非待编辑区域,在模型训练的所有时间步对应的加噪和去噪操作中,采用样本图像的潜在特征作为输入,降低了模型在图像编辑任务下的重建难度,还降低了生成对象色差,提高模型一致性。The technical solution of the disclosed embodiment adds noise and denoise to the area to be edited, increasing local correlation and improving the matching degree between the edited image and the description text. This improves the generation stability of the model and avoids the generation errors that may occur during the denoising and denoising of the sample image as a whole, as the description text is a description of the entire image, resulting in the edited image not meeting expectations. For areas not to be edited, the latent features of the sample image are used as input in the denoising and denoising operations corresponding to all time steps of model training. This reduces the reconstruction difficulty of the model in image editing tasks, reduces the color difference of the generated objects, and improves model consistency.

图5为本公开实施例所提供的一种图像编辑装置的结构示意图,该装置可以通过软件和/或硬件的形式实现,可选的,通过电子设备来实现,该电子设备可以是移动终端、PC端或服务器等。Figure 5 is a structural diagram of an image editing device provided by an embodiment of the present disclosure. The device can be implemented in the form of software and/or hardware. Optionally, it can be implemented by an electronic device, which can be a mobile terminal, a PC or a server.

如图5所示,所述装置包括:获取模块510、噪声添加模块520以及图像生成模块530。As shown in FIG5 , the apparatus includes: an acquisition module 510 , a noise adding module 520 and an image generating module 530 .

获取模块510,用于获取原始图像中的待编辑区域和所述待编辑区域对应的目标提示词,其中,所述目标提示词为用于描述图像编辑的预期效果的文本信息;An acquisition module 510 is configured to acquire a region to be edited in an original image and a target prompt word corresponding to the region to be edited, wherein the target prompt word is text information used to describe an expected effect of image editing;

噪声添加模块520,用于根据所述待编辑区域确定目标掩膜图,通过图像编辑模型基于所述目标掩膜图向所述待编辑区域添加预设噪声,得到局部噪声图像;a noise adding module 520 for determining a target mask image according to the region to be edited, and adding preset noise to the region to be edited based on the target mask image by an image editing model to obtain a local noise image;

图像生成模块530,用于通过所述图像编辑模型基于所述目标提示词,对所述局部噪声图像的待编辑区域进行噪声预测处理和图像生成处理,根据噪声预测结果和图像生成结果输出目标图像。The image generation module 530 is configured to perform noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word through the image editing model, and output a target image according to the noise prediction results and the image generation results.

可选地,所述图像编辑模型的训练方式包括:Optionally, the image editing model is trained by:

获取样本图集;Get the sample atlas;

对于所述样本图集中的样本图像,确定所述样本图像中的目标区域,对所述目标区域进行膨胀操作,得到样本膨胀图像,根据所述样本图集对应的所述样本膨胀图像确定样本掩膜图集;For a sample image in the sample atlas, determining a target area in the sample image, performing a dilation operation on the target area to obtain a sample dilated image, and determining a sample mask atlas according to the sample dilated image corresponding to the sample atlas;

对所述目标区域进行图像内容识别,得到所述目标区域对应的描述文本,根据所述样本图集对应的所述目标区域的描述文本确定描述文本集;Performing image content recognition on the target area to obtain description text corresponding to the target area, and determining a description text set based on the description text of the target area corresponding to the sample atlas;

根据所述样本图集、样本掩膜图集和描述文本集训练预设编辑模型,得到图像编辑模型。A preset editing model is trained based on the sample atlas, sample mask atlas and description text set to obtain an image editing model.

进一步地,所述根据所述样本图集对应的所述样本膨胀图像确定样本掩膜图集,包括:Furthermore, determining a sample mask atlas according to the sample dilation image corresponding to the sample atlas includes:

对于所述样本图集对应的样本膨胀图像,根据所述样本膨胀图像中的目标区域生成样本图像的参考掩膜图;For the sample dilated image corresponding to the sample atlas, generating a reference mask image of the sample image according to the target area in the sample dilated image;

对所述参考掩膜图进行高斯模糊处理,得到所述样本图像对应的至少两个样本掩膜图;Performing Gaussian blur processing on the reference mask image to obtain at least two sample mask images corresponding to the sample image;

根据所述样本图集中样本图像对应的至少两个样本掩膜图,确定样本掩膜图集。A sample mask atlas is determined according to at least two sample mask images corresponding to the sample images in the sample atlas.

可选地,在对所述参考掩膜图进行高斯模糊处理,得到所述样本图像对应的至少两个样本掩膜图之后,还包括:Optionally, after performing Gaussian blur processing on the reference mask image to obtain at least two sample mask images corresponding to the sample image, the method further includes:

根据高斯模糊处理所采用的高斯核的尺寸,确定所述样本掩膜图的模糊等级,根据所述模糊等级标记对应的样本掩膜图;Determining a blur level of the sample mask image according to a size of a Gaussian kernel used in Gaussian blur processing, and marking a corresponding sample mask image according to the blur level;

所述根据所述样本图集、样本掩膜图集和描述文本集训练预设编辑模型,得到图像编辑模型,包括:The method of training a preset editing model based on the sample atlas, the sample mask atlas, and the description text set to obtain an image editing model includes:

根据所述样本图集、样本掩膜图集、样本掩膜图对应的模糊等级和描述文本集训练预设编辑模型,得到图像编辑模型。A preset editing model is trained according to the sample atlas, the sample mask atlas, the blur levels corresponding to the sample mask images, and the description text set to obtain an image editing model.

可选地,获取模块510具体用于:Optionally, the acquisition module 510 is specifically configured to:

在编辑界面中展示原始图像;Display the original image in the editing interface;

获取针对所述原始图像的区域选择操作,根据所述区域选择操作确定所述原始图像中的待编辑区域;Obtaining a region selection operation for the original image, and determining a region to be edited in the original image according to the region selection operation;

获取针对所述待编辑区域的文本输入操作,根据所述文本输入操作确定所述待编辑区域对应的目标提示词。A text input operation for the area to be edited is acquired, and a target prompt word corresponding to the area to be edited is determined according to the text input operation.

可选地,噪声添加模块520具体用于:Optionally, the noise adding module 520 is specifically configured to:

根据所述待编辑区域中的前景内容确定目标掩膜图;Determining a target mask image according to the foreground content in the area to be edited;

通过所述图像编辑模型生成所述原始图像的潜在特征图,结合所述目标掩膜图和潜在特征图确定所述潜在特征图中的待编辑区域;Generate a potential feature map of the original image by using the image editing model, and determine a region to be edited in the potential feature map by combining the target mask map and the potential feature map;

基于所述潜在特征图中的待编辑区域执行设定次数的噪声添加操作,得到局部噪声图像。A noise adding operation is performed a set number of times based on the area to be edited in the potential feature map to obtain a local noise image.

进一步地,所述基于所述潜在特征图中的待编辑区域执行设定次数的噪声添加操作,得到局部噪声图像,包括:Furthermore, performing a set number of noise addition operations based on the area to be edited in the potential feature map to obtain a local noise image includes:

获取噪声特征图,其中,所述噪声特征图表征预设噪声;Acquire a noise characteristic graph, wherein the noise characteristic graph represents a preset noise;

结合所述目标掩膜图和噪声特征图确定噪声特征区域;Determining a noise feature area by combining the target mask image and the noise feature image;

基于所述待编辑区域和噪声特征区域执行设定次数的噪声添加操作,得到局部噪声图像。A noise adding operation is performed a set number of times based on the area to be edited and the noise feature area to obtain a local noise image.

可选地,图像生成模块530具体用于:Optionally, the image generation module 530 is specifically configured to:

通过所述图像编辑模型根据所述局部噪声图像的待编辑区域中的图像特征确定预测噪声;Determining predicted noise according to image features in the to-be-edited area of the local noise image by the image editing model;

通过所述图像编辑模型基于所述目标提示词生成文本特征,确定所述文本特征与局部噪声图像的相关性;generating text features based on the target prompt word using the image editing model, and determining the correlation between the text features and the local noise image;

根据所述相关性和局部噪声图像生成局部编辑图像,基于所述预测噪声对局部编辑图像进行去噪操作,输出目标图像。A local edited image is generated according to the correlation and the local noise image, a denoising operation is performed on the local edited image based on the predicted noise, and a target image is output.

可选地,还包括:Optionally, it also includes:

等级设置模块,用于获取掩膜设置信息,所述掩膜设置信息包括模糊等级;A level setting module, configured to obtain mask setting information, wherein the mask setting information includes a blur level;

进一步地,所述根据所述相关性和局部噪声图像生成局部编辑图像,包括:Furthermore, generating a local edited image according to the correlation and the local noise image includes:

基于所述模糊等级确定生成对象对待编辑区域的填充程度;Determining a filling degree of the area to be edited by the generated object based on the blur level;

根据所述相关性、填充程度和局部噪声图像生成局部编辑图像,其中,所述局部编辑图像中的待编辑区域包括所述生成对象。A local editing image is generated according to the correlation, the filling degree and the local noise image, wherein the area to be edited in the local editing image includes the generated object.

本公开实施例所提供的图像编辑装置可执行本公开任意实施例所提供的图像编辑方法,具备执行方法相应的功能模块和有益效果。The image editing device provided by the embodiments of the present disclosure can execute the image editing method provided by any embodiment of the present disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

值得注意的是,上述装置所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本公开实施例的保护范围。It is worth noting that the various units and modules included in the above-mentioned device are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be achieved; in addition, the specific names of the functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the embodiments of the present disclosure.

图6为本公开实施例所提供的一种电子设备的结构示意图。下面参考图6,其示出了适于用来实现本公开实施例的电子设备(例如图6中的终端设备或服务器)600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。FIG6 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present disclosure. Referring to FIG6 , a schematic diagram of the structure of an electronic device (such as a terminal device or server in FIG6 ) 600 suitable for implementing an embodiment of the present disclosure is shown below. The terminal device in the embodiment of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc. The electronic device shown in FIG6 is merely an example and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.

如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。编辑/输出(I/O)接口605也连接至总线604。As shown in FIG6 , electronic device 600 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 601, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded from a storage device 608 into a random access memory (RAM) 603. Various programs and data required for the operation of electronic device 600 are also stored in RAM 603. Processing device 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices may be connected to the I/O interface 605: an input device 606 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; an output device 607 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609. The communication device 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. Although FIG. 6 shows the electronic device 600 with various devices, it should be understood that not all of the devices shown are required to be implemented or present. More or fewer devices may alternatively be implemented or present.

特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602. When the computer program is executed by the processing device 601, the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.

本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of the messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes and are not used to limit the scope of these messages or information.

本公开实施例提供的电子设备与上述实施例提供的图像编辑方法属于同一发明构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例与上述实施例具有相同的有益效果。The electronic device provided by the embodiment of the present disclosure and the image editing method provided by the above embodiment belong to the same inventive concept. For technical details not fully described in this embodiment, please refer to the above embodiment, and this embodiment has the same beneficial effects as the above embodiment.

本公开实施例提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例所提供的图像编辑方法。An embodiment of the present disclosure provides a computer storage medium having a computer program stored thereon. When the program is executed by a processor, the image editing method provided by the above embodiment is implemented.

需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. A computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or component, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, device, or component. In the present disclosure, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, which carries computer-readable program code. Such a propagated data signal may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport a program for use by or in conjunction with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium may be transmitted using any suitable medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination thereof.

在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and server can communicate using any currently known or later developed network protocol, such as HTTP (HyperText Transfer Protocol), and can be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), an internet (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any currently known or later developed network.

上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.

上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:The computer-readable medium carries one or more programs. When the one or more programs are executed by the electronic device, the electronic device:

获取原始图像中的待编辑区域和所述待编辑区域对应的目标提示词,其中,所述目标提示词为用于描述图像编辑的预期效果的文本信息;Acquire a region to be edited in the original image and a target prompt word corresponding to the region to be edited, wherein the target prompt word is text information used to describe the expected effect of image editing;

根据所述待编辑区域确定目标掩膜图,通过图像编辑模型基于所述目标掩膜图向所述待编辑区域添加预设噪声,得到局部噪声图像;Determining a target mask image according to the area to be edited, and adding preset noise to the area to be edited based on the target mask image using an image editing model to obtain a local noise image;

通过所述图像编辑模型基于所述目标提示词,对所述局部噪声图像的待编辑区域进行噪声预测处理和图像生成处理,根据噪声预测结果和图像生成结果输出目标图像。The image editing model performs noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word, and outputs a target image according to the noise prediction result and the image generation result.

可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, or a combination thereof, including, but not limited to, object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages. The program code may be executed entirely on the user's computer, partially on the user's computer, as a stand-alone software package, partially on the user's computer and partially on a remote computer, or entirely on the remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., through the Internet using an Internet service provider).

附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the possible implementation architecture, functions and operations of the systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each box in the flowchart or block diagram can represent a module, program segment, or a part of code, and the module, program segment, or a part of code contains one or more executable instructions for realizing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order than that marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram and/or flowchart, and the combination of the boxes in the block diagram and/or flowchart, can be implemented with a dedicated hardware-based system that performs the specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.

描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments described in this disclosure may be implemented in software or hardware, wherein the name of a unit does not necessarily limit the unit itself.

本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, and without limitation, exemplary types of hardware logic components that may be used include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chip (SOCs), complex programmable logic devices (CPLDs), and the like.

在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium can be a tangible medium that can contain or store a program for use by or in conjunction with an instruction execution system, device or equipment. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or equipment, or any suitable combination of the foregoing. A more specific example of a machine-readable storage medium can include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, but also includes other technical solutions formed by any combination of the above-mentioned technical features or their equivalents without departing from the above-mentioned disclosed concepts. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.

此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。In addition, although each operation is described in a specific order, this should not be understood as requiring these operations to be performed in the specific order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Similarly, although some specific implementation details have been included in the above discussion, these should not be interpreted as limiting the scope of the present disclosure. Some features described in the context of a separate embodiment can also be implemented in a single embodiment in combination. On the contrary, the various features described in the context of a single embodiment can also be implemented in multiple embodiments individually or in any suitable sub-combination mode.

尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。Although the subject matter has been described in language specific to structural features and/or methodological logical acts, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

Claims (12)

一种图像编辑方法,包括:An image editing method, comprising: 获取原始图像中的待编辑区域和所述待编辑区域对应的目标提示词,其中,所述目标提示词为用于描述图像编辑的预期效果的文本信息;Acquire a region to be edited in the original image and a target prompt word corresponding to the region to be edited, wherein the target prompt word is text information used to describe the expected effect of image editing; 根据所述待编辑区域确定目标掩膜图,通过图像编辑模型基于所述目标掩膜图向所述待编辑区域添加预设噪声,得到局部噪声图像;Determining a target mask image according to the area to be edited, and adding preset noise to the area to be edited based on the target mask image using an image editing model to obtain a local noise image; 通过所述图像编辑模型基于所述目标提示词,对所述局部噪声图像的待编辑区域进行噪声预测处理和图像生成处理,根据噪声预测结果和图像生成结果输出目标图像。The image editing model performs noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word, and outputs a target image according to the noise prediction result and the image generation result. 根据权利要求1所述的方法,其中,所述图像编辑模型的训练方式包括:The method according to claim 1, wherein the training method of the image editing model comprises: 获取样本图集;Get the sample atlas; 对于所述样本图集中的样本图像,确定所述样本图像中的目标区域,对所述目标区域进行膨胀操作,得到样本膨胀图像,根据所述样本图集对应的所述样本膨胀图像确定样本掩膜图集;For a sample image in the sample atlas, determining a target area in the sample image, performing a dilation operation on the target area to obtain a sample dilated image, and determining a sample mask atlas according to the sample dilated image corresponding to the sample atlas; 对所述目标区域进行图像内容识别,得到所述目标区域对应的描述文本,根据所述样本图集对应的所述目标区域的描述文本确定描述文本集;Performing image content recognition on the target area to obtain description text corresponding to the target area, and determining a description text set based on the description text of the target area corresponding to the sample atlas; 根据所述样本图集、样本掩膜图集和描述文本集训练预设编辑模型,得到图像编辑模型。A preset editing model is trained based on the sample atlas, sample mask atlas and description text set to obtain an image editing model. 根据权利要求2所述的方法,其中,所述根据所述样本图集对应的所述样本膨胀图像确定样本掩膜图集,包括:The method according to claim 2, wherein determining a sample mask atlas based on the sample dilation image corresponding to the sample atlas comprises: 对于所述样本图集对应的样本膨胀图像,根据所述样本膨胀图像中的目标区域生成样本图像的参考掩膜图;For the sample dilated image corresponding to the sample atlas, generating a reference mask image of the sample image according to the target area in the sample dilated image; 对所述参考掩膜图进行高斯模糊处理,得到所述样本图像对应的至少两个样本掩膜图;Performing Gaussian blur processing on the reference mask image to obtain at least two sample mask images corresponding to the sample image; 根据所述样本图集中样本图像对应的至少两个样本掩膜图,确定样本掩膜图集,其中,所述至少两个样本掩膜图与描述文本相关联。A sample mask atlas set is determined according to at least two sample mask images corresponding to the sample images in the sample atlas set, wherein the at least two sample mask images are associated with description texts. 根据权利要求3所述的方法,其中,在对所述参考掩膜图进行高斯模糊处理,得到所述样本图像对应的至少两个样本掩膜图之后,所述方法还包括:The method according to claim 3, wherein, after performing Gaussian blur processing on the reference mask image to obtain at least two sample mask images corresponding to the sample image, the method further comprises: 根据高斯模糊处理所采用的高斯核的尺寸,确定所述样本掩膜图的模糊等级,根据所述模糊等级标记对应的样本掩膜图;Determining a blur level of the sample mask image according to a size of a Gaussian kernel used in Gaussian blur processing, and marking a corresponding sample mask image according to the blur level; 所述根据所述样本图集、样本掩膜图集和描述文本集训练预设编辑模型,得到图像编辑模型,包括:The method of training a preset editing model based on the sample atlas, the sample mask atlas, and the description text set to obtain an image editing model includes: 根据所述样本图集、样本掩膜图集、样本掩膜图对应的模糊等级和描述文本集训练预设编辑模型,得到图像编辑模型。A preset editing model is trained according to the sample atlas, the sample mask atlas, the blur levels corresponding to the sample mask images, and the description text set to obtain an image editing model. 根据权利要求1-4任一项所述的方法,其中,所述获取原始图像中的待编辑区域和所述待编辑区域对应的目标提示词,包括:The method according to any one of claims 1 to 4, wherein obtaining the area to be edited in the original image and the target prompt word corresponding to the area to be edited comprises: 在编辑界面中展示原始图像;Display the original image in the editing interface; 获取针对所述原始图像的区域选择操作,根据所述区域选择操作确定所述原始图像中的待编辑区域;Obtaining a region selection operation for the original image, and determining a region to be edited in the original image according to the region selection operation; 获取针对所述待编辑区域的文本输入操作,根据所述文本输入操作确定所述待编辑区域对应的目标提示词。A text input operation for the area to be edited is acquired, and a target prompt word corresponding to the area to be edited is determined according to the text input operation. 根据权利要求1-5任一项所述的方法,其中,所述根据所述待编辑区域确定目标掩膜图,通过图像编辑模型基于所述目标掩膜图向所述待编辑区域添加预设噪声,得到局部噪声图像,包括:The method according to any one of claims 1 to 5, wherein determining a target mask image according to the area to be edited, and adding preset noise to the area to be edited based on the target mask image using an image editing model to obtain a local noise image comprises: 根据所述待编辑区域中的前景内容确定目标掩膜图;Determining a target mask image according to the foreground content in the area to be edited; 通过所述图像编辑模型生成所述原始图像的潜在特征图,结合所述目标掩膜图和潜在特征图确定所述潜在特征图中的待编辑区域;Generate a potential feature map of the original image by using the image editing model, and determine a region to be edited in the potential feature map by combining the target mask map and the potential feature map; 基于所述潜在特征图中的待编辑区域执行设定次数的噪声添加操作,得到局部噪声图像。A noise adding operation is performed a set number of times based on the area to be edited in the potential feature map to obtain a local noise image. 根据权利要求6所述的方法,其中,所述基于所述潜在特征图中的待编辑区域执行设定次数的噪声添加操作,得到局部噪声图像,包括:The method according to claim 6, wherein performing a set number of noise addition operations based on the area to be edited in the potential feature map to obtain a local noise image comprises: 获取噪声特征图,其中,所述噪声特征图表征预设噪声;Acquire a noise characteristic graph, wherein the noise characteristic graph represents a preset noise; 结合所述目标掩膜图和所述噪声特征图确定噪声特征区域;Determining a noise feature area by combining the target mask image and the noise feature image; 基于所述待编辑区域和所述噪声特征区域执行设定次数的噪声添加操作,得到局部噪声图像。A noise adding operation is performed a set number of times based on the area to be edited and the noise feature area to obtain a local noise image. 根据权利要求1-7任一项所述的方法,其中,所述通过所述图像编辑模型基于所述目标提示词,对所述局部噪声图像的待编辑区域进行噪声预测处理和图像生成处理,根据噪声预测结果和图像生成结果输出目标图像,包括:The method according to any one of claims 1 to 7, wherein the performing noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word by the image editing model, and outputting the target image according to the noise prediction result and the image generation result, comprises: 通过所述图像编辑模型根据所述局部噪声图像的待编辑区域中的图像特征确定预测噪声;Determining predicted noise according to image features in the to-be-edited area of the local noise image by the image editing model; 通过所述图像编辑模型基于所述目标提示词生成文本特征,确定所述文本特征与局部噪声图像的相关性;generating text features based on the target prompt word using the image editing model, and determining the correlation between the text features and the local noise image; 根据所述相关性和局部噪声图像生成局部编辑图像,基于所述预测噪声对所述局部编辑图像进行去噪操作,输出目标图像。A local edited image is generated according to the correlation and the local noise image, a denoising operation is performed on the local edited image based on the predicted noise, and a target image is output. 根据权利要求8所述的方法,还包括:The method according to claim 8, further comprising: 获取掩膜设置信息,所述掩膜设置信息包括模糊等级;Obtaining mask setting information, wherein the mask setting information includes a blur level; 所述根据所述相关性和局部噪声图像生成局部编辑图像,包括:Generating a local edited image according to the correlation and the local noise image includes: 基于所述模糊等级确定生成对象对待编辑区域的填充程度;Determining a filling degree of the area to be edited by the generated object based on the blur level; 根据所述相关性、填充程度和局部噪声图像生成局部编辑图像,其中,所述局部编辑图像中的待编辑区域包括所述生成对象。A local editing image is generated according to the correlation, the filling degree and the local noise image, wherein the area to be edited in the local editing image includes the generated object. 一种图像编辑装置,包括:An image editing device, comprising: 获取模块,被配置为获取原始图像中的待编辑区域和所述待编辑区域对应的目标提示词,其中,所述目标提示词为用于描述图像编辑的预期效果的文本信息;an acquisition module configured to acquire a region to be edited in an original image and a target prompt word corresponding to the region to be edited, wherein the target prompt word is text information used to describe an expected effect of image editing; 噪声添加模块,被配置为根据所述待编辑区域确定目标掩膜图,通过图像编辑模型基于所述目标掩膜图向所述待编辑区域添加预设噪声,得到局部噪声图像;a noise adding module configured to determine a target mask image according to the area to be edited, and add preset noise to the area to be edited based on the target mask image by an image editing model to obtain a local noise image; 图像生成模块,被配置为通过所述图像编辑模型基于所述目标提示词,对所述局部噪声图像的待编辑区域进行噪声预测处理和图像生成处理,根据噪声预测结果和图像生成结果输出目标图像。The image generation module is configured to perform noise prediction processing and image generation processing on the to-be-edited area of the local noise image based on the target prompt word through the image editing model, and output the target image according to the noise prediction result and the image generation result. 一种电子设备,包括:An electronic device, comprising: 一个或多个处理器;one or more processors; 存储装置,用于存储一个或多个程序,其中,A storage device for storing one or more programs, wherein: 当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-9中任一所述的图像编辑方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the image editing method according to any one of claims 1 to 9. 一种包含计算机可执行指令的存储介质,其中,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-9中任一所述的图像编辑方法。A storage medium comprising computer-executable instructions, wherein the computer-executable instructions, when executed by a computer processor, are used to perform the image editing method according to any one of claims 1 to 9.
PCT/CN2025/088422 2024-04-15 2025-04-11 Image editing method and apparatus, device, and storage medium Pending WO2025218588A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202410455156.2A CN118247388A (en) 2024-04-15 2024-04-15 Image editing method, device, equipment and storage medium
CN202410455156.2 2024-04-15

Publications (1)

Publication Number Publication Date
WO2025218588A1 true WO2025218588A1 (en) 2025-10-23

Family

ID=91555328

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2025/088422 Pending WO2025218588A1 (en) 2024-04-15 2025-04-11 Image editing method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN118247388A (en)
WO (1) WO2025218588A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118247388A (en) * 2024-04-15 2024-06-25 北京字跳网络技术有限公司 Image editing method, device, equipment and storage medium
CN119516038A (en) * 2024-10-08 2025-02-25 厦门大学 A text-guided image editing method, device, equipment and medium
CN119515697B (en) * 2024-11-11 2025-11-21 中移动信息技术有限公司 Image fusion method, device, apparatus, storage medium and program product
CN119379842B (en) * 2024-12-31 2025-05-30 中国计量大学 Image generation method, system and medium based on object semantics

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116543075A (en) * 2023-03-31 2023-08-04 北京百度网讯科技有限公司 Image generation method, device, electronic device and storage medium
CN116958324A (en) * 2023-07-24 2023-10-27 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of image generation model
US20230377225A1 (en) * 2022-05-19 2023-11-23 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for editing an image and method and apparatus for training an image editing model, device and medium
CN117372574A (en) * 2023-10-23 2024-01-09 科大讯飞股份有限公司 Image editing method, device, equipment and readable storage medium
CN117541511A (en) * 2023-12-04 2024-02-09 北京字跳网络技术有限公司 An image processing method, device, electronic equipment and storage medium
CN117670658A (en) * 2023-12-12 2024-03-08 支付宝(杭州)信息技术有限公司 Image generation method, device, electronic equipment and storage medium
CN118247388A (en) * 2024-04-15 2024-06-25 北京字跳网络技术有限公司 Image editing method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230377225A1 (en) * 2022-05-19 2023-11-23 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for editing an image and method and apparatus for training an image editing model, device and medium
CN116543075A (en) * 2023-03-31 2023-08-04 北京百度网讯科技有限公司 Image generation method, device, electronic device and storage medium
CN116958324A (en) * 2023-07-24 2023-10-27 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of image generation model
CN117372574A (en) * 2023-10-23 2024-01-09 科大讯飞股份有限公司 Image editing method, device, equipment and readable storage medium
CN117541511A (en) * 2023-12-04 2024-02-09 北京字跳网络技术有限公司 An image processing method, device, electronic equipment and storage medium
CN117670658A (en) * 2023-12-12 2024-03-08 支付宝(杭州)信息技术有限公司 Image generation method, device, electronic equipment and storage medium
CN118247388A (en) * 2024-04-15 2024-06-25 北京字跳网络技术有限公司 Image editing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN118247388A (en) 2024-06-25

Similar Documents

Publication Publication Date Title
WO2025218588A1 (en) Image editing method and apparatus, device, and storage medium
US20230394671A1 (en) Image segmentation method and apparatus, and device, and storage medium
CN112561840B (en) Video clipping method and device, storage medium and electronic equipment
WO2022166872A1 (en) Special-effect display method and apparatus, and device and medium
CN111414879B (en) Face shielding degree identification method and device, electronic equipment and readable storage medium
CN111325704B (en) Image restoration method and device, electronic equipment and computer-readable storage medium
CN114549722B (en) Rendering method, device, equipment and storage medium of 3D material
CN110349107B (en) Image enhancement method, device, electronic equipment and storage medium
CN110298851B (en) Training method and device for human body segmentation neural network
CN112330788B (en) Image processing method, device, readable medium and electronic device
CN114913061A (en) Image processing method and device, storage medium and electronic equipment
WO2024240222A1 (en) Image stylization processing method and apparatus, device, storage medium and program product
CN110689478A (en) Image stylization processing method, device, electronic device and readable medium
CN115205305A (en) Instance segmentation model training method, instance segmentation method and device
CN114972020B (en) Image processing method, device, storage medium and electronic device
CN114419298A (en) Virtual object generation method, device, equipment and storage medium
CN117830516A (en) Light field reconstruction method, light field reconstruction device, electronic equipment, medium and product
CN114742934B (en) Image rendering method and device, readable medium and electronic equipment
CN115546487A (en) Image model training method, device, medium and electronic equipment
CN111680754B (en) Image classification method, device, electronic equipment and computer readable storage medium
CN111862342B (en) Augmented reality texture processing method and device, electronic equipment and storage medium
WO2024131652A1 (en) Special effect processing method and apparatus, and electronic device and storage medium
CN119251373A (en) Map generation method, device, medium, equipment and computer program product
CN112418233B (en) Image processing method and device, readable medium and electronic equipment
WO2024131503A1 (en) Special-effect image generation method and apparatus, and device and storage medium