WO2025152653A1

WO2025152653A1 - Method and apparatus for image processing, and device and storage medium

Info

Publication number: WO2025152653A1
Application number: PCT/CN2024/138045
Authority: WO
Inventors: 齐天浩; 邬彦泽; 刘佳伟; 方山城; 刘玮
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2024-01-17
Filing date: 2024-12-10
Publication date: 2025-07-24
Anticipated expiration: 2026-07-17
Also published as: CN120339050A

Abstract

According to the embodiments of the present disclosure, provided are a method and apparatus for image processing, and a device and a storage medium. The method comprises: acquiring a first reference image feature of a first reference image; determining a style-related feature of the first reference image on the basis of the first reference image feature, a first predetermined prompt for style feature extraction, and a first query representation; and generating a first target image on the basis of the style-related feature of the first reference image and first input text indicating target image content, wherein the first target image matches both the image style of the first reference image and the target image content. In this way, a style feature of a reference image can be decoupled from a semantic feature thereof, thus facilitating the generation of an image that has the style of the reference image and also meets the semantics of input text.

Description

Method, device, apparatus and storage medium for image processing

本申请要求2024年01月17日递交的、标题为“用于图像处理的方法、装置、设备和存储介质”、申请号为202410070352.8的中国发明专利申请的优先权，该申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese invention patent application entitled “Method, device, apparatus and storage medium for image processing” and application number 202410070352.8 filed on January 17, 2024, the entire contents of which are incorporated by reference into this application.

Technical Field

本公开的示例实施例总体涉及计算机领域，特别地涉及用于图像处理的方法、装置、设备和计算机可读存储介质。Example embodiments of the present disclosure generally relate to the field of computers, and more particularly, to methods, devices, apparatuses, and computer-readable storage media for image processing.

Background Art

在计算机视觉(CV)领域中，基于机器学习的各种图像生成技术已经得到显著发展，并且具有广泛应用。例如，在诸如社交、游戏、图像编辑等很多应用场景中期望生成和使用具有特定风格的图像。基于机器学习的图像生成技术可以用于这样的应用场景中，以提高图像生成效果。In the field of computer vision (CV), various image generation techniques based on machine learning have been significantly developed and have been widely used. For example, in many application scenarios such as social networking, games, and image editing, it is expected to generate and use images with a specific style. Image generation techniques based on machine learning can be used in such application scenarios to improve the image generation effect.

Summary of the invention

在本公开的第一方面，提供了一种图像处理方法。该方法包括：获取第一参考图像的第一参考图像特征；基于第一参考图像特征、用于风格特征提取的第一预定提示和第一查询表示，确定第一参考图像的风格相关特征；以及基于第一参考图像的风格相关特征和指示目标图像内容的第一输入文本，生成第一目标图像，第一目标图像与第一参考图像的图像风格和目标图像内容相匹配。In a first aspect of the present disclosure, an image processing method is provided. The method includes: obtaining a first reference image feature of a first reference image; determining a style-related feature of the first reference image based on the first reference image feature, a first predetermined prompt for style feature extraction, and a first query representation; and generating a first target image based on the style-related feature of the first reference image and a first input text indicating the content of the target image, the first target image matching the image style and the target image content of the first reference image.

在本公开的第二方面，提供了一种用于图像处理的装置。该装置包括：第一图像特征提取模块，被配置为获取第一参考图像的第一参考图像特征；风格相关特征模块，被配置为基于第一参考图像特征、用于风格特征提取的第一预定提示和第一查询表示，确定第一参考图像的风格相关特征；以及第一目标图像生成模块，被配置为基于第一参考图像的风格相关特征和指示目标图像内容的第一输入文本，生成第一目标图像，第一目标图像与第一参考图像的图像风格和目标图像内容相匹配。In a second aspect of the present disclosure, a device for image processing is provided. The device includes: a first image feature extraction module configured to obtain a first reference image feature of a first reference image; a style-related feature module configured to determine the style-related features of the first reference image based on the first reference image feature, a first predetermined prompt for style feature extraction, and a first query representation; and a first target image generation module configured to generate a first target image based on the style-related features of the first reference image and a first input text indicating the content of the target image, the first target image matching the image style and the target image content of the first reference image.

在本公开的第三方面，提供了一种电子设备。该设备包括至少一个处理单元；以及至少一个存储器，至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令。指令在由至少一个处理单元执行时使设备执行第一方面的方法。In a third aspect of the present disclosure, an electronic device is provided. The device includes at least one processing unit; and at least one memory, the at least one memory is coupled to the at least one processing unit and stores instructions for execution by the at least one processing unit. When the instructions are executed by the at least one processing unit, the device executes the method of the first aspect.

在本公开的第四方面，提供了一种计算机可读存储介质。该计算机可读存储介质上存储有计算机程序，计算机程序可由处理器执行以实现第一方面的方法。In a fourth aspect of the present disclosure, a computer-readable storage medium is provided, wherein a computer program is stored on the computer-readable storage medium, and the computer program can be executed by a processor to implement the method of the first aspect.

在本公开的第五方面，提供了一种计算机程序产品，该计算机程序产品被有形地存储在计算机存储介质中并且包括计算机可执行指令，计算机可执行指令在由设备执行时使设备执行以实现第一方面的方法。In a fifth aspect of the present disclosure, a computer program product is provided, which is tangibly stored in a computer storage medium and includes computer executable instructions, which when executed by a device cause the device to execute to implement the method of the first aspect.

应当理解，本内容部分中所描述的内容并非旨在限定本公开的实施例的关键特征或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的描述而变得容易理解。It should be understood that the contents described in this content section are not intended to limit the key features or important features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become easily understood through the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

结合附图并参考以下详细说明，本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中，相同或相似的附图标记表示相同或相似的元素，其中：The above and other features, advantages and aspects of the embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. In the accompanying drawings, the same or similar reference numerals represent the same or similar elements, wherein:

图1示出了本公开的实施例能够在其中实现的示例环境的示意图；FIG1 is a schematic diagram showing an example environment in which embodiments of the present disclosure can be implemented;

图2示出了根据本公开的一些实施例的用于生成具有参考风格和目标内容的图像的示例架构的示意图；FIG2 shows a schematic diagram of an example architecture for generating an image having a reference style and a target content according to some embodiments of the present disclosure;

图3示出了根据本公开的一些实施例的用于生成具有参考内容和目标风格的图像的示例架构的示意图；FIG3 shows a schematic diagram of an example architecture for generating an image with reference content and a target style according to some embodiments of the present disclosure;

图4示出了根据本公开的一些实施例的交叉注意力机制应用的示意图；FIG4 shows a schematic diagram of an application of a cross-attention mechanism according to some embodiments of the present disclosure;

图5A示出了根据本公开的一些实施例的用于风格特征提取的训练任务的示意图；FIG5A shows a schematic diagram of a training task for style feature extraction according to some embodiments of the present disclosure;

图5B示出了根据本公开的一些实施例的用于语义特征提取的训练任务的示意图；FIG5B shows a schematic diagram of a training task for semantic feature extraction according to some embodiments of the present disclosure;

图5C示出了根据本公开的一些实施例的用于图像重建的训练任务的示意图；FIG5C shows a schematic diagram of a training task for image reconstruction according to some embodiments of the present disclosure;

图6示出了根据本公开的一些实施例的图像处理的过程的流程图；FIG6 shows a flowchart of an image processing process according to some embodiments of the present disclosure;

图7示出了根据本公开的一些实施例的用于图像处理的装置的框图；以及FIG7 shows a block diagram of an apparatus for image processing according to some embodiments of the present disclosure; and

图8示出了能够实施本公开的多个实施例的设备的框图。FIG8 shows a block diagram of a device capable of implementing various embodiments of the present disclosure.

DETAILED DESCRIPTION

可以理解的是，在使用本公开各实施例公开的技术方案之前，均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。It is understandable that before using the technical solutions disclosed in the embodiments of the present disclosure, the types, scope of use, usage scenarios, etc. of the personal information involved in the present disclosure should be informed to the user and the user's authorization should be obtained in an appropriate manner in accordance with relevant laws and regulations.

例如，在响应于接收到用户的主动请求时，向用户发送提示信息，以明确地提示用户，其请求执行的操作将需要获取和使用到用户的个人信息。从而，使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。For example, in response to receiving an active request from a user, a prompt message is sent to the user to clearly prompt the user that the operation requested to be performed will require obtaining and using the user's personal information. Thus, the user can autonomously choose whether to provide personal information to software or hardware such as an electronic device, application, server, or storage medium that performs the operation of the technical solution of the present disclosure according to the prompt message.

作为一种可选的但非限定性的实现方式，响应于接收到用户的主动请求，向用户发送提示信息的方式例如可以是弹窗的方式，弹窗中可以以文字的方式呈现提示信息。此外，弹窗中还可以承载供用户选择“同意”或者“不同意”向电子设备提供个人信息的选择控件。As an optional but non-limiting implementation, in response to receiving an active request from the user, the prompt information may be sent to the user in the form of a pop-up window, in which the prompt information may be presented in text form. In addition, the pop-up window may also carry a selection control for the user to choose "agree" or "disagree" to provide personal information to the electronic device.

可以理解的是，上述通知和获取用户授权过程仅是示意性的，不对本公开的实现方式构成限定，其它满足相关法律法规的方式也可应用于本公开的实现方式中。It is understandable that the above notification and the process of obtaining user authorization are merely illustrative and do not constitute a limitation on the implementation of the present disclosure. Other methods that meet the relevant laws and regulations may also be applied to the implementation of the present disclosure.

可以理解的是，本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。It is understandable that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of relevant laws, regulations and relevant provisions.

下面将参照附图更详细地描述本公开的实施例。虽然附图中示出了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反，提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments set forth herein. On the contrary, these embodiments are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes and are not intended to limit the scope of protection of the present disclosure.

需要注意的是，本文中所提供的任何节/子节的标题并不是限制性的。本文通篇描述了各种实施例，并且任何类型的实施例都可以包括在任何节/子节下。此外，在任一节/子节中描述的实施例可以以任何方式与同一节/子节和/或不同节/子节中描述的任何其他实施例相结合。It should be noted that the titles of any sections/subsections provided herein are not restrictive. Various embodiments are described throughout this article, and any type of embodiment may be included under any section/subsection. In addition, the embodiments described in any section/subsection may be combined in any manner with any other embodiments described in the same section/subsection and/or different sections/subsections.

在本文中，除非明确说明，“响应于A”执行一个步骤并不意味着在“A”之后立即执行该步骤，而是可以包括一个或多个中间步骤。Herein, unless explicitly stated, executing a step “in response to A” does not mean executing the step immediately after “A” but may include one or more intermediate steps.

在本公开的实施例的描述中，术语“包括”及其类似用语应当理解为开放性包含，即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“一些实施例”应当理解为“至少一些实施例”。下文还可能包括其他明确的和隐含的定义。术语“第一”、“第二”等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。In the description of the embodiments of the present disclosure, the term "including" and similar terms should be understood as open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions may be included below. The terms "first", "second", etc. may refer to different or the same objects. Other explicit and implicit definitions may be included below.

如本文中所使用的，术语“模型”可以从训练数据中学习到相应的输入与输出之间的关联，从而在训练完成后可以针对给定的输入，生成对应的输出。模型的生成可以基于机器学习技术。深度学习是一种机器学习算法，通过使用多层处理单元来处理输入和提供相应输出。在本文中，“模型”也可以被称为“机器学习模型”、“机器学习网络”或“网络”，这些术语在本文中可互换地使用。一个模型又可以包括不同类型的处理单元或网络。As used herein, the term "model" can learn the association between the corresponding input and output from the training data, so that after the training is completed, the corresponding output can be generated for a given input. The generation of the model can be based on machine learning technology. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs by using multiple layers of processing units. In this article, "model" may also be referred to as "machine learning model", "machine learning network" or "network", which are used interchangeably in this article. A model may also include different types of processing units or networks.

如本文所使用的，目标图像与图像风格相匹配可以指代目标图像在视觉上具有与该图像风格一致或类似的风格。目标图像与图像内容相匹配可以指代目标图像在视觉上包含与该目标内容相同或类似的内容。As used herein, a target image matches an image style, which may refer to the target image visually having a style consistent with or similar to the image style. A target image matches an image content, which may refer to the target image visually containing the same or similar content as the target content.

示例环境Example Environment

图1示出了本公开的实施例能够在其中实现的示例环境100的示意图。在环境100中，电子设备110中部署有图像处理系统120，也简称为系统120。图像处理系统120被配置为基于输入文本102和参考图像101生成目标图像105。1 shows a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented. In the environment 100 , an image processing system 120 , also referred to as system 120 , is deployed in an electronic device 110 . The image processing system 120 is configured to generate a target image 105 based on an input text 102 and a reference image 101 .

参考图像101可以用于提供或指示期望在目标图像105中包含的图像要素，例如图像风格、图像内容等。输入文本可以用于指示期望在目标图像105中包含的另一图像要素。如本文所使用的，术语“图像要素”可以指示构成图像的各种显式要素或隐式要素。示例性的，图像要素可以包括图像风格或图像内容。图像风格的示例包括但不限于水彩、蜡笔、素描、漫画等等。图像内容的示例可以包括但不限于图像的前景的至少一部分、图像的背景的至少一部分等。The reference image 101 can be used to provide or indicate an image element that is desired to be included in the target image 105, such as an image style, image content, etc. The input text can be used to indicate another image element that is desired to be included in the target image 105. As used herein, the term "image element" can indicate various explicit elements or implicit elements that constitute an image. Exemplarily, an image element can include an image style or image content. Examples of image styles include, but are not limited to, watercolor, crayons, sketches, comics, etc. Examples of image content can include, but are not limited to, at least a portion of the foreground of an image, at least a portion of the background of an image, etc.

在一些实施例中，图像处理系统120可以生成与参考图像101具有相同风格的图像。在下文中，参考图像101的图像风格也称为参考风格。在这种实施例中，输入文本102可以指示目标图像内容。所生成的目标图像105可以与参考图像101的图像风格和目标图像内容相匹配。例如，目标图像105可以具有参考图像101的图像风格并且包含目标图像内容。In some embodiments, the image processing system 120 may generate an image having the same style as the reference image 101. Hereinafter, the image style of the reference image 101 is also referred to as the reference style. In such embodiments, the input text 102 may indicate the target image content. The generated target image 105 may match the image style and target image content of the reference image 101. For example, the target image 105 may have the image style of the reference image 101 and include the target image content.

备选地或附加地，在一些实施例中，图像处理系统120可以生成与参考图像101具有相同内容的图像。在这种实施例中，输入文本102可以指示目标图像风格。所生成的目标图像105可以与目标图像风格和参考图像101中的图像内容相匹配。例如，目标图像105可以包含参考图像101中的图像内容并且具有输入文本102所指示的风格。Alternatively or additionally, in some embodiments, the image processing system 120 may generate an image having the same content as the reference image 101. In such embodiments, the input text 102 may indicate a target image style. The generated target image 105 may match the target image style and the image content in the reference image 101. For example, the target image 105 may contain the image content in the reference image 101 and have the style indicated by the input text 102.

在环境100中，电子设备110可以是任意类型的具有计算能力的设备，包括终端设备或服务端设备。终端设备可以是任意类型的移动终端、固定终端或便携式终端，包括移动手机、台式计算机、膝上型计算机、笔记本计算机、上网本计算机、平板计算机、媒体计算机、多媒体平板、个人通信系统(PCS)设备、个人导航设备、个人数字助理(PDA)、音频/视频播放器、数码相机/摄像机、定位设备、电视接收器、无线电广播接收器、电子书设备、游戏设备或者前述各项的任意组合，包括这些设备的配件和外设或者其任意组合。服务端设备例如可以包括计算系统/服务器，诸如大型机、边缘计算节点、云环境中的电子设备，等等。In environment 100, electronic device 110 can be any type of device with computing capabilities, including terminal devices or server devices. The terminal device can be any type of mobile terminal, fixed terminal or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, personal communication systems (PCS) devices, personal navigation devices, personal digital assistants (PDAs), audio/video players, digital cameras/camcorders, positioning devices, television receivers, radio broadcast receivers, e-book devices, gaming devices, or any combination of the foregoing, including accessories and peripherals of these devices or any combination thereof. The server device can include, for example, a computing system/server, such as a mainframe, an edge computing node, an electronic device in a cloud environment, and the like.

应当理解，仅出于示例性的目的描述环境100的结构和功能，而不暗示对于本公开的范围的任何限制。It should be understood that the structure and functionality of environment 100 are described for exemplary purposes only and does not imply any limitation on the scope of the present disclosure.

如上文所提及的，期望生成具有参考图像的参考风格的图像。为此，在一些常规方案中，可以微调基础图像生成模型(这样的模型可以根据输入文本生成图像)的全部参数或部分参数，来生成风格化的图像。然而，这种方案的耗时较长(例如，至少是分钟级的)，并且需要人工参与。As mentioned above, it is desirable to generate an image having a reference style of a reference image. To this end, in some conventional schemes, all or part of the parameters of a basic image generation model (such a model can generate an image based on input text) can be fine-tuned to generate a stylized image. However, such a scheme is time-consuming (e.g., at least minutes) and requires manual intervention.

在另一些常规方案中，采样两段式编码器，其包括固定的主干网络和需要训练的头部网络，以此来提取参考图像的特征。编码器得到的特征可以进一步与去噪网络结合。然而，在这种常规中，训练任务通常是重建任务，所以编码器所学到的是内容和风格的杂糅信息。这导致参考图像中的语义可能与输入文本所指示的语义有冲突，从而导致所生成的图像可能包含参考图像的语义。In other conventional schemes, a two-stage encoder is sampled, which includes a fixed backbone network and a head network that needs to be trained to extract features of the reference image. The features obtained by the encoder can be further combined with a denoising network. However, in this conventional method, the training task is usually a reconstruction task, so what the encoder learns is a mixture of content and style. This causes the semantics in the reference image to conflict with the semantics indicated by the input text, resulting in the generated image possibly containing the semantics of the reference image.

为此，本公开的实施例提供了一种用于图像生成的改进方案。在本公开的实施例中，引入了特定于风格特征提取的预定提示和查询表示来提取参考图像中的风格相关特征。而后，可以基于所提取的风格相关特征和指示目标图像内容的输入文本，生成与参考图像的图像风格和目标图像内容相匹配的目标图像。To this end, an embodiment of the present disclosure provides an improved solution for image generation. In the embodiment of the present disclosure, a predetermined prompt and query representation specific to style feature extraction are introduced to extract style-related features in a reference image. Then, a target image matching the image style and target image content of the reference image can be generated based on the extracted style-related features and input text indicating the content of the target image.

在本公开的实施例中，利用特定于风格特征提取的预定提示和查询表示来进行特征提取，以将参考图像的风格特征与语义特征解耦。这样可以缓解参考图像中的语义与输入文本的语义之间的不一致问题。由此，可以有利地生成既具有参考图像风格又符合输入文本语义的图像。In the embodiments of the present disclosure, feature extraction is performed using predetermined prompts and query representations specific to style feature extraction to decouple the style features of the reference image from the semantic features. This can alleviate the inconsistency problem between the semantics in the reference image and the semantics of the input text. As a result, an image that has both the style of the reference image and the semantics of the input text can be advantageously generated.

示例图像生成架构Example Image Generation Architecture

在一些实施例中，输入文本102可以指示期望目标图像所具有的内容，也称为目标图像内容或简称为目标内容。在这种实施例中，参考图像101所指示的是图像风格。目标图像105具有参考风格和目标内容。图2示出了根据本公开的一些实施例的用于生成具有参考风格和目标内容的图像的示例架构200的示意图。架构200可以在系统120中实现。针对图像特征的提取，架构200采用了两段式结构。In some embodiments, input text 102 may indicate the content that the target image is expected to have, also referred to as target image content or simply target content. In such embodiments, reference image 101 indicates an image style. Target image 105 has a reference style and target content. FIG. 2 shows a schematic diagram of an example architecture 200 for generating an image with a reference style and target content according to some embodiments of the present disclosure. Architecture 200 may be implemented in system 120. For the extraction of image features, architecture 200 adopts a two-stage structure.

如图2所示，参考图像201的参考风格为风格A。系统120可以获取参考图像201的图像特征，也称为参考图像特征202。示例性的，图像编码器220可以基于参考图像201生成参考图像特征202。图像编码器220可以采用任何合适的网络结构，本公开的实施例在此方面不受限制。As shown in FIG2 , the reference style of the reference image 201 is style A. The system 120 may obtain image features of the reference image 201, also referred to as reference image features 202. Exemplarily, the image encoder 220 may generate the reference image features 202 based on the reference image 201. The image encoder 220 may adopt any suitable network structure, and the embodiments of the present disclosure are not limited in this respect.

参考图像特征202可以用作特征变换模型230的输入。除此之外，特征变换模型230还具有其他输入，包括用于风格特征提取的预定提示204(也称为第一预定提示)和用于风格特征提取的查询表示203(也称为第一查询表示)。预定提示204可以是任何合适的文本或提示词。该示例中将预定提示204示出为文字“风格”仅是示例性的，而无意任何限制。在本公开的实施例中，第一预定提示是特定于风格特征提取的，而与参考图像无关。也即，第一预定提示不随着参考图像而变化。特征变换模型230可以用任何合适结构的网络来实现。作为示例而无意任何限制，特征变换模型230可以是查询变换器(Query Transformer)。The reference image features 202 can be used as inputs to the feature transformation model 230. In addition, the feature transformation model 230 has other inputs, including a predetermined prompt 204 (also referred to as a first predetermined prompt) for style feature extraction and a query representation 203 (also referred to as a first query representation) for style feature extraction. The predetermined prompt 204 can be any suitable text or prompt word. In this example, the predetermined prompt 204 is shown as the text "style" for exemplary purposes only and is not intended to be limiting. In an embodiment of the present disclosure, the first predetermined prompt is specific to style feature extraction and is independent of the reference image. That is, the first predetermined prompt does not change with the reference image. The feature transformation model 230 can be implemented with a network of any suitable structure. As an example and without any limitation, the feature transformation model 230 can be a query transformer.

查询表示203是可学习的。查询表示203可以是在特征变换模型230的训练中得到。例如，查询表示203可以被初始化，并且在特征变换模型230的训练中，查询表示203可以一起更新，直到训练完成被固化。这样的查询表示203可以用于指示特征变换模型230提取与图像风格相关的特征。如图2所示，查询表示203可以是任何合适的维度的向量化表示。查询表示203是针对风格特征提取而得到的。The query representation 203 is learnable. The query representation 203 may be obtained during the training of the feature transformation model 230. For example, the query representation 203 may be initialized, and during the training of the feature transformation model 230, the query representation 203 may be updated together until the training is completed and solidified. Such a query representation 203 may be used to instruct the feature transformation model 230 to extract features related to the image style. As shown in FIG. 2 , the query representation 203 may be a vectorized representation of any suitable dimension. The query representation 203 is obtained for style feature extraction.

特征变换模型230可以基于参考图像特征202、第一预定提示204和第一查询表示203，生成参考图像201的风格相关特征205。换言之，在第一预定提示204和第一查询表示203的提示或引导下，特征变换模型230可以从参考图像201中提取与图像风格相关的特征。The feature transformation model 230 can generate style-related features 205 of the reference image 201 based on the reference image features 202, the first predetermined hint 204, and the first query representation 203. In other words, under the hint or guidance of the first predetermined hint 204 and the first query representation 203, the feature transformation model 230 can extract features related to the image style from the reference image 201.

接下来，系统120可以基于风格相关特征205和指示目标图像内容的输入文本209，生成目标图像210。输入文本209可以以任何合适的字符来指示目标图像所期望包含的内容，例如但不限于人物、动物、物品、风景、建筑等等。在该示例中，输入文本209所指示的目标内容为“熊猫”，但这仅是示例性的而无意任何限制。如图2所示，目标图像210具有参考图像201的风格(即，风格A)并且包含输入文本209所指示的内容(即，熊猫)。Next, the system 120 can generate a target image 210 based on the style-related features 205 and the input text 209 indicating the content of the target image. The input text 209 can indicate the content that the target image is expected to contain in any suitable characters, such as but not limited to people, animals, objects, scenery, buildings, etc. In this example, the target content indicated by the input text 209 is "panda", but this is only exemplary and not intended to be any limitation. As shown in Figure 2, the target image 210 has the style of the reference image 201 (i.e., style A) and contains the content indicated by the input text 209 (i.e., panda).

系统120可以采用任何合适的算法或模型来生成目标图像。如图2所示，在一些实施例中，可以采用模型240，也称为第一模型。可以采用任何合适的网络结构来实现模型240。示例性的，模型240可以是扩散模型，其可以执行多个去噪步骤。模型240也可以是其他类型的，例如生成式对抗网络。System 120 may use any suitable algorithm or model to generate the target image. As shown in FIG2 , in some embodiments, model 240, also referred to as a first model, may be used. Model 240 may be implemented using any suitable network structure. Exemplarily, model 240 may be a diffusion model that may perform multiple denoising steps. Model 240 may also be of other types, such as a generative adversarial network.

具体地，文本编码器250可以生成输入文本209的文本特征207。文本特征207可以作为文本条件被提供至模型240。此外，模型240还可以接收风格相关特征205。模型240可以基于文本特征207和风格相关特征205，生成目标图像特征208。目标图像特征208而后可以用于生成目标图像210。例如，系统120可以利用图像解码器(未示出)来将目标图像特征208转换为目标图像210。本公开的实施例在如何将目标图像特征转换为目标图像方面不受限制。此外，在利用模型240执行的每个步骤(例如，去噪步骤)中，模型240还接收上一步骤所输出的图像特征作为输入。Specifically, the text encoder 250 can generate text features 207 of the input text 209. The text features 207 can be provided to the model 240 as text conditions. In addition, the model 240 can also receive style-related features 205. The model 240 can generate target image features 208 based on the text features 207 and the style-related features 205. The target image features 208 can then be used to generate the target image 210. For example, the system 120 can utilize an image decoder (not shown) to convert the target image features 208 into the target image 210. The embodiments of the present disclosure are not limited in how to convert the target image features into the target image. In addition, in each step (e.g., a denoising step) performed using the model 240, the model 240 also receives the image features output by the previous step as input.

在一些实施例中，模型240可以包括多个处理层，多个处理层分别用于处理相应尺寸的特征。在这种实施例中，风格相关特征205被提供到尺寸大于第一阈值尺寸的第一数目的处理层。例如，模型240可以是去噪U型网络。示例性的，如图2所示，接近模型输入和输出的处理层用于处理尺寸较大的特征，而接近模型中间的处理层用于处理尺寸较小的特征。相应地，风格相关特征205被提供至接近模型输入和输出的处理层。作为示例而无意任何限制，假设模型240包括16个处理层，其依次编号为0、1、…….、15。风格相关特征205可以被提供到0至3层和9至15层。下文将参考图4描述风格相关特征与文本特征的结合。In some embodiments, the model 240 may include multiple processing layers, and the multiple processing layers are respectively used to process features of corresponding sizes. In this embodiment, the style-related features 205 are provided to a first number of processing layers whose sizes are greater than the first threshold size. For example, the model 240 can be a denoising U-shaped network. Exemplarily, as shown in Figure 2, the processing layers close to the model input and output are used to process features of larger sizes, while the processing layers close to the middle of the model are used to process features of smaller sizes. Accordingly, the style-related features 205 are provided to the processing layers close to the model input and output. As an example and without any limitation, it is assumed that the model 240 includes 16 processing layers, which are numbered 0, 1, ..., 15 in sequence. The style-related features 205 can be provided to layers 0 to 3 and 9 to 15. The combination of style-related features and text features will be described below with reference to Figure 4.

尺寸较大的特征包括图像的更多细节，并且对应于精细处理层。通常认为这样的精细处理层主要负责图像的颜色、结构等风格相关要素的生成。尺寸较小的特征包括图像的高级语义信息，并且对应于较粗处理层。通常认为这样的较粗处理层主要负责图像的语义生成。因此，通过将风格相关特征仅注入到精细处理层而不注入到较粗处理层，可以进一步解耦参考图像的风格和语义。在这种实施例中，可以有利地使得模型240在图像生成中更专注于参考图像的风格。Features with larger sizes include more details of the image and correspond to the fine processing layer. It is generally believed that such fine processing layers are mainly responsible for the generation of style-related elements such as color and structure of the image. Features with smaller sizes include high-level semantic information of the image and correspond to the coarser processing layer. It is generally believed that such coarser processing layers are mainly responsible for the semantic generation of the image. Therefore, by injecting style-related features only into the fine processing layer and not into the coarser processing layer, the style and semantics of the reference image can be further decoupled. In this embodiment, it is advantageous to enable the model 240 to focus more on the style of the reference image in image generation.

以上描述了生成具有参考风格和目标内容的图像的示例实现。下面描述生成具有目标风格和参考内容的图像的示例实现。An example implementation of generating an image having a reference style and a target content is described above. An example implementation of generating an image having a target style and a reference content is described below.

在一些实施例中，输入文本102可以指示期望目标图像所具有的风格，也称为目标图像风格或简称为目标风格。在这种实施例中，参考图像101所指示的是图像内容，也称为参考内容。目标图像105具有参考内容和目标风格。图3示出了根据本公开的一些实施例的用于生成具有参考内容和目标风格的图像的示例架构300的示意图。架构300可以在系统120中实现。针对图像特征的提取，架构300采用了两段式结构。注意，图3中的文本编码器350与图2中的文本编码器250可以相同或不同，并且图3中的图像编码器320与图2中的图像编码器220可以相同或不同。本公开的实施例在此方面不受限制。In some embodiments, the input text 102 may indicate the style of the desired target image, also referred to as the target image style or simply the target style. In this embodiment, the reference image 101 indicates the image content, also referred to as the reference content. The target image 105 has the reference content and the target style. FIG3 shows a schematic diagram of an example architecture 300 for generating an image with reference content and a target style according to some embodiments of the present disclosure. The architecture 300 may be implemented in the system 120. For the extraction of image features, the architecture 300 adopts a two-stage structure. Note that the text encoder 350 in FIG3 may be the same or different from the text encoder 250 in FIG2, and the image encoder 320 in FIG3 may be the same or different from the image encoder 220 in FIG2. The embodiments of the present disclosure are not limited in this respect.

如图3所示，参考图像301的参考风格为风格A，并且图像内容为笑脸。系统120可以获取参考图像301的图像特征，也称为参考图像特征302。示例性的，图像编码器320可以基于参考图像301生成参考图像特征302。图像编码器320可以采用任何合适的网络结构，本公开的实施例在此方面不受限制。As shown in FIG3 , the reference style of the reference image 301 is style A, and the image content is a smiling face. The system 120 may obtain image features of the reference image 301, also referred to as reference image features 302. Exemplarily, the image encoder 320 may generate the reference image features 302 based on the reference image 301. The image encoder 320 may adopt any suitable network structure, and the embodiments of the present disclosure are not limited in this respect.

参考图像特征302可以用作特征变换模型330的输入。除此之外，特征变换模型330还具有其他输入，包括用于语义特征提取的预定提示304(也称为第二预定提示)和用于语义特征提取的查询表示303(也称为第二查询表示)。预定提示304可以是任何合适的文本或提示词。该示例中将预定提示304示出为文字“内容”仅是示例性的，而无意任何限制。在本公开的实施例中，第二预定提示是特定于语义特征提取的，而与参考图像无关。也即，第二预定提示不随着参考图像而变化。特征变换模型230可以用任何合适结构的网络来实现。作为示例而无意任何限制，特征变换模型230可以是查询变换器(Query Transformer)。The reference image features 302 can be used as inputs to the feature transformation model 330. In addition, the feature transformation model 330 has other inputs, including a predetermined prompt 304 (also referred to as a second predetermined prompt) for semantic feature extraction and a query representation 303 (also referred to as a second query representation) for semantic feature extraction. The predetermined prompt 304 can be any suitable text or prompt word. In this example, the predetermined prompt 304 is shown as the text "content" for exemplary purposes only and is not intended to be limiting. In an embodiment of the present disclosure, the second predetermined prompt is specific to semantic feature extraction and is independent of the reference image. That is, the second predetermined prompt does not change with the reference image. The feature transformation model 230 can be implemented with a network of any suitable structure. As an example and without any limitation, the feature transformation model 230 can be a query transformer.

查询表示303是可学习的。查询表示303可以是在特征变换模型330的训练中得到。例如，查询表示303可以被初始化，并且在特征变换模型330的训练中，查询表示303可以一起更新，直到训练完成被固化。这样的查询表示303可以用于指示特征变换模型330提取与图像内容相关的特征，也即语义相关特征。如图3所示，查询表示303可以是任何合适的维度的向量化表示。查询表示303是针对语义特征提取而得到的。The query representation 303 is learnable. The query representation 303 may be obtained during the training of the feature transformation model 330. For example, the query representation 303 may be initialized, and during the training of the feature transformation model 330, the query representation 303 may be updated together until the training is completed and solidified. Such a query representation 303 may be used to instruct the feature transformation model 330 to extract features related to the image content, that is, semantically related features. As shown in FIG. 3 , the query representation 303 may be a vectorized representation of any suitable dimension. The query representation 303 is obtained for semantic feature extraction.

特征变换模型330可以基于参考图像特征302、第二预定提示304和第二查询表示303，生成参考图像301的语义相关特征305。换言之，在第二预定提示304和第二查询表示303的提示或引导下，特征变换模型330可以从参考图像301中提取与图像语义相关的特征。The feature transformation model 330 can generate semantically relevant features 305 of the reference image 301 based on the reference image features 302, the second predetermined prompt 304, and the second query representation 303. In other words, under the prompt or guidance of the second predetermined prompt 304 and the second query representation 303, the feature transformation model 330 can extract features related to the image semantics from the reference image 301.

接下来，系统120可以基于语义相关特征305和指示目标图像风格的输入文本309，生成目标图像310。输入文本309可以以任何合适的字符来指示目标图像所期望具有的风格。在该示例中，输入文本309所指示的目标风格为“风格B”，但这仅是示例性的而无意任何限制。如图3所示，目标图像310具有参考图像301的内容(即，笑脸)并且包含输入文本309所指示的风格(即，风格B)。Next, the system 120 can generate a target image 310 based on the semantically relevant features 305 and the input text 309 indicating the style of the target image. The input text 309 can indicate the style that the target image is expected to have in any suitable characters. In this example, the target style indicated by the input text 309 is "Style B", but this is only exemplary and not intended to be limiting. As shown in Figure 3, the target image 310 has the content of the reference image 301 (i.e., a smiling face) and contains the style indicated by the input text 309 (i.e., Style B).

与参考图2所描述的类似，系统120可以采用任何合适的算法或模型来生成目标图像310。如图3所示，在一些实施例中，可以采用模型240。具体地，文本编码器350可以生成输入文本309的文本特征307。文本特征307可以作为文本条件被提供至模型240。此外，模型240还可以接收语义相关特征305。模型240可以基于文本特征307和语义相关特征305，生成目标图像特征308。目标图像特征308而后可以用于生成目标图像310。例如，系统120可以利用图像解码器(未示出)来将目标图像特征308转换为目标图像310。本公开的实施例在如何将目标图像特征转换为目标图像方面不受限制。此外，在利用模型240执行的每个步骤(例如，去噪步骤)中，模型240还接收上一步骤所输出的图像特征作为输入。Similar to what is described with reference to FIG. 2 , the system 120 may use any suitable algorithm or model to generate the target image 310. As shown in FIG. 3 , in some embodiments, the model 240 may be used. Specifically, the text encoder 350 may generate text features 307 of the input text 309. The text features 307 may be provided to the model 240 as text conditions. In addition, the model 240 may also receive semantically relevant features 305. The model 240 may generate target image features 308 based on the text features 307 and the semantically relevant features 305. The target image features 308 may then be used to generate the target image 310. For example, the system 120 may utilize an image decoder (not shown) to convert the target image features 308 into the target image 310. The embodiments of the present disclosure are not limited in how to convert the target image features into the target image. In addition, in each step (e.g., a denoising step) performed using the model 240, the model 240 also receives the image features output by the previous step as input.

在一些实施例中，模型240可以包括多个处理层，多个处理层分别用于处理相应尺寸的特征。在这种实施例中，语义相关特征305被提供到尺寸小于第二阈值尺寸的第二数目的处理层。第二阈值尺寸与上文描述的第一阈值尺寸可以相同或不同。例如，模型240可以是去噪U型网络。示例性的，如图3所示，接近模型输入和输出的处理层用于处理尺寸较大的特征，而接近模型中间的处理层用于处理尺寸较小的特征。相应地，语义相关特征305被提供至接近模型中间的处理层。作为示例而无意任何限制，假设模型240包括16个处理层，其依次编号为0、1、…….、15。语义相关特征305可以被提供到4至8层。下文将参考图4描述语义相关特征与文本特征的结合。In some embodiments, model 240 may include multiple processing layers, and multiple processing layers are used to process features of corresponding sizes. In this embodiment, semantically relevant features 305 are provided to a second number of processing layers whose size is less than a second threshold size. The second threshold size may be the same or different from the first threshold size described above. For example, model 240 may be a denoising U-shaped network. Exemplarily, as shown in Figure 3, the processing layers close to the model input and output are used to process features with larger sizes, while the processing layers close to the middle of the model are used to process features with smaller sizes. Accordingly, semantically relevant features 305 are provided to the processing layers close to the middle of the model. As an example and without any intention of limitation, it is assumed that model 240 includes 16 processing layers, which are numbered 0, 1, ..., 15 in sequence. Semantically relevant features 305 may be provided to 4 to 8 layers. The combination of semantically relevant features and text features will be described below with reference to Figure 4.

如参考图2所描述的，尺寸较大的特征包括图像的更多细节，并且对应于精细处理层。通常认为这样的精细处理层主要负责图像的颜色、结构等风格相关要素的生成。尺寸较小的特征包括图像的高级语义信息，并且对应于较粗处理层。通常认为这样的较粗处理层主要负责图像的语义生成。因此，通过将语义相关特征仅注入到较粗处理层而不注入到精细处理层，可以进一步解耦参考图像的风格和语义。在这种实施例中，可以使得模型240在图像生成中更专注于参考图像的内容。As described with reference to FIG2 , features with larger sizes include more details of the image and correspond to a fine processing layer. It is generally believed that such a fine processing layer is mainly responsible for the generation of style-related elements such as color and structure of the image. Features with smaller sizes include high-level semantic information of the image and correspond to a coarser processing layer. It is generally believed that such a coarser processing layer is mainly responsible for the semantic generation of the image. Therefore, by injecting semantically relevant features only into the coarser processing layer and not into the fine processing layer, the style and semantics of the reference image can be further decoupled. In this embodiment, the model 240 can be made to focus more on the content of the reference image in image generation.

参考图像的特征与文本特征的结合Combination of reference image features with text features

如上文所提及的，在一些实施例中，风格相关特征和语义相关特征可以被提供到模型240的某些处理层中。下文仅出于说明的目的，将风格相关特征和语义相关特征统称为图像相关特征。在这样的处理层中，图像特征和输入文本的文本特征可以作为条件用于目标图像生成。在一些实施例中，模型240可以是基于交叉注意力机制的。相应地，在处理层中，可以应用与图像相关特征和文本特征有关的交叉注意力。As mentioned above, in some embodiments, style-related features and semantic-related features can be provided to certain processing layers of model 240. Hereinafter, for the purpose of illustration only, style-related features and semantic-related features are collectively referred to as image-related features. In such processing layers, image features and text features of input text can be used as conditions for target image generation. In some embodiments, model 240 can be based on a cross-attention mechanism. Accordingly, in the processing layers, cross-attention related to image-related features and text features can be applied.

图4示出了根据本公开的一些实施例的交叉注意力机制应用的示意图。在图4中，在各个特征下方示出了相应的尺寸。如图4所示，在某个处理层中，将该处理层的输入图像特征411(其由Z表示，并且可以来自上一处理层或上一去噪步骤)转换成查询特征421。通过转换文本特征412(其由C_t表示)和图像相关特征413(其由C_i表示，可以是语义相关特征或风格相关特征)，生成键(Key)特征431和值(Value)特征432。而后，可以基于查询特征421、键特征431和值特征432，生成该处理层的输出图像特征450。如图4所示，查询特征421和键特征431的点乘可以得到注意力图。通过注意图和值特征432的矩阵乘可以得到输出图像特征450。FIG4 shows a schematic diagram of the application of the cross-attention mechanism according to some embodiments of the present disclosure. In FIG4 , the corresponding size is shown below each feature. As shown in FIG4 , in a certain processing layer, the input image feature 411 of the processing layer (which is represented by Z and can come from the previous processing layer or the previous denoising step) is converted into a query feature 421. By converting the text feature 412 (which is represented by C _t ) and the image-related feature 413 (which is represented by C _i and can be a semantic-related feature or a style-related feature), a key feature 431 and a value feature 432 are generated. Then, based on the query feature 421, the key feature 431 and the value feature 432, the output image feature 450 of the processing layer can be generated. As shown in FIG4 , the dot product of the query feature 421 and the key feature 431 can obtain an attention map. The output image feature 450 can be obtained by matrix multiplication of the attention map and the value feature 432.

在一些实施例中，这种处理层的参数可以是通过不同训练模式得到的，不同训练模式例如可以包括预训练和微调。如图4所示，处理层可以包括第一转换单元401，用于将输入图像特征411转换成查询特征421。处理层还可以包括第二转换单元402、第三转换单元403、第四转换单元404和第五转换单元405。第二转换单元402用于将文本特征412转换为第一中间文本特征422，作为键特征的一部分。第三转换单元403用于将文本特征412转换为第二中间文本特征423，作为值特征的一部分。第四转换单元404用于将图像相关特征413转换为第一中间图像相关特征424，作为键特征的一部分。第五转换单元405用于将图像相关特征413转换为第二中间图像相关特征425，作为值特征的一部分。In some embodiments, the parameters of such a processing layer may be obtained through different training modes, and the different training modes may include, for example, pre-training and fine-tuning. As shown in FIG4 , the processing layer may include a first conversion unit 401 for converting an input image feature 411 into a query feature 421. The processing layer may also include a second conversion unit 402, a third conversion unit 403, a fourth conversion unit 404, and a fifth conversion unit 405. The second conversion unit 402 is used to convert a text feature 412 into a first intermediate text feature 422 as part of a key feature. The third conversion unit 403 is used to convert a text feature 412 into a second intermediate text feature 423 as part of a value feature. The fourth conversion unit 404 is used to convert an image-related feature 413 into a first intermediate image-related feature 424 as part of a key feature. The fifth conversion unit 405 is used to convert an image-related feature 413 into a second intermediate image-related feature 425 as part of a value feature.

第一中间文本特征422和第一中间图像相关特征424可以被组合成键特征431，例如，可以被拼接(concatenate)成键特征431。类似地，第二中间文本特征423和第二中间图像相关特征425可以被组合成值特征432，例如，可以被拼接成值特征432。The first intermediate text feature 422 and the first intermediate image-related feature 424 may be combined into a key feature 431, for example, may be concatenated into the key feature 431. Similarly, the second intermediate text feature 423 and the second intermediate image-related feature 425 may be combined into a value feature 432, for example, may be concatenated into the value feature 432.

在一些实施例中，第一转换单元401、第二转换单元402和第三转换单元403是通过第一训练模式得到的，并且第四转换单元404和第五转换单元405是通过不同于所述第一训练模式的第二训练模式得到的。例如，第一转换单元401、第二转换单元402和第三转换单元403可以是通过预训练得到的，第四转换单元404和第五转换单元405可以是通过微调得到的。以此方式，可以对预训练的基础模型添加附加单元，从而可以微调附加单元而无需训练基础模型所包括的部分。这可以有效地降低训练成本。In some embodiments, the first conversion unit 401, the second conversion unit 402, and the third conversion unit 403 are obtained through a first training mode, and the fourth conversion unit 404 and the fifth conversion unit 405 are obtained through a second training mode different from the first training mode. For example, the first conversion unit 401, the second conversion unit 402, and the third conversion unit 403 may be obtained through pre-training, and the fourth conversion unit 404 and the fifth conversion unit 405 may be obtained through fine-tuning. In this way, additional units can be added to the pre-trained base model, so that the additional units can be fine-tuned without training the parts included in the base model. This can effectively reduce the training cost.

在以上描述的实施例中，解耦了参考图像中的语义信息和风格信息。这可以视为一种双解耦表示提取(DDRE)。具体来说，分别给特征变换模型输入特定于风格特征提取的预定提示文本(例如，“风格”)和特定于语义特征提取的预定提示文本(例如，“内容”)，使得特征变换模型能够获得与提示文本对齐的图像特征。也即，可以分别获得风格相关特征和语义相关特征。In the above-described embodiment, the semantic information and style information in the reference image are decoupled. This can be regarded as a double decoupled representation extraction (DDRE). Specifically, a predetermined prompt text specific to style feature extraction (e.g., "style") and a predetermined prompt text specific to semantic feature extraction (e.g., "content") are input to the feature transformation model respectively, so that the feature transformation model can obtain image features aligned with the prompt text. That is, style-related features and semantic-related features can be obtained respectively.

训练任务和样本Training tasks and samples

为了实现上述的双解耦表示提取，在训练中，可以执行各种合适的训练任务。In order to achieve the above-mentioned dual disentangled representation extraction, various suitable training tasks can be performed during training.

在一些实施例中，可以执行用于风格特征提取的训练任务，也称为风格表示提取(STyle Representation Extraction，STRE)任务。在STRE任务中，用成对的具有相同风格的不同样本图像来进行训练。图5A示出了根据本公开的一些实施例的用于风格特征提取的训练任务500A的示意图。In some embodiments, a training task for style feature extraction, also known as a style representation extraction (STRE) task, may be performed. In the STRE task, pairs of different sample images with the same style are used for training. FIG5A shows a schematic diagram of a training task 500A for style feature extraction according to some embodiments of the present disclosure.

在训练任务500A中，成对的样本图像511和样本图像520具有相同风格，例如风格A。样本图像511用作参考，样本图像520用作目标。在训练中，第一查询表示203被初始化。特征变换模型230可以基于样本图像511的第一样本图像特征(例如，图像编码器220的输出)、第一预定提示204和第一查询表示203，生成样本图像511的风格相关特征515。以样本图像511的风格相关特征515和指示样本图像520的图像内容的文本519作为条件，通过对样本图像520的加噪和去噪来更新特征变换模型230和初始化的第一查询表示203。当预定条件满足时，停止更新特征变换模型230和第一查询表示203。在一些实施例中，如果模型240包括可训练的转换单元(例如，参考图4所描述的)，这样的转换单元也随着训练而更新。In the training task 500A, a pair of sample images 511 and sample images 520 have the same style, for example, style A. The sample image 511 is used as a reference, and the sample image 520 is used as a target. In the training, the first query representation 203 is initialized. The feature transformation model 230 can generate style-related features 515 of the sample image 511 based on the first sample image features of the sample image 511 (for example, the output of the image encoder 220), the first predetermined prompt 204, and the first query representation 203. With the style-related features 515 of the sample image 511 and the text 519 indicating the image content of the sample image 520 as conditions, the feature transformation model 230 and the initialized first query representation 203 are updated by denoising and denoising the sample image 520. When the predetermined condition is met, the updating of the feature transformation model 230 and the first query representation 203 is stopped. In some embodiments, if the model 240 includes a trainable conversion unit (for example, as described with reference to FIG. 4 ), such a conversion unit is also updated with the training.

以上所描述的STRE任务是一种非重建任务。通过这样的训练任务，可以有利于参考图像的风格和语义解耦，还能保证训练过程中图像信息(在此示例中为风格信息)不会太强，从而“淹没”输入文本的提示信息。The STRE task described above is a non-reconstruction task. Through such a training task, it is beneficial to decouple the style and semantics of the reference image, and it can also ensure that the image information (in this example, the style information) during the training process is not too strong, thereby "drowning" the prompt information of the input text.

在一些实施例中，可以执行用于语义特征提取的训练任务，也称为语义表示提取(SEmantics Representation Extraction，SERE)任务。在SERE任务中，用成对的具有相同语义的不同样本图像来进行训练。图5B示出了根据本公开的一些实施例的用于语义特征提取的训练任务500B的示意图。In some embodiments, a training task for semantic feature extraction, also known as a semantic representation extraction (SERE) task, may be performed. In the SERE task, pairs of different sample images with the same semantics are used for training. FIG5B shows a schematic diagram of a training task 500B for semantic feature extraction according to some embodiments of the present disclosure.

在训练任务500B中，成对的样本图像521和样本图像530具有相同的内容，例如音符。样本图像521用作参考，样本图像530用作目标。在训练中，第二查询表示303被初始化。特征变换模型330可以基于样本图像531的样本图像特征(例如，图像编码器320的输出)、第二预定提示304和第二查询表示303，生成样本图像521的语义相关特征525。以样本图像521的语义相关特征525和指示样本图像530的风格的文本529作为条件，通过对样本图像530的加噪和去噪来更新特征变换模型330和初始化的第二查询表示303。当预定条件满足时，停止更新特征变换模型330和第二查询表示303。在一些实施例中，如果模型240包括可训练的转换单元(例如，参考图4所描述的)，这样的转换单元也随着训练而更新。In the training task 500B, a pair of sample images 521 and sample images 530 have the same content, such as musical notes. The sample image 521 is used as a reference and the sample image 530 is used as a target. In the training, the second query representation 303 is initialized. The feature transformation model 330 can generate semantically relevant features 525 of the sample image 521 based on the sample image features of the sample image 531 (e.g., the output of the image encoder 320), the second predetermined prompt 304, and the second query representation 303. With the semantically relevant features 525 of the sample image 521 and the text 529 indicating the style of the sample image 530 as conditions, the feature transformation model 330 and the initialized second query representation 303 are updated by denoising and denoising the sample image 530. When the predetermined condition is met, the updating of the feature transformation model 330 and the second query representation 303 is stopped. In some embodiments, if the model 240 includes a trainable conversion unit (e.g., as described with reference to FIG. 4), such a conversion unit is also updated with the training.

以上所描述的SERE任务是一种非重建任务。通过这样的训练任务，可以有利于参考图像的风格和语义解耦，还能保证训练过程中图像信息(在此示例中为语义信息)不会太强，从而“淹没”输入文本的提示信息。The SERE task described above is a non-reconstruction task. Through such a training task, it is beneficial to decouple the style and semantics of the reference image, and it can also ensure that the image information (semantic information in this example) during training is not too strong, thereby "drowning" the prompt information of the input text.

在一些实施例中，为了避免非重建任务引起的图像信息遗漏，可以附加地执行重建任务。图5C示出了根据本公开的一些实施例的用于图像重建的训练任务500C的示意图。In some embodiments, in order to avoid image information omission caused by non-reconstruction tasks, a reconstruction task may be additionally performed. Fig. 5C shows a schematic diagram of a training task 500C for image reconstruction according to some embodiments of the present disclosure.

如图5C所示，上半分支为风格分支。在训练中，特征变换模型230可以基于样本图像531的样本图像特征(例如，图像编码器220的输出)、第一预定提示204和第一查询表示203，生成样本图像531的风格相关特征533。5C , the upper half branch is a style branch. During training, the feature transformation model 230 can generate style-related features 533 of the sample image 531 based on the sample image features of the sample image 531 (eg, output of the image encoder 220 ), the first predetermined hint 204 , and the first query representation 203 .

在语义分支中，可以获取样本图像531的语义相关特征532。例如，语义相关特征532的获取可以利用特征变换模型330。如图所示，特征变换模型330可以基于样本图像531的样本图像特征(例如，图像编码器320的输出)、第二预定提示304和第二查询表示303，生成样本图像531的语义相关特征532。In the semantic branch, semantically relevant features 532 of the sample image 531 may be obtained. For example, the acquisition of the semantically relevant features 532 may utilize the feature transformation model 330. As shown in the figure, the feature transformation model 330 may generate the semantically relevant features 532 of the sample image 531 based on the sample image features of the sample image 531 (e.g., the output of the image encoder 320), the second predetermined hint 304, and the second query representation 303.

这样，可以以风格相关特征533和语义相关特征532作为条件，通过重建样本图像531来更新特征变换模型230和第一查询表示203。相应地，特征变换模型330和第二查询表示303也被更新。在一些实施例中，如果模型240包括可训练的转换单元(例如，参考图4所描述的)，这样的转换单元也随着基于重建任务的训练而更新。In this way, the feature transformation model 230 and the first query representation 203 can be updated by reconstructing the sample image 531 with the style-related features 533 and the semantic-related features 532 as conditions. Accordingly, the feature transformation model 330 and the second query representation 303 are also updated. In some embodiments, if the model 240 includes a trainable transformation unit (e.g., as described with reference to FIG. 4 ), such a transformation unit is also updated with the training based on the reconstruction task.

如上文所描述的，在训练任务500A、500B和500C的执行中，特征变换模型230、特征变换模型330、第一查询表示203和第二查询表示303是可训练的，模型240是部分可训练的(如参考图4所描述的部分)，其余的模型是冻结的。As described above, in the execution of training tasks 500A, 500B and 500C, feature transformation model 230, feature transformation model 330, first query representation 203 and second query representation 303 are trainable, model 240 is partially trainable (such as the part described in reference Figure 4), and the remaining models are frozen.

在一些实施例中，为了支持上述的训练任务，可以创建相应的样本集。示例性的，可以确定指示不同风格的一组风格词和指示不同内容的一组主体词。通过风格词和主体词的组合，可以得到多项提示信息，例如，多个提示词。每项提示信息指示待生成的风格和内容。例如，同一项提示信息可以生成多个样本图像。在一些实施例中，可以将同一项提示信息生成的多个样本图像中的任一对样本图像用于风格特征提取的训练任务。例如，图5A中的样本图像511和样本图像520是由同一提示信息生成的。与使用相同风格词但不同主体词的图像对相比，使用相同提示信息生成的图像对可以获得更好的风格化效果。In some embodiments, in order to support the above-mentioned training tasks, corresponding sample sets can be created. Exemplarily, a group of style words indicating different styles and a group of subject words indicating different contents can be determined. Through the combination of style words and subject words, multiple prompt information can be obtained, for example, multiple prompt words. Each prompt information indicates the style and content to be generated. For example, the same prompt information can generate multiple sample images. In some embodiments, any pair of sample images among the multiple sample images generated by the same prompt information can be used for the training task of style feature extraction. For example, sample image 511 and sample image 520 in Figure 5A are generated by the same prompt information. Compared with image pairs using the same style words but different subject words, image pairs generated using the same prompt information can obtain better stylization effects.

在一些实施例中，具有相同主体词、不同风格词的样本图像对可以用于语义特征提取的训练任务。例如，图5B中的样本图像521和样本图像530可以是基于相同主体词、不同风格词而生成的。In some embodiments, sample image pairs with the same subject word and different style words can be used for the training task of semantic feature extraction. For example, sample image 521 and sample image 530 in FIG5B can be generated based on the same subject word and different style words.

示例过程、装置和设备Example procedures, devices and equipment

图6示出了根据本公开的一些实施例的图像处理的过程600的流程图。过程600可以被实现在电子设备110处。FIG6 shows a flowchart of a process 600 of image processing according to some embodiments of the present disclosure. The process 600 may be implemented at the electronic device 110.

在框610，电子设备110获取第一参考图像的第一参考图像特征。In block 610 , the electronic device 110 acquires a first reference image feature of a first reference image.

在框610，电子设备110基于所述第一参考图像特征、用于风格特征提取的第一预定提示和第一查询表示，确定所述第一参考图像的风格相关特征。In block 610 , the electronic device 110 determines style-related features of the first reference image based on the first reference image features, a first predetermined cue for style feature extraction, and a first query representation.

在框610，电子设备110基于所述第一参考图像的所述风格相关特征和指示目标图像内容的第一输入文本，生成第一目标图像。所述第一目标图像与所述第一参考图像的图像风格和所述目标图像内容相匹配。At block 610 , the electronic device 110 generates a first target image based on the style-related features of the first reference image and a first input text indicating the content of the target image. The first target image matches the image style of the first reference image and the content of the target image.

在一些实施例中，生成第一目标图像包括：获取第一输入文本的第一文本特征；基于风格相关特征和第一文本特征，利用第一模型生成第一目标图像特征，其中第一模型包括多个处理层，多个处理层分别用于处理相应尺寸的特征，并且风格相关特征被提供到尺寸大于第一阈值尺寸的第一数目的处理层；以及基于第一目标图像特征，生成第一目标图像。In some embodiments, generating a first target image includes: obtaining a first text feature of a first input text; generating a first target image feature using a first model based on style-related features and the first text feature, wherein the first model includes multiple processing layers, the multiple processing layers are respectively used to process features of corresponding sizes, and the style-related features are provided to a first number of processing layers having a size greater than a first threshold size; and generating a first target image based on the first target image feature.

在一些实施例中，利用第一模型生成第一目标图像特征包括：针对第一数目的处理层中的给定处理层：将给定处理层的输入图像特征转换为查询特征；通过转换第一文本特征和风格相关特征，生成键特征和值特征；基于查询特征、键特征和值特征，生成给定处理层的输出图像特征。In some embodiments, generating a first target image feature using a first model includes: for a given processing layer in the first number of processing layers: converting the input image features of the given processing layer into query features; generating key features and value features by converting first text features and style-related features; and generating output image features of the given processing layer based on the query features, key features, and value features.

在一些实施例中，输入图像特征是利用第一转换单元而被转换为查询特征的，并且生成键特征和值特征包括：利用第二转换单元将第一文本特征转换为第一中间文本特征；利用第三转换单元将第一文本特征转换为第二中间文本特征；利用第四转换单元将风格相关特征转换为第一中间风格特征；利用第五转换单元将风格相关特征转换为第二中间风格特征；将第一中间文本特征和第一中间风格特征组合成键特征；以及将第二中间文本特征和第二中间风格特征组合成值特征。In some embodiments, input image features are converted into query features using a first conversion unit, and generating key features and value features includes: converting the first text feature into a first intermediate text feature using a second conversion unit; converting the first text feature into a second intermediate text feature using a third conversion unit; converting the style-related feature into a first intermediate style feature using a fourth conversion unit; converting the style-related feature into a second intermediate style feature using a fifth conversion unit; combining the first intermediate text feature and the first intermediate style feature into a key feature; and combining the second intermediate text feature and the second intermediate style feature into a value feature.

在一些实施例中，第一转换单元、第二转换单元和第三转换单元是通过第一训练模式得到的，并且第四转换单元和第五转换单元是通过不同于第一训练模式的第二训练模式得到的。In some embodiments, the first transformation unit, the second transformation unit, and the third transformation unit are obtained through a first training pattern, and the fourth transformation unit and the fifth transformation unit are obtained through a second training pattern different from the first training pattern.

在一些实施例中，过程600还包括：获取第二参考图像的第二参考图像特征；基于第二参考图像特征、用于语义特征提取的第二预定提示和第二查询表示，确定第二参考图像的语义相关特征；以及基于语义相关特征和指示目标图像风格的第二输入文本，生成第二目标图像，第二目标图像与第二参考图像的图像内容和目标图像风格相匹配。In some embodiments, process 600 also includes: obtaining second reference image features of the second reference image; determining semantically relevant features of the second reference image based on the second reference image features, a second predetermined prompt for semantic feature extraction, and a second query representation; and generating a second target image based on the semantically relevant features and a second input text indicating the style of the target image, the second target image matching the image content and the target image style of the second reference image.

在一些实施例中，生成第二目标图像包括：获取第二输入文本的第二文本特征；基于语义相关特征和第二文本特征，根据第一模型生成第二目标图像特征，其中第一模型包括多个处理层，多个处理层分别用于处理相应尺寸的特征，并且语义相关特征被提供到尺寸小于第二阈值尺寸的第二数目的处理层；以及基于第二目标图像特征，生成第二目标图像。In some embodiments, generating a second target image includes: obtaining second text features of a second input text; generating second target image features according to a first model based on semantically related features and the second text features, wherein the first model includes multiple processing layers, the multiple processing layers are respectively used to process features of corresponding sizes, and the semantically related features are provided to a second number of processing layers whose sizes are less than a second threshold size; and generating a second target image based on the second target image features.

在一些实施例中，风格相关特征是利用特征变换模型确定的，第一查询表示是通过特征变换模型的训练得到的，并且特征变换模型的训练包括：获取具有相同图像风格的第一样本图像和第二样本图像；基于第一样本图像的第一样本图像特征、第一预定提示和第一查询表示，利用特征变换模型确定第一样本图像的风格相关特征；以第一样本图像的风格相关特征和指示第二样本图像的图像内容的文本作为条件，通过对第二样本图像的加噪和去噪来更新特征变换模型和第一查询表示。In some embodiments, style-related features are determined using a feature transformation model, the first query representation is obtained by training the feature transformation model, and the training of the feature transformation model includes: obtaining a first sample image and a second sample image having the same image style; determining style-related features of the first sample image using the feature transformation model based on first sample image features of the first sample image, a first predetermined prompt, and a first query representation; and updating the feature transformation model and the first query representation by denoising and denoising the second sample image, taking the style-related features of the first sample image and text indicating the image content of the second sample image as conditions.

在一些实施例中，第一样本图像和第二样本图像是基于相同的提示信息生成的，提示信息指示待生成图像的风格和内容。In some embodiments, the first sample image and the second sample image are generated based on the same prompt information, where the prompt information indicates the style and content of the image to be generated.

在一些实施例中，特征变换模型的训练还包括：基于第三样本图像的第三样本图像特征、第一预定提示和第一查询表示，利用特征变换模型确定第三样本图像的风格相关特征；获取第三样本图像的语义相关特征；以及以风格相关特征和语义相关特征作为条件，通过重建第三样本图像来更新特征变换模型和第一查询表示。In some embodiments, the training of the feature transformation model also includes: determining style-related features of the third sample image using the feature transformation model based on third sample image features of the third sample image, a first predetermined prompt and a first query representation; obtaining semantic-related features of the third sample image; and updating the feature transformation model and the first query representation by reconstructing the third sample image with the style-related features and the semantic-related features as conditions.

图7示出了根据本公开的某些实施例的用于图像处理的装置700的示意性结构框图。装置700可以被实现为或者被包括在电子设备110中。装置700中的各个模块/组件可以由硬件、软件、固件或者它们的任意组合来实现。7 shows a schematic structural block diagram of an apparatus 700 for image processing according to some embodiments of the present disclosure. The apparatus 700 may be implemented as or included in the electronic device 110. Each module/component in the apparatus 700 may be implemented by hardware, software, firmware or any combination thereof.

如图所示，装置700包括第一图像特征提取模块710，被配置为获取第一参考图像的第一参考图像特征。装置700还包括风格相关特征模块720，被配置为基于第一参考图像特征、用于风格特征提取的第一预定提示和第一查询表示，确定第一参考图像的风格相关特征。装置700还包括第一目标图像生成模块730，被配置为基于第一参考图像的风格相关特征和指示目标图像内容的第一输入文本，生成第一目标图像，第一目标图像与第一参考图像的图像风格和目标图像内容相匹配。As shown in the figure, the device 700 includes a first image feature extraction module 710, which is configured to obtain a first reference image feature of a first reference image. The device 700 also includes a style-related feature module 720, which is configured to determine the style-related features of the first reference image based on the first reference image feature, a first predetermined prompt for style feature extraction, and a first query representation. The device 700 also includes a first target image generation module 730, which is configured to generate a first target image based on the style-related features of the first reference image and a first input text indicating the content of the target image, and the first target image matches the image style and the target image content of the first reference image.

在一些实施例中，第一目标图像生成模块730进一步被配置为：获取第一输入文本的第一文本特征；基于风格相关特征和第一文本特征，利用第一模型生成第一目标图像特征，其中第一模型包括多个处理层，多个处理层分别用于处理相应尺寸的特征，并且风格相关特征被提供到尺寸大于第一阈值尺寸的第一数目的处理层；以及基于第一目标图像特征，生成第一目标图像。In some embodiments, the first target image generation module 730 is further configured to: obtain a first text feature of a first input text; generate a first target image feature using a first model based on the style-related feature and the first text feature, wherein the first model includes multiple processing layers, the multiple processing layers are respectively used to process features of corresponding sizes, and the style-related features are provided to a first number of processing layers whose sizes are greater than a first threshold size; and generate a first target image based on the first target image feature.

在一些实施例中，第一目标图像生成模块730进一步被配置为：针对第一数目的处理层中的给定处理层：将给定处理层的输入图像特征转换为查询特征；通过转换第一文本特征和风格相关特征，生成键特征和值特征；基于查询特征、键特征和值特征，生成给定处理层的输出图像特征。In some embodiments, the first target image generation module 730 is further configured to: for a given processing layer in the first number of processing layers: convert the input image features of the given processing layer into query features; generate key features and value features by converting the first text features and style-related features; and generate output image features of the given processing layer based on the query features, key features, and value features.

在一些实施例中，输入图像特征是利用第一转换单元而被转换为查询特征的，并且第一目标图像生成模块730进一步被配置为：利用第二转换单元将第一文本特征转换为第一中间文本特征；利用第三转换单元将第一文本特征转换为第二中间文本特征；利用第四转换单元将风格相关特征转换为第一中间风格特征；利用第五转换单元将风格相关特征转换为第二中间风格特征；将第一中间文本特征和第一中间风格特征组合成键特征；以及将第二中间文本特征和第二中间风格特征组合成值特征。In some embodiments, the input image feature is converted into a query feature using a first conversion unit, and the first target image generation module 730 is further configured to: convert the first text feature into a first intermediate text feature using a second conversion unit; convert the first text feature into a second intermediate text feature using a third conversion unit; convert the style-related feature into a first intermediate style feature using a fourth conversion unit; convert the style-related feature into a second intermediate style feature using a fifth conversion unit; combine the first intermediate text feature and the first intermediate style feature into a key feature; and combine the second intermediate text feature and the second intermediate style feature into a value feature.

在一些实施例中，装置700还包括：第二图像特征提取模块，被配置为获取第二参考图像的第二参考图像特征；语义相关特征模块，被配置为基于第二参考图像特征、用于语义特征提取的第二预定提示和第二查询表示，确定第二参考图像的语义相关特征；以及第二目标图像生成模块，被配置为基于语义相关特征和指示目标图像风格的第二输入文本，生成第二目标图像，第二目标图像与第二参考图像的图像内容和目标图像风格相匹配。In some embodiments, the device 700 also includes: a second image feature extraction module, configured to obtain second reference image features of the second reference image; a semantic related feature module, configured to determine the semantic related features of the second reference image based on the second reference image features, a second predetermined prompt for semantic feature extraction and a second query representation; and a second target image generation module, configured to generate a second target image based on the semantic related features and a second input text indicating the style of the target image, the second target image matching the image content and the target image style of the second reference image.

在一些实施例中，第二目标图像生成模块进一步被配置为：获取第二输入文本的第二文本特征；基于语义相关特征和第二文本特征，根据第一模型生成第二目标图像特征，其中第一模型包括多个处理层，多个处理层分别用于处理相应尺寸的特征，并且语义相关特征被提供到尺寸小于第二阈值尺寸的第二数目的处理层；以及基于第二目标图像特征，生成第二目标图像。In some embodiments, the second target image generation module is further configured to: obtain second text features of a second input text; generate second target image features according to a first model based on semantically related features and the second text features, wherein the first model includes multiple processing layers, the multiple processing layers are respectively used to process features of corresponding sizes, and the semantically related features are provided to a second number of processing layers whose sizes are less than a second threshold size; and generate a second target image based on the second target image features.

图8示出了示出了其中可以实施本公开的一个或多个实施例的电子设备800的框图。应当理解，图8所示出的电子设备800仅仅是示例性的，而不应当构成对本文所描述的实施例的功能和范围的任何限制。图8所示出的电子设备800可以用于实现图1的电子设备110。FIG8 shows a block diagram of an electronic device 800 in which one or more embodiments of the present disclosure may be implemented. It should be understood that the electronic device 800 shown in FIG8 is merely exemplary and should not constitute any limitation on the functionality and scope of the embodiments described herein. The electronic device 800 shown in FIG8 may be used to implement the electronic device 110 of FIG1 .

如图8所示，电子设备800是通用电子设备的形式。电子设备800的组件可以包括但不限于一个或多个处理器或处理单元810、存储器820、存储设备830、一个或多个通信单元840、一个或多个输入设备850以及一个或多个输出设备860。处理单元810可以是实际或虚拟处理器并且能够根据存储器820中存储的程序来执行各种处理。在多处理器系统中，多个处理单元并行执行计算机可执行指令，以提高电子设备800的并行处理能力。As shown in FIG8 , the electronic device 800 is in the form of a general electronic device. The components of the electronic device 800 may include, but are not limited to, one or more processors or processing units 810, a memory 820, a storage device 830, one or more communication units 840, one or more input devices 850, and one or more output devices 860. The processing unit 810 may be an actual or virtual processor and is capable of performing various processes according to a program stored in the memory 820. In a multi-processor system, multiple processing units execute computer executable instructions in parallel to improve the parallel processing capability of the electronic device 800.

电子设备800通常包括多个计算机存储介质。这样的介质可以是电子设备800可访问的任何可以获取的介质，包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器820可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如，只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备830可以是可拆卸或不可拆卸的介质，并且可以包括机器可读介质，诸如闪存驱动、磁盘或者任何其他介质，其可以能够用于存储信息和/或数据并且可以在电子设备800内被访问。The electronic device 800 typically includes a plurality of computer storage media. Such media can be any accessible media that can be obtained by the electronic device 800, including but not limited to volatile and non-volatile media, removable and non-removable media. The memory 820 can be a volatile memory (e.g., a register, a cache, a random access memory (RAM)), a non-volatile memory (e.g., a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. The storage device 830 can be a removable or non-removable medium, and can include a machine-readable medium, such as a flash drive, a disk, or any other medium, which can be used to store information and/or data and can be accessed within the electronic device 800.

电子设备800可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图8中示出，可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中，每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器820可以包括计算机程序产品825，其具有一个或多个程序模块，这些程序模块被配置为执行本公开的各种实施例的各种方法或动作。The electronic device 800 may further include additional removable/non-removable, volatile/non-volatile storage media. Although not shown in FIG. 8 , a disk drive for reading or writing from a removable, non-volatile disk (e.g., a “floppy disk”) and an optical drive for reading or writing from a removable, non-volatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data media interfaces. The memory 820 may include a computer program product 825 having one or more program modules configured to perform various methods or actions of various embodiments of the present disclosure.

通信单元840实现通过通信介质与其他电子设备进行通信。附加地，电子设备800的组件的功能可以以单个计算集群或多个计算机器来实现，这些计算机器能够通过通信连接进行通信。因此，电子设备800可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。The communication unit 840 implements communication with other electronic devices through a communication medium. Additionally, the functions of the components of the electronic device 800 can be implemented with a single computing cluster or multiple computing machines that can communicate through a communication connection. Therefore, the electronic device 800 can operate in a networked environment using a logical connection with one or more other servers, a network personal computer (PC), or another network node.

输入设备850可以是一个或多个输入设备，例如鼠标、键盘、追踪球等。输出设备860可以是一个或多个输出设备，例如显示器、扬声器、打印机等。电子设备800还可以根据需要通过通信单元840与一个或多个外部设备(未示出)进行通信，外部设备诸如存储设备、显示设备等，与一个或多个使得用户与电子设备800交互的设备进行通信，或者与使得电子设备800与一个或多个其他电子设备通信的任何设备(例如，网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。The input device 850 may be one or more input devices, such as a mouse, a keyboard, a tracking ball, etc. The output device 860 may be one or more output devices, such as a display, a speaker, a printer, etc. The electronic device 800 may also communicate with one or more external devices (not shown) through the communication unit 840 as needed, such as a storage device, a display device, etc., communicate with one or more devices that allow a user to interact with the electronic device 800, or communicate with any device that allows the electronic device 800 to communicate with one or more other electronic devices (e.g., a network card, a modem, etc.). Such communication may be performed via an input/output (I/O) interface (not shown).

根据本公开的示例性实现方式，提供了一种计算机可读存储介质，其上存储有计算机可执行指令，其中计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式，还提供了一种计算机程序产品，计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令，而计算机可执行指令被处理器执行以实现上文描述的方法。According to an exemplary implementation of the present disclosure, a computer-readable storage medium is provided, on which computer-executable instructions are stored, wherein the computer-executable instructions are executed by a processor to implement the method described above. According to an exemplary implementation of the present disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.

这里参照根据本公开实现的方法、装置、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Various aspects of the present disclosure are described herein with reference to the flowcharts and/or block diagrams of the methods, devices, equipment, and computer program products implemented according to the present disclosure. It should be understood that each box in the flowchart and/or block diagram and the combination of each box in the flowchart and/or block diagram can be implemented by computer-readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元，从而生产出一种机器，使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine, so that when these instructions are executed by the processing unit of the computer or other programmable data processing device, a device that implements the functions/actions specified in one or more boxes in the flowchart and/or block diagram is generated. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause the computer, programmable data processing device, and/or other equipment to work in a specific manner, so that the computer-readable medium storing the instructions includes a manufactured product, which includes instructions for implementing various aspects of the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上，使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other device so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, so that the instructions executed on the computer, other programmable data processing apparatus, or other device implement the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

附图中的流程图和框图显示了根据本公开的多个实现的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分，模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flow chart and block diagram in the accompanying drawings show the possible architecture, function and operation of the system, method and computer program product according to multiple implementations of the present disclosure. In this regard, each square box in the flow chart or block diagram can represent a part of a module, program segment or instruction, and a part of a module, program segment or instruction includes one or more executable instructions for realizing the logical function of the specification. In some implementations as replacements, the function marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two continuous square boxes can actually be executed substantially in parallel, and they can sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be realized by a special hardware-based system that performs the function or action of the specification, or can be realized by a combination of special hardware and computer instructions.

以上已经描述了本公开的各实现，上述说明是示例性的，并非穷尽性的，并且也不限于所公开的各实现。在不偏离所说明的各实现的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实现的原理、实际应用或对市场中的技术的改进，或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。The above descriptions of various implementations of the present disclosure are exemplary, non-exhaustive, and not limited to the disclosed implementations. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described implementations. The selection of terms used herein is intended to best explain the principles of the implementations, practical applications, or improvements to the technology in the market, or to enable other persons of ordinary skill in the art to understand the various implementations disclosed herein.

Claims

An image processing method, comprising:

Acquire a first reference image feature of a first reference image;

determining style-related features of the first reference image based on the first reference image features, a first predetermined cue for style feature extraction, and a first query representation; and

Based on the style-related features of the first reference image and first input text indicating content of a target image, a first target image is generated, the first target image matching the image style of the first reference image and the content of the target image.

The method of claim 1, wherein generating the first target image comprises:

Acquire a first text feature of the first input text;

Based on the style-related feature and the first text feature, generating a first target image feature using a first model, wherein the first model includes a plurality of processing layers, the plurality of processing layers are respectively used to process features of corresponding sizes, and the style-related feature is provided to a first number of processing layers whose sizes are greater than a first threshold size; and

Based on the first target image feature, the first target image is generated.

The method according to claim 2, wherein generating the first target image feature using a first model comprises:

For a given processing layer in the first number of processing layers:

Converting input image features of the given processing layer into query features;

Generate a key feature and a value feature by converting the first text feature and the style-related feature;

An output image feature of the given processing layer is generated based on the query feature, the key feature, and the value feature.

The method according to claim 3, wherein the input image feature is converted into the query feature using a first conversion unit, and generating the key feature and the value feature comprises:

Using a second conversion unit to convert the first text feature into a first intermediate text feature;

Using a third conversion unit to convert the first text feature into a second intermediate text feature;

Using a fourth conversion unit, converting the style-related feature into a first intermediate style feature;

Using a fifth conversion unit, the style-related feature is converted into a second intermediate style feature;

combining the first intermediate text feature and the first intermediate style feature into the key feature; and

The second intermediate text feature and the second intermediate style feature are combined into the value feature.

The method according to claim 4, wherein the first conversion unit, the second conversion unit and the third conversion unit are obtained through a first training mode, and the fourth conversion unit and the fifth conversion unit are obtained through a second training mode different from the first training mode.

The method according to claim 1, further comprising:

Acquire a second reference image feature of the second reference image;

determining semantically relevant features of the second reference image based on the second reference image features, a second predetermined cue for semantic feature extraction, and a second query representation; and

Based on the semantically relevant features and a second input text indicating a style of a target image, a second target image is generated, the second target image matching the image content of the second reference image and the style of the target image.

The method of claim 6, wherein generating the second target image comprises:

Acquire a second text feature of the second input text;

generating a second target image feature according to a first model based on the semantically related feature and the second text feature, wherein the first model includes a plurality of processing layers, the plurality of processing layers are respectively used to process features of corresponding sizes, and the semantically related features are provided to a second number of processing layers whose sizes are smaller than a second threshold size; and

Based on the second target image feature, the second target image is generated.

The method according to claim 1, wherein the style-related features are determined using a feature transformation model, the first query representation is obtained by training the feature transformation model, and the training of the feature transformation model comprises:

Acquire a first sample image and a second sample image having the same image style;

determining, using the feature transformation model, style-related features of the first sample image based on a first sample image feature of the first sample image, the first predetermined prompt, and the first query representation;

The feature transformation model and the first query representation are updated by adding noise and denoising the second sample image, taking the style-related features of the first sample image and the text indicating the image content of the second sample image as conditions.

The method according to claim 8, wherein the first sample image and the second sample image are generated based on the same prompt information, and the prompt information indicates the style and content of the image to be generated.

The method according to claim 8, wherein the training of the feature transformation model further comprises:

determining, using the feature transformation model, style-related features of the third sample image based on third sample image features of the third sample image, the first predetermined prompt, and the first query representation;

Acquiring semantically relevant features of the third sample image; and

The feature transformation model and the first query representation are updated by reconstructing the third sample image based on the style-related features and the semantics-related features.

A device for image processing, comprising:

A first image feature extraction module is configured to obtain a first reference image feature of a first reference image;

a style-related feature module configured to determine style-related features of the first reference image based on the first reference image features, a first predetermined hint for style feature extraction, and a first query representation; and

The first target image generation module is configured to generate a first target image based on the style-related features of the first reference image and a first input text indicating the content of the target image, wherein the first target image matches the image style of the first reference image and the content of the target image.

An electronic device, comprising:

at least one processing unit; and

At least one memory, the at least one memory being coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method according to any one of claims 1 to 10 when executed by the at least one processing unit.

A computer-readable storage medium having a computer program stored thereon, wherein the computer program can be executed by a processor to implement the method according to any one of claims 1 to 10.

A computer program product, which is tangibly stored in a computer storage medium and includes computer executable instructions, which when executed by a device cause the device to perform the method according to any one of claims 1-10.