CN118212687A

CN118212687A - Human body posture image generation method, device, equipment and medium

Info

Publication number: CN118212687A
Application number: CN202410266800.1A
Authority: CN
Inventors: 石雅洁
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2024-03-07
Filing date: 2024-03-07
Publication date: 2024-06-18

Abstract

The disclosure relates to the technical field of image processing, and provides a human body posture image generation method, a device, electronic equipment and a medium. The method comprises the steps of obtaining a target person image and a preset human body posture image, and inputting the target person image into a target person feature extraction module; encoding the target character image through a picture encoder to obtain an image feature vector corresponding to the target character image; encoding the target character image through a texture encoder to obtain a multi-scale feature vector corresponding to the target character image; dividing the target character image through an image dividing model to obtain a divided human mask image; determining a first fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image and the human mask image; extracting features of the preset human body posture image through a control network to obtain a feature vector corresponding to the preset human body posture image; the method comprises the steps of inputting first fusion features corresponding to a target person image and feature vectors corresponding to a preset human body posture image into a diffusion model, and processing the first fusion features corresponding to the target person image and the feature vectors corresponding to the preset human body posture image through the diffusion model to obtain a first human body posture image corresponding to the target person image.

Description

Human body posture image generation method, device, equipment and medium

Technical Field

The disclosure relates to the technical field of image processing, and in particular relates to a human body posture image generation method, a device, electronic equipment and a medium.

Background

With the rapid development of artificial intelligence and computer vision technologies, gesture-guided portrait generation tasks have received extensive attention from researchers and industry. The task aims to synthesize a realistic figure matched with a given gesture, which has potential application value in the fields of movie production, virtual reality, fashion design and the like. Currently, generating an countermeasure network (GANs) is the dominant approach to solving the gesture-guided portrait generation task. GANs can learn to synthesize high quality portrait images by letting a generator and a discriminator constantly fight in training. However, the existing GANs-based method still has the defects in the aspects of fidelity of the portrait and maintenance of the texture details. These techniques sometimes result in texture distortions, such as blurring of clothing texture or unnatural skin texture changes, when synthesizing portraits of a particular pose.

Furthermore, the existing GANs model often performs poorly when dealing with special situations, such as complex deformations of the body's posture, severe occlusions of body parts, etc. The generated image may have the problems of fusion of human body parts, unnatural joint distortion, loss of details of a shielding area and the like, and the sense of reality and usability of the image are reduced. As research proceeds, diffusion models have been explored as an emerging depth generation model that generates images by modeling the inverse diffusion process of data distribution. While diffusion models offer superior potential in generating realistic images, they still face challenges in gesture-guided portrait generation tasks. It is difficult for the generated portrait to precisely align a given pose and appearance, resulting in a composite image that is inconsistent with the desired pose or appearance information.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a computer readable storage medium for generating a human body posture image, so as to solve a technical problem that it is difficult to precisely align a given posture and appearance of a human figure generated in the prior art, resulting in that a synthesized image is inconsistent with desired posture or appearance information.

In a first aspect of an embodiment of the present disclosure, there is provided a human body posture image generating method, including: acquiring a target character image and a preset human body posture image, and inputting the target character image into a target character feature extraction module; encoding the target character image through a picture encoder to obtain an image feature vector corresponding to the target character image; encoding the target character image through a texture encoder to obtain a multi-scale feature vector corresponding to the target character image; dividing the target character image through an image dividing model to obtain a divided human mask image; determining a first fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image and the human mask image; extracting features of the preset human body posture image through a control network to obtain a feature vector corresponding to the preset human body posture image; and inputting the first fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image into a diffusion model, and processing the first fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image through the diffusion model to obtain the first human body posture image corresponding to the target person image.

In a second aspect of the embodiments of the present disclosure, there is provided a human body posture image generating apparatus including: the acquisition module is used for acquiring a target character image and a preset human body posture image and inputting the target character image into the target character feature extraction module; the first encoding module is used for encoding the target person image through the picture encoder to obtain an image feature vector corresponding to the target person image; the second encoding module is used for encoding the target person image through the texture encoder to obtain a multi-scale feature vector corresponding to the target person image; the segmentation module is used for carrying out segmentation processing on the target person image through the image segmentation model to obtain a segmented human mask image; the determining module is used for determining a first fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image and the human mask image; the feature extraction module is used for carrying out feature extraction on the preset human body posture image through the control network to obtain a feature vector corresponding to the preset human body posture image; the human body posture generation module is used for inputting the first fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image into the diffusion model, and processing the first fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image through the diffusion model to obtain the first human body posture image corresponding to the target person image.

In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: according to the embodiment of the disclosure, the image of the target person can be processed by adopting the picture encoder and the texture encoder, so that not only the key characteristics related to the gesture are extracted, but also the texture information of the target person, such as skin texture, clothing details and the like, is reserved. The multi-scale feature extraction method is beneficial to maintaining the vivid texture of the image in the gesture transformation process and improving the quality of the generated image. By introducing a combination of control network and image segmentation model, the invention is able to generate human body images that are more accurately aligned to a given pose. The control network ensures that the characteristics of the preset human body pose can effectively influence the generated result, and the image segmentation model provides a human body mask map so as to better process human body boundaries and occlusion areas when synthesizing the human body pose. The introduction of the diffusion model provides a new processing path, and the model is helpful for smoothly processing the complex human body posture transformation and the detail generation of the shielding part by simulating the reverse diffusion process, so that the adaptability and the stability of the generated image to complex scenes are improved. By utilizing the fusion characteristics and the preset posture characteristics of the target character image, the invention can ensure that the generated character image not only maintains the appearance characteristics of the target character, but also accurately matches the preset posture, thereby improving the consistency between the character and the posture and enhancing the authenticity and the credibility of the generated image.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of an embodiment of the invention may be applied;

Fig. 2 is a flowchart of a method for generating a human body posture image according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a target person feature extraction module according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram of a human body posture image generation model provided by an embodiment of the present disclosure;

Fig. 5 is a block diagram of a human body posture image generating apparatus provided in an embodiment of the present disclosure;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of an embodiment of the present invention may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.

The user can interact with the server 105 through the network 104 using the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or transmit data, or the like. The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices with display screens including, but not limited to, smartphones, tablet computers, portable computers, desktop computers, and the like.

The server 105 may be a server providing various services. For example, the server 105 may acquire a target person image and a preset human posture image from the first terminal device 103 (may also be the second terminal device 102 or the third terminal device 103), and input the target person image to the target person feature extraction module; encoding the target character image through a picture encoder to obtain an image feature vector corresponding to the target character image; encoding the target character image through a texture encoder to obtain a multi-scale feature vector corresponding to the target character image; dividing the target character image through an image dividing model to obtain a divided human mask image; determining a first fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image and the human mask image; extracting features of the preset human body posture image through a control network to obtain a feature vector corresponding to the preset human body posture image; the method comprises the steps of inputting first fusion features corresponding to a target person image and feature vectors corresponding to a preset human body posture image into a diffusion model, and processing the first fusion features corresponding to the target person image and the feature vectors corresponding to the preset human body posture image through the diffusion model to obtain a first human body posture image corresponding to the target person image.

In some embodiments, the human body posture image generating method provided by the embodiment of the present invention is generally performed by the server 105, and accordingly, the human body posture image generating apparatus is generally provided in the server 105. In other embodiments, some terminal devices may have similar functionality as a server to perform the method. Therefore, the human body posture image generation method provided by the embodiment of the invention is not limited to be executed at the server side.

A human body posture image generating method and apparatus according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 2 is a flowchart of a human body posture image generating method according to an embodiment of the present disclosure. The method provided by the embodiments of the present disclosure may be performed by any electronic device having computer processing capabilities, for example, the electronic device may be a server as shown in fig. 1.

As shown in fig. 2, the human body posture image generating method includes steps S210 to S270.

In step S210, a target person image and a preset human body pose image are acquired, and the target person image is input to the target person feature extraction module.

Step S220, the image encoder encodes the target person image to obtain the image feature vector corresponding to the target person image.

In step S230, the texture encoder encodes the target person image to obtain a multi-scale feature vector corresponding to the target person image.

In step S240, the target person image is subjected to a segmentation process by the image segmentation model, and a segmented human mask image is obtained.

In step S250, a first fusion feature corresponding to the target person image is determined from the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image, and the human mask map.

In step S260, feature extraction is performed on the preset human body posture image through the control network, so as to obtain a feature vector corresponding to the preset human body posture image.

In step S270, the first fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image are input to the diffusion model, and the first fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image are processed through the diffusion model to obtain the first human body posture image corresponding to the target person image.

The method can process the target character image by adopting a picture encoder and a texture encoder respectively, so that not only the key characteristics related to the gesture are extracted, but also the texture information of the target character, such as skin texture, clothing details and the like, is reserved. The multi-scale feature extraction method is beneficial to maintaining the vivid texture of the image in the gesture transformation process and improving the quality of the generated image. By introducing a combination of control network and image segmentation model, the invention is able to generate human body images that are more accurately aligned to a given pose. The control network ensures that the characteristics of the preset human body pose can effectively influence the generated result, and the image segmentation model provides a human body mask map so as to better process human body boundaries and occlusion areas when synthesizing the human body pose. The introduction of the diffusion model provides a new processing path, and the model is helpful for smoothly processing the complex human body posture transformation and the detail generation of the shielding part by simulating the reverse diffusion process, so that the adaptability and the stability of the generated image to complex scenes are improved. By utilizing the fusion characteristics and the preset posture characteristics of the target character image, the invention can ensure that the generated character image not only maintains the appearance characteristics of the target character, but also accurately matches the preset posture, thereby improving the consistency between the character and the posture and enhancing the authenticity and the credibility of the generated image.

In some embodiments of the present disclosure, the user is allowed to upload his own personal image directly, or an image of other people they wish to compose. This approach is very intuitive and personalized for the user, and any specific portrait can be uploaded according to personal needs, and subsequent gesture synthesis processing can be performed based on this. In addition, the present invention may also provide a user with a built-in or externally linked image database from which the user may select the appropriate target person image. Images in the database include, but are not limited to, portraits of different gender, age, style, and pose for the user to choose freely. This scheme is suitable for users who do not have specific character image requirements or who wish to quickly pick samples to conduct experiments. The user may select the character pose image they want to generate through a particular interactive interface. The interface may be a graphical menu containing a series of pre-set human gesture thumbnails that a user may browse and select a particular gesture as a reference for generating a portrait. This approach provides a highly customized choice that enables the user to precisely control the pose of the final generated image.

To enhance the user experience, the present invention may also include a gesture recommendation system based on past behavior, preferences, or popularity trends of the user. The system may automatically recommend certain gesture images and the user may choose to accept the recommendation or continue browsing more options. The recommendation system may use a machine learning algorithm to refine the recommendation based on user history selections and other user data. When the invention is implemented, the design of the user interface can be further optimized, so as to provide more visual and convenient operation flow. Further, continuous improvement and updating is necessary for management of databases and algorithms of recommendation systems to ensure that users can obtain high quality and highly relevant image resources. By providing a plurality of modes for acquiring images, the invention not only expands the application range, but also improves the user friendliness, and promotes the gesture-guided portrait generation technology to be effectively utilized in wider practical application scenes.

Referring to fig. 3, the target character feature module includes a picture encoder (e.g., CLIP picture encoder), a texture encoder, a segmentation model, a text encoder (e.g., CLIP text encoder), and a transformer block module.

Based on the foregoing embodiment, the target person image is I _p-target, I _p-target is input to the CLIP picture encoder, and the CLIP picture encoder encodes the target person image to obtain an image feature vector, for example, E _CLIP-i, corresponding to the target person image. Inputting I _p-target into a texture encoder, and encoding the target person image through the texture encoder, namely encoding the target person image through multiple layers in the texture encoder, and obtaining output from different layers of the encoder to form multi-scale image feature characterization; and then splicing to form spliced multi-scale feature F _c, namely the multi-scale feature vector corresponding to the target person image. I _p-target is input into a segmentation model, and the target person image is processed through the segmentation model to obtain a segmented human Mask image, namely the segmented human Mask image. And then respectively inputting the E _CLIP-i、F_c and the segmented human Mask images into a transformer block module for processing to obtain first fusion features corresponding to the target character images.

Specifically, first, a target person image is input to a CLIP picture encoder (Contrastive Language-IMAGE PRETRAINING). CLIP is a deep learning model that learns to understand image content and semantic information through joint training of images and text. In the present invention, a CLIP picture encoder is used to encode a target person image. By means of the encoder, the target person image is converted into an image feature vector, the feature vector captures the visual content and style of the picture, and basic visual information is provided for subsequent generation tasks. The target person image is further input to a texture encoder. This encoder focuses on encoding texture information of the image, and on specific texture details in the image, such as surface textures of clothing, hair, skin, etc. Through the multi-layered network structure in the texture encoder, the target person image undergoes a series of encoding processes. These hierarchically different encoder outputs form a multi-scale representation of the image features, including different dimensional information from shallow texture features to deep structural features. These different levels of feature tokens are stitched together to form a comprehensive multi-scale feature vector, which further enriches the feature information to improve the texture retention of the image. Subsequently, the target person image is input to the segmentation model. The segmentation model is an image processing technology, and can distinguish characters in an image from a background to generate a Mask image of a human body. The obtained segmented human mask map defines the outline of the target person and the position information of different human parts, and is beneficial to more accurately processing human body areas in the subsequent synthesis process. The image feature vector, the multi-scale feature vector and the segmented human mask map obtained above are then input to transformer block modules, respectively. The Transformer block model is a self-attention mechanism based network architecture, commonly used to process sequence data. In the present invention, it is used to process the three features described above to extract a more complex feature vector that fuses texture and structural information, i.e., the first fused feature. The first fusion feature combines the overall pose information, texture details, and human body structure information of the image to provide a solid basis for generating a character image that matches the preset pose. Through the series of feature extraction and fusion processing, the invention ensures that the target portrait maintains high-quality texture and structure consistency in the gesture conversion process, so that the generated character image reaches higher level in the aspects of visual reality and detail richness. This approach ensures accurate presentation of pose and texture features of the target person while improving the quality of the generated image.

With continued reference to block transformer block in fig. 3, determining a first fused feature vector corresponding to the target person image from the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image, and the human mask map includes: performing self-attention processing on the image feature vector corresponding to the target person image to obtain a first feature vector; performing picture cross attention processing on the multi-scale feature vector and the first feature vector corresponding to the target character image to obtain a second feature vector; and performing spatial attention processing on the human mask image and the second feature vector to obtain a first fusion feature vector corresponding to the target person image. For example, the self-attention mechanism is first applied to the image feature vector corresponding to the target person image. This is a technique that allows models to focus on specific parts of an image when processing image data, by which the models can capture local relationships and global dependencies in the image. Through the self-attention process, a first feature vector is obtained which enhances the local detail and global context relationship, and provides rich visual information and finer feature expression for the generation process. And then, carrying out picture cross attention processing on the first feature vector obtained in the last step and the multi-scale feature vector. Cross-attention here refers to a specific attention mechanism that enables associations to be made between different feature representations to enhance the model's understanding of the input data. This approach allows the model to not only consider features from different network levels (multi-scale features), but also to achieve better complementarity and fusion between these features and the visual features obtained by the CLIP encoder, resulting in a second feature vector. The last step is a spatial attention process that uses the segmented human mask map and the second feature vector. The spatial attention mechanism focuses on spatial information of the image, particularly useful for understanding spatial structure of the target person and more accurate expression of the human body region. In this step, the gesture features of the person can be better combined with their corresponding appearance features using the spatial attention process, ensuring that the body portion and key features of the person can be preserved and accurately represented when synthesizing the new pose image of the person. Through the three steps, the first fusion feature vector finally generated by the invention synthesizes the texture features, the structural information and the spatial characteristics of the target person image, thereby providing powerful feature support for generating the person image with high quality and consistent with the required gesture. The method remarkably improves the quality and consistency of the synthesized portrait, optimizes the corresponding generation task, and provides a reliable characteristic foundation for more complex gesture synthesis.

In some embodiments, in performing the task of generating the target person pose image, some text data may be added, where the text data is mainly used to decorate the target person in the final generated target person pose image, e.g., the text data is "person wearing a sunglasses, person holding a large sword," etc. In this embodiment, the method further includes: acquiring text data, wherein the text data comprises words for decorating a target person; encoding the text data through a text encoder to obtain text feature vectors corresponding to the text data; determining a second fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image, the human mask image and the text feature vector corresponding to the text data; and inputting the second fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image into a diffusion model, and processing the second fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image through the diffusion model to obtain a second human body posture image corresponding to the target person image.

Based on the foregoing embodiments, the user provides text data that contains descriptive words or phrases that describe the decorative elements or features that they wish to see in the final generated character pose image. The provided text data is then transmitted to a text encoder, which is responsible for converting these descriptive text into a machine-understandable form, i.e. a text feature vector. The text encoder generates a high-dimensional vector that reflects the text description by learning the inherent semantics of various natural language expressions. And further determining a second fusion feature corresponding to the target person image by combining the image feature vector, the multi-scale feature vector, the human mask image and the text feature vector obtained by the text encoder. This second fusion feature contains not only visual information, but also fuses the user's text description information, which allows the system to add user-specified decorative attributes while maintaining character pose consistency. And then, inputting the feature vector corresponding to the second fusion feature corresponding to the target character image and the preset human posture image into the diffusion model. The diffusion model will process these feature data to gradually construct the final image from the noise distribution by simulating a reverse markov chain process that can produce a more realistic image while preserving and rendering the specific pose and ornamental features desired by the user. After the diffusion model is processed, a second human body posture image corresponding to the target character image with the user-defined decoration is finally generated. The invention opens up a new image generation path by converting natural language text descriptions into image content, which enables users to customize and influence the details and style of the generated image through simple text input. The invention has stronger self-defining ability in vision, and enhances the interactivity and experience with users. By fusing visual and language information, the invention remarkably improves the diversity and individuation degree of the generated images and the capability of meeting the requirements of specific users.

With continued reference to block transformer block in fig. 3, the implementation process after adding text data is specifically described: self-attention calculation: the input is the image feature vector obtained in the step (1), and the output is a first feature vector m1; picture cross-attention calculation: m1 is taken as Q, the multiscale feature vector obtained in the step (2) is taken as K, V, and the cross attention is calculated to obtain a second feature vector m2; picture-text cross-attention calculation: m2 is taken as Q, the text feature vector obtained in the step (4) is taken as K, V, and the cross attention is calculated to obtain a third feature vector m3; spatial cross-attention calculation: creating an initial attention matrix A for the spatial cross attention module through the human MASK diagram obtained in the step (3). The attention matrix A is determined by the association relation between the spatial position of the human MASK diagram and the text, the image token after the human MASK diagram is analyzed is represented as N _i, the text token of the text is represented as N _t, and the attention moment matrix is represented asEach column in a corresponds to a text token, and if the picture area selected by the user corresponds to text, there is a corresponding mask, and the mask is directly flat as the value of this column. If not, the column value is set to zero. The initialized attention matrix A is added into the cross attention layer, and the modified cross attention is as follows:

Where Q is the image feature vector, K and V are the text feature vectors, d _k is the dimensions of Q and K, and w is the weight that controls the user's input intensity of attention. The calculation mode according to the experimental weight w is as follows:

Where w' is a user specified intensity coefficient and σ is the noise level throughout the diffusion process. This module is serially connected with 6 sub transformer block modules, resulting in a second fusion feature denoted E _F.

In some embodiments, determining the second fusion feature corresponding to the target person image based on the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image, the human mask map, and the text feature vector corresponding to the text data comprises: performing self-attention processing on the image feature vector corresponding to the target person image to obtain a first feature vector; performing picture cross attention processing on the multi-scale feature vector and the first feature vector corresponding to the target character image to obtain a second feature vector; performing picture text cross attention processing on the second feature vector and the text feature vector corresponding to the text data to obtain a third feature vector; and performing spatial attention processing on the human mask image and the third feature vector to obtain a second fusion feature vector corresponding to the target person image. For example, the self-attention process is performed on the image feature vector corresponding to the target person image. This step enhances the representation of key parts in the image feature vector by using a self-attention mechanism, thereby better capturing detailed information and internal dependencies of the target person image. By this process, a first feature vector may be generated in which certain areas or attributes, such as facial features, clothing textures, etc., may be highlighted. Next, the first feature vector is combined with the multi-scale feature vector for picture cross-attention processing. This processing step enables information from different levels (scales) and different visual details to be interacted with and integrated. The cross-attention mechanism helps to fuse local texture and global structural features that are necessary to generate images with complex texture and structural consistency. This step produces a second feature vector that contains further enhanced and refined visual features. Then, a picture text cross attention process is performed on the second feature vector and a text feature vector corresponding to the input text data. The key to this step is to enable the text description to influence the visual content, establishing a link between the visual information of the image and the text information of the natural language. The picture text cross-attention mechanism allows the model to combine decorative features (e.g., wear, accessories) of the user description with visual features of the image, resulting in a third feature vector that is rich and consistent with the user description. Finally, the third feature vector is spatially attentive by using the human mask map. This step relies on spatial attention mechanisms to ensure that the feature vectors can accurately correspond to specific parts and poses of the human body. Through the spatial attention process, the model may more accurately spatially assign and adjust features to ensure that the generated pose image visually accurately matches the original human body contours and poses. The result of this whole set of processing is a second fused feature vector that not only includes the visual features of the target person, but also integrates the textual description of the user and the spatial information of the human mask. This enables the generated target character to be consistent in pose while possessing the ornamental features desired by the user. The formation of the second fused feature vector is critical to generating a high quality, personalized, and user descriptive-compliant character pose image.

In some embodiments, processing, by the diffusion model, the first fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image, where obtaining the first human body posture image corresponding to the target person image includes obtaining a preset noise figure, and inputting the noise figure to the diffusion model; and according to the preset noise diagram, denoising the first fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image for a preset number of times to obtain a first human body posture image corresponding to the target person image. In this embodiment, the diffusion model may be Unet, and the control network may be a pre-trained control net network.

Referring to fig. 4, fig. 4 shows a carrier, i.e. a human body posture image generation model, to which the method of the present invention is applied. The human body posture image generation model comprises the target character feature extraction module, a diffusion model, a control network and a decoder.

After the target person extraction module processes the target person image, the first fusion feature corresponding to the target person image can be directly input into the diffusion model and the noise map zt can be input into the diffusion model under the condition of no text data input. The result of the processing of the preset human body posture image (namely, the human body skeleton image) through the control network is also input into the expansion model, and denoising is carried out on the basis of input data through the expansion model, so that a first human body posture image (namely, a human figure image) corresponding to the target human figure image is obtained.

Specifically, in the process of generating an image, a first step is to acquire a preset noise figure. The noise figure is an image with random data as an initial state of the generation process. It is similar to a blank canvas, allowing the diffusion model to build up the desired image content step by step. And inputting the obtained preset noise figure into a diffusion model. The diffusion model is typically a deep neural network that is capable of modeling the diffusion process, i.e., gradually generating images with specific structures and features from random noise states. In the denoising process, carrying out repeated iterative processing on a preset noise figure, gradually removing noise and introducing contents contained in the fusion feature vector and the gesture feature vector. This iterative process is typically set to a preset number of times to ensure image quality and stability of the generation. Each iteration carries out fine granularity adjustment on the noise image and the fusion characteristic, so that the image is gradually clear, the details are continuously perfect, and finally a first human body posture image corresponding to the target human figure image is formed. In this embodiment, the diffusion model selects Unet architecture. Unet are often used for image segmentation and generation tasks, and are particularly suited for processing local and global information in image data. Its symmetrical structure makes it possible to maintain the fineness of the image content during feature fusion. And performing iterative denoising processing on the noise map by adopting the Unet model, so that a vivid image which retains the texture and posture information of the target person can be effectively generated. The control Net can learn how to capture key features of the posture in advance, and ensures the accuracy and consistency of the posture in the generating process. By means of the advanced deep learning network and the well-designed denoising iterative process, the method can generate the high-quality target character image with accurate gesture, vivid visual effect and rich content. This technique not only can cope with complex image generation requirements, but also allows users to customize and reflect individual aesthetic and specific scene needs.

In some embodiments, processing, by the diffusion model, the second fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image, to obtain the second human body posture image corresponding to the target person image includes: acquiring a preset noise figure, and inputting the noise figure into a diffusion model; and according to the preset noise diagram, denoising the second fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image for a preset number of times to obtain a second human body posture image corresponding to the target person image. For example, a predetermined noise figure is first acquired before starting the generation process. The noise map consists of randomly generated noise points that will serve as starting points for the diffusion model to generate the image. (the noise figure is typically a high-dimensional, randomly distributed data vector, generated by matching the expected input size of the model) the noise figure is provided as an input to the diffusion model. The diffusion model is a deep learning model designed specifically for image generation that simulates the back diffusion process from a chaotic noise state to a sharp image state. In the subsequent denoising step, the input noise data is iterated for a preset number of times. These iterations are designed to gradually remove noise elements and introduce features of the target portrait, including text descriptive information of the user, visual features of the target portrait, and feature vectors of the preset human pose. In each iteration, the diffusion model gradually learns and reproduces the key features of the target portrait, and the definition and detail of the image are increased in the whole process. The generation quality can be accurately controlled through the iteration of the appointed times, and the generated image is enabled to be vivid and accord with the preset human body posture and text description information. And after the iterative denoising processing for a predetermined number of times, finally obtaining a second human body posture image corresponding to the target person image. The method of the invention has the advantages that the strong generating capacity of the diffusion model and the deep understanding of the characteristics are utilized, and complex characteristic information (including text description) can be effectively converted into high-quality customized images. The flexibility of personalized portrait generation is improved, and the experience of participation of users in authoring is enriched.

The training process of the target person feature extraction module is described below with reference to fig. 3 and 4, specifically as follows; the method further includes, prior to inputting the target persona image into the target persona feature extraction module: acquiring a plurality of training samples, wherein each training sample comprises a character image of a training object and training text data for decorating the object; encoding the character image of the training object through a picture encoder to obtain an image feature vector corresponding to the character image of the training object; encoding the character image of the training object through a texture encoder to obtain a multi-scale feature vector corresponding to the character image of the training object; dividing the character image of the training object through the image dividing model to obtain a divided human mask image; encoding the training text data through a text encoder to obtain text feature vectors corresponding to the training text data; determining fusion features corresponding to the character images of the training objects according to the image feature vectors corresponding to the character images of the training objects, the multi-scale feature vectors corresponding to the character images of the training objects, the segmented human mask image and the text feature vectors corresponding to the training text data; and calculating loss based on the fusion characteristics corresponding to the character images of the training objects, and updating parameters in the target character characteristic extraction module through the loss.

Specifically, a set of training samples is collected, each sample containing a character image of a training object and corresponding training text data. These character images are diversified, reflecting various gestures, expressions, backgrounds, etc., while training text data contains relevant descriptive information, such as apparel, actions, or scenes. The person image of each training object is encoded using a picture encoder, converting the image into an image feature vector. This process involves extracting visual content and style information from the images for subsequent model training. Meanwhile, the character image is encoded by a texture encoder, and multi-scale feature vectors are generated. These features capture texture details at different levels from coarse to fine, enriching the texture expression in the subsequent generation process. And processing the training character image by using the image segmentation model so as to obtain a human mask image. The segmentation model is used to determine the boundaries of the character and the background in the image, making the contour and the main structure of the character more definite. Text data in the training samples is input to a text encoder and converted into text feature vectors. The text encoder encodes descriptive information into a numerical representation that can be understood by a machine. And combining the obtained image feature vectors, the multi-scale feature vectors, the human mask image and the text feature vectors to determine the fusion features corresponding to the character images of the training objects. This fusion feature contains the visual details of the character and the text description information as the basis for training the network. The model loss is calculated based on the fusion features of the training objects. In deep learning, the loss function measures the difference between the generated result and the actual result. Parameters in the target persona feature extraction module are adjusted and updated based on this loss by back propagation and optimization algorithms (e.g., SGD, adam, etc.). The optimization aims at reducing loss and improving the capability of the model in feature extraction and discrimination. The pre-training process is critical, and enables the model to accurately grasp and generate target character image characteristics required by a user during actual application, including customized characteristics based on text description. Through the supervised training process, the invention ensures that the performance of the generated model is fully optimized, thereby improving the quality and accuracy of the finally generated image. Referring to fig. 4, in this embodiment, the fusion feature of the training object may be decoded by a Decoder (i.e., decoder) to reconstruct a target character image, i.e., a character image of the training object, and calculated with a real training image to obtain a loss.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. The human body posture image generating device described below and the human body posture image generating method described above may be referred to in correspondence with each other. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 5 is a schematic structural diagram of a human body posture image generating device provided in an embodiment of the present disclosure.

As shown in fig. 5, the human body posture image generating device 500 includes an acquisition module 510, a first encoding module 520, a second encoding module 530, a segmentation module 540, a determination module 550, a feature extraction module 560, and a human body posture generating module 570.

Specifically, the acquiring module 510 is configured to acquire a target person image and a preset human posture image, and input the target person image to the target person feature extracting module.

The first encoding module 520 is configured to encode the target person image by using a picture encoder to obtain an image feature vector corresponding to the target person image.

The second encoding module 530 is configured to encode the target person image by using a texture encoder to obtain a multi-scale feature vector corresponding to the target person image.

The segmentation module 540 is configured to perform segmentation processing on the target person image through the image segmentation model, so as to obtain a segmented human mask map.

The determining module 550 is configured to determine a first fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image, and the human mask map.

The feature extraction module 560 is configured to perform feature extraction on a preset human body posture image through the control network, so as to obtain a feature vector corresponding to the preset human body posture image.

The human body posture generation module 570 is configured to input the first fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image into the diffusion model, and process the first fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image through the diffusion model to obtain the first human body posture image corresponding to the target person image.

The human body posture image generating device 500 may process the target person image by using a picture encoder and a texture encoder, respectively, thereby not only extracting key features related to the posture, but also retaining texture information of the target person, such as skin texture, clothing details, and the like. The multi-scale feature extraction method is beneficial to maintaining the vivid texture of the image in the gesture transformation process and improving the quality of the generated image. By introducing a combination of control network and image segmentation model, the invention is able to generate human body images that are more accurately aligned to a given pose. The control network ensures that the characteristics of the preset human body pose can effectively influence the generated result, and the image segmentation model provides a human body mask map so as to better process human body boundaries and occlusion areas when synthesizing the human body pose. The introduction of the diffusion model provides a new processing path, and the model is helpful for smoothly processing the complex human body posture transformation and the detail generation of the shielding part by simulating the reverse diffusion process, so that the adaptability and the stability of the generated image to complex scenes are improved. By utilizing the fusion characteristics and the preset posture characteristics of the target character image, the invention can ensure that the generated character image not only maintains the appearance characteristics of the target character, but also accurately matches the preset posture, thereby improving the consistency between the character and the posture and enhancing the authenticity and the credibility of the generated image.

In some embodiments of the present disclosure, the determination module 550 is configured to: performing self-attention processing on the image feature vector corresponding to the target person image to obtain a first feature vector; performing picture cross attention processing on the multi-scale feature vector and the first feature vector corresponding to the target character image to obtain a second feature vector; and performing spatial attention processing on the human mask image and the second feature vector to obtain a first fusion feature vector corresponding to the target person image.

In some embodiments of the present disclosure, the human posture image generation device 500 is further configured to: acquiring text data, wherein the text data comprises words for decorating a target person; encoding the text data through a text encoder to obtain text feature vectors corresponding to the text data; determining a second fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image, the human mask image and the text feature vector corresponding to the text data; and inputting the second fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image into a diffusion model, and processing the second fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image through the diffusion model to obtain a second human body posture image corresponding to the target person image.

In some embodiments of the present disclosure, the determining module 550 is further configured to: performing self-attention processing on the image feature vector corresponding to the target person image to obtain a first feature vector; performing picture cross attention processing on the multi-scale feature vector and the first feature vector corresponding to the target character image to obtain a second feature vector; performing picture text cross attention processing on the second feature vector and the text feature vector corresponding to the text data to obtain a third feature vector; and performing spatial attention processing on the human mask image and the third feature vector to obtain a second fusion feature vector corresponding to the target person image.

In some embodiments of the present disclosure, the human gesture generation module 570 described above is configured to: acquiring a preset noise figure, and inputting the noise figure into a diffusion model; and according to the preset noise diagram, denoising the first fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image for a preset number of times to obtain a first human body posture image corresponding to the target person image.

In some embodiments of the present disclosure, the human gesture generation module 570 described above is further configured to: acquiring a preset noise figure, and inputting the noise figure into a diffusion model; and according to the preset noise diagram, denoising the second fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image for a preset number of times to obtain a second human body posture image corresponding to the target person image.

In some embodiments of the present disclosure, the human body posture image generating device 500 is further configured to, prior to inputting the target human body image to the target human feature extraction module: acquiring a plurality of training samples, wherein each training sample comprises a character image of a training object and training text data for decorating the object; encoding the character image of the training object through a picture encoder to obtain an image feature vector corresponding to the character image of the training object; encoding the character image of the training object through a texture encoder to obtain a multi-scale feature vector corresponding to the character image of the training object; dividing the character image of the training object through the image dividing model to obtain a divided human mask image; encoding the training text data through a text encoder to obtain text feature vectors corresponding to the training text data; determining fusion features corresponding to the character images of the training objects according to the image feature vectors corresponding to the character images of the training objects, the multi-scale feature vectors corresponding to the character images of the training objects, the segmented human mask image and the text feature vectors corresponding to the training text data; and calculating loss based on the fusion characteristics corresponding to the character images of the training objects, and updating parameters in the target character characteristic extraction module through the loss.

Fig. 6 is a schematic diagram of an electronic device 6 provided by an embodiment of the present disclosure. As shown in fig. 6, the electronic device 6 of this embodiment includes: a processor 601, a memory 602 and a computer program 603 stored in the memory 602 and executable on the processor 601. The steps of the various method embodiments described above are implemented by the processor 601 when executing the computer program 603. Or the processor 601 when executing the computer program 603 performs the functions of the modules in the above-described device embodiments.

The electronic device 6 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 6 may include, but is not limited to, a processor 601 and a memory 602. It will be appreciated by those skilled in the art that fig. 6 is merely an example of the electronic device 6 and is not limiting of the electronic device 6 and may include more or fewer components than shown, or different components.

The Processor 601 may be a central processing unit (Central Processing Unit, CPU) or other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.

The memory 602 may be an internal storage unit of the electronic device 6, for example, a hard disk or a memory of the electronic device 6. The memory 602 may also be an external storage device of the electronic device 6, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 6. The memory 602 may also include both internal and external storage units of the electronic device 6. The memory 602 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. A human body posture image generation method, characterized by comprising:

acquiring a target person image and a preset human body posture image, and inputting the target person image into a target person feature extraction module;

encoding the target person image through a picture encoder to obtain an image feature vector corresponding to the target person image;

Encoding the target person image through a texture encoder to obtain a multi-scale feature vector corresponding to the target person image;

Dividing the target person image through an image dividing model to obtain a divided human mask image;

Determining a first fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image and the human mask image;

extracting features of the preset human body posture image through a control network to obtain a feature vector corresponding to the preset human body posture image;

And inputting the first fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image into a diffusion model, and processing the first fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image through the diffusion model to obtain the first human body posture image corresponding to the target person image.

2. The method of claim 1, wherein the determining the first fused feature vector corresponding to the target person image from the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image, and the human mask map comprises:

performing self-attention processing on the image feature vector corresponding to the target person image to obtain a first feature vector;

Performing picture cross attention processing on the multi-scale feature vector corresponding to the target character image and the first feature vector to obtain a second feature vector;

And performing spatial attention processing on the human mask image and the second feature vector to obtain a first fusion feature vector corresponding to the target person image.

3. The method according to claim 2, wherein the method further comprises:

acquiring text data, wherein the text data comprises words for decorating a target person;

Encoding the text data through a text encoder to obtain text feature vectors corresponding to the text data;

Determining a second fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image, the human mask image and the text feature vector corresponding to the text data;

And inputting the second fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image into the diffusion model, and processing the second fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image through the diffusion model to obtain the second human body posture image corresponding to the target person image.

4. The method of claim 3, wherein the determining the second fused feature corresponding to the target person image based on the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image, the human mask map, and the text feature vector corresponding to the text data comprises:

Performing picture text cross attention processing on the second feature vector and the text feature vector corresponding to the text data to obtain a third feature vector;

And performing spatial attention processing on the human mask image and the third feature vector to obtain a second fusion feature vector corresponding to the target person image.

5. The method of claim 1, wherein the processing, by the diffusion model, the first fusion feature corresponding to the target person image and the feature vector corresponding to the preset human posture image to obtain the first human posture image corresponding to the target person image includes:

Acquiring a preset noise figure, and inputting the noise figure into the diffusion model;

And according to the preset noise diagram, denoising the first fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image for a preset number of times to obtain a first human body posture image corresponding to the target person image.

6. The method of claim 3, wherein the processing, by the diffusion model, the second fusion feature corresponding to the target person image and the feature vector corresponding to the preset human posture image to obtain the second human posture image corresponding to the target person image includes:

And according to the preset noise diagram, denoising the second fusion feature corresponding to the target person image and the feature vector corresponding to the preset human body posture image for a preset number of times to obtain a second human body posture image corresponding to the target person image.

7. The method of claim 1, wherein prior to inputting the target persona image into a target persona feature extraction module, the method further comprises:

Acquiring a plurality of training samples, wherein each training sample comprises a character image of a training object and training text data for decorating the object;

Encoding the character image of the training object through the image encoder to obtain an image feature vector corresponding to the character image of the training object;

encoding the character image of the training object through the texture encoder to obtain a multi-scale feature vector corresponding to the character image of the training object;

Dividing the character image of the training object through the image dividing model to obtain a divided human mask image;

Encoding the training text data through a text encoder to obtain text feature vectors corresponding to the training text data;

Determining fusion features corresponding to the character images of the training objects according to the image feature vectors corresponding to the character images of the training objects, the multi-scale feature vectors corresponding to the character images of the training objects, the segmented human mask image and the text feature vectors corresponding to the training text data;

and calculating loss based on the fusion characteristics corresponding to the character images of the training objects, and updating parameters in the target character characteristic extraction module through the loss.

8. A human posture image generating apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring a target character image and a preset human body posture image and inputting the target character image into the target character feature extraction module;

The first encoding module is used for encoding the target person image through a picture encoder to obtain an image feature vector corresponding to the target person image;

the second encoding module is used for encoding the target person image through a texture encoder to obtain a multi-scale feature vector corresponding to the target person image;

The segmentation module is used for carrying out segmentation processing on the target person image through an image segmentation model to obtain a segmented human mask image;

The determining module is used for determining a first fusion feature corresponding to the target person image according to the image feature vector corresponding to the target person image, the multi-scale feature vector corresponding to the target person image and the human mask image;

the feature extraction module is used for carrying out feature extraction on the preset human body posture image through a control network to obtain a feature vector corresponding to the preset human body posture image;

The human body posture generation module is used for inputting the first fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image into a diffusion model, and processing the first fusion characteristic corresponding to the target person image and the characteristic vector corresponding to the preset human body posture image through the diffusion model to obtain the first human body posture image corresponding to the target person image.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.