WO2025112948A1 - Image generation method, automatic question answering method, and parameter generation model training method - Google Patents
Image generation method, automatic question answering method, and parameter generation model training method Download PDFInfo
- Publication number
- WO2025112948A1 WO2025112948A1 PCT/CN2024/125023 CN2024125023W WO2025112948A1 WO 2025112948 A1 WO2025112948 A1 WO 2025112948A1 CN 2024125023 W CN2024125023 W CN 2024125023W WO 2025112948 A1 WO2025112948 A1 WO 2025112948A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- sample
- parameter
- generation
- parameters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/001—Texturing; Colouring; Generation of texture or colour
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the embodiments of the present disclosure relate to the field of computer technology, and in particular to image generation, automatic question answering, and parameter generation model training methods.
- Text-generated image technology has gradually become the core technology in the field of artificial intelligence generation (AIGC, AI Generated Content).
- AIGC artificial intelligence generation
- Text-generated image technology can generate images through text descriptions, and can be transformed and adjusted according to user requirements and input content, making it easier for users to create works of art with unique styles, and has been widely used in the field of digital art.
- an embodiment of the present disclosure provides an image generation method.
- One or more embodiments of the present disclosure also relate to an automatic question-answering method, a parameter generation model training method, an image generation device, an automatic question-answering device, a parameter generation model training device, a computing device, a computer-readable storage medium, and a computer program to solve the technical defects existing in the prior art.
- an image generating method comprising:
- the image generation parameters and the image description text are input into the image generation model to obtain the target image corresponding to the image description text.
- a parameter generation model training method is provided, which is applied to a cloud-side device, including:
- sample set includes a plurality of sample image-text pairs
- the sample image-text pairs include a sample image and a sample description text
- the sample image-text pairs carry sample parameter information
- the model parameters of the pre-trained language model are adjusted to obtain a parameter generation model that has completed training.
- an automatic question-answering method comprising:
- the image generation parameters and image description text are input into the image generation model to obtain the reply image corresponding to the image question answering request.
- a first acquisition module is configured to acquire image description text
- a first input module is configured to input the image description text and the generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on a plurality of sample image-text pairs and sample parameter information carried by the plurality of sample image-text pairs;
- the second input module is configured to input the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text.
- a parameter generation model training device is provided, which is applied to a cloud-side device, including:
- a third input module is configured to input a plurality of sample image-text pairs and prediction prompt information into the pre-trained language model to obtain image prediction parameters corresponding to the plurality of sample image-text pairs respectively;
- the adjustment module is configured to adjust the model parameters of the pre-trained language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model that has completed the training.
- an automatic question-answering device comprising:
- a fourth input module configured to input the image description text and the generation prompt information into a parameter generation model, and obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on a plurality of sample image-text pairs and sample parameter information carried by the plurality of sample image-text pairs;
- the fifth input module is configured to input the image generation parameters and the image description text into the image generation model to obtain a reply image corresponding to the image question and answer request.
- a computing device including:
- the memory is used to store computer-executable instructions
- the processor is used to execute the computer-executable instructions.
- a computer-readable storage medium which stores computer-executable instructions.
- the instructions are executed by a processor, the steps of the method provided in the first aspect, the second aspect, or the third aspect are implemented.
- a computer program is provided, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the method provided in the first aspect, the second aspect, or the third aspect above.
- An image generation method obtained by an embodiment of the present disclosure obtains an image description text; inputs the image description text and generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs; the image generation parameters and the image description text are input into the image generation model to obtain a target image corresponding to the image description text.
- the image generation parameters are obtained by semantically decomposing the visual elements of the image using the parameter generation model, and the precise image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and the image generation parameters, thereby improving the interpretability and controllability of the image generation.
- FIG1 is an architecture diagram of an image generation system provided by an embodiment of the present disclosure.
- FIG2 is an architecture diagram of another image generation system provided by an embodiment of the present disclosure.
- FIG3 is a flow chart of an image generating method provided by an embodiment of the present disclosure.
- FIG4 is a process flow chart of an image generation method provided by an embodiment of the present disclosure.
- FIG5 is a flow chart of a parameter generation model training method provided by an embodiment of the present disclosure.
- FIG6 is a flow chart of another parameter generation model training method provided by an embodiment of the present disclosure.
- FIG7 is a flow chart of an automatic question-answering method provided by an embodiment of the present disclosure.
- FIG8 is a process flow chart of another image generation method provided by an embodiment of the present disclosure.
- FIG9 is a schematic diagram of a processing process of a parameter generation model provided by an embodiment of the present disclosure.
- FIG10 is a schematic diagram of an image generation interface provided by an embodiment of the present disclosure.
- FIG11 is a schematic diagram of the structure of an image generating device provided by an embodiment of the present disclosure.
- FIG12 is a schematic diagram of the structure of a parameter generation model training device provided by an embodiment of the present disclosure.
- FIG13 is a schematic diagram of the structure of an automatic question-answering device provided by an embodiment of the present disclosure.
- FIG. 14 is a structural block diagram of a computing device provided by an embodiment of the present disclosure.
- first, second, etc. may be used to describe various information in one or more embodiments of the present disclosure, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
- the first may also be referred to as the second, and similarly, the second may also be referred to as the first.
- word "if” as used herein may be interpreted as "at the time of” or "when” or "in response to determining”.
- the user information including but not limited to user device information, user personal information, etc.
- data including but not limited to data used for analysis, stored data, displayed data, etc.
- the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions, and corresponding operation entrances are provided for users to choose to authorize or refuse.
- a large model refers to a deep learning model with large-scale model parameters, which usually contains hundreds of millions, tens of billions, hundreds of billions, trillions, or even more than ten trillion model parameters.
- a large model can also be called a foundation model/foundation model.
- the large model is pre-trained with large-scale unlabeled corpus to produce a pre-trained model with more than 100 million parameters.
- This model can adapt to a wide range of downstream tasks, and the model has good generalization ability, such as a large-scale language model (LLM, Large Language Model), a multi-modal pre-training model, etc.
- LLM Large Language Model
- the big model can be widely used in natural language processing (NLP), computer vision and other fields. Specifically, it can be applied to computer vision tasks such as visual question answering (VQA), image description (IC, Image Caption), image generation, as well as natural language processing tasks such as text-based sentiment classification, text summary generation, and machine translation.
- NLP natural language processing
- VQA visual question answering
- IC image description
- Image Caption image generation
- natural language processing tasks such as text-based sentiment classification, text summary generation, and machine translation.
- the main application scenarios of the big model include digital assistants, intelligent robots, search, online education, office software, e-commerce, intelligent design, etc.
- Latent space diffusion model refers to an artificial intelligence model widely used in image editing and image generation. A technique for encoding images into a latent space for denoising and de-noising.
- CLIP Content Language-Image Pre-training
- CLIP is a model for measuring the correlation between text and images, and is widely used in text and image representation extraction.
- Large-scale language models are trained on massive amounts of text data to predict and generate various language expressions, such as text, sentences, paragraphs, etc.
- LoRA A language model fine-tuning method based on low-rank adaptation, which can achieve good results by training only a small number of parameters when using a large model to adapt downstream tasks.
- the text-generated graph technology outputs images based on the image description text input by the user, which has great value in many scenarios such as advertising recommendation, interactive entertainment, art design, etc.
- the common text-generated graph architecture is the latent space diffusion model. Since the latent space diffusion model operates in the hidden space, it is difficult to understand and explain its internal working mechanism and decision-making process, which makes the generation process difficult to understand. In addition, this type of method relies on the feature extraction module (CLIP encoder) to understand the text, and cannot control the semantics of the visual elements in the generation stage, which leads to a large difference between the final generated result and the image description text originally input by the user.
- CLIP encoder feature extraction module
- the disclosed embodiment proposes an image generation method combined with a large-scale language model, which uses the powerful semantic understanding and associative ability of the large-scale language model to complete the semantic decomposition and reorganization of the visual elements of the image, and inputs the parameterized visual elements into the image generation model to complete accurate image generation, so as to meet the stronger interpretability and controllability of the image generation process.
- the large-scale language model can align complex and long texts, laying the technical foundation for more user interaction methods.
- convenient and refined image editing similar structure image generation and other extended applications can be realized.
- an embodiment of the present disclosure proposes an image generation scheme to obtain image description text; input the image description text and generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs; input the image generation parameters and the image description text into the image generation model to obtain a target image corresponding to the image description text.
- an image generation method is provided.
- the present disclosure also relates to an automatic question-answering method, a parameter generation model training method, an image generation device, an automatic question-answering device, a parameter generation model training device, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
- FIG. 1 shows an architecture diagram of an image generation system provided by an embodiment of the present disclosure.
- the image generation system may include a client 100 and a server 200;
- the client 100 is used to send image description text to the server 200;
- the server 200 is used to input the image description text and the generation prompt information into the parameter generation model to obtain the image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image.
- the training is obtained; the image generation parameters and the image description text are input into the image generation model to obtain a target image corresponding to the image description text; and the target image is sent to the client 100;
- the client 100 is also used to receive the target image sent by the server 200.
- the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of image generation.
- FIG. 2 shows an architecture diagram of another image generation system provided by an embodiment of the present disclosure.
- the image generation system may include multiple clients 100 and a server 200, wherein the client 100 may include a terminal-side device, and the server 200 may include a cloud-side device.
- a communication connection may be established between multiple clients 100 through the server 200.
- the server 200 is used to provide image generation services between multiple clients 100. Multiple clients 100 may serve as a sender or a receiver, respectively, and realize communication through the server 200.
- the user can interact with the server 200 through the client 100 to receive data sent by other clients 100, or send data to other clients 100, etc.
- the user can publish a data stream to the server 200 through the client 100, and the server 200 generates a target image according to the data stream and pushes the target image to other clients that have established communication.
- the client 100 and the server 200 are connected via a network.
- the network provides a medium for a communication link between the client 100 and the server 200.
- the network may include various connection types, such as wired or wireless communication links or optical fiber cables, etc.
- the data transmitted by the client 100 may need to be encoded, transcoded, compressed, etc. before being released to the server 200.
- the client 100 may be a browser, an APP (Application), or a web application such as an H5 (HyperText Markup Language5, the fifth edition of Hypertext Markup Language) application, or a light application (also known as a mini-program, a lightweight application) or a cloud application, etc.
- the client 100 may be based on the software development kit (SDK, Software Development Kit) of the corresponding service provided by the server 200, such as based on the real-time communication (RTC, Real Time Communication) SDK development and acquisition, etc.
- SDK Software Development Kit
- RTC Real Time Communication
- the electronic device may have a display screen and support information browsing, etc., such as a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, etc.
- a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, etc.
- applications may also be configured in the electronic device, such as human-computer dialogue applications, model training applications, text processing applications, web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc.
- the server 200 may include servers that provide various services, such as servers that provide communication services to multiple clients, servers that provide background training to support models used on clients, and servers that process data sent by clients. It should be noted that the server 200 can be implemented as a distributed server cluster consisting of multiple servers, or as a single server.
- the server can also be a server of a distributed system, or a server combined with a blockchain.
- the server can also be a cloud service, cloud database, cloud computing, cloud function, cloud storage, Cloud servers that provide basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), big data and artificial intelligence platforms, or intelligent cloud computing servers or intelligent cloud hosts with artificial intelligence technology.
- FIG. 3 shows a flow chart of an image generation method provided by an embodiment of the present disclosure, which specifically includes the following steps:
- Step 302 Obtain image description text.
- image description text may be acquired, and a target image that meets actual needs of the user may be generated based on the image description text.
- the image description text represents the user's image generation needs.
- the image description text can be description text in different languages, such as English description text, Chinese description text, etc. Since the image generation process introduces a large-scale language model, the image description text can be a short text or a long text. For example, the image description text is "two black cats and a white dog on an orange sofa".
- image description text which can be selected according to actual conditions, and the embodiments of the present disclosure do not limit this.
- the image description text sent by the user through the client can be received.
- the image description text can be read from other data acquisition devices or databases.
- Step 304 Input the image description text and generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs.
- the image description text and generation prompt information may be further input into a parameter generation model to obtain image generation parameters corresponding to the image description text.
- the parameter generation model is obtained by training a pre-trained language model based on a sample set.
- the pre-trained language model can be a large-scale language model, such as a multimodal large model, or a language processing model trained using the first training set.
- Image generation parameters can be understood as parameterized image generation conditions. Image generation parameters include but are not limited to size parameters, position parameters, shape parameters, and the like.
- Generation prompt information can be understood as a parameter generation paradigm, also known as a generation condition paradigm. Generation prompt information is used to guide the parameter generation model to generate image generation parameters.
- Generation prompt information includes but is not limited to image generation condition elements of shape, color, size, position, blur, and key points. For example, generation prompt information can be "what are the elements in the image.”
- the generation conditions are unified into parameters that can be described in natural language rather than image conditions, so as to facilitate semantic understanding of large-scale language models.
- Step 306 input the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text.
- an image description text is obtained, the image description text and generation prompt information are input into a parameter generation model, and after the image generation parameters corresponding to the image description text are obtained, the image generation parameters and the image description text can be further input into the image generation model to obtain a target image corresponding to the image description text.
- the image generation model can be a latent space diffusion model, or an image generation model trained using the second training set.
- the image generation model is used to generate a final target image based on image generation parameters and image description text.
- the target image can be a black and white image or a color image (RGB Image).
- the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of image generation.
- the encoding and decoding unit generates a target image corresponding to the image description text according to the parameter encoding features and the image description text.
- FIG 4 shows a processing flow chart of an image generation method provided by an embodiment of the present disclosure.
- the image description text and the image generation parameters can be input into the image generation model together.
- the image generation parameters are first encoded by the parameter encoding unit to obtain parameter encoding features.
- the encoding and decoding unit generates a target image corresponding to the image description text according to the parameter encoding features and the image description text.
- the encoding and decoding unit includes an encoding unit and a decoding unit. After obtaining the parameter encoding features, the parameter encoding features can be input into the encoding unit in the latent space for parameterized conditional control, and the residual is sent to the decoding unit to realize the final parameter fusion generation.
- the parameter encoding unit can directly encode the image generation parameters to obtain parameter encoding features. Furthermore, since the image generation parameters include parameters of different dimensions, the parameter encoding unit can also encode the image generation parameters into parameter encoding features of different latitudes in view of the diversity of the image generation parameter dimensions.
- the image generation parameters and the image description text are input into the image generation model, and the parameter encoding unit encodes the image generation parameters to obtain parameter encoding features; the encoding and decoding unit generates a target image corresponding to the image description text according to the parameter encoding features and the image description text. Parameters are controlled to ensure the accuracy of the target image.
- the parameter encoding unit includes a one-dimensional parameter encoding unit, a two-dimensional parameter encoding unit and a feature aggregation unit; the parameter encoding unit generates parameter encoding for the image to obtain the parameter encoding feature, which may include the following steps:
- a one-dimensional parameter encoding unit encoding the one-dimensional parameter in the image generation parameter to obtain a one-dimensional encoding feature
- the two-dimensional parameter encoding unit encodes the two-dimensional parameters in the image generation parameters to obtain a two-dimensional encoding feature
- the one-dimensional coding features and the two-dimensional coding features are aggregated by the feature aggregation unit to obtain parameter coding features.
- the image generation parameters may include parameters of different dimensions, such as one-dimensional parameters and two-dimensional parameters.
- the one-dimensional parameters include but are not limited to color parameters and blur parameters.
- the two-dimensional parameters include but are not limited to position parameters and shape parameters. Therefore, in order to better encode the image generation parameters, the embodiment of the present disclosure divides the parameter encoding unit to obtain a one-dimensional parameter encoding unit for encoding the one-dimensional parameters and a two-dimensional parameter encoding unit for encoding the two-dimensional parameters.
- a feature aggregation unit may be additionally added to the parameter encoding unit, so that the one-dimensional encoding features and the two-dimensional encoding features are aggregated through the feature aggregation unit to obtain parameter encoding features.
- the one-dimensional parameters in the image generation parameters are encoded by a one-dimensional parameter encoding unit to obtain a one-dimensional encoding feature;
- the two-dimensional parameters in the image generation parameters are encoded by a two-dimensional parameter encoding unit to obtain a two-dimensional encoding feature;
- the one-dimensional encoding feature and the two-dimensional encoding feature are aggregated by a feature aggregation unit to obtain a parameter encoding feature, thereby encoding parameters of different dimensions respectively, thereby ensuring the accuracy of the parameter encoding feature.
- an image update parameter for the target image may be received, and the target image may be edited based on the image update parameter, or an image with a structure similar to the target image may be generated. That is, after inputting the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text, the following steps may also be included:
- the mask generation sequence and the target image are input into the image generation model to obtain the updated target image.
- the image update parameter is used to modify or adjust the target image to obtain an updated target image.
- the image update parameter may be an image generation condition independent of the image generation parameter, such as if the image generation parameter is "the number of black cats is 2", and the image update parameter is "the number of black cats is 3"; for example, if the image generation parameter is "the number of black cats is 2", the image update parameter is "the number of black cats plus one”.
- image update parameters for a target image which can be selected based on actual conditions, and the embodiments of the present disclosure do not impose any limitations on this.
- an image update parameter sent by a user through a client may be received.
- an image update parameter may be obtained from other data. Get the image update parameters for the target image from the device or database.
- the image editing region in the target image can be determined.
- the image editing region refers to the region in the target image that needs to be parameter updated.
- the image editing region in the attention map output by the attention mechanism can be determined, and the image editing region in the attention map can be masked to generate a mask generation sequence.
- image update parameters for the target image are obtained; according to the image update parameters, the image editing area in the target image is determined; the image editing area is masked to obtain a mask generation sequence; the mask generation sequence and the target image are input into the image generation model to obtain an updated target image.
- image editing area in the attention map based on the mask generation sequence, other areas where parameters are not updated are not affected, and identity-preserving image editing and similar structure image generation are achieved.
- taking the image generation model as a diffusion model as an example, the above-mentioned inputting the mask generation sequence and the target image into the image generation model to obtain the updated target image may include the following steps:
- the diffusion result of the current diffusion step is generated according to the diffusion result of the completed diffusion step and the mask generation sequence, until the iteration stop condition is reached, and the updated target image is obtained.
- the diffusion generation model includes multiple diffusion steps in its processing.
- Each diffusion step can be regarded as a transformation of the image. These transformations can help the diffusion model better simulate the image degradation process and restore the noise-free image from the noisy image.
- the multiple diffusion steps of the diffusion model can be divided into two stages, namely the diffusion stage and the denoising stage.
- the diffusion stage the diffusion model can simulate the image degradation process by gradually adding noise to the input image; in the denoising stage, the diffusion model can find and eliminate noise from the noisy image to restore the original noise-free image.
- the next editable image generation can be performed based on the diffusion results of the previous diffusion steps.
- the diffusion result and the mask generation sequence can be multiplied to obtain the diffusion result of the current diffusion step, until the iteration stop condition is reached, thereby obtaining the updated target image.
- the diffusion result of the current diffusion step is generated according to the diffusion result of the completed diffusion step and the mask generation sequence, until the iteration stop condition is reached, and the updated target image is obtained, thereby realizing identity-preserving image editing and similar structure image generation.
- the following steps may also be included:
- the target image and the updated target image are sent to the client, so that the client displays the target image and the updated target image to the user.
- the target image and the updated target image can be sent to the client at the same time, and the user can select the target image that meets the actual requirements from the target image and the updated target image displayed on the client. images that meet actual needs.
- the target image and the updated target image can be directly sent to the client so that the client displays the target image and the updated target image to the user.
- the target image and the updated target image can be sent to the client according to the user's display demand information.
- the display demand information represents the user's demand for viewing the target image.
- the display demand information includes but is not limited to only displaying the target image and the updated target image, displaying the image description information, the target image and the updated target image.
- the display demand information is specifically set according to the actual needs of the user, and the embodiments of the present disclosure do not impose any restrictions on this.
- the target image and the updated target image are sent to the client, so that the client displays the target image and the updated target image to the user, providing the user with a variety of choices, and the user can intuitively compare the target image and the updated target image to determine whether the updated target image meets his or her needs.
- the following steps may also be included:
- the image selection information is used to identify the image selected by the user.
- the image selection information includes but is not limited to the image number and the image position, and is selected according to actual conditions, and the embodiment of the present disclosure does not impose any limitation on this.
- the user may send image selection information in a manner including but not limited to voice input, touch, and text input. Based on the image selection information, an image that meets the user's needs may be determined, and the image corresponding to the image selection information may be used as a real image to adjust parameters of the image generation model.
- image selection information sent by the user through the client is received, and the model parameters of the image generation model are adjusted based on the image selection information, so that the image generation model has the ability to interact with the user, so that the user can adjust the image generation result in an interactive manner, thereby improving user satisfaction.
- the following steps may also be included:
- the image generation parameters and the image description text are input into the image generation model.
- the target image can be sent to the client.
- the user can process the target image by himself. If he is not satisfied with the target image, he can send the adjustment sample data to train the model again.
- the model parameters of the image generation model can be adjusted according to the adjustment sample data; in a second possible implementation method, the model parameters of the parameter generation model can be adjusted according to the adjustment sample data; in a third possible implementation method, the adjustment sample data can be adjusted according to the adjustment sample data.
- Model parameters of the image generation model and the parameter generation model are input into the image generation model.
- the adjusted sample data can be obtained by the user adding labels to the target image (such as positive and negative sample labels), or by adjusting parameters of the target image (such as adjusting size parameters).
- labels such as positive and negative sample labels
- adjusting parameters of the target image such as adjusting size parameters.
- a target image is sent to a client so that the client displays the target image to a user; adjustment sample data sent by the user through the client is received, and model parameters of the image generation model are adjusted according to the adjustment sample data, so that the image generation model has the ability to interact with the user, so that the user can adjust the image generation result in an interactive manner, thereby improving user satisfaction.
- the following steps may also be included:
- sample set includes a plurality of sample image-text pairs
- the sample image-text pairs include a sample image and a sample description text
- the sample image-text pairs carry sample parameter information
- the model parameters of the pre-trained language model are adjusted to obtain a parameter generation model that has completed training.
- the training method of the parameter generation model is supervised training based on prompt learning, that is, each sample image-text pair in the sample set carries real sample parameter information, and the sample parameter information is the generation target of the parameter generation model, which is used to guide the training process of the parameter generation model.
- the prediction prompt information is used to prompt the prediction process of the pre-trained language model.
- the prediction prompt information is set according to the actual situation, and the present disclosure embodiment does not impose any restrictions on this.
- the prediction prompt information can be "You are an intelligent bounding box generation model. I will provide you with images and image description texts.
- the format of each bounding box should be (object name, center point horizontal coordinate, center point vertical coordinate, bounding box width, bounding box height)".
- the prediction prompt information can enable the pre-trained language model to output prediction results in a fixed data format.
- the implementation method of "inputting multiple sample image-text pairs and prediction prompt information into the pre-trained language model to obtain image prediction parameters corresponding to multiple sample image-text pairs” can refer to the implementation method of "inputting image description text and generation prompt information into the parameter generation model to obtain image generation parameters corresponding to the image description text", and this disclosure will not repeat it.
- sample sets which can be selected according to actual conditions.
- the embodiments of the present disclosure do not limit this.
- a large number of sample image text pairs carrying sample parameter information input by the user can be received to form a sample set.
- the sample set can be obtained from other data.
- a large number of sample image-text pairs carrying sample parameter information are read from an acquisition device or a database to form a sample set.
- the sample set may include at least one of an original sample subset, a generative sample subset, and a constructed sample subset.
- a plurality of sample images carrying sample description texts may be obtained to form an original sample subset.
- the image understanding capability of a large model may be used to perform sample description texts on sample images to form a generative sample subset, and when forming a generative sample subset, a collaborative detection and segmentation model may be used to extract a variety of sample parameter information.
- Parameter configuration conditions may be set based on prior knowledge, such as output rules, object rules, element number rules, etc., to construct pseudo data to form a constructed sample subset, thereby using the constructed sample subset to enable the parameter generation model to have a purposeful learning capability.
- FIG5 shows a flow chart of a parameter generation model training method provided by an embodiment of the present disclosure, wherein the original sample subset, the generative sample subset and the constructed sample subset are mixed to obtain a sample set, multiple sample image-text pairs and prediction prompt information in the sample set are input into a pre-trained language model, image prediction parameters corresponding to the multiple sample image-text pairs are obtained, model parameters of the pre-trained language model are adjusted according to the image prediction parameters and sample parameter information, and a parameter generation model that has been trained is obtained.
- the image generation parameters output by the parameter generation model can be made more reasonable.
- the model parameters of the pre-trained language model are adjusted according to the image prediction parameters and sample parameter information to obtain a parameter generation model that has completed training.
- the parameter generation model finally obtained can be made more accurate.
- For the first sample image input the first sample image and the construction prompt information into the pre-trained language model to obtain a first sample description text of the first sample image;
- a sample set is constructed according to a plurality of sample images, sample description texts of the plurality of sample images, and sample parameter information.
- the construction prompt information is used to prompt the pre-trained language model to generate a sample description text of the sample image.
- the construction prompt information is set according to the actual situation, and the embodiment of the present disclosure does not make any limitation on this.
- the prediction prompt information may be "Please describe the image”.
- multiple sample images which can be selected according to actual conditions, and the embodiments of the present disclosure do not limit this.
- multiple sample images input by a user can be received.
- multiple sample images can be read from other data acquisition devices or databases.
- the sample description text and sample parameter information of multiple sample images are used to build a sample set.
- the sample description text is generated through the image understanding ability of the pre-trained language model, ensuring the readiness of the sample description text.
- relevant keywords, sentence structure and other information can be obtained from the first sample description text through methods such as text mining and image analysis, and this information can be used as the first sample parameter information.
- the segmentation model may be collaboratively detected to realize the extraction of multiple sample parameter information. That is, the above-mentioned generation of the first sample parameter information of the first sample image according to the first sample image and the first sample description text may include the following steps:
- the key area position information refers to the coordinates of the center point of the key area in the first sample image, and the key area can be understood as a bounding box.
- the first sample image is a little girl and a dog running
- the key area in the first sample image is the area where the little girl is and the area where the dog is.
- the first sample parameter information includes but is not limited to the height and width of the key area.
- the first sample description text and the first sample image can be input into the detection segmentation model, and the key region position information of the first sample image can be obtained through region detection by the detection segmentation model.
- the key region position information and the first sample image can be input into the visual segmentation model (SAM, Segmentation As A Module) to obtain the first sample parameter information of the first sample image.
- the first sample image is subjected to region detection according to the first sample description text to determine the key region position information of the first sample image; the first sample image is subjected to visual segmentation according to the key region position information to obtain the first sample parameter information of the first sample image.
- the accuracy of the first sample parameter information is improved through region detection and visual segmentation.
- a constructed sample subset included in a sample set is taken as an example to illustrate a construction method of the constructed sample subset.
- a sample set is constructed according to a plurality of sample images, sample description texts of the plurality of sample images, and sample parameter information, including:
- a sample set is constructed according to the multiple sample images, the sample description texts of the multiple sample images and the adjusted sample parameter information.
- the parameter configuration conditions include but are not limited to brightness greater than a preset brightness threshold, clarity greater than a preset clarity threshold, etc., which are selected based on actual conditions, and the embodiments of the present disclosure do not impose any limitations on this.
- parameter configuration conditions which are selected according to actual conditions, and the embodiments of the present disclosure do not limit this.
- the parameter configuration conditions input by the user can be received.
- the parameter configuration conditions can be read from other data acquisition devices or databases.
- the first sample parameter information meets the parameter configuration condition. If the first sample parameter information meets the parameter configuration condition, the first sample parameter is not adjusted. If the first sample parameter information does not meet the parameter configuration condition, the first sample parameter information is adjusted to obtain the adjusted first sample parameter information.
- the parameter configuration condition is obtained; when the first sample parameter information does not meet the parameter configuration condition, the first sample parameter information is adjusted to obtain the adjusted first sample parameter information; a sample set is constructed according to multiple sample images, sample description texts of the multiple sample images and the adjusted sample parameter information. The reliability of the sample parameter information is monitored through the parameter configuration condition, thereby improving the accuracy of the sample parameter information.
- the following steps may also be included:
- sample set includes a plurality of sample image-text pairs
- the sample image-text pairs include a sample image and a sample description text
- the sample image-text pairs carry sample parameter information
- the model parameters of the initial generation model are adjusted to obtain a trained image generation model.
- the training method of the image generation model is supervised training, that is, the sample description text in the sample set carries the real sample image, and the sample image is the generation target of the image generation model, which is used to guide the training process of the image generation model.
- the method of "obtaining a sample set” can refer to the method of obtaining a sample set in the training process of the parameter generation model mentioned above.
- the implementation method of "inputting multiple sample description texts and sample parameter information corresponding to the multiple sample description texts into the initial generation model to obtain predicted images corresponding to the multiple sample description texts” can refer to the implementation method of "inputting image generation parameters and image description texts into the image generation model to obtain target images corresponding to the image description texts” mentioned above, and the embodiments of the present disclosure will not be repeated here.
- the loss value can be calculated according to the predicted image and the sample image, and the model parameters of the initial generation model can be adjusted according to the loss value until the predicted image is reached.
- Set a stopping condition to obtain a trained image generation model where there are many functions for calculating loss values, such as cross entropy loss function, L1 norm loss function, maximum loss function, mean square error loss function, logarithmic loss function, etc. The specific selection is based on the actual situation, and the embodiments of the present disclosure do not impose any restrictions on this.
- the preset stop condition includes that the loss value is less than or equal to a preset threshold value. After the loss value is calculated based on the predicted image and the sample image, the loss value is compared with the preset threshold value.
- the loss value is greater than the preset threshold, it means that the difference between the predicted image and the sample image is large, and the initial generation model has poor prediction ability for the predicted image.
- the model parameters of the initial generation model can be adjusted, and the initial generation model can continue to be trained until the loss value is less than or equal to the preset threshold, indicating that the difference between the predicted image and the sample image is small, and the preset stopping condition is met, and a trained image generation model is obtained.
- the model parameters of the initial generation model are adjusted, and the initial information extraction model continues to be trained until the preset number of iterations is reached, and the iteration is stopped to obtain a trained image generation model, wherein the preset threshold and the preset number of iterations are selected according to actual conditions, and the embodiments of the present disclosure do not impose any limitations on this.
- the model parameters of the initial generation model are adjusted according to the predicted image and the sample image to obtain a trained image generation model.
- the model parameters of the initial generation model can be made more accurate.
- FIG. 6 shows a flow chart of another parameter generation model training method provided by an embodiment of the present disclosure.
- the parameter generation model training is applied to a cloud-side device and specifically includes the following steps:
- Step 602 Acquire a sample set, wherein the sample set includes a plurality of sample image-text pairs, the sample image-text pairs include a sample image and a sample description text, and the sample image-text pairs carry sample parameter information.
- Step 604 input the plurality of sample image-text pairs and prediction prompt information into the pre-trained language model to obtain image prediction parameters corresponding to the plurality of sample image-text pairs.
- Step 606 According to the image prediction parameters and sample parameter information, the model parameters of the pre-trained language model are adjusted to obtain a parameter generation model that has completed training.
- steps 602 to 606 is detailed in the training method of the parameter generation model in the above-mentioned image generation method, and the present disclosed embodiment does not impose any limitation on this.
- the model parameters of the trained parameter generation model can be sent to the terminal device, so that the user can locally build the parameter generation model based on the model parameters to generate image generation parameters.
- the model parameters of the pre-trained language model are adjusted according to the image prediction parameters and the sample parameter information to obtain a parameter generation model that has completed the training. Adjustment can make the final parameter generation model more accurate.
- FIG. 7 shows a flow chart of an automatic question-answering method provided by an embodiment of the present disclosure, which specifically includes the following steps:
- Step 702 Receive an image question and answer request, wherein the image question and answer request carries an image description text.
- Step 704 Input the image description text and generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs.
- Step 706 Input the image generation parameters and the image description text into the image generation model to obtain a reply image corresponding to the image question and answer request.
- steps 702 to 706 is detailed in the above steps 302 to 306, and the present embodiment does not impose any limitation on this.
- the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of the automatic question-answering process.
- FIG8 shows a flowchart of the processing process of another image generation method provided by an embodiment of the present disclosure.
- the powerful semantic understanding and associative ability of the parameter generation model is used to complete the semantic decomposition and reorganization of the visual elements of the image, and the parameters are input into the image generation model to complete accurate image generation.
- the image generation stage can be divided into two stages: the parameter generation model processing stage and the image generation model processing stage, wherein the parameter generation model processing stage includes fine-tuning the pre-trained language model, constructing the conditional generation sample set, and designing the generated prompt information;
- a parameter generation model is obtained by training a pre-trained language model based on multiple sample image-text pairs (composed of sample description texts and sample images) and sample parameter information carried by multiple sample image-text pairs; the image description text is obtained; the image description text and generation prompt information are input into the parameter generation model to obtain image generation parameters corresponding to the image description text; the image generation parameters and the image description text are input into the image generation model to obtain a target image corresponding to the image description text.
- a two-stage diffusion generation architecture is used to implement auxiliary generation based on a pre-trained language model.
- the emergence ability and appropriate fine-tuning of the pre-trained language model are used to output the image generation parameters required in the image generation model processing stage.
- image generation based on image generation parameter control is performed to output an image with precise semantics.
- the generated image can clearly express the generation prompt information or image generation parameters used in the generation process, the user can understand and verify the accuracy of the generation result.
- FIG. 9 shows a schematic diagram of a processing process of a parameter generation model provided by an embodiment of the present disclosure.
- the image description text is obtained as “two black cats and a white dog on an orange sofa”, and the image description text and the generated prompt information 1 are combined.
- the title of my image is “two black cats and a white dog on an orange sofa” and “elements in the image” What is it?
- Input parameter generation model get the image generation parameter 1 corresponding to the image description text "[(black cat, [30, 171, 212, 286]), (black cat, [40, 200, 123, 412]), (white dog, [24, 543, 231, 332]), (sofa, [264, 173, 222, 221])]"; combine the image description text and generate prompt information 2 "What is their bounding box?”
- Input parameter generation model get the image generation parameter 2 corresponding to the image description text "[(black cat, [30, 171, 212, 286]), (black cat, [40, 200, 123, 412]), (white dog, [24, 543, 231, 332]), (sofa, [264, 173, 222, 221])]”; combine the image description text and generate prompt information 3 "What is the main color of each element?
- Image generation parameter 3 corresponding to the image description text "[(black cat, [black, gray, dark gray]),..., (white dog, [white, light gray, brown]), (sofa, [orange, brown, gray])]"; finally, aggregate image generation parameter 1, image generation parameter 2 and image generation parameter 3 to obtain the final image generation parameter.
- FIG. 10 shows a schematic diagram of an interface of an image generation interface provided by an embodiment of the present disclosure.
- the image generation interface is divided into a request input interface and a result display interface.
- the request input interface includes a request input box, an “OK” control, and a “Cancel” control.
- the result display interface includes a result display box.
- the user inputs an image generation request through the request input box displayed on the client, wherein the image generation request carries the image description text, clicks the "OK" control, and the server receives the image description text sent by the client, inputs the image description text and the generation prompt information into the parameter generation model, and obtains the image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by multiple sample image-text pairs; the image generation parameters and the image description text are input into the image generation model, and the target image corresponding to the image description text is obtained, and the target image is sent to the client.
- the client displays the target image in the result display box.
- FIG11 shows a structural schematic diagram of an image generation device provided by an embodiment of the present disclosure.
- the device includes:
- a first acquisition module 1102 is configured to acquire image description text
- the first input module 1104 is configured to input the image description text and the generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on a plurality of sample image-text pairs and sample parameter information carried by the plurality of sample image-text pairs;
- the second input module 1106 is configured to input the image generation parameters and the image description text into the image generation model to obtain a target image corresponding to the image description text.
- the image generation model includes a parameter encoding unit and a coding unit; the second input module 1106 is further configured to input the image generation parameters and the image description text into the image generation model, and encode the image generation parameters to obtain parameter encoding features through the parameter encoding unit; and generate the image generation parameters according to the parameter encoding features and the image description text through the coding unit.
- the parameter coding unit includes a one-dimensional parameter coding unit, a two-dimensional parameter coding unit and a feature aggregation unit; the second input module 1106 is further configured to encode the one-dimensional parameters in the image generation parameters through the one-dimensional parameter coding unit to obtain one-dimensional coding features; encode the two-dimensional parameters in the image generation parameters through the two-dimensional parameter coding unit to obtain two-dimensional coding features; and aggregate the one-dimensional coding features and the two-dimensional coding features through the feature aggregation unit to obtain parameter coding features.
- the device also includes: a third acquisition module, configured to obtain image update parameters for the target image; determine the image editing area in the target image based on the image update parameters; mask the image editing area to obtain a mask generation sequence; input the mask generation sequence and the target image into the image generation model to obtain an updated target image.
- a third acquisition module configured to obtain image update parameters for the target image; determine the image editing area in the target image based on the image update parameters; mask the image editing area to obtain a mask generation sequence; input the mask generation sequence and the target image into the image generation model to obtain an updated target image.
- the device further includes: a first sending module configured to send the target image and the updated target image to the client, so that the client displays the target image and the updated target image to the user.
- a first sending module configured to send the target image and the updated target image to the client, so that the client displays the target image and the updated target image to the user.
- the device further includes: a second receiving module configured to receive image selection information sent by a user through a client, and adjust model parameters of the image generation model based on the image selection information.
- a second receiving module configured to receive image selection information sent by a user through a client, and adjust model parameters of the image generation model based on the image selection information.
- the device further includes: a third receiving module, configured to receive adjustment sample data sent by a user through a client, and adjust model parameters of the image generation model according to the adjustment sample data, wherein the adjustment sample data is constructed based on the target image.
- a third receiving module configured to receive adjustment sample data sent by a user through a client, and adjust model parameters of the image generation model according to the adjustment sample data, wherein the adjustment sample data is constructed based on the target image.
- the device also includes: a parameter generation model training module, configured to obtain a sample set, wherein the sample set includes multiple sample image-text pairs, the sample image-text pairs include sample images and sample description texts, and the sample image-text pairs carry sample parameter information; input the multiple sample image-text pairs and prediction prompt information into a pre-trained language model to obtain image prediction parameters corresponding to the multiple sample image-text pairs; adjust the model parameters of the pre-trained language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model that has completed training.
- a parameter generation model training module configured to obtain a sample set, wherein the sample set includes multiple sample image-text pairs, the sample image-text pairs include sample images and sample description texts, and the sample image-text pairs carry sample parameter information; input the multiple sample image-text pairs and prediction prompt information into a pre-trained language model to obtain image prediction parameters corresponding to the multiple sample image-text pairs; adjust the model parameters of the pre-trained language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation
- the parameter generation model training module is further configured to obtain a plurality of sample images
- the first sample image and construction prompt information are input into a pre-trained language model to obtain a first sample description text of the first sample image; based on the first sample image and the first sample description text, first sample parameter information of the first sample image is generated; based on multiple sample images, sample description texts of multiple sample images, and sample parameter information, a sample set is constructed.
- the parameter generation model training module is further configured to perform region detection on the first sample image according to the first sample description text to determine key region position information of the first sample image; perform visual segmentation on the first sample image according to the key region position information to obtain first sample parameter information of the first sample image.
- the device also includes: a fourth acquisition module, configured to acquire parameter configuration conditions; when the first sample parameter information does not meet the parameter configuration conditions, adjusting the first sample parameter information to obtain the adjusted first sample parameter information; a parameter generation model training module, further configured to construct a sample set based on multiple sample images, sample description texts of multiple sample images and the adjusted sample parameter information.
- a fourth acquisition module configured to acquire parameter configuration conditions; when the first sample parameter information does not meet the parameter configuration conditions, adjusting the first sample parameter information to obtain the adjusted first sample parameter information
- a parameter generation model training module further configured to construct a sample set based on multiple sample images, sample description texts of multiple sample images and the adjusted sample parameter information.
- the device further comprises: an image generation model training module configured to obtain a sample set, wherein the sample set comprises a plurality of sample image-text pairs, the sample image-text pairs comprise a sample image and a sample description text, the sample image-text pairs This pair carries sample parameter information; multiple sample description texts and sample parameter information corresponding to the multiple sample description texts are input into the initial generation model to obtain predicted images corresponding to the multiple sample description texts; according to the predicted images and the sample images, the model parameters of the initial generation model are adjusted to obtain a trained image generation model.
- an image generation model training module configured to obtain a sample set, wherein the sample set comprises a plurality of sample image-text pairs, the sample image-text pairs comprise a sample image and a sample description text, the sample image-text pairs This pair carries sample parameter information; multiple sample description texts and sample parameter information corresponding to the multiple sample description texts are input into the initial generation model to obtain predicted images corresponding to the multiple sample description texts; according to the predicted images and the sample images, the model parameters of the initial generation model are adjusted
- the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of image generation.
- the above is a schematic scheme of an image generating device of this embodiment. It should be noted that the technical scheme of the image generating device and the technical scheme of the above image generating method belong to the same concept, and the details not described in detail in the technical scheme of the image generating device can be referred to the description of the technical scheme of the above image generating method.
- the present disclosure also provides a parameter generation model training device embodiment
- FIG12 shows a structural schematic diagram of a parameter generation model training device provided by an embodiment of the present disclosure. As shown in FIG12, the device is applied to a cloud-side device, including:
- the second acquisition module 1202 is configured to acquire a sample set, wherein the sample set includes a plurality of sample image-text pairs, the sample image-text pairs include a sample image and a sample description text, and the sample image-text pairs carry sample parameter information;
- the third input module 1204 is configured to input a plurality of sample image-text pairs and prediction prompt information into the pre-trained language model to obtain image prediction parameters corresponding to the plurality of sample image-text pairs respectively;
- the adjustment module 1206 is configured to adjust the model parameters of the pre-trained language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model that has completed training.
- the model parameters of the pre-trained language model are adjusted according to the image prediction parameters and sample parameter information to obtain a parameter generation model that has completed training.
- the parameter generation model finally obtained can be made more accurate.
- the above is a schematic scheme of a parameter generation model training device of this embodiment. It should be noted that the technical scheme of the parameter generation model training device and the technical scheme of the parameter generation model training method mentioned above belong to the same concept, and the details not described in detail in the technical scheme of the parameter generation model training device can be found in the description of the technical scheme of the parameter generation model training method mentioned above.
- FIG13 shows a schematic diagram of the structure of an automatic question-answering device provided by an embodiment of the present disclosure.
- the device includes:
- the first receiving module 1302 is configured to receive an image question and answer request, wherein the image question and answer request carries an image description text;
- the fourth input module 1304 is configured to input the image description text and the generation prompt information into the parameter generation model to obtain the image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs;
- the fifth input module 1306 is configured to input the image generation parameters and the image description text into the image generation model to obtain a reply image corresponding to the image question and answer request.
- the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of the automatic question-answering process.
- the components of the computing device 1400 include but are not limited to a memory 1410 and a processor 1420.
- the processor 1420 is connected to the memory 1410 via a bus 1430, and a database 1450 is used to store data.
- the computing device 1400 also includes an access device 1440 that enables the computing device 1400 to communicate via one or more networks 1460.
- networks 1460 include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
- PSTN public switched telephone network
- LAN local area network
- WAN wide area network
- PAN personal area network
- Internet a combination of communication networks such as the Internet.
- the access device 1440 may include one or more of any type of network interface, wired or wireless (e.g., a network interface card (NIC)), such as an IEEE802.11 wireless local area network (WLAN) wireless interface, a World Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.
- NIC network interface card
- the above components of the computing device 1400 and other components not shown in FIG. 14 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG. 14 is only for illustrative purposes and is not intended to limit the scope of the present disclosure. Those skilled in the art may add or replace other components as needed.
- Computing device 1400 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or a personal computer (PC).
- Computing device 1400 may also be a mobile or stationary server.
- the processor 1420 is used to execute the following computer-executable instructions, which, when executed by the processor, implement the steps of the above-mentioned image generation method or parameter generation model training method or automatic question-answering method.
- the above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the technical schemes of the above-mentioned image generation method, parameter generation model training method and automatic question answering method belong to the same concept. For details not described in detail in the technical scheme of the computing device, please refer to the above-mentioned image generation method or parameter generation model training method. A description of the technical solution for the data generation model training method or the automatic question answering method.
- An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, can implement the steps of the above-mentioned image generation method or parameter generation model training method or automatic question-answering method.
- the above is a schematic scheme of a computer-readable storage medium of this embodiment. It should be noted that the technical scheme of the storage medium and the technical scheme of the above-mentioned image generation method, parameter generation model training method and automatic question answering method belong to the same concept, and the details not described in detail in the technical scheme of the storage medium can be referred to the description of the technical scheme of the above-mentioned image generation method or parameter generation model training method or automatic question answering method.
- An embodiment of the present disclosure further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the above-mentioned image generation method or parameter generation model training method or automatic question-answering method.
- the above is a schematic scheme of a computer program of this embodiment. It should be noted that the technical scheme of the computer program and the technical schemes of the above-mentioned image generation method, parameter generation model training method and automatic question answering method belong to the same concept, and the details not described in detail in the technical scheme of the computer program can be found in the description of the technical schemes of the above-mentioned image generation method or parameter generation model training method or automatic question answering method.
- the computer instructions include computer program codes, which may be in source code form, object code form, executable files or some intermediate forms, etc.
- the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, USB flash drive, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media do not include electric carrier signals and telecommunication signals.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Processing Or Creating Images (AREA)
Abstract
Description
本公开要求于2023年11月28日提交中国专利局、申请号为2023116226401、申请名称为“图像生成、自动问答以及参数生成模型训练方法”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure claims priority to a Chinese patent application filed with the Chinese Patent Office on November 28, 2023, with application number 2023116226401 and application name “Image Generation, Automatic Question Answering, and Parameter Generation Model Training Method,” the entire contents of which are incorporated by reference in this disclosure.
本公开实施例涉及计算机技术领域,特别涉及图像生成、自动问答以及参数生成模型训练方法。The embodiments of the present disclosure relate to the field of computer technology, and in particular to image generation, automatic question answering, and parameter generation model training methods.
随着计算机技术的发展,文生图技术逐渐成为人工智能生成领域(AIGC,AI Generated Content)的核心技术。文生图技术可以通过文本描述来生成图像,并且能够根据用户的要求和输入内容进行变换和调整,从而使用户更轻松地创作出具有独特风格的艺术作品,并在数字艺术领域中得到了广泛的应用。With the development of computer technology, text-generated image technology has gradually become the core technology in the field of artificial intelligence generation (AIGC, AI Generated Content). Text-generated image technology can generate images through text descriptions, and can be transformed and adjusted according to user requirements and input content, making it easier for users to create works of art with unique styles, and has been widely used in the field of digital art.
目前,常见的文生图架构是隐空间扩散模型(HSDM,Hidden Space Diffusion Model),然而,由于隐空间扩散模型是在隐藏空间中进行操作的,导致生成过程难以理解、生成图像与用户最初输入的文本差异较大。Currently, the common text-generated graph architecture is the Hidden Space Diffusion Model (HSDM). However, since the Hidden Space Diffusion Model operates in the hidden space, the generation process is difficult to understand and the generated image is quite different from the text originally input by the user.
发明内容Summary of the invention
有鉴于此,本公开实施例提供了一种图像生成方法。本公开一个或者多个实施例同时涉及一种自动问答方法,一种参数生成模型训练方法,一种图像生成装置,一种自动问答装置,一种参数生成模型训练装置,一种计算设备,一种计算机可读存储介质以及一种计算机程序,以解决现有技术中存在的技术缺陷。In view of this, an embodiment of the present disclosure provides an image generation method. One or more embodiments of the present disclosure also relate to an automatic question-answering method, a parameter generation model training method, an image generation device, an automatic question-answering device, a parameter generation model training device, a computing device, a computer-readable storage medium, and a computer program to solve the technical defects existing in the prior art.
根据本公开实施例的第一方面,提供了一种图像生成方法,包括:According to a first aspect of an embodiment of the present disclosure, there is provided an image generating method, comprising:
获取图像描述文本;Get the image description text;
将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到;Input the image description text and the generation prompt information into the parameter generation model to obtain the image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs;
将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像。The image generation parameters and the image description text are input into the image generation model to obtain the target image corresponding to the image description text.
根据本公开实施例的第二方面,提供了一种参数生成模型训练方法,应用于云侧设备,包括:According to a second aspect of an embodiment of the present disclosure, a parameter generation model training method is provided, which is applied to a cloud-side device, including:
获取样本集,其中,样本集包括多个样本图像文本对,样本图像文本对包括样本图像和样本描述文本,样本图像文本对携带样本参数信息; Acquire a sample set, wherein the sample set includes a plurality of sample image-text pairs, the sample image-text pairs include a sample image and a sample description text, and the sample image-text pairs carry sample parameter information;
将多个样本图像文本对和预测提示信息输入预训练语言模型,获得多个样本图像文本对分别对应的图像预测参数;Inputting a plurality of sample image-text pairs and prediction prompt information into a pre-trained language model to obtain image prediction parameters corresponding to the plurality of sample image-text pairs;
根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型。According to the image prediction parameters and sample parameter information, the model parameters of the pre-trained language model are adjusted to obtain a parameter generation model that has completed training.
根据本公开实施例的第三方面,提供了一种自动问答方法,包括:According to a third aspect of an embodiment of the present disclosure, there is provided an automatic question-answering method, comprising:
接收图像问答请求,其中,图像问答请求携带图像描述文本;receiving an image question answering request, wherein the image question answering request carries an image description text;
将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到;Input the image description text and the generation prompt information into the parameter generation model to obtain the image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs;
将图像生成参数和图像描述文本输入图像生成模型,获得图像问答请求对应的答复图像。The image generation parameters and image description text are input into the image generation model to obtain the reply image corresponding to the image question answering request.
根据本公开实施例的第四方面,提供了一种图像生成装置,包括:According to a fourth aspect of an embodiment of the present disclosure, there is provided an image generating device, including:
第一获取模块,被配置为获取图像描述文本;A first acquisition module is configured to acquire image description text;
第一输入模块,被配置为将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到;A first input module is configured to input the image description text and the generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on a plurality of sample image-text pairs and sample parameter information carried by the plurality of sample image-text pairs;
第二输入模块,被配置为将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像。The second input module is configured to input the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text.
根据本公开实施例的第五方面,提供了一种参数生成模型训练装置,应用于云侧设备,包括:According to a fifth aspect of an embodiment of the present disclosure, a parameter generation model training device is provided, which is applied to a cloud-side device, including:
第二获取模块,被配置为获取样本集,其中,样本集包括多个样本图像文本对,样本图像文本对包括样本图像和样本描述文本,样本图像文本对携带样本参数信息;A second acquisition module is configured to acquire a sample set, wherein the sample set includes a plurality of sample image-text pairs, the sample image-text pairs include a sample image and a sample description text, and the sample image-text pairs carry sample parameter information;
第三输入模块,被配置为将多个样本图像文本对和预测提示信息输入预训练语言模型,获得多个样本图像文本对分别对应的图像预测参数;A third input module is configured to input a plurality of sample image-text pairs and prediction prompt information into the pre-trained language model to obtain image prediction parameters corresponding to the plurality of sample image-text pairs respectively;
调整模块,被配置为根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型。The adjustment module is configured to adjust the model parameters of the pre-trained language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model that has completed the training.
根据本公开实施例的第六方面,提供了一种自动问答装置,包括:According to a sixth aspect of an embodiment of the present disclosure, there is provided an automatic question-answering device, comprising:
第一接收模块,被配置为接收图像问答请求,其中,图像问答请求携带图像描述文本;A first receiving module is configured to receive an image question and answer request, wherein the image question and answer request carries an image description text;
第四输入模块,被配置为将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到;a fourth input module, configured to input the image description text and the generation prompt information into a parameter generation model, and obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on a plurality of sample image-text pairs and sample parameter information carried by the plurality of sample image-text pairs;
第五输入模块,被配置为将图像生成参数和图像描述文本输入图像生成模型,获得图像问答请求对应的答复图像。 The fifth input module is configured to input the image generation parameters and the image description text into the image generation model to obtain a reply image corresponding to the image question and answer request.
根据本公开实施例的第七方面,提供了一种计算设备,包括:According to a seventh aspect of an embodiment of the present disclosure, there is provided a computing device, including:
存储器和处理器;Memory and processor;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,该计算机可执行指令被处理器执行时实现上述第一方面或者第二方面或者第三方面所提供方法的步骤。The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions. When the computer-executable instructions are executed by the processor, the steps of the method provided in the first aspect, the second aspect, or the third aspect are implemented.
根据本公开实施例的第八方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现上述第一方面或者第二方面或者第三方面所提供方法的步骤。According to an eighth aspect of an embodiment of the present disclosure, a computer-readable storage medium is provided, which stores computer-executable instructions. When the instructions are executed by a processor, the steps of the method provided in the first aspect, the second aspect, or the third aspect are implemented.
根据本公开实施例的第九方面,提供了一种计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行上述第一方面或者第二方面或者第三方面所提供方法的步骤。According to a ninth aspect of an embodiment of the present disclosure, a computer program is provided, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the method provided in the first aspect, the second aspect, or the third aspect above.
本公开一个实施例提供的图像生成方法,获取图像描述文本;将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到;将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像。通过利用参数生成模型对图像的视觉元素进行语义拆解,获得图像生成参数,进一步基于图像生成参数完成精准的图像生成,使得目标图像可以清晰地表达图像描述文本和图像生成参数,提高了图像生成的可解释性和可控性。An image generation method provided by an embodiment of the present disclosure obtains an image description text; inputs the image description text and generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs; the image generation parameters and the image description text are input into the image generation model to obtain a target image corresponding to the image description text. The image generation parameters are obtained by semantically decomposing the visual elements of the image using the parameter generation model, and the precise image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and the image generation parameters, thereby improving the interpretability and controllability of the image generation.
图1是本公开一个实施例提供的一种图像生成系统的架构图;FIG1 is an architecture diagram of an image generation system provided by an embodiment of the present disclosure;
图2是本公开一个实施例提供的另一种图像生成系统的架构图;FIG2 is an architecture diagram of another image generation system provided by an embodiment of the present disclosure;
图3是本公开一个实施例提供的一种图像生成方法的流程图;FIG3 is a flow chart of an image generating method provided by an embodiment of the present disclosure;
图4是本公开一个实施例提供的一种图像生成方法的处理过程流程图;FIG4 is a process flow chart of an image generation method provided by an embodiment of the present disclosure;
图5是本公开一个实施例提供的一种参数生成模型训练方法的流程图;FIG5 is a flow chart of a parameter generation model training method provided by an embodiment of the present disclosure;
图6是本公开一个实施例提供的另一种参数生成模型训练方法的流程图;FIG6 is a flow chart of another parameter generation model training method provided by an embodiment of the present disclosure;
图7是本公开一个实施例提供的一种自动问答方法的流程图;FIG7 is a flow chart of an automatic question-answering method provided by an embodiment of the present disclosure;
图8是本公开一个实施例提供的另一种图像生成方法的处理过程流程图;FIG8 is a process flow chart of another image generation method provided by an embodiment of the present disclosure;
图9是本公开一个实施例提供的一种参数生成模型的处理过程示意图;FIG9 is a schematic diagram of a processing process of a parameter generation model provided by an embodiment of the present disclosure;
图10是本公开一个实施例提供的一种图像生成界面的界面示意图;FIG10 is a schematic diagram of an image generation interface provided by an embodiment of the present disclosure;
图11是本公开一个实施例提供的一种图像生成装置的结构示意图;FIG11 is a schematic diagram of the structure of an image generating device provided by an embodiment of the present disclosure;
图12是本公开一个实施例提供的一种参数生成模型训练装置的结构示意图;FIG12 is a schematic diagram of the structure of a parameter generation model training device provided by an embodiment of the present disclosure;
图13是本公开一个实施例提供的一种自动问答装置的结构示意图; FIG13 is a schematic diagram of the structure of an automatic question-answering device provided by an embodiment of the present disclosure;
图14是本公开一个实施例提供的一种计算设备的结构框图。FIG. 14 is a structural block diagram of a computing device provided by an embodiment of the present disclosure.
在下面的描述中阐述了很多具体细节以便于充分理解本公开。但是本公开能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本公开内涵的情况下做类似推广,因此本公开不受下面公开的具体实施的限制。Many specific details are described in the following description to facilitate a full understanding of the present disclosure. However, the present disclosure can be implemented in many other ways than those described herein, and those skilled in the art can make similar generalizations without violating the connotation of the present disclosure, so the present disclosure is not limited by the specific implementation disclosed below.
在本公开一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开一个或多个实施例。在本公开一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本公开一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terms used in one or more embodiments of the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit one or more embodiments of the present disclosure. The singular forms of "a", "said" and "the" used in one or more embodiments of the present disclosure and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used in one or more embodiments of the present disclosure refers to and includes any or all possible combinations of one or more associated listed items.
应当理解,尽管在本公开一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, etc. may be used to describe various information in one or more embodiments of the present disclosure, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of one or more embodiments of the present disclosure, the first may also be referred to as the second, and similarly, the second may also be referred to as the first. Depending on the context, the word "if" as used herein may be interpreted as "at the time of" or "when" or "in response to determining".
此外,需要说明的是,本公开一个或多个实施例所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,并且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准,并提供有相应的操作入口,供用户选择授权或者拒绝。In addition, it should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in one or more embodiments of the present disclosure are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions, and corresponding operation entrances are provided for users to choose to authorize or refuse.
本公开一个或多个实施例中,大模型是指具有大规模模型参数的深度学习模型,通常包含上亿、上百亿、上千亿、上万亿甚至十万亿以上的模型参数。大模型又可以称为基石模型/基础模型(Foundation Model),通过大规模无标注的语料进行大模型的预训练,产出亿级以上参数的预训练模型,这种模型能适应广泛的下游任务,模型具有较好的泛化能力,例如大规模语言模型(LLM,Large Language Model)、多模态预训练模型(multi-modal pre-training model)等。In one or more embodiments of the present disclosure, a large model refers to a deep learning model with large-scale model parameters, which usually contains hundreds of millions, tens of billions, hundreds of billions, trillions, or even more than ten trillion model parameters. A large model can also be called a foundation model/foundation model. The large model is pre-trained with large-scale unlabeled corpus to produce a pre-trained model with more than 100 million parameters. This model can adapt to a wide range of downstream tasks, and the model has good generalization ability, such as a large-scale language model (LLM, Large Language Model), a multi-modal pre-training model, etc.
大模型在实际应用时,仅需少量样本对预训练模型进行微调即可应用于不同的任务中,大模型可以广泛应用于自然语言处理(NLP,Natural Language Processing)、计算机视觉等领域,具体可以应用于如视觉问答(VQA,Visual Question Answering)、图像描述(IC,Image Caption)、图像生成等计算机视觉领域任务,以及基于文本的情感分类、文本摘要生成、机器翻译等自然语言处理领域任务,大模型主要的应用场景包括数字助理、智能机器人、搜索、在线教育、办公软件、电子商务、智能设计等。When the big model is used in practice, only a small number of samples are needed to fine-tune the pre-trained model and it can be applied to different tasks. The big model can be widely used in natural language processing (NLP), computer vision and other fields. Specifically, it can be applied to computer vision tasks such as visual question answering (VQA), image description (IC, Image Caption), image generation, as well as natural language processing tasks such as text-based sentiment classification, text summary generation, and machine translation. The main application scenarios of the big model include digital assistants, intelligent robots, search, online education, office software, e-commerce, intelligent design, etc.
首先,对本公开一个或多个实施例涉及的名词术语进行解释。First, the terms involved in one or more embodiments of the present disclosure are explained.
隐空间扩散模型:隐空间扩散模型是指一种广泛应用于图像编辑、图像生成的人工智 能技术,用于将图像编码到隐空间进行加噪和去噪。Latent space diffusion model: Latent space diffusion model refers to an artificial intelligence model widely used in image editing and image generation. A technique for encoding images into a latent space for denoising and de-noising.
注意力机制:注意力机制(Attention mechanism)是一种机器学习方法,可以根据输入数据各部分重要性的不同而分配不同的权重。Attention mechanism: Attention mechanism is a machine learning method that can assign different weights according to the importance of each part of the input data.
CLIP:CLIP(Contrastive Language-Image Pre-training)一种用于文本和图像相关性度量的模型,广泛用于文本和图像的表征提取。CLIP: CLIP (Contrastive Language-Image Pre-training) is a model for measuring the correlation between text and images, and is widely used in text and image representation extraction.
大规模语言模型:大规模语言模型通过训练海量的文本数据来预测和生成各种语言表达形式,如文本、句子、段落等。Large-scale language models: Large-scale language models are trained on massive amounts of text data to predict and generate various language expressions, such as text, sentences, paragraphs, etc.
LoRA:低秩适配(Low-Rankadaptation)的一种语言大模型微调方法,可以在使用大模型适配下游任务时只需要训练少量的参数即可达到很好的效果。LoRA: A language model fine-tuning method based on low-rank adaptation, which can achieve good results by training only a small number of parameters when using a large model to adapt downstream tasks.
文生图技术基于用户输入的图像描述文本输出图像,在广告推荐、互动娱乐、美术设计等众多场景都有巨大的价值。常见的文生图架构是隐空间扩散模型,由于隐空间扩散模型是在隐藏空间中进行操作的,因此很难理解和解释其内部的工作机制和决策过程,导致生成过程难以理解,并且,这类方法对文本的理解依赖特征提取模块(CLIP encoder),且在生成阶段无法对视觉元素做语义上的控制,这导致最终生成的结果和用户最初输入的图像描述文本差异较大。The text-generated graph technology outputs images based on the image description text input by the user, which has great value in many scenarios such as advertising recommendation, interactive entertainment, art design, etc. The common text-generated graph architecture is the latent space diffusion model. Since the latent space diffusion model operates in the hidden space, it is difficult to understand and explain its internal working mechanism and decision-making process, which makes the generation process difficult to understand. In addition, this type of method relies on the feature extraction module (CLIP encoder) to understand the text, and cannot control the semantics of the visual elements in the generation stage, which leads to a large difference between the final generated result and the image description text originally input by the user.
为了提高图像生成模型的准确性,本公开实施例提出了一种结合大规模语言模型的图像生成方法,利用大规模语言模型的强大语义理解和联想能力,完成对图像的视觉元素进行语义拆解和重组,将视觉元素参数化后输入图像生成模型完成精准的图像生成,以满足图像生成过程更强的可解释性和可控性,并且,大规模语言模型可以对齐复杂长文本,为更多的用户交互方式做技术铺垫。并且,通过结合大规模语言模型和图像生成模型,可以实现便捷、精细化的图像编辑、相似结构图像生成等扩展应用。In order to improve the accuracy of the image generation model, the disclosed embodiment proposes an image generation method combined with a large-scale language model, which uses the powerful semantic understanding and associative ability of the large-scale language model to complete the semantic decomposition and reorganization of the visual elements of the image, and inputs the parameterized visual elements into the image generation model to complete accurate image generation, so as to meet the stronger interpretability and controllability of the image generation process. In addition, the large-scale language model can align complex and long texts, laying the technical foundation for more user interaction methods. In addition, by combining the large-scale language model and the image generation model, convenient and refined image editing, similar structure image generation and other extended applications can be realized.
具体地,本公开实施例提出了一种图像生成方案,获取图像描述文本;将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到;将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像。Specifically, an embodiment of the present disclosure proposes an image generation scheme to obtain image description text; input the image description text and generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs; input the image generation parameters and the image description text into the image generation model to obtain a target image corresponding to the image description text.
在本公开中,提供了一种图像生成方法,本公开同时涉及一种自动问答方法,一种参数生成模型训练方法,一种图像生成装置,一种自动问答装置,一种参数生成模型训练装置,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。In the present disclosure, an image generation method is provided. The present disclosure also relates to an automatic question-answering method, a parameter generation model training method, an image generation device, an automatic question-answering device, a parameter generation model training device, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
参见图1,图1示出了本公开一个实施例提供的一种图像生成系统的架构图,图像生成系统可以包括客户端100和服务端200;Referring to FIG. 1 , FIG. 1 shows an architecture diagram of an image generation system provided by an embodiment of the present disclosure. The image generation system may include a client 100 and a server 200;
客户端100,用于向服务端200发送图像描述文本;The client 100 is used to send image description text to the server 200;
服务端200,用于将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型 基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到;将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像;向客户端100发送目标图像;The server 200 is used to input the image description text and the generation prompt information into the parameter generation model to obtain the image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image. Based on a plurality of sample image-text pairs and sample parameter information carried by the plurality of sample image-text pairs, the training is obtained; the image generation parameters and the image description text are input into the image generation model to obtain a target image corresponding to the image description text; and the target image is sent to the client 100;
客户端100,还用于接收服务端200发送的目标图像。The client 100 is also used to receive the target image sent by the server 200.
应用本公开实施例的方案,通过利用参数生成模型对图像的视觉元素进行语义拆解,获得图像生成参数,进一步基于图像生成参数完成精准的图像生成,使得目标图像可以清晰地表达图像描述文本和图像生成参数,提高了图像生成的可解释性和可控性。By applying the solution of the embodiment of the present disclosure, the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of image generation.
参见图2,图2示出了本公开一个实施例提供的另一种图像生成系统的架构图,图像生成系统可以包括多个客户端100以及服务端200,其中,客户端100可以包括端侧设备,服务端200可以包括云侧设备。多个客户端100之间通过服务端200可以建立通信连接,在图像生成场景中,服务端200即用来在多个客户端100之间提供图像生成服务,多个客户端100可以分别作为发送端或接收端,通过服务端200实现通信。Referring to FIG. 2 , FIG. 2 shows an architecture diagram of another image generation system provided by an embodiment of the present disclosure. The image generation system may include multiple clients 100 and a server 200, wherein the client 100 may include a terminal-side device, and the server 200 may include a cloud-side device. A communication connection may be established between multiple clients 100 through the server 200. In the image generation scenario, the server 200 is used to provide image generation services between multiple clients 100. Multiple clients 100 may serve as a sender or a receiver, respectively, and realize communication through the server 200.
用户通过客户端100可与服务端200进行交互以接收其它客户端100发送的数据,或将数据发送至其它客户端100等。在图像生成场景中,可以是用户通过客户端100向服务端200发布数据流,服务端200根据该数据流生成目标图像,并将目标图像推送至其他建立通信的客户端中。The user can interact with the server 200 through the client 100 to receive data sent by other clients 100, or send data to other clients 100, etc. In the image generation scenario, the user can publish a data stream to the server 200 through the client 100, and the server 200 generates a target image according to the data stream and pushes the target image to other clients that have established communication.
其中,客户端100与服务端200之间通过网络建立连接。网络为客户端100与服务端200之间提供了通信链路的介质。网络可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。客户端100所传输的数据可能需要经过编码、转码、压缩等处理之后才发布至服务端200。The client 100 and the server 200 are connected via a network. The network provides a medium for a communication link between the client 100 and the server 200. The network may include various connection types, such as wired or wireless communication links or optical fiber cables, etc. The data transmitted by the client 100 may need to be encoded, transcoded, compressed, etc. before being released to the server 200.
客户端100可以为浏览器、APP(Application,应用程序)、或网页应用如H5(HyperText Markup Language5,超文本标记语言第5版)应用、或轻应用(也被称为小程序,一种轻量级应用程序)或云应用等,客户端100可以基于服务端200提供的相应服务的软件开发工具包(SDK,Software Development Kit),如基于实时通信(RTC,Real Time Communication)SDK开发获得等。客户端100可以部署在电子设备中,需要依赖设备运行或者设备中的某些APP而运行等。电子设备例如可以具有显示屏并支持信息浏览等,如可以是个人移动终端如手机、平板电脑、个人计算机等。在电子设备中通常还可以配置各种其它类应用,例如人机对话类应用、模型训练类应用、文本处理类应用、网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The client 100 may be a browser, an APP (Application), or a web application such as an H5 (HyperText Markup Language5, the fifth edition of Hypertext Markup Language) application, or a light application (also known as a mini-program, a lightweight application) or a cloud application, etc. The client 100 may be based on the software development kit (SDK, Software Development Kit) of the corresponding service provided by the server 200, such as based on the real-time communication (RTC, Real Time Communication) SDK development and acquisition, etc. The client 100 may be deployed in an electronic device, and may need to rely on the device to run or some APPs in the device to run, etc. For example, the electronic device may have a display screen and support information browsing, etc., such as a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, etc. Various other types of applications may also be configured in the electronic device, such as human-computer dialogue applications, model training applications, text processing applications, web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc.
服务端200可以包括提供各种服务的服务器,例如为多个客户端提供通信服务的服务器,又如为客户端上使用的模型提供支持的用于后台训练的服务器,又如对客户端发送的数据进行处理的服务器等。需要说明的是,服务端200可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。服务器也可以是云服务、云数据库、云计算、云函数、云存储、 网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN,Content Delivery Network)以及大数据和人工智能平台等基础云计算服务的云服务器,或者是带人工智能技术的智能云计算服务器或智能云主机。The server 200 may include servers that provide various services, such as servers that provide communication services to multiple clients, servers that provide background training to support models used on clients, and servers that process data sent by clients. It should be noted that the server 200 can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. The server can also be a server of a distributed system, or a server combined with a blockchain. The server can also be a cloud service, cloud database, cloud computing, cloud function, cloud storage, Cloud servers that provide basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), big data and artificial intelligence platforms, or intelligent cloud computing servers or intelligent cloud hosts with artificial intelligence technology.
值得说明的是,本公开实施例中提供的图像生成方法一般由服务端执行,但是,在本公开的其它实施例中,客户端也可以与服务端具有相似的功能,从而执行本公开实施例所提供的图像生成方法。在其它实施例中,本公开实施例所提供的图像生成方法还可以是由客户端与服务端共同执行。It is worth noting that the image generation method provided in the embodiment of the present disclosure is generally executed by the server, but in other embodiments of the present disclosure, the client may also have similar functions as the server, thereby executing the image generation method provided in the embodiment of the present disclosure. In other embodiments, the image generation method provided in the embodiment of the present disclosure may also be jointly executed by the client and the server.
参见图3,图3示出了本公开一个实施例提供的一种图像生成方法的流程图,具体包括以下步骤:Referring to FIG. 3 , FIG. 3 shows a flow chart of an image generation method provided by an embodiment of the present disclosure, which specifically includes the following steps:
步骤302:获取图像描述文本。Step 302: Obtain image description text.
本公开一个或多个实施例中,图像生成开始时,可以获取图像描述文本,基于图像描述文本生成符合用户实际需求的目标图像。In one or more embodiments of the present disclosure, when image generation begins, image description text may be acquired, and a target image that meets actual needs of the user may be generated based on the image description text.
具体地,图像描述文本表征了用户的图像生成需求。图像描述文本可以是不同语言的描述文本,如英文描述文本、中文描述文本等等。由于图像生成过程引入了大规模语言模型,因此,图像描述文本可以是短文本,还可以是长文本。例如,图像描述文本为“橙色沙发上的两只黑猫和一只白狗”。Specifically, the image description text represents the user's image generation needs. The image description text can be description text in different languages, such as English description text, Chinese description text, etc. Since the image generation process introduces a large-scale language model, the image description text can be a short text or a long text. For example, the image description text is "two black cats and a white dog on an orange sofa".
实际应用中,获取图像描述文本的方式有多种,具体根据实际情况进行选择,本公开实施例对此不做任何限定。本公开一种可能的实现方式中,可以接收用户通过客户端发送的图像描述文本。本公开另一种可能的实现方式中,可以从其他数据获取设备或数据库中读取图像描述文本。In practical applications, there are many ways to obtain image description text, which can be selected according to actual conditions, and the embodiments of the present disclosure do not limit this. In one possible implementation of the present disclosure, the image description text sent by the user through the client can be received. In another possible implementation of the present disclosure, the image description text can be read from other data acquisition devices or databases.
步骤304:将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到。Step 304: Input the image description text and generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs.
本公开一个或多个实施例中,获取图像描述文本之后,进一步地,可以将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数。In one or more embodiments of the present disclosure, after obtaining the image description text, the image description text and generation prompt information may be further input into a parameter generation model to obtain image generation parameters corresponding to the image description text.
具体地,参数生成模型基于样本集对预训练语言模型训练得到。预训练语言模型可以是大规模语言模型,如多模态大模型,还可以是利用第一训练集训练得到的语言处理模型。图像生成参数可以理解为参数化图像生成条件。图像生成参数包括但不限于尺寸参数、位置参数、形状参数等等。生成提示信息可以理解为参数生成范式,也称为生成条件范式。生成提示信息用于引导参数生成模型生成图像生成参数。生成提示信息包括但不限于形状、颜色、尺寸、位置、虚化、关键点的图像生成条件要素,例如,生成提示信息可以是“图像中的元素是什么”。Specifically, the parameter generation model is obtained by training a pre-trained language model based on a sample set. The pre-trained language model can be a large-scale language model, such as a multimodal large model, or a language processing model trained using the first training set. Image generation parameters can be understood as parameterized image generation conditions. Image generation parameters include but are not limited to size parameters, position parameters, shape parameters, and the like. Generation prompt information can be understood as a parameter generation paradigm, also known as a generation condition paradigm. Generation prompt information is used to guide the parameter generation model to generate image generation parameters. Generation prompt information includes but is not limited to image generation condition elements of shape, color, size, position, blur, and key points. For example, generation prompt information can be "what are the elements in the image."
需要说明的是,本公开实施例中将生成条件统一成可以被自然语言描述的参数,而非图像条件,从而便于大规模语言模型做语义理解。 It should be noted that in the embodiment of the present disclosure, the generation conditions are unified into parameters that can be described in natural language rather than image conditions, so as to facilitate semantic understanding of large-scale language models.
步骤306:将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像。Step 306: input the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text.
本公开一个或多个实施例中,获取图像描述文本,将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数之后,进一步地可以将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像。In one or more embodiments of the present disclosure, an image description text is obtained, the image description text and generation prompt information are input into a parameter generation model, and after the image generation parameters corresponding to the image description text are obtained, the image generation parameters and the image description text can be further input into the image generation model to obtain a target image corresponding to the image description text.
具体地,图像生成模型可以是隐空间扩散模型,还可以是利用第二训练集训练得到的图像生成模型。图像生成模型用于基于图像生成参数和图像表述文本,生成最终的目标图像。目标图像可以是黑白图像,也可以是彩色图像(RGB Image)。Specifically, the image generation model can be a latent space diffusion model, or an image generation model trained using the second training set. The image generation model is used to generate a final target image based on image generation parameters and image description text. The target image can be a black and white image or a color image (RGB Image).
应用本公开实施例的方案,通过利用参数生成模型对图像的视觉元素进行语义拆解,获得图像生成参数,进一步基于图像生成参数完成精准的图像生成,使得目标图像可以清晰地表达图像描述文本和图像生成参数,提高了图像生成的可解释性和可控性。By applying the solution of the embodiment of the present disclosure, the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of image generation.
本公开一种可选的实施例中,将图像生成参数和图像描述文本输入图像生成模型,在图像生成模型中可以采用旁支结构处理图像生成参数,也即,在图像生成模型原本具有的编解码单元的基础上,额外增加了参数编码单元作为旁支结构,也即图像生成模型包括参数编码单元和编解码单元;上述将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像,可以包括以下步骤:In an optional embodiment of the present disclosure, image generation parameters and image description text are input into an image generation model, and a side branch structure can be used in the image generation model to process the image generation parameters, that is, on the basis of the encoding and decoding unit originally possessed by the image generation model, a parameter encoding unit is additionally added as a side branch structure, that is, the image generation model includes a parameter encoding unit and an encoding and decoding unit; the above-mentioned inputting the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text may include the following steps:
将图像生成参数和图像描述文本输入图像生成模型,经参数编码单元,对图像生成参数编码得到参数编码特征;Inputting image generation parameters and image description text into the image generation model, and encoding the image generation parameters through the parameter encoding unit to obtain parameter encoding features;
经编解码单元,根据参数编码特征和图像描述文本,生成图像描述文本对应的目标图像。The encoding and decoding unit generates a target image corresponding to the image description text according to the parameter encoding features and the image description text.
参见图4,图4示出了本公开一个实施例提供的一种图像生成方法的处理过程流程图,如图4所示,参数生成模型生成图像生成参数之后,可以将图像描述文本和图像生成参数共同输入图像生成模型,在图像生成模型中,首先经参数编码单元对图像生成参数编码得到参数编码特征,其次,经编解码单元,根据参数编码特征和图像描述文本,生成图像描述文本对应的目标图像。Refer to Figure 4, which shows a processing flow chart of an image generation method provided by an embodiment of the present disclosure. As shown in Figure 4, after the parameter generation model generates the image generation parameters, the image description text and the image generation parameters can be input into the image generation model together. In the image generation model, the image generation parameters are first encoded by the parameter encoding unit to obtain parameter encoding features. Secondly, the encoding and decoding unit generates a target image corresponding to the image description text according to the parameter encoding features and the image description text.
需要说明的是,编解码单元包括编码单元和解码单元。获得参数编码特征之后,可以将参数编码特征共同输入隐空间内的编码单元进行参数化条件控制,并残差到解码单元实现最终的参数融合生成。It should be noted that the encoding and decoding unit includes an encoding unit and a decoding unit. After obtaining the parameter encoding features, the parameter encoding features can be input into the encoding unit in the latent space for parameterized conditional control, and the residual is sent to the decoding unit to realize the final parameter fusion generation.
实际应用中,参数编码单元可以直接对图像生成参数进行编码,获得参数编码特征。进一步地,由于图像生成参数中包括不同维度的参数,因此,针对图像生成参数维度的多样性,参数编码单元也可以将图像生成参数编码成不同纬度的参数编码特征。In practical applications, the parameter encoding unit can directly encode the image generation parameters to obtain parameter encoding features. Furthermore, since the image generation parameters include parameters of different dimensions, the parameter encoding unit can also encode the image generation parameters into parameter encoding features of different latitudes in view of the diversity of the image generation parameter dimensions.
应用本公开实施例的方案,将图像生成参数和图像描述文本输入图像生成模型,经参数编码单元,对图像生成参数编码得到参数编码特征;经编解码单元,根据参数编码特征和图像描述文本,生成图像描述文本对应的目标图像。在目标图像生成过程融入图像生成 参数进行控制,保证了目标图像的精准性。Applying the solution of the embodiment of the present disclosure, the image generation parameters and the image description text are input into the image generation model, and the parameter encoding unit encodes the image generation parameters to obtain parameter encoding features; the encoding and decoding unit generates a target image corresponding to the image description text according to the parameter encoding features and the image description text. Parameters are controlled to ensure the accuracy of the target image.
本公开一种可选的实施例中,参数编码单元包括一维参数编码单元、二维参数编码单元和特征聚合单元;上述经参数编码单元,对图像生成参数编码得到参数编码特征,可以包括以下步骤:In an optional embodiment of the present disclosure, the parameter encoding unit includes a one-dimensional parameter encoding unit, a two-dimensional parameter encoding unit and a feature aggregation unit; the parameter encoding unit generates parameter encoding for the image to obtain the parameter encoding feature, which may include the following steps:
经一维参数编码单元,对图像生成参数中的一维参数编码得到一维编码特征;Through a one-dimensional parameter encoding unit, encoding the one-dimensional parameter in the image generation parameter to obtain a one-dimensional encoding feature;
经二维参数编码单元,对图像生成参数中的二维参数编码得到二维编码特征;The two-dimensional parameter encoding unit encodes the two-dimensional parameters in the image generation parameters to obtain a two-dimensional encoding feature;
经特征聚合单元,对一维编码特征和二维编码特征聚合得到参数编码特征。The one-dimensional coding features and the two-dimensional coding features are aggregated by the feature aggregation unit to obtain parameter coding features.
具体地,图像生成参数中可能包括不同维度的参数,如一维参数和二维参数。其中,一维参数包括但不限于颜色参数、虚化参数。二维参数包括但不限于位置参数、形状参数。因此,为了更好地编码图像生成参数,本公开实施例对参数编码单元进行了划分,得到用于对一维参数编码的一维参数编码单元,以及用于对二维参数编码的二维参数编码单元。Specifically, the image generation parameters may include parameters of different dimensions, such as one-dimensional parameters and two-dimensional parameters. Among them, the one-dimensional parameters include but are not limited to color parameters and blur parameters. The two-dimensional parameters include but are not limited to position parameters and shape parameters. Therefore, in order to better encode the image generation parameters, the embodiment of the present disclosure divides the parameter encoding unit to obtain a one-dimensional parameter encoding unit for encoding the one-dimensional parameters and a two-dimensional parameter encoding unit for encoding the two-dimensional parameters.
进一步地,由于图像生成参数表征了目标图像整体的生成条件,因此,在参数编码单元中还可以额外增加特征聚合单元,从而经特征聚合单元,对一维编码特征和二维编码特征进行聚合,得到参数编码特征。Furthermore, since the image generation parameters characterize the overall generation conditions of the target image, a feature aggregation unit may be additionally added to the parameter encoding unit, so that the one-dimensional encoding features and the two-dimensional encoding features are aggregated through the feature aggregation unit to obtain parameter encoding features.
应用本公开实施例的方案,经一维参数编码单元,对图像生成参数中的一维参数编码得到一维编码特征;经二维参数编码单元,对图像生成参数中的二维参数编码得到二维编码特征;经特征聚合单元,对一维编码特征和二维编码特征聚合得到参数编码特征,从而分别对不同维度的参数进行编码,保证了参数编码特征的精准性。By applying the scheme of the embodiment of the present disclosure, the one-dimensional parameters in the image generation parameters are encoded by a one-dimensional parameter encoding unit to obtain a one-dimensional encoding feature; the two-dimensional parameters in the image generation parameters are encoded by a two-dimensional parameter encoding unit to obtain a two-dimensional encoding feature; and the one-dimensional encoding feature and the two-dimensional encoding feature are aggregated by a feature aggregation unit to obtain a parameter encoding feature, thereby encoding parameters of different dimensions respectively, thereby ensuring the accuracy of the parameter encoding feature.
本公开一种可选的实施例中,将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像之后,进一步地,可以接收针对目标图像的图像更新参数,基于图像更新参数对目标图像进行图像编辑,或者生成与目标图像结构相似的图像,也即,上述将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像之后,还可以包括以下步骤:In an optional embodiment of the present disclosure, after inputting the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text, further, an image update parameter for the target image may be received, and the target image may be edited based on the image update parameter, or an image with a structure similar to the target image may be generated. That is, after inputting the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text, the following steps may also be included:
获取针对目标图像的图像更新参数;Obtain image update parameters for a target image;
根据图像更新参数,确定目标图像中的图像编辑区域;determining an image editing region in a target image according to an image updating parameter;
对图像编辑区域进行掩码,获得掩码生成序列;Mask the image editing area to obtain a mask generation sequence;
将掩码生成序列和目标图像输入图像生成模型,获得更新后的目标图像。The mask generation sequence and the target image are input into the image generation model to obtain the updated target image.
具体地,图像更新参数用于修改或调整目标图像,以获得更新后的目标图像。图像更新参数可以是独立于图像生成参数的图像生成条件,如图像生成参数为“黑猫数量为2”,图像更新参数为“黑猫数量为3”;又如图像生成参数为“黑猫数量为2”,图像更新参数为“黑猫数量加一”。获取针对目标图像的图像更新参数的方式有多种,具体根据实际情况进行选择,本公开实施例对此不做任何限定。本公开一种可能的实现方式中,可以接收用户通过客户端发送的图像更新参数。本公开另一种可能的实现方式中,可以从其他数据 获取设备或数据库中读取针对目标图像的图像更新参数。Specifically, the image update parameter is used to modify or adjust the target image to obtain an updated target image. The image update parameter may be an image generation condition independent of the image generation parameter, such as if the image generation parameter is "the number of black cats is 2", and the image update parameter is "the number of black cats is 3"; for example, if the image generation parameter is "the number of black cats is 2", the image update parameter is "the number of black cats plus one". There are many ways to obtain image update parameters for a target image, which can be selected based on actual conditions, and the embodiments of the present disclosure do not impose any limitations on this. In one possible implementation of the present disclosure, an image update parameter sent by a user through a client may be received. In another possible implementation of the present disclosure, an image update parameter may be obtained from other data. Get the image update parameters for the target image from the device or database.
需要说明的是,获取针对目标图像的图像更新参数之后,确定图像更新参数与图像生成参数相比有变化之后,可以确定目标图像中的图像编辑区域。图像编辑区域是指目标图像中需要进行参数更新的区域。根据图像更新参数,确定目标图像中的图像编辑区域时,可以确定注意力机制输出的注意力图(attention map)中的图像编辑区域,并对注意力图中的图像编辑区域进行掩码,生成掩码生成序列。It should be noted that after obtaining the image update parameters for the target image and determining that the image update parameters have changed compared to the image generation parameters, the image editing region in the target image can be determined. The image editing region refers to the region in the target image that needs to be parameter updated. When determining the image editing region in the target image based on the image update parameters, the image editing region in the attention map output by the attention mechanism can be determined, and the image editing region in the attention map can be masked to generate a mask generation sequence.
应用本公开实施例的方案,获取针对目标图像的图像更新参数;根据图像更新参数,确定目标图像中的图像编辑区域;对图像编辑区域进行掩码,获得掩码生成序列;将掩码生成序列和目标图像输入图像生成模型,获得更新后的目标图像。通过基于掩码生成序列对注意力图中的图像编辑区域进行重新生成,从而不影响其他参数不更新的区域,实现了身份保持的图像编辑和相似结构图像生成。By applying the solution of the embodiment of the present disclosure, image update parameters for the target image are obtained; according to the image update parameters, the image editing area in the target image is determined; the image editing area is masked to obtain a mask generation sequence; the mask generation sequence and the target image are input into the image generation model to obtain an updated target image. By regenerating the image editing area in the attention map based on the mask generation sequence, other areas where parameters are not updated are not affected, and identity-preserving image editing and similar structure image generation are achieved.
本公开一种可选的实施例中,以图像生成模型为扩散模型为例,上述将掩码生成序列和目标图像输入图像生成模型,获得更新后的目标图像,可以包括以下步骤:In an optional embodiment of the present disclosure, taking the image generation model as a diffusion model as an example, the above-mentioned inputting the mask generation sequence and the target image into the image generation model to obtain the updated target image may include the following steps:
针对图像生成模型的当前扩散步骤,根据已完成扩散步骤的扩散结果和掩码生成序列,生成当前扩散步骤的扩散结果,直至达到迭代停止条件,获得更新后的目标图像。For the current diffusion step of the image generation model, the diffusion result of the current diffusion step is generated according to the diffusion result of the completed diffusion step and the mask generation sequence, until the iteration stop condition is reached, and the updated target image is obtained.
具体地,扩散生成模型处理过程中包括多个扩散步骤(diffusion step),每一个扩散步骤都可以看作是对图像的一种变换,这些变换可以帮助扩散模型更好地模拟图像退化的过程,并从带有噪声的图像中恢复出无噪声的图像。一般来说,扩散模型的多个扩散步骤可以分为两个阶段,分别是扩散阶段和去噪阶段。在扩散阶段,扩散模型可以通过逐渐添加噪声到输入图像上来模拟图像退化的过程;而在去噪阶段,扩散模型可以从带有噪声的图像中找出并消除噪声来恢复出原来的无噪声图像。Specifically, the diffusion generation model includes multiple diffusion steps in its processing. Each diffusion step can be regarded as a transformation of the image. These transformations can help the diffusion model better simulate the image degradation process and restore the noise-free image from the noisy image. Generally speaking, the multiple diffusion steps of the diffusion model can be divided into two stages, namely the diffusion stage and the denoising stage. In the diffusion stage, the diffusion model can simulate the image degradation process by gradually adding noise to the input image; in the denoising stage, the diffusion model can find and eliminate noise from the noisy image to restore the original noise-free image.
本公开实施例中,在图像生成模型中,可以基于前一次生成各扩散步骤的扩散结果,进行后一次编辑式的图像生成,在编辑式的图像生成过程中,可以将扩散结果和掩码生成序列相乘,得到当前扩散步骤的扩散结果,直至达到迭代停止条件,获得更新后的目标图像。In the disclosed embodiment, in the image generation model, the next editable image generation can be performed based on the diffusion results of the previous diffusion steps. During the editable image generation process, the diffusion result and the mask generation sequence can be multiplied to obtain the diffusion result of the current diffusion step, until the iteration stop condition is reached, thereby obtaining the updated target image.
应用本公开实施例的方案,针对图像生成模型的当前扩散步骤,根据已完成扩散步骤的扩散结果和掩码生成序列,生成当前扩散步骤的扩散结果,直至达到迭代停止条件,获得更新后的目标图像,实现了身份保持的图像编辑和相似结构图像生成。By applying the solution of the embodiment of the present disclosure, for the current diffusion step of the image generation model, the diffusion result of the current diffusion step is generated according to the diffusion result of the completed diffusion step and the mask generation sequence, until the iteration stop condition is reached, and the updated target image is obtained, thereby realizing identity-preserving image editing and similar structure image generation.
本公开一种可选的实施例中,上述将掩码生成序列和目标图像输入图像生成模型,获得更新后的目标图像之后,还可以包括以下步骤:In an optional embodiment of the present disclosure, after inputting the mask generation sequence and the target image into the image generation model to obtain the updated target image, the following steps may also be included:
将目标图像和更新后的目标图像发送至客户端,以使客户端向用户展示目标图像和更新后的目标图像。The target image and the updated target image are sent to the client, so that the client displays the target image and the updated target image to the user.
需要说明的是,获得更新后的目标图像之后,可以同时将目标图像和更新后的目标图像发送至客户端,用户可以从客户端展示的目标图像和更新后的目标图像中,选择符合实 际需求的图像。It should be noted that after obtaining the updated target image, the target image and the updated target image can be sent to the client at the same time, and the user can select the target image that meets the actual requirements from the target image and the updated target image displayed on the client. images that meet actual needs.
实际应用中,将目标图像和更新后的目标图像发送至客户端的方式有多种,具体根据实际情况进行选择,本公开实施例对此不做任何限定。本公开一种可能的实现方式中,可以直接将目标图像和更新后的目标图像发送至客户端,以使客户端向用户展示目标图像和更新后的目标图像。本公开另一种可能的实现方式中,可以根据用户的展示需求信息向客户端发送目标图像和更新后的目标图像。其中,展示需求信息表征用户查看目标图像的需求。展示需求信息包括但不限于仅展示目标图像和更新后的目标图像、展示图像描述信息、目标图像和更新后的目标图像,展示需求信息具体根据用户实际需求进行设置,本公开实施例对此不作任何限定。In actual applications, there are many ways to send the target image and the updated target image to the client, and the specific selection is based on the actual situation. The embodiments of the present disclosure do not impose any restrictions on this. In one possible implementation of the present disclosure, the target image and the updated target image can be directly sent to the client so that the client displays the target image and the updated target image to the user. In another possible implementation of the present disclosure, the target image and the updated target image can be sent to the client according to the user's display demand information. Among them, the display demand information represents the user's demand for viewing the target image. The display demand information includes but is not limited to only displaying the target image and the updated target image, displaying the image description information, the target image and the updated target image. The display demand information is specifically set according to the actual needs of the user, and the embodiments of the present disclosure do not impose any restrictions on this.
应用本公开实施例的方案,将目标图像和更新后的目标图像发送至客户端,以使客户端向用户展示目标图像和更新后的目标图像,为用户提供了多种选择,并且,用户可以直观地对比目标图像和更新后的目标图像,确定更新后的目标图像是否符合自己的需求。By applying the solution of the embodiment of the present disclosure, the target image and the updated target image are sent to the client, so that the client displays the target image and the updated target image to the user, providing the user with a variety of choices, and the user can intuitively compare the target image and the updated target image to determine whether the updated target image meets his or her needs.
本公开一种可选的实施例中,上述将目标图像和更新后的目标图像发送至客户端之后,还可以包括以下步骤:In an optional embodiment of the present disclosure, after sending the target image and the updated target image to the client, the following steps may also be included:
接收用户通过客户端发送的图像选择信息,并基于图像选择信息调整图像生成模型的模型参数。Receive image selection information sent by the user through the client, and adjust model parameters of the image generation model based on the image selection information.
具体地,图像选择信息用于标识用户选择的图像。图像选择信息包括但不限于图像编号、图像位置,具体根据实际情况进行选择,本公开实施例对此不做任何限定。Specifically, the image selection information is used to identify the image selected by the user. The image selection information includes but is not limited to the image number and the image position, and is selected according to actual conditions, and the embodiment of the present disclosure does not impose any limitation on this.
需要说明的是,用户发送图像选择信息的方式包括但不限于语音输入、触控、文本输入。基于图像选择信息可以确定符合用户需求的图像,进一步可以将图像选择信息对应的图像作为真实图像对图像生成模型进行参数调整。It should be noted that the user may send image selection information in a manner including but not limited to voice input, touch, and text input. Based on the image selection information, an image that meets the user's needs may be determined, and the image corresponding to the image selection information may be used as a real image to adjust parameters of the image generation model.
应用本公开实施例的方案,接收用户通过客户端发送的图像选择信息,并基于图像选择信息调整图像生成模型的模型参数,使得图像生成模型具备了与用户的交互能力,以便用户能够通过交互方式调整图像生成的结果,提高了用户满意度。By applying the solution of the embodiment of the present disclosure, image selection information sent by the user through the client is received, and the model parameters of the image generation model are adjusted based on the image selection information, so that the image generation model has the ability to interact with the user, so that the user can adjust the image generation result in an interactive manner, thereby improving user satisfaction.
本公开一种可选的实施例中,上述将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像之后,还可以包括以下步骤:In an optional embodiment of the present disclosure, after inputting the image generation parameters and the image description text into the image generation model and obtaining the target image corresponding to the image description text, the following steps may also be included:
接收用户通过客户端发送的调整样本数据,并根据调整样本数据调整图像生成模型的模型参数,其中,调整样本数据基于目标图像构建得到。Receive adjustment sample data sent by the user through the client, and adjust the model parameters of the image generation model according to the adjustment sample data, wherein the adjustment sample data is constructed based on the target image.
实际应用中,将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像之后,可以将目标图像发送至客户端,进一步地,用户可以基于目标图像自行进行处理,若对目标图像不满意,还可以发送调整样本数据以二次训练模型。具体地,接收用户通过客户端发送的调整样本数据之后,第一种可能的实现方式中,可以根据调整样本数据调整图像生成模型的模型参数;第二种可能的实现方式中,可以根据调整样本数据调整参数生成模型的模型参数;第三种可能的实现方式中,可以根据调整样本数据调整 图像生成模型和参数生成模型的模型参数。其中,调整样本数据可以是用户对目标图像添加标签(如正负样本标签)得到,或者对目标图像进行参数调整(如调整尺寸参数)得到,本公开实施例对调整样本数据的生成方式不做任何限定。In practical applications, the image generation parameters and the image description text are input into the image generation model. After obtaining the target image corresponding to the image description text, the target image can be sent to the client. Furthermore, the user can process the target image by himself. If he is not satisfied with the target image, he can send the adjustment sample data to train the model again. Specifically, after receiving the adjustment sample data sent by the user through the client, in a first possible implementation method, the model parameters of the image generation model can be adjusted according to the adjustment sample data; in a second possible implementation method, the model parameters of the parameter generation model can be adjusted according to the adjustment sample data; in a third possible implementation method, the adjustment sample data can be adjusted according to the adjustment sample data. Model parameters of the image generation model and the parameter generation model. The adjusted sample data can be obtained by the user adding labels to the target image (such as positive and negative sample labels), or by adjusting parameters of the target image (such as adjusting size parameters). The embodiments of the present disclosure do not impose any restrictions on the generation method of the adjusted sample data.
需要说明的是,“根据调整样本数据调整图像生成模型的模型参数”的实现方式与图像生成模型的训练方式相同,本公开实施例便不再进行赘述。It should be noted that the implementation method of "adjusting the model parameters of the image generation model according to the adjusted sample data" is the same as the training method of the image generation model, and will not be repeated in the embodiment of the present disclosure.
应用本公开实施例的方案,将目标图像发送至客户端,以使客户端向用户展示目标图像;接收用户通过客户端发送的调整样本数据,并根据调整样本数据调整图像生成模型的模型参数,使得图像生成模型具备了与用户的交互能力,以便用户能够通过交互方式调整图像生成的结果,提高了用户满意度。By applying the solution of the embodiment of the present disclosure, a target image is sent to a client so that the client displays the target image to a user; adjustment sample data sent by the user through the client is received, and model parameters of the image generation model are adjusted according to the adjustment sample data, so that the image generation model has the ability to interact with the user, so that the user can adjust the image generation result in an interactive manner, thereby improving user satisfaction.
本公开一种可选的实施例中,上述将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数之前,还可以包括以下步骤:In an optional embodiment of the present disclosure, before inputting the image description text and the generation prompt information into the parameter generation model to obtain the image generation parameters corresponding to the image description text, the following steps may also be included:
获取样本集,其中,样本集包括多个样本图像文本对,样本图像文本对包括样本图像和样本描述文本,样本图像文本对携带样本参数信息;Acquire a sample set, wherein the sample set includes a plurality of sample image-text pairs, the sample image-text pairs include a sample image and a sample description text, and the sample image-text pairs carry sample parameter information;
将多个样本图像文本对和预测提示信息输入预训练语言模型,获得多个样本图像文本对分别对应的图像预测参数;Inputting a plurality of sample image-text pairs and prediction prompt information into a pre-trained language model to obtain image prediction parameters corresponding to the plurality of sample image-text pairs;
根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型。According to the image prediction parameters and sample parameter information, the model parameters of the pre-trained language model are adjusted to obtain a parameter generation model that has completed training.
具体地,参数生成模型的训练方式为基于提示学习的有监督训练,也即样本集中的各样本图像文本对是携带真实样本参数信息的,样本参数信息为参数生成模型的生成目标,用于指导参数生成模型的训练过程。Specifically, the training method of the parameter generation model is supervised training based on prompt learning, that is, each sample image-text pair in the sample set carries real sample parameter information, and the sample parameter information is the generation target of the parameter generation model, which is used to guide the training process of the parameter generation model.
预测提示信息用于提示预训练语言模型的预测过程,预测提示信息具体根据实际情况进行设置,本公开实施例对此不做任何限定。例如,预测提示信息可以是“你是一个智能边界框生成模型。我会为你提供图像、图像的描述文本,每个边界框的格式应为(对象名称、中心点横坐标、中心点纵坐标、边界框宽、边界框高)”。通过预测提示信息可以使得预训练语言模型输出固定数据格式的预测结果。The prediction prompt information is used to prompt the prediction process of the pre-trained language model. The prediction prompt information is set according to the actual situation, and the present disclosure embodiment does not impose any restrictions on this. For example, the prediction prompt information can be "You are an intelligent bounding box generation model. I will provide you with images and image description texts. The format of each bounding box should be (object name, center point horizontal coordinate, center point vertical coordinate, bounding box width, bounding box height)". The prediction prompt information can enable the pre-trained language model to output prediction results in a fixed data format.
需要说明的是,“将多个样本图像文本对和预测提示信息输入预训练语言模型,获得多个样本图像文本对分别对应的图像预测参数”的实现方式可以参考上述“将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数”的实现方式,本公开便不再进行赘述。根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数时,可以采用LoRA方式,可以设定rank=8,做到在不遗忘基础模型语义理解能力前提下的模型调整。It should be noted that the implementation method of "inputting multiple sample image-text pairs and prediction prompt information into the pre-trained language model to obtain image prediction parameters corresponding to multiple sample image-text pairs" can refer to the implementation method of "inputting image description text and generation prompt information into the parameter generation model to obtain image generation parameters corresponding to the image description text", and this disclosure will not repeat it. When adjusting the model parameters of the pre-trained language model according to the image prediction parameters and sample parameter information, the LoRA method can be used, and rank=8 can be set to achieve model adjustment without forgetting the semantic understanding ability of the basic model.
实际应用中,获取样本集的方式有多种,具体根据实际情况进行选择,本公开实施例对此不做任何限定。本公开一种可能的实现方式中,可以接收用户输入的大量携带样本参数信息的样本图像文本对组成样本集。本公开另一种可能的实现方式中,可以从其他数据 获取设备或者数据库中读取大量携带样本参数信息的样本图像文本对组成样本集。In practical applications, there are many ways to obtain sample sets, which can be selected according to actual conditions. The embodiments of the present disclosure do not limit this. In one possible implementation of the present disclosure, a large number of sample image text pairs carrying sample parameter information input by the user can be received to form a sample set. In another possible implementation of the present disclosure, the sample set can be obtained from other data. A large number of sample image-text pairs carrying sample parameter information are read from an acquisition device or a database to form a sample set.
值得说明的是,样本集可以包括原始样本子集、生成式样本子集和构造式样本子集中的至少一种。具体地,可以获取多个携带样本描述文本的样本图像组成原始样本子集。可以利用大模型的图像理解能力对样本图像进行样本描述文本组成生成式样本子集,并且,在组成生成式样本子集时,还可以协同检测分割模型实现多种样本参数信息的抽取。可以基于先验知识设定参数配置条件,参数配置条件如输出规则、对象规则、元素数量规则等,构造伪数据组成构造式样本子集,从而利用构造式样本子集使得参数生成模型有目的的学习能力。It is worth noting that the sample set may include at least one of an original sample subset, a generative sample subset, and a constructed sample subset. Specifically, a plurality of sample images carrying sample description texts may be obtained to form an original sample subset. The image understanding capability of a large model may be used to perform sample description texts on sample images to form a generative sample subset, and when forming a generative sample subset, a collaborative detection and segmentation model may be used to extract a variety of sample parameter information. Parameter configuration conditions may be set based on prior knowledge, such as output rules, object rules, element number rules, etc., to construct pseudo data to form a constructed sample subset, thereby using the constructed sample subset to enable the parameter generation model to have a purposeful learning capability.
参见图5,图5示出了本公开一个实施例提供的一种参数生成模型训练方法的流程图,将原始样本子集、生成式样本子集和构造式样本子集混合得到样本集,将样本集中的多个样本图像文本对和预测提示信息输入预训练语言模型,获得多个样本图像文本对分别对应的图像预测参数,根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型。通过构造包括三个样本子集的样本集,可以使得参数生成模型输出的图像生成参数更加合理。Referring to FIG5 , FIG5 shows a flow chart of a parameter generation model training method provided by an embodiment of the present disclosure, wherein the original sample subset, the generative sample subset and the constructed sample subset are mixed to obtain a sample set, multiple sample image-text pairs and prediction prompt information in the sample set are input into a pre-trained language model, image prediction parameters corresponding to the multiple sample image-text pairs are obtained, model parameters of the pre-trained language model are adjusted according to the image prediction parameters and sample parameter information, and a parameter generation model that has been trained is obtained. By constructing a sample set including three sample subsets, the image generation parameters output by the parameter generation model can be made more reasonable.
应用本公开实施例的方案,根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型,通过不断对预训练语言模型的模型参数进行调整,能使最终得到的参数生成模型更加精准。By applying the solution of the embodiment of the present disclosure, the model parameters of the pre-trained language model are adjusted according to the image prediction parameters and sample parameter information to obtain a parameter generation model that has completed training. By continuously adjusting the model parameters of the pre-trained language model, the parameter generation model finally obtained can be made more accurate.
本公开一种可选的实施例中,以样本集中包括的生成式样本子集为例,上述获取样本集,可以包括以下步骤:In an optional embodiment of the present disclosure, taking the generative sample subset included in the sample set as an example, the above-mentioned acquisition of the sample set may include the following steps:
获取多个样本图像;Acquire multiple sample images;
针对第一样本图像,将第一样本图像和构建提示信息输入预训练语言模型,获得第一样本图像的第一样本描述文本;For the first sample image, input the first sample image and the construction prompt information into the pre-trained language model to obtain a first sample description text of the first sample image;
根据第一样本图像和第一样本描述文本,生成第一样本图像的第一样本参数信息;Generate first sample parameter information of the first sample image according to the first sample image and the first sample description text;
根据多个样本图像、多个样本图像的样本描述文本和样本参数信息,构建样本集。A sample set is constructed according to a plurality of sample images, sample description texts of the plurality of sample images, and sample parameter information.
具体地,构建提示信息用于提示预训练语言模型生成样本图像的样本描述文本。构建提示信息具体根据实际情况进行设置,本公开实施例对此不做任何限定。例如,预测提示信息可以是“请对图像进行描述”。Specifically, the construction prompt information is used to prompt the pre-trained language model to generate a sample description text of the sample image. The construction prompt information is set according to the actual situation, and the embodiment of the present disclosure does not make any limitation on this. For example, the prediction prompt information may be "Please describe the image".
需要说明的是,获取多个样本图像的方式有多种,具体根据实际情况进行选择,本公开实施例对此不做任何限定。本公开一种可能的实现方式中,可以接收用户输入的多个样本图像。本公开另一种可能的实现方式中,可以从其他数据获取设备或者数据库中读取多个样本图像。It should be noted that there are multiple ways to obtain multiple sample images, which can be selected according to actual conditions, and the embodiments of the present disclosure do not limit this. In one possible implementation of the present disclosure, multiple sample images input by a user can be received. In another possible implementation of the present disclosure, multiple sample images can be read from other data acquisition devices or databases.
应用本公开实施例的方案,获取多个样本图像;针对第一样本图像,将第一样本图像和构建提示信息输入预训练语言模型,获得第一样本图像的第一样本描述文本;根据第一样本图像和第一样本描述文本,生成第一样本图像的第一样本参数信息;根据多个样本图 像、多个样本图像的样本描述文本和样本参数信息,构建样本集。通过预训练语言模型的图像理解能力生成样本描述文本,保证了样本描述文本的准备性。Apply the solution of the embodiment of the present disclosure to obtain multiple sample images; for a first sample image, input the first sample image and the construction prompt information into the pre-trained language model to obtain a first sample description text of the first sample image; generate first sample parameter information of the first sample image according to the first sample image and the first sample description text; according to the multiple sample images The sample description text and sample parameter information of multiple sample images are used to build a sample set. The sample description text is generated through the image understanding ability of the pre-trained language model, ensuring the readiness of the sample description text.
实际应用中,根据第一样本图像和第一样本描述文本,生成第一样本图像的第一样本参数信息的方式有多种,具体根据实际情况进行选择,本公开实施例对此不做任何限定。本公开一种可能的实现方式中,可以通过文本挖掘、图像分析等方法从第一样本描述文本中获取相关的关键词、句子结构等信息,并将这些信息作为第一样本参数信息。In practical applications, there are multiple ways to generate the first sample parameter information of the first sample image based on the first sample image and the first sample description text, which are selected according to the actual situation, and the embodiments of the present disclosure do not limit this. In a possible implementation of the present disclosure, relevant keywords, sentence structure and other information can be obtained from the first sample description text through methods such as text mining and image analysis, and this information can be used as the first sample parameter information.
本公开另一种可能的实现方式中,可以协同检测分割模型实现多种样本参数信息的抽取,也即,上述根据第一样本图像和第一样本描述文本,生成第一样本图像的第一样本参数信息,可以包括以下步骤:In another possible implementation of the present disclosure, the segmentation model may be collaboratively detected to realize the extraction of multiple sample parameter information. That is, the above-mentioned generation of the first sample parameter information of the first sample image according to the first sample image and the first sample description text may include the following steps:
根据第一样本描述文本对第一样本图像进行区域检测,确定第一样本图像的关键区域位置信息;Performing region detection on the first sample image according to the first sample description text to determine key region position information of the first sample image;
根据关键区域位置信息对第一样本图像进行视觉分割,获得第一样本图像的第一样本参数信息。Visually segment the first sample image according to the key area position information to obtain first sample parameter information of the first sample image.
具体地,关键区域位置信息是指第一样本图像中关键区域的中心点坐标,关键区域可以理解为边界框。例如,第一样本图像为小女孩和一只狗在奔跑,第一样本图像中的关键区域则为小女孩所在的区域和狗所在的区域。第一样本参数信息包括但不限于关键区域的高度、宽度。Specifically, the key area position information refers to the coordinates of the center point of the key area in the first sample image, and the key area can be understood as a bounding box. For example, the first sample image is a little girl and a dog running, and the key area in the first sample image is the area where the little girl is and the area where the dog is. The first sample parameter information includes but is not limited to the height and width of the key area.
需要说明的是,根据第一样本描述文本对第一样本图像进行区域检测时,可以将第一样本描述文本和第一样本图像输入检测分割模型中,经检测分割模型的区域检测,获得第一样本图像的关键区域位置信息。根据关键区域位置信息对第一样本图像进行视觉分割时,可以将关键区域位置信息和第一样本图像输入视觉分割模型(SAM,Segmentation As A Module)中,获得第一样本图像的第一样本参数信息。It should be noted that when performing region detection on the first sample image according to the first sample description text, the first sample description text and the first sample image can be input into the detection segmentation model, and the key region position information of the first sample image can be obtained through region detection by the detection segmentation model. When performing visual segmentation on the first sample image according to the key region position information, the key region position information and the first sample image can be input into the visual segmentation model (SAM, Segmentation As A Module) to obtain the first sample parameter information of the first sample image.
应用本公开实施例的方案,根据第一样本描述文本对第一样本图像进行区域检测,确定第一样本图像的关键区域位置信息;根据关键区域位置信息对第一样本图像进行视觉分割,获得第一样本图像的第一样本参数信息。通过区域检测以及视觉分割,提高了第一样本参数信息的精准性。By applying the solution of the embodiment of the present disclosure, the first sample image is subjected to region detection according to the first sample description text to determine the key region position information of the first sample image; the first sample image is subjected to visual segmentation according to the key region position information to obtain the first sample parameter information of the first sample image. The accuracy of the first sample parameter information is improved through region detection and visual segmentation.
本公开一种可选的实施例中,以样本集中包括的构造式样本子集为例,对构造式样本子集的构建方式进行说明,上述根据第一样本图像和第一样本描述文本,生成第一样本图像的第一样本参数信息之后,还可以包括以下步骤:In an optional embodiment of the present disclosure, a constructed sample subset included in a sample set is taken as an example to illustrate a construction method of the constructed sample subset. After generating the first sample parameter information of the first sample image according to the first sample image and the first sample description text, the following steps may also be included:
获取参数配置条件;Get parameter configuration conditions;
在第一样本参数信息不满足参数配置条件的情况下,调整第一样本参数信息,获得调整后的第一样本参数信息;When the first sample parameter information does not meet the parameter configuration condition, adjust the first sample parameter information to obtain adjusted first sample parameter information;
根据多个样本图像、多个样本图像的样本描述文本和样本参数信息,构建样本集,包括: A sample set is constructed according to a plurality of sample images, sample description texts of the plurality of sample images, and sample parameter information, including:
根据多个样本图像、多个样本图像的样本描述文本和调整后的样本参数信息,构建样本集。A sample set is constructed according to the multiple sample images, the sample description texts of the multiple sample images and the adjusted sample parameter information.
具体地,参数配置条件包括但不限于亮度大于预设亮度阈值,清晰度大于预设清晰度阈值等等,具体根据实际情况进行选择,本公开实施例对此不做任何限定。Specifically, the parameter configuration conditions include but are not limited to brightness greater than a preset brightness threshold, clarity greater than a preset clarity threshold, etc., which are selected based on actual conditions, and the embodiments of the present disclosure do not impose any limitations on this.
实际应用中,获取参数配置条件的方式有多种,具体根据实际情况进行选择,本公开实施例对此不做任何限定。本公开一种可能的实现方式中,可以接收用户输入的参数配置条件。本公开另一种可能的实现方式中,可以从其他数据获取设备或者数据库中读取参数配置条件。In practical applications, there are many ways to obtain parameter configuration conditions, which are selected according to actual conditions, and the embodiments of the present disclosure do not limit this. In one possible implementation of the present disclosure, the parameter configuration conditions input by the user can be received. In another possible implementation of the present disclosure, the parameter configuration conditions can be read from other data acquisition devices or databases.
需要说明的是,获取参数配置条件之后,进一步地,可以判断第一样本参数信息是否满足参数配置条件。在第一样本参数信息满足参数配置条件的情况下,不对第一样本参数进行调整。在第一样本参数信息不满足参数配置条件的情况下,调整第一样本参数信息,获得调整后的第一样本参数信息。It should be noted that after obtaining the parameter configuration condition, it can be further determined whether the first sample parameter information meets the parameter configuration condition. If the first sample parameter information meets the parameter configuration condition, the first sample parameter is not adjusted. If the first sample parameter information does not meet the parameter configuration condition, the first sample parameter information is adjusted to obtain the adjusted first sample parameter information.
应用本公开实施例的方案,获取参数配置条件;在第一样本参数信息不满足参数配置条件的情况下,调整第一样本参数信息,获得调整后的第一样本参数信息;根据多个样本图像、多个样本图像的样本描述文本和调整后的样本参数信息,构建样本集。通过参数配置条件对样本参数信息进行可靠性监测,提高了样本参数信息的精准性。By applying the solution of the embodiment of the present disclosure, the parameter configuration condition is obtained; when the first sample parameter information does not meet the parameter configuration condition, the first sample parameter information is adjusted to obtain the adjusted first sample parameter information; a sample set is constructed according to multiple sample images, sample description texts of the multiple sample images and the adjusted sample parameter information. The reliability of the sample parameter information is monitored through the parameter configuration condition, thereby improving the accuracy of the sample parameter information.
本公开一种可选的实施例中,上述将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像之前,还可以包括以下步骤:In an optional embodiment of the present disclosure, before inputting the image generation parameters and the image description text into the image generation model to obtain the target image corresponding to the image description text, the following steps may also be included:
获取样本集,其中,样本集包括多个样本图像文本对,样本图像文本对包括样本图像和样本描述文本,样本图像文本对携带样本参数信息;Acquire a sample set, wherein the sample set includes a plurality of sample image-text pairs, the sample image-text pairs include a sample image and a sample description text, and the sample image-text pairs carry sample parameter information;
将多个样本描述文本和多个样本描述文本对应的样本参数信息输入初始生成模型,获得多个样本描述文本分别对应的预测图像;Inputting a plurality of sample description texts and sample parameter information corresponding to the plurality of sample description texts into an initial generation model to obtain prediction images corresponding to the plurality of sample description texts respectively;
根据预测图像和样本图像,调整初始生成模型的模型参数,获得完成训练的图像生成模型。According to the predicted image and the sample image, the model parameters of the initial generation model are adjusted to obtain a trained image generation model.
具体地,图像生成模型的训练方式为有监督训练,也即样本集中的样本描述文本是携带真实样本图像的,样本图像为图像生成模型的生成目标,用于指导图像生成模型的训练过程。Specifically, the training method of the image generation model is supervised training, that is, the sample description text in the sample set carries the real sample image, and the sample image is the generation target of the image generation model, which is used to guide the training process of the image generation model.
需要说明的是,“获取样本集”的方式可以参考参见上述参数生成模型训练过程中获取样本集的方式。“将多个样本描述文本和多个样本描述文本对应的样本参数信息输入初始生成模型,获得多个样本描述文本分别对应的预测图像”的实现方式可以参考与上述“将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像”的实现方式,本公开实施例便不再进行赘述。It should be noted that the method of "obtaining a sample set" can refer to the method of obtaining a sample set in the training process of the parameter generation model mentioned above. The implementation method of "inputting multiple sample description texts and sample parameter information corresponding to the multiple sample description texts into the initial generation model to obtain predicted images corresponding to the multiple sample description texts" can refer to the implementation method of "inputting image generation parameters and image description texts into the image generation model to obtain target images corresponding to the image description texts" mentioned above, and the embodiments of the present disclosure will not be repeated here.
实际应用中,根据预测图像和样本图像,调整初始生成模型的模型参数时,可以根据预测图像和样本图像计算损失值,根据损失值调整初始生成模型的模型参数,直至达到预 设停止条件,获得完成训练的图像生成模型,其中,计算损失值的函数有很多,如交叉熵损失函数、L1范数损失函数、最大损失函数、均方误差损失函数、对数损失函数等,具体根据实际情况进行选择,本公开实施例对此不做任何限定。In practical applications, when adjusting the model parameters of the initial generation model according to the predicted image and the sample image, the loss value can be calculated according to the predicted image and the sample image, and the model parameters of the initial generation model can be adjusted according to the loss value until the predicted image is reached. Set a stopping condition to obtain a trained image generation model, where there are many functions for calculating loss values, such as cross entropy loss function, L1 norm loss function, maximum loss function, mean square error loss function, logarithmic loss function, etc. The specific selection is based on the actual situation, and the embodiments of the present disclosure do not impose any restrictions on this.
本公开一种可能的实现方式中,预设停止条件包括损失值小于或等于预设阈值。根据预测图像和样本图像,计算损失值之后,将损失值与预设阈值进行比较。In a possible implementation of the present disclosure, the preset stop condition includes that the loss value is less than or equal to a preset threshold value. After the loss value is calculated based on the predicted image and the sample image, the loss value is compared with the preset threshold value.
具体地,若损失值大于预设阈值,则说明预测图像和样本图像之间的差异较大,初始生成模型对于预测图像的预测能力较差,此时可以调整初始生成模型的模型参数,继续对初始生成模型进行训练,直至损失值小于或等于预设阈值,说明预测图像和样本图像的差异较小,达到预设停止条件,获得完成训练的图像生成模型。Specifically, if the loss value is greater than the preset threshold, it means that the difference between the predicted image and the sample image is large, and the initial generation model has poor prediction ability for the predicted image. At this time, the model parameters of the initial generation model can be adjusted, and the initial generation model can continue to be trained until the loss value is less than or equal to the preset threshold, indicating that the difference between the predicted image and the sample image is small, and the preset stopping condition is met, and a trained image generation model is obtained.
本公开另一种可能的实现方式中,除了比较损失值和预设阈值的大小关系之外,还可以结合迭代次数,确定当前的初始生成模型是否训练完成。In another possible implementation of the present disclosure, in addition to comparing the loss value and the preset threshold, it is also possible to determine whether the current initial generation model has been trained in combination with the number of iterations.
具体地,若损失值大于预设阈值,则调整初始生成模型的模型参数,继续对初始信息抽取模型进行训练,直至达到预设迭代次数的情况下,停止迭代,得到完成训练的图像生成模型,其中,预设阈值和预设迭代次数具体根据实际情况进行选择,本公开实施例对此不做任何限定。Specifically, if the loss value is greater than a preset threshold, the model parameters of the initial generation model are adjusted, and the initial information extraction model continues to be trained until the preset number of iterations is reached, and the iteration is stopped to obtain a trained image generation model, wherein the preset threshold and the preset number of iterations are selected according to actual conditions, and the embodiments of the present disclosure do not impose any limitations on this.
应用本公开实施例的方案,根据预测图像和样本图像,调整初始生成模型的模型参数,获得完成训练的图像生成模型,通过不断对初始生成模型的模型参数进行调整,能使最终得到的图像生成模型更加精准。By applying the solution of the embodiment of the present disclosure, the model parameters of the initial generation model are adjusted according to the predicted image and the sample image to obtain a trained image generation model. By continuously adjusting the model parameters of the initial generation model, the final image generation model can be made more accurate.
参见图6,图6示出了本公开一个实施例提供的另一种参数生成模型训练方法的流程图,参数生成模型训练应用于云侧设备,具体包括以下步骤:Referring to FIG. 6 , FIG. 6 shows a flow chart of another parameter generation model training method provided by an embodiment of the present disclosure. The parameter generation model training is applied to a cloud-side device and specifically includes the following steps:
步骤602:获取样本集,其中,样本集包括多个样本图像文本对,样本图像文本对包括样本图像和样本描述文本,样本图像文本对携带样本参数信息。Step 602: Acquire a sample set, wherein the sample set includes a plurality of sample image-text pairs, the sample image-text pairs include a sample image and a sample description text, and the sample image-text pairs carry sample parameter information.
步骤604:将多个样本图像文本对和预测提示信息输入预训练语言模型,获得多个样本图像文本对分别对应的图像预测参数。Step 604: input the plurality of sample image-text pairs and prediction prompt information into the pre-trained language model to obtain image prediction parameters corresponding to the plurality of sample image-text pairs.
步骤606:根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型。Step 606: According to the image prediction parameters and sample parameter information, the model parameters of the pre-trained language model are adjusted to obtain a parameter generation model that has completed training.
需要说明的是,步骤602至步骤606的实现方式详见上述图像生成方法中参数生成模型的训练方式,本公开实施例对此不做任何限定。It should be noted that the implementation method of steps 602 to 606 is detailed in the training method of the parameter generation model in the above-mentioned image generation method, and the present disclosed embodiment does not impose any limitation on this.
实际应用中,获得训练完成的参数生成模型之后,可以将训练完成的参数生成模型的模型参数发送至端侧设备,以使用户基于模型参数在本地构建参数生成模型,生成图像生成参数。In actual applications, after obtaining the trained parameter generation model, the model parameters of the trained parameter generation model can be sent to the terminal device, so that the user can locally build the parameter generation model based on the model parameters to generate image generation parameters.
应用本公开实施例的方案,根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型,通过不断对预训练语言模型的模型参数进行 调整,能使最终得到的参数生成模型更加精准。By applying the solution of the embodiment of the present disclosure, the model parameters of the pre-trained language model are adjusted according to the image prediction parameters and the sample parameter information to obtain a parameter generation model that has completed the training. Adjustment can make the final parameter generation model more accurate.
参见图7,图7示出了本公开一个实施例提供的一种自动问答方法的流程图,具体包括以下步骤:Referring to FIG. 7 , FIG. 7 shows a flow chart of an automatic question-answering method provided by an embodiment of the present disclosure, which specifically includes the following steps:
步骤702:接收图像问答请求,其中,图像问答请求携带图像描述文本。Step 702: Receive an image question and answer request, wherein the image question and answer request carries an image description text.
步骤704:将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到。Step 704: Input the image description text and generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs.
步骤706:将图像生成参数和图像描述文本输入图像生成模型,获得图像问答请求对应的答复图像。Step 706: Input the image generation parameters and the image description text into the image generation model to obtain a reply image corresponding to the image question and answer request.
需要说明的是,步骤702至步骤706的实现方式详见上述步骤302至步骤306,本公开实施例对此不做任何限定。It should be noted that the implementation of steps 702 to 706 is detailed in the above steps 302 to 306, and the present embodiment does not impose any limitation on this.
应用本公开实施例的方案,通过利用参数生成模型对图像的视觉元素进行语义拆解,获得图像生成参数,进一步基于图像生成参数完成精准的图像生成,使得目标图像可以清晰地表达图像描述文本和图像生成参数,提高了自动问答过程的可解释性和可控性。By applying the solution of the embodiment of the present disclosure, the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of the automatic question-answering process.
参见图8,图8示出了本公开一个实施例提供的另一种图像生成方法的处理过程流程图,图像生成过程中,利用参数生成模型的强大语义理解和联想能力,完成对图像的视觉元素的语义拆解和重组,将其参数化后输入图像生成模型完成精准的图像生成。图像生成阶段可以分为两个阶段:参数生成模型处理阶段和图像生成模型处理阶段,其中,参数生成模型处理阶段包括预训练语言模型的微调、条件生成样本集的构建以及生成提示信息的设计;Referring to FIG8 , FIG8 shows a flowchart of the processing process of another image generation method provided by an embodiment of the present disclosure. During the image generation process, the powerful semantic understanding and associative ability of the parameter generation model is used to complete the semantic decomposition and reorganization of the visual elements of the image, and the parameters are input into the image generation model to complete accurate image generation. The image generation stage can be divided into two stages: the parameter generation model processing stage and the image generation model processing stage, wherein the parameter generation model processing stage includes fine-tuning the pre-trained language model, constructing the conditional generation sample set, and designing the generated prompt information;
如图8所示,基于多个样本图像文本对(基于样本描述文本和样本图像构成)和多个样本图像文本对携带的样本参数信息对预训练语言模型训练得到参数生成模型;获取图像描述文本;将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数;将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像。As shown in FIG8 , a parameter generation model is obtained by training a pre-trained language model based on multiple sample image-text pairs (composed of sample description texts and sample images) and sample parameter information carried by multiple sample image-text pairs; the image description text is obtained; the image description text and generation prompt information are input into the parameter generation model to obtain image generation parameters corresponding to the image description text; the image generation parameters and the image description text are input into the image generation model to obtain a target image corresponding to the image description text.
应用本公开实施例的方案,利用两阶段的扩散生成架构实现了基于预训练语言模型的辅助生成,在参数生成模型处理阶段中借助预训练语言模型的涌现能力及适当的微调,输出图像生成模型处理阶段所需要的图像生成参数,随后在图像生成模型处理阶段中进行基于图像生成参数控制的图像生成,输出精准语义的图像,并且,由于生成的图像能够清晰地表达生成过程中所使用的生成提示信息或图像生成参数,以便用户理解和验证生成结果的准确性。By applying the solution of the embodiment of the present disclosure, a two-stage diffusion generation architecture is used to implement auxiliary generation based on a pre-trained language model. In the parameter generation model processing stage, the emergence ability and appropriate fine-tuning of the pre-trained language model are used to output the image generation parameters required in the image generation model processing stage. Subsequently, in the image generation model processing stage, image generation based on image generation parameter control is performed to output an image with precise semantics. Moreover, since the generated image can clearly express the generation prompt information or image generation parameters used in the generation process, the user can understand and verify the accuracy of the generation result.
参见图9,图9示出了本公开一个实施例提供的一种参数生成模型的处理过程示意图,如图9所示,获取图像描述文本为“橙色沙发上的两只黑猫和一只白狗”,将图像描述文本和生成提示信息1我的标题是“橙色沙发上的两只黑猫和一只白狗”和“图像中的元素 是什么?”输入参数生成模型,获得图像描述文本对应的图像生成参数1“[(黑猫,2),(白狗,1),(沙发,1)]”;将图像描述文本和生成提示信息2“他们的边界框是什么?”输入参数生成模型,获得图像描述文本对应的图像生成参数2“[(黑猫,[30,171,212,286]),(黑猫,[40,200,123,412]),(白狗,[24,543,231,332]),(沙发,[264,173,222,221])]”;将图像描述文本和生成提示信息3“每种元素的主要颜色是什么?”输入参数生成模型,获得图像描述文本对应的图像生成参数3“[(黑猫,[黑色,灰色,深灰色]),...,(白狗,[白色,浅灰色,棕色]),(沙发,[橙色,棕色,灰色])]”;最后,对图像生成参数1、图像生成参数2以及图像生成参数3进行聚合,得到最终的图像生成参数。Referring to FIG. 9 , FIG. 9 shows a schematic diagram of a processing process of a parameter generation model provided by an embodiment of the present disclosure. As shown in FIG. 9 , the image description text is obtained as “two black cats and a white dog on an orange sofa”, and the image description text and the generated prompt information 1 are combined. The title of my image is “two black cats and a white dog on an orange sofa” and “elements in the image” What is it? " Input parameter generation model, get the image generation parameter 1 corresponding to the image description text "[(black cat, [30, 171, 212, 286]), (black cat, [40, 200, 123, 412]), (white dog, [24, 543, 231, 332]), (sofa, [264, 173, 222, 221])]"; combine the image description text and generate prompt information 2 "What is their bounding box?" Input parameter generation model, get the image generation parameter 2 corresponding to the image description text "[(black cat, [30, 171, 212, 286]), (black cat, [40, 200, 123, 412]), (white dog, [24, 543, 231, 332]), (sofa, [264, 173, 222, 221])]"; combine the image description text and generate prompt information 3 "What is the main color of each element? "Input parameter generation model, obtain image generation parameter 3 corresponding to the image description text "[(black cat, [black, gray, dark gray]),..., (white dog, [white, light gray, brown]), (sofa, [orange, brown, gray])]"; finally, aggregate image generation parameter 1, image generation parameter 2 and image generation parameter 3 to obtain the final image generation parameter.
参见图10,图10示出了本公开一个实施例提供的一种图像生成界面的界面示意图。图像生成界面分为请求输入界面和结果展示界面。请求输入界面中包括请求输入框、“确定”控件以及“取消”控件。结果展示界面中包括结果展示框。Referring to FIG. 10 , FIG. 10 shows a schematic diagram of an interface of an image generation interface provided by an embodiment of the present disclosure. The image generation interface is divided into a request input interface and a result display interface. The request input interface includes a request input box, an “OK” control, and a “Cancel” control. The result display interface includes a result display box.
用户通过客户端显示的请求输入框输入图像生成请求,其中,图像生成请求携带图像描述文本,点选“确定”控件,服务端接收客户端发送的图像描述文本,将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到;将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像,并将目标图像发送至客户端。客户端在结果展示框中显示目标图像。The user inputs an image generation request through the request input box displayed on the client, wherein the image generation request carries the image description text, clicks the "OK" control, and the server receives the image description text sent by the client, inputs the image description text and the generation prompt information into the parameter generation model, and obtains the image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by multiple sample image-text pairs; the image generation parameters and the image description text are input into the image generation model, and the target image corresponding to the image description text is obtained, and the target image is sent to the client. The client displays the target image in the result display box.
实际应用中,用户对控件进行操作的方式包括点击、双击、触控、鼠标悬停、滑动、长按、语音控制或摇一摇等任一方式,具体根据实际情况进行选择,本公开实施例对此不作任何限定。In actual applications, users can operate controls by clicking, double-clicking, touching, hovering the mouse, sliding, long pressing, voice control, or shaking, etc. The selection is based on actual conditions, and the embodiments of the present disclosure do not impose any limitations on this.
与上述图像生成方法实施例相对应,本公开还提供了图像生成装置实施例,图11示出了本公开一个实施例提供的一种图像生成装置的结构示意图。如图11所示,该装置包括:Corresponding to the above-mentioned image generation method embodiment, the present disclosure also provides an image generation device embodiment, and FIG11 shows a structural schematic diagram of an image generation device provided by an embodiment of the present disclosure. As shown in FIG11 , the device includes:
第一获取模块1102,被配置为获取图像描述文本;A first acquisition module 1102 is configured to acquire image description text;
第一输入模块1104,被配置为将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到;The first input module 1104 is configured to input the image description text and the generation prompt information into a parameter generation model to obtain image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on a plurality of sample image-text pairs and sample parameter information carried by the plurality of sample image-text pairs;
第二输入模块1106,被配置为将图像生成参数和图像描述文本输入图像生成模型,获得图像描述文本对应的目标图像。The second input module 1106 is configured to input the image generation parameters and the image description text into the image generation model to obtain a target image corresponding to the image description text.
可选地,图像生成模型包括参数编码单元和编解码单元;第二输入模块1106,进一步被配置为将图像生成参数和图像描述文本输入图像生成模型,经参数编码单元,对图像生成参数编码得到参数编码特征;经编解码单元,根据参数编码特征和图像描述文本,生成 图像描述文本对应的目标图像。Optionally, the image generation model includes a parameter encoding unit and a coding unit; the second input module 1106 is further configured to input the image generation parameters and the image description text into the image generation model, and encode the image generation parameters to obtain parameter encoding features through the parameter encoding unit; and generate the image generation parameters according to the parameter encoding features and the image description text through the coding unit. The target image corresponding to the image description text.
可选地,参数编码单元包括一维参数编码单元、二维参数编码单元和特征聚合单元;第二输入模块1106,进一步被配置为经一维参数编码单元,对图像生成参数中的一维参数编码得到一维编码特征;经二维参数编码单元,对图像生成参数中的二维参数编码得到二维编码特征;经特征聚合单元,对一维编码特征和二维编码特征聚合得到参数编码特征。Optionally, the parameter coding unit includes a one-dimensional parameter coding unit, a two-dimensional parameter coding unit and a feature aggregation unit; the second input module 1106 is further configured to encode the one-dimensional parameters in the image generation parameters through the one-dimensional parameter coding unit to obtain one-dimensional coding features; encode the two-dimensional parameters in the image generation parameters through the two-dimensional parameter coding unit to obtain two-dimensional coding features; and aggregate the one-dimensional coding features and the two-dimensional coding features through the feature aggregation unit to obtain parameter coding features.
可选地,该装置还包括:第三获取模块,被配置为获取针对目标图像的图像更新参数;根据图像更新参数,确定目标图像中的图像编辑区域;对图像编辑区域进行掩码,获得掩码生成序列;将掩码生成序列和目标图像输入图像生成模型,获得更新后的目标图像。Optionally, the device also includes: a third acquisition module, configured to obtain image update parameters for the target image; determine the image editing area in the target image based on the image update parameters; mask the image editing area to obtain a mask generation sequence; input the mask generation sequence and the target image into the image generation model to obtain an updated target image.
可选地,该装置还包括:第一发送模块,被配置为将目标图像和更新后的目标图像发送至客户端,以使客户端向用户展示目标图像和更新后的目标图像。Optionally, the device further includes: a first sending module configured to send the target image and the updated target image to the client, so that the client displays the target image and the updated target image to the user.
可选地,该装置还包括:第二接收模块,被配置为接收用户通过客户端发送的图像选择信息,并基于图像选择信息调整图像生成模型的模型参数。Optionally, the device further includes: a second receiving module configured to receive image selection information sent by a user through a client, and adjust model parameters of the image generation model based on the image selection information.
可选地,该装置还包括:第三接收模块,被配置为接收用户通过客户端发送的调整样本数据,并根据调整样本数据调整图像生成模型的模型参数,其中,调整样本数据基于目标图像构建得到。Optionally, the device further includes: a third receiving module, configured to receive adjustment sample data sent by a user through a client, and adjust model parameters of the image generation model according to the adjustment sample data, wherein the adjustment sample data is constructed based on the target image.
可选地,该装置还包括:参数生成模型训练模块,被配置为获取样本集,其中,样本集包括多个样本图像文本对,样本图像文本对包括样本图像和样本描述文本,样本图像文本对携带样本参数信息;将多个样本图像文本对和预测提示信息输入预训练语言模型,获得多个样本图像文本对分别对应的图像预测参数;根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型。Optionally, the device also includes: a parameter generation model training module, configured to obtain a sample set, wherein the sample set includes multiple sample image-text pairs, the sample image-text pairs include sample images and sample description texts, and the sample image-text pairs carry sample parameter information; input the multiple sample image-text pairs and prediction prompt information into a pre-trained language model to obtain image prediction parameters corresponding to the multiple sample image-text pairs; adjust the model parameters of the pre-trained language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model that has completed training.
可选地,参数生成模型训练模块,进一步被配置为获取多个样本图像;Optionally, the parameter generation model training module is further configured to obtain a plurality of sample images;
针对第一样本图像,将第一样本图像和构建提示信息输入预训练语言模型,获得第一样本图像的第一样本描述文本;根据第一样本图像和第一样本描述文本,生成第一样本图像的第一样本参数信息;根据多个样本图像、多个样本图像的样本描述文本和样本参数信息,构建样本集。For a first sample image, the first sample image and construction prompt information are input into a pre-trained language model to obtain a first sample description text of the first sample image; based on the first sample image and the first sample description text, first sample parameter information of the first sample image is generated; based on multiple sample images, sample description texts of multiple sample images, and sample parameter information, a sample set is constructed.
可选地,参数生成模型训练模块,进一步被配置为根据第一样本描述文本对第一样本图像进行区域检测,确定第一样本图像的关键区域位置信息;根据关键区域位置信息对第一样本图像进行视觉分割,获得第一样本图像的第一样本参数信息。Optionally, the parameter generation model training module is further configured to perform region detection on the first sample image according to the first sample description text to determine key region position information of the first sample image; perform visual segmentation on the first sample image according to the key region position information to obtain first sample parameter information of the first sample image.
可选地,该装置还包括:第四获取模块,被配置为获取参数配置条件;在第一样本参数信息不满足参数配置条件的情况下,调整第一样本参数信息,获得调整后的第一样本参数信息;参数生成模型训练模块,进一步被配置为根据多个样本图像、多个样本图像的样本描述文本和调整后的样本参数信息,构建样本集。Optionally, the device also includes: a fourth acquisition module, configured to acquire parameter configuration conditions; when the first sample parameter information does not meet the parameter configuration conditions, adjusting the first sample parameter information to obtain the adjusted first sample parameter information; a parameter generation model training module, further configured to construct a sample set based on multiple sample images, sample description texts of multiple sample images and the adjusted sample parameter information.
可选地,该装置还包括:图像生成模型训练模块,被配置为获取样本集,其中,样本集包括多个样本图像文本对,样本图像文本对包括样本图像和样本描述文本,样本图像文 本对携带样本参数信息;将多个样本描述文本和多个样本描述文本对应的样本参数信息输入初始生成模型,获得多个样本描述文本分别对应的预测图像;根据预测图像和样本图像,调整初始生成模型的模型参数,获得完成训练的图像生成模型。Optionally, the device further comprises: an image generation model training module configured to obtain a sample set, wherein the sample set comprises a plurality of sample image-text pairs, the sample image-text pairs comprise a sample image and a sample description text, the sample image-text pairs This pair carries sample parameter information; multiple sample description texts and sample parameter information corresponding to the multiple sample description texts are input into the initial generation model to obtain predicted images corresponding to the multiple sample description texts; according to the predicted images and the sample images, the model parameters of the initial generation model are adjusted to obtain a trained image generation model.
应用本公开实施例的方案,通过利用参数生成模型对图像的视觉元素进行语义拆解,获得图像生成参数,进一步基于图像生成参数完成精准的图像生成,使得目标图像可以清晰地表达图像描述文本和图像生成参数,提高了图像生成的可解释性和可控性。By applying the solution of the embodiment of the present disclosure, the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of image generation.
上述为本实施例的一种图像生成装置的示意性方案。需要说明的是,该图像生成装置的技术方案与上述的图像生成方法的技术方案属于同一构思,图像生成装置的技术方案未详细描述的细节内容,均可以参见上述图像生成方法的技术方案的描述。The above is a schematic scheme of an image generating device of this embodiment. It should be noted that the technical scheme of the image generating device and the technical scheme of the above image generating method belong to the same concept, and the details not described in detail in the technical scheme of the image generating device can be referred to the description of the technical scheme of the above image generating method.
与上述参数生成模型训练方法实施例相对应,本公开还提供了参数生成模型训练装置实施例,图12示出了本公开一个实施例提供的一种参数生成模型训练装置的结构示意图。如图12所示,该装置应用于云侧设备,包括:Corresponding to the above-mentioned parameter generation model training method embodiment, the present disclosure also provides a parameter generation model training device embodiment, and FIG12 shows a structural schematic diagram of a parameter generation model training device provided by an embodiment of the present disclosure. As shown in FIG12, the device is applied to a cloud-side device, including:
第二获取模块1202,被配置为获取样本集,其中,样本集包括多个样本图像文本对,样本图像文本对包括样本图像和样本描述文本,样本图像文本对携带样本参数信息;The second acquisition module 1202 is configured to acquire a sample set, wherein the sample set includes a plurality of sample image-text pairs, the sample image-text pairs include a sample image and a sample description text, and the sample image-text pairs carry sample parameter information;
第三输入模块1204,被配置为将多个样本图像文本对和预测提示信息输入预训练语言模型,获得多个样本图像文本对分别对应的图像预测参数;The third input module 1204 is configured to input a plurality of sample image-text pairs and prediction prompt information into the pre-trained language model to obtain image prediction parameters corresponding to the plurality of sample image-text pairs respectively;
调整模块1206,被配置为根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型。The adjustment module 1206 is configured to adjust the model parameters of the pre-trained language model according to the image prediction parameters and the sample parameter information to obtain a parameter generation model that has completed training.
应用本公开实施例的方案,根据图像预测参数和样本参数信息,调整预训练语言模型的模型参数,获得完成训练的参数生成模型,通过不断对预训练语言模型的模型参数进行调整,能使最终得到的参数生成模型更加精准。By applying the solution of the embodiment of the present disclosure, the model parameters of the pre-trained language model are adjusted according to the image prediction parameters and sample parameter information to obtain a parameter generation model that has completed training. By continuously adjusting the model parameters of the pre-trained language model, the parameter generation model finally obtained can be made more accurate.
上述为本实施例的一种参数生成模型训练装置的示意性方案。需要说明的是,该参数生成模型训练装置的技术方案与上述的参数生成模型训练方法的技术方案属于同一构思,参数生成模型训练装置的技术方案未详细描述的细节内容,均可以参见上述参数生成模型训练方法的技术方案的描述。The above is a schematic scheme of a parameter generation model training device of this embodiment. It should be noted that the technical scheme of the parameter generation model training device and the technical scheme of the parameter generation model training method mentioned above belong to the same concept, and the details not described in detail in the technical scheme of the parameter generation model training device can be found in the description of the technical scheme of the parameter generation model training method mentioned above.
与上述自动问答方法实施例相对应,本公开还提供了自动问答装置实施例,图13示出了本公开一个实施例提供的一种自动问答装置的结构示意图。如图13所示,该装置包括:Corresponding to the above-mentioned automatic question-answering method embodiment, the present disclosure also provides an automatic question-answering device embodiment. FIG13 shows a schematic diagram of the structure of an automatic question-answering device provided by an embodiment of the present disclosure. As shown in FIG13 , the device includes:
第一接收模块1302,被配置为接收图像问答请求,其中,图像问答请求携带图像描述文本;The first receiving module 1302 is configured to receive an image question and answer request, wherein the image question and answer request carries an image description text;
第四输入模块1304,被配置为将图像描述文本和生成提示信息输入参数生成模型,获得图像描述文本对应的图像生成参数,其中,图像生成参数用于描述图像的视觉特征,参数生成模型基于多个样本图像文本对和多个样本图像文本对携带的样本参数信息训练得到; The fourth input module 1304 is configured to input the image description text and the generation prompt information into the parameter generation model to obtain the image generation parameters corresponding to the image description text, wherein the image generation parameters are used to describe the visual features of the image, and the parameter generation model is trained based on multiple sample image-text pairs and sample parameter information carried by the multiple sample image-text pairs;
第五输入模块1306,被配置为将图像生成参数和图像描述文本输入图像生成模型,获得图像问答请求对应的答复图像。The fifth input module 1306 is configured to input the image generation parameters and the image description text into the image generation model to obtain a reply image corresponding to the image question and answer request.
应用本公开实施例的方案,通过利用参数生成模型对图像的视觉元素进行语义拆解,获得图像生成参数,进一步基于图像生成参数完成精准的图像生成,使得目标图像可以清晰地表达图像描述文本和图像生成参数,提高了自动问答过程的可解释性和可控性。By applying the solution of the embodiment of the present disclosure, the visual elements of the image are semantically decomposed by utilizing a parameter generation model to obtain image generation parameters, and accurate image generation is further completed based on the image generation parameters, so that the target image can clearly express the image description text and image generation parameters, thereby improving the interpretability and controllability of the automatic question-answering process.
上述为本实施例的一种自动问答装置的示意性方案。需要说明的是,该自动问答装置的技术方案与上述的自动问答方法的技术方案属于同一构思,自动问答装置的技术方案未详细描述的细节内容,均可以参见上述自动问答方法的技术方案的描述。The above is a schematic scheme of an automatic question-answering device of this embodiment. It should be noted that the technical scheme of the automatic question-answering device and the technical scheme of the automatic question-answering method described above are of the same concept, and the details not described in detail in the technical scheme of the automatic question-answering device can be found in the description of the technical scheme of the automatic question-answering method described above.
图14示出了本公开一个实施例提供的一种计算设备的结构框图。该计算设备1400的部件包括但不限于存储器1410和处理器1420。处理器1420与存储器1410通过总线1430相连接,数据库1450用于保存数据。14 shows a block diagram of a computing device provided by an embodiment of the present disclosure. The components of the computing device 1400 include but are not limited to a memory 1410 and a processor 1420. The processor 1420 is connected to the memory 1410 via a bus 1430, and a database 1450 is used to store data.
计算设备1400还包括接入设备1440,接入设备1440使得计算设备1400能够经由一个或多个网络1460通信。这些网络的示例包括公用交换电话网(PSTN,Public Switched Telephone Network)、局域网(LAN,Local Area Network)、广域网(WAN,Wide Area Network)、个域网(PAN,Personal Area Network)或诸如因特网的通信网络的组合。接入设备1440可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC,Network Interface Card))中的一个或多个,诸如IEEE802.11无线局域网(WLAN,Wireless Local Area Networks)无线接口、全球微波互联接入(Wi-MAX,World Interoperability for Microwave Access)接口、以太网接口、通用串行总线(USB,Universal Serial Bus)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC,Near Field Communication)接口,等等。The computing device 1400 also includes an access device 1440 that enables the computing device 1400 to communicate via one or more networks 1460. Examples of these networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet. The access device 1440 may include one or more of any type of network interface, wired or wireless (e.g., a network interface card (NIC)), such as an IEEE802.11 wireless local area network (WLAN) wireless interface, a World Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, a near field communication (NFC) interface, and the like.
在本公开的一个实施例中,计算设备1400的上述部件以及图14中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图14所示的计算设备结构框图仅仅是出于示例的目的,而不是对本公开范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。In one embodiment of the present disclosure, the above components of the computing device 1400 and other components not shown in FIG. 14 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG. 14 is only for illustrative purposes and is not intended to limit the scope of the present disclosure. Those skilled in the art may add or replace other components as needed.
计算设备1400可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或个人计算机(PC,Personal Computer)的静止计算设备。计算设备1400还可以是移动式或静止式的服务器。Computing device 1400 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or a personal computer (PC). Computing device 1400 may also be a mobile or stationary server.
其中,处理器1420用于执行如下计算机可执行指令,该计算机可执行指令被处理器执行时实现上述图像生成方法或者参数生成模型训练方法或者自动问答方法的步骤。Among them, the processor 1420 is used to execute the following computer-executable instructions, which, when executed by the processor, implement the steps of the above-mentioned image generation method or parameter generation model training method or automatic question-answering method.
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的图像生成方法、参数生成模型训练方法以及自动问答方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述图像生成方法或者参 数生成模型训练方法或者自动问答方法的技术方案的描述。The above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the technical schemes of the above-mentioned image generation method, parameter generation model training method and automatic question answering method belong to the same concept. For details not described in detail in the technical scheme of the computing device, please refer to the above-mentioned image generation method or parameter generation model training method. A description of the technical solution for the data generation model training method or the automatic question answering method.
本公开一实施例还提供一种计算机可读存储介质,其存储有计算机可执行指令,该计算机可执行指令被处理器执行时实现上述图像生成方法或者参数生成模型训练方法或者自动问答方法的步骤。An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, can implement the steps of the above-mentioned image generation method or parameter generation model training method or automatic question-answering method.
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的图像生成方法、参数生成模型训练方法以及自动问答方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述图像生成方法或者参数生成模型训练方法或者自动问答方法的技术方案的描述。The above is a schematic scheme of a computer-readable storage medium of this embodiment. It should be noted that the technical scheme of the storage medium and the technical scheme of the above-mentioned image generation method, parameter generation model training method and automatic question answering method belong to the same concept, and the details not described in detail in the technical scheme of the storage medium can be referred to the description of the technical scheme of the above-mentioned image generation method or parameter generation model training method or automatic question answering method.
本公开一实施例还提供一种计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行上述图像生成方法或者参数生成模型训练方法或者自动问答方法的步骤。An embodiment of the present disclosure further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the above-mentioned image generation method or parameter generation model training method or automatic question-answering method.
上述为本实施例的一种计算机程序的示意性方案。需要说明的是,该计算机程序的技术方案与上述的图像生成方法、参数生成模型训练方法以及自动问答方法的技术方案属于同一构思,计算机程序的技术方案未详细描述的细节内容,均可以参见上述图像生成方法或者参数生成模型训练方法或者自动问答方法的技术方案的描述。The above is a schematic scheme of a computer program of this embodiment. It should be noted that the technical scheme of the computer program and the technical schemes of the above-mentioned image generation method, parameter generation model training method and automatic question answering method belong to the same concept, and the details not described in detail in the technical scheme of the computer program can be found in the description of the technical schemes of the above-mentioned image generation method or parameter generation model training method or automatic question answering method.
上述对本公开特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The above describes specific embodiments of the present disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims can be performed in an order different from that in the embodiments and still achieve the desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or continuous order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据专利实践的要求进行适当的增减,例如在某些地区,根据专利实践,计算机可读介质不包括电载波信号和电信信号。The computer instructions include computer program codes, which may be in source code form, object code form, executable files or some intermediate forms, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, USB flash drive, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media do not include electric carrier signals and telecommunication signals.
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开实施例并不受所描述的动作顺序的限制,因为依据本公开实施例,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本公开实施例所必须的。It should be noted that, for the above-mentioned method embodiments, for the sake of simplicity of description, they are all expressed as a series of action combinations, but those skilled in the art should know that the embodiments of the present disclosure are not limited by the described action sequence, because according to the embodiments of the present disclosure, certain steps can be performed in other sequences or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the embodiments of the present disclosure.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.
以上公开的本公开优选实施例只是用于帮助阐述本公开。可选实施例并没有详尽叙述 所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本公开实施例的内容,可作很多的修改和变化。本公开选取并具体描述这些实施例,是为了更好地解释本公开实施例的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本公开。本公开仅受权利要求书及其全部范围和等效物的限制。 The preferred embodiments of the present disclosure disclosed above are only used to help explain the present disclosure. The optional embodiments are not described in detail. All details are provided, and the invention is not limited to the specific implementation methods described. Obviously, many modifications and changes can be made according to the content of the embodiments of the present disclosure. The present disclosure selects and specifically describes these embodiments in order to better explain the principles and practical applications of the embodiments of the present disclosure, so that those skilled in the art can understand and use the present disclosure well. The present disclosure is limited only by the claims and their full scope and equivalents.
Claims (17)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311622640.1A CN117830447A (en) | 2023-11-28 | 2023-11-28 | Image generation, automatic question answering, and parameter generation model training methods |
| CN202311622640.1 | 2023-11-28 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025112948A1 true WO2025112948A1 (en) | 2025-06-05 |
Family
ID=90521793
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/125023 Pending WO2025112948A1 (en) | 2023-11-28 | 2024-10-15 | Image generation method, automatic question answering method, and parameter generation model training method |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN117830447A (en) |
| WO (1) | WO2025112948A1 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118570333B (en) * | 2024-06-07 | 2025-02-28 | 广州营客信息科技有限公司 | A method for generating graphic information |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116186545A (en) * | 2023-03-28 | 2023-05-30 | 抖音视界有限公司 | Training, application method, device, electronic device and medium of pre-trained model |
| CN116363261A (en) * | 2023-03-31 | 2023-06-30 | 北京百度网讯科技有限公司 | Training method of image editing model, image editing method and device |
| CN116935169A (en) * | 2023-09-13 | 2023-10-24 | 腾讯科技(深圳)有限公司 | Training method for draft graph model and draft graph method |
| US20230377226A1 (en) * | 2022-05-19 | 2023-11-23 | Google Llc | Generating images using sequences of generative neural networks |
-
2023
- 2023-11-28 CN CN202311622640.1A patent/CN117830447A/en active Pending
-
2024
- 2024-10-15 WO PCT/CN2024/125023 patent/WO2025112948A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230377226A1 (en) * | 2022-05-19 | 2023-11-23 | Google Llc | Generating images using sequences of generative neural networks |
| CN116186545A (en) * | 2023-03-28 | 2023-05-30 | 抖音视界有限公司 | Training, application method, device, electronic device and medium of pre-trained model |
| CN116363261A (en) * | 2023-03-31 | 2023-06-30 | 北京百度网讯科技有限公司 | Training method of image editing model, image editing method and device |
| CN116935169A (en) * | 2023-09-13 | 2023-10-24 | 腾讯科技(深圳)有限公司 | Training method for draft graph model and draft graph method |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117830447A (en) | 2024-04-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN116797868A (en) | Text image generation method and diffusion generation model training method | |
| WO2025007892A1 (en) | Task processing method and task processing system | |
| CN117876941A (en) | Target multi-mode model system, construction method, video processing model training method and video processing method | |
| CN118138856A (en) | Video generation method, device, electronic equipment and readable storage medium | |
| KR20250078545A (en) | Data processing method of task processing model and method for creating virtual character animation | |
| WO2025148629A1 (en) | Document retrieval method and automatic question answering method | |
| WO2025139247A1 (en) | Task processing method, traffic task processing method, and task processing model training method | |
| CN118227770B (en) | Task processing method, legal question answering method and task processing model training method | |
| CN116913278A (en) | Voice processing method, device, equipment and storage medium | |
| CN118377906A (en) | Text processing method, text generation model training method and model training method | |
| CN117893652A (en) | Video generation method and parameter generation model training method | |
| CN116958738A (en) | Training method and device of picture recognition model, storage medium and electronic equipment | |
| WO2025112948A1 (en) | Image generation method, automatic question answering method, and parameter generation model training method | |
| CN117271745A (en) | Information processing method and device, computing equipment and storage medium | |
| CN118864625A (en) | Image generation method and device | |
| CN117150338A (en) | Task processing, automatic question and answer and multimedia data identification model training method | |
| CN119181102B (en) | Short text generation image model training method, system, short text to image generation method, electronic device and storage medium | |
| CN119031208A (en) | Video generation, motion video generation of virtual objects, video editing, video generation model training and information processing methods based on video generation models | |
| CN120523896A (en) | Task processing, automatic question answering, and task processing model training methods | |
| CN117972047A (en) | Document retrieval method and automatic question-answering method | |
| CN120687541A (en) | Task processing methods, text processing methods, automatic question-answering methods, task processing model training methods, information processing methods based on task processing models, and cloud training platforms | |
| CN118132988A (en) | Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product | |
| CN118212460A (en) | Image classification method, automatic question-answering method, image class feature fusion model training method and information processing method based on deep learning model | |
| CN116136869A (en) | Dialogue Content Generation, Virtual Dialogue, and Data Processing Method for Dialogue Content | |
| CN117633540B (en) | Sample data construction method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24896042 Country of ref document: EP Kind code of ref document: A1 |