JP7507791B2

JP7507791B2 - SYSTEM AND METHOD FOR SYNTHESISING APPAREL ENSEMBLES FOR MODELS - Patent application

Info

Publication number: JP7507791B2
Application number: JP2021569899A
Authority: JP
Inventors: シー．コルバート，マーカス
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-05-24
Filing date: 2020-05-01
Publication date: 2024-06-28
Anticipated expiration: 2040-05-01
Also published as: CN112912898A; WO2020242718A1; JP2022534082A

Description

連続性および優先権の主張
この国際特許出願は、２０１９年５月２４日に出願された米国特許出願第１６／４２２，２７８号の優先権を主張する。親出願は、２０１７年６月２７日に出願された米国特許出願第１５／６３４，１７１号（２０１９年５月２８日に公開された米国特許第１０，３０４，２２７号）の一部継続出願である。従来の開示は、この参照によって含まれる。 Continuation and Priority Claims This international patent application claims priority to U.S. Patent Application No. 16/422,278, filed May 24, 2019. The parent application is a continuation-in-part of U.S. Patent Application No. 15/634,171, filed June 27, 2017 (U.S. Patent No. 10,304,227, published May 28, 2019). The prior disclosures are incorporated by reference.

技術分野
本発明は、自動画像生成に関する。より詳しくは、本発明は、画像対の訓練セットに基づいて合成画像を作成する方法に関する。一旦訓練されると、一実施形態は、画像対の１つの部材に類似する新しい画像を受け入れ、対の他の画像が見えるかもしれないものを表現する合成画像を作成することができる。 TECHNICAL FIELD The present invention relates to automatic image generation. More particularly, the present invention relates to a method for creating a composite image based on a training set of image pairs. Once trained, an embodiment can accept a new image that is similar to one member of the image pair and create a composite image that represents what the other image in the pair might look like.

ニューラルネットワークは、生物学的プロセスによって引き起こされるコンピュータシステムである。特に、それらは、生物学的ニューロンが機能すると考えられる方法と同様に動作するように設計される複数のモジュールを備える。ニューロンは、しばしば、十分な入力が十分にアクティブである場合「点火する」、複数入力、単一出力閾値化アキュムレータとしてモデル化される（１つのモデルニューロンの出力は、多くの次のニューロンモデル入力に、または、フィードバックループの以前のニューロンモデルにさえ接続されうる）。単一のニューロンモデルの機能は非常に単純であるが、適切に構成または「訓練」される、同時に動作する多数のモデルは、従来のデータ処理またはプログラミング技法では対処が困難な問題に対して驚くほど良好な結果を生成することができるということが観察されてきた。 Neural networks are computer systems inspired by biological processes. In particular, they comprise multiple modules that are designed to behave similarly to the way biological neurons are thought to function. Neurons are often modeled as multiple-input, single-output thresholding accumulators that "fire" if enough inputs are sufficiently active (the output of one model neuron may be connected to many subsequent neuron model inputs, or even to previous neuron models in a feedback loop). While the functioning of a single neuron model is very simple, it has been observed that a large number of models operating simultaneously, properly constructed or "trained", can produce surprisingly good results for problems that are difficult to address using traditional data processing or programming techniques.

１つの一般的なニューラルネットワークのタスクは画像認識である。複数のニューロンは、画像の各ピクセルのための入力ニューロンのアレイによってピラミッドのような階層に配置され、１つまたは複数のサイズが減少するニューロンの層が続き、出力ニューロンは、入力画像が関心の特性を有するかを示すように指定される。この種のネットワークは、関心の特性が存在するかまたは不在である画像セットへの露出によって「訓練」可能である。例えば、特性は、ペンギンが画像内に存在するかでもよい。一旦訓練されると、ネットワークは、訓練セットの一部ではなく、ペンギンが新しい画像内に描写されるかをかなり正確に決定可能になりうる。 One common neural network task is image recognition. Multiple neurons are arranged in a pyramid-like hierarchy with an array of input neurons for each pixel in the image, followed by one or more layers of neurons of decreasing size, and an output neuron designated to indicate whether the input image has a property of interest. This type of network can be "trained" by exposure to a set of images in which the property of interest is either present or absent. For example, the property might be whether a penguin is present in the image. Once trained, the network can be able to determine with considerable accuracy whether a penguin is depicted in a new image that was not part of the training set.

ニューラルネットワークを用いて、従来の訓練およびランダムシードに基づいて、新しい情報を生成または合成することもできる。例えば、生物物理学者マイクタイカ博士は、ニューラルネットワークの訓練を試み、人間の顔の芸術的な描写に似ている画像を生成したが、その画像には、実際の被写体は含まれておらず、個人とのいかなる類似点も純粋に偶発的である。 Neural networks can also be used to generate or synthesize new information based on traditional training and random seeds. For example, biophysicist Dr. Mike Tyka has attempted to train a neural network to generate images that resemble artistic depictions of human faces, although the images do not contain any actual subjects and any resemblance to individuals is purely coincidental.

ニューラルネットワークが絶対確実というわけでない（認識部はターゲットを誤認しうるし、または、生成ネットワークは意図された目的のために使うことができない出力を構築しうる）が、それらは、しばしば、決定論的な方法があまりに遅いか、あまりに複雑であるか、あまりに高価である用途において実用に適している。したがって、ニューラルネットワークは、コンピュータ上で実施するのが容易である機械的方法と、人間の判断に依存して最善の結果を達成する労働集約型の方法と、の間のギャップを埋めることができる。 Although neural networks are not infallible (a recognizer can misidentify a target, or a generative network can construct an output that cannot be used for its intended purpose), they are often practical in applications where deterministic methods are too slow, too complex, or too expensive. Thus, neural networks can bridge the gap between mechanical methods that are easy to implement on a computer, and labor-intensive methods that rely on human judgment to achieve the best results.

ニューラルネットワークは、画像セットによって訓練され、各セットの２つ以上の画像は、アパレルのアイテム（衣類、靴、アクセサリなど）を示し、１つの画像は、アパレルのアイテムを着用しているモデルを示す。一旦訓練されると、「アパレル」画像の新しいセットは、ネットワークに提示され、アパレルのアイテムを着用しているモデルに似ている合成画像は、自動的に生成される。合成画像は、ユーザに表示される。 The neural network is trained with a set of images, where two or more images in each set show an item of apparel (clothing, shoes, accessories, etc.) and one image shows a model wearing the item of apparel. Once trained, a new set of "apparel" images is presented to the network and a synthetic image that resembles the model wearing the item of apparel is automatically generated. The synthetic image is displayed to the user.

本発明の一実施形態による方法を概説するフローチャートである。1 is a flow chart outlining a method according to one embodiment of the present invention. 本発明の一実施形態を実施するニューラルネットワークによるデータフローの簡略表現である。2 is a simplified representation of data flow through a neural network implementing one embodiment of the present invention. 「モデルの姿勢」画像制御パラメータが変化するときに作成される合成画像の範囲を示す。13 shows the range of synthetic images that are produced as the "model pose" image control parameter is varied. 一実施形態のより総合的な適用を概説するフローチャートである。1 is a flow chart outlining a more comprehensive application of one embodiment. 一実施形態の他の適用を概説するフローチャートである。11 is a flow chart outlining another application of an embodiment. 本発明の一実施形態を実施するニューラルネットワークによるデータフローの簡略表現である。2 is a simplified representation of data flow through a neural network implementing one embodiment of the present invention. 本発明の一実施形態を実施するニューラルネットワークによるデータフローのより詳細な表現である。4 is a more detailed representation of data flow through a neural network implementing one embodiment of the present invention. 本発明の一実施形態を実施するニューラルネットワークを訓練する有効な方法を概説するフローチャートである。1 is a flow chart outlining an efficient method for training a neural network implementing an embodiment of the present invention.

大部分の人々は衣服を着るし、多くの人々は、購入の前に衣類を試着する機会を顧客に提供しない（できない）販売会社から自分の衣服のいくつかを選択し、購入する。カタログおよびオンライン小売店は（とりわけ）、代表するモデルが着用し、一般的な状況において提示される衣類の写真を生成するために、しばしば大抵のことをする。これらの写真は、直接それを見ることができない顧客に、衣類のフィット感および見かけの印象を伝えるために重要である。 Most people wear clothes, and many people select and purchase some of their clothing from retailers that do not (or cannot) offer customers the opportunity to try on the clothing before purchasing. Catalogs and online retailers (among others) often go to great lengths to produce photographs of clothing worn by representative models and presented in typical contexts. These photographs are important for conveying an impression of the fit and appearance of the garment to customers who cannot see it in person.

本発明の実施形態は、ニューラルネットワークを用いて、衣類のみのより低価格の画像から、衣類を着ているモデルの画像を合成する。有益に、合成プロセスは、合成画像の特性を調整するために使用可能な制御を露出させる。例えば、合成画像内のモデルの肌色、体形、姿勢および他の特性は、有用な範囲にわたり変化してもよい。これにより、ユーザ（例えば顧客）は、合成した「着用したときの」画像を自分により似せるように調整することができる。したがって、代表的な写真を生成するためのコストを削減することに加えて、一実施形態によって、ユーザは、自分が着用したとき、特定の衣類がどのように見えうるかについて、より良く理解することができる。この改善された表現は、小売業者の販売の可能性を向上させうる（または、購入者が着てみると気に入らなかった衣類を返却するのを回避する）。 An embodiment of the invention uses neural networks to synthesize an image of a model wearing a garment from a lower-cost image of the garment alone. Beneficially, the synthesis process exposes controls that can be used to adjust the properties of the synthetic image. For example, the skin tone, body shape, posture, and other properties of the model in the synthetic image may be varied over a useful range. This allows a user (e.g., a customer) to adjust the synthetic "as worn" image to more closely resemble themselves. Thus, in addition to reducing the cost of generating representative photographs, an embodiment allows a user to better understand how a particular garment might look when worn by the user. This improved representation may improve a retailer's chances of a sale (or avoid a buyer returning a garment that they don't like when they try it on).

実施形態は、「ニューラルネットワーク」として知られているソフトウェアシステムの力を利用することによって、この合成を達成する。ニューラルネットワークは、しばしば、画像内の対象物および特徴の自動認識を実行するために用いられ、それらは、「この画像はエッフェル塔を示すか？」または「この画像は帽子をかぶっている人を含むか？」のような質問に答えることができる。ニューラルネットワークはまた、それらが訓練された他の画像に類似している新しい画像を生成または合成するように構成可能である。ニューラルネットワークのこれらの２つのタイプ（認識および合成）がともに推測されるとき、それらは、敵対的生成ネットワーク（「ＧＡＮ」）を形成し、合成部は、その出力を最適化し、認識部に「はい」と答えさせる画像を生成し、認識部は、その分析を最適化し、合成画像によって「だまされる」という可能性を減少する。したがって、１つの可能なＧＡＮは訓練され、帽子をかぶっている人のより良好な画像を生成することができる（「帽子をかぶっている人」のいくつかの合成画像は、帽子をかぶっている人のように見えず、認識部をだますかもしれないが、一般的に、ネットワークの訓練が慎重に実行される場合、多くまたは大部分の合成画像は、人間が「帽子をかぶっている人の写真」と認識する画像を必要とするプロセスにおいて有用であることを認識されたい）。図１は、一実施形態の動作の中心部の概要を示す。方法は、ニューラルネットワークを初期化することから開始する（１００）。このネットワークは、一対の画像を用いて訓練され（１１０）、各対の一方の画像は、衣類（例えば、中立面上に平たく広げられた衣類）を示し、各対の他方の画像は、衣類を着用しているモデルを示す。一旦訓練が終了すると、「Ｚベクター」（後述する）から有用なパラメータが識別される（１２０）。 Embodiments achieve this synthesis by harnessing the power of software systems known as "neural networks." Neural networks are often used to perform automatic recognition of objects and features in images; they can answer questions such as "Does this image show the Eiffel Tower?" or "Does this image contain a person wearing a hat?" Neural networks can also be configured to generate, or synthesize, new images that are similar to other images they were trained on. When these two types of neural networks (recognition and synthesis) are inferred together, they form a generative adversarial network ("GAN"), where the synthesis part optimizes its output to generate images that cause the recognition part to answer "yes," and the recognition part optimizes its analysis to reduce the chance of being "fooled" by the synthetic image. Thus, one possible GAN can be trained to generate better images of people wearing hats (it should be recognized that some synthetic images of "people wearing hats" may not look like people wearing hats and may fool the recognizer, but in general, if the training of the network is performed carefully, many or most synthetic images will be useful in a process that requires images that humans recognize as "photos of people wearing hats"). Figure 1 shows an overview of the core of the operation of one embodiment. The method begins with initializing a neural network (100). The network is trained (110) with pairs of images, one image of each pair showing the garment (e.g., the garment laid out flat on a neutral surface) and the other image of each pair showing a model wearing the garment. Once training is complete, useful parameters are identified (120) from the "Z-vector" (described below).

ここで、典型的な使用では、衣類画像は、訓練されたネットワークに提供される（１３０）。衣類画像は、訓練画像の１つである必要はないが（好ましくは訓練画像の１つではないが）、その代わりに、対応する片方のない画像である。ネットワークは、ネットワークの訓練およびＺベクターのパラメータに基づいて、この衣類の画像をモデル上に合成する（１４０）。合成画像は、ユーザに表示される（１５０）。ユーザは、Ｚベクターパラメータを調整してもよく（１６０）、新しい画像は、合成され（１４０）、表示される（１５０）。調整および再合成は、衣類を着用しているモデルのさまざまな合成画像を生成するために、所望の回数繰り返してもよい。 Now, in a typical use, a clothing image is provided to the trained network (130). The clothing image does not have to be one of the training images (but preferably is not one of the training images), but is instead an image without a corresponding counterpart. The network synthesizes (140) this clothing image onto the model based on the network's training and Z-vector parameters. The synthesized image is displayed (150) to the user. The user may adjust (160) the Z-vector parameters, and the new image is synthesized (140) and displayed (150). The adjustment and resynthesis may be repeated as many times as desired to generate various synthesized images of the model wearing the clothing.

Ｚベクターパラメータは、特性、例えば、モデルの肌色、体形およびサイズ、姿勢または位置、さらにはアクセサリ（例えば、靴のスタイル、ハンドバッグ、眼鏡、スカーフまたは宝石）さえも制御してもよく、その結果、ユーザは、自分が身に着けた場合、衣類が見えるだろうものにより酷似する画像を生成するために、合成プロセスを制御することが可能でもよい。 The Z-vector parameters may control characteristics such as the model's skin color, body shape and size, pose or position, and even accessories (e.g., shoe style, handbag, glasses, scarf or jewelry), so that the user may be able to control the compositing process to produce an image that more closely resembles what the garment would look like if worn by the user.

図２は、本発明の一実施形態で使用可能なニューラルネットワークの１つのタイプである敵対的生成ネットワークの概念上の表現を示す。情報は、概して、図面全体において左から右に通過するとして理解可能である。衣類の画像２１０は、第１のニューラルネットワーク２２０の入力層２２１に伝えられる。入力層は、入力画像内のピクセルとほぼ同数の入力要素（すなわち、この図面に示されるより多くの要素）を有してもよい。または、入力層は、各入力ピクセルのカラーチャネルごとに要素を備えてもよい（例えば、ピクセルごとに赤、緑および青要素）。入力層２２１は、可変結合荷重２２２のネットワークによって中間層２２３に結合され、中間層２２３は、類似のネットワーク２２４によって出力層２２５に結合されている。結合は、すべて対すべてで示されるが、各結合は、可変荷重に関連付けられてもよいので、いくつかの結合は、事実上不在でもよい（荷重＝０）。層対層の結合に加えて、一実施形態は、フィードバック結合２２６を用いてもよく、フィードバック結合２２６は、すぐ前の層に、または、より前の層に行ってもよい。各ニューラルネットワークは、ここで示される３層より多くを有してもよい。好ましい実施形態において、約７層が用いられる。 2 shows a conceptual representation of a generative adversarial network, which is one type of neural network that can be used in an embodiment of the present invention. Information can be understood as generally passing from left to right across the drawing. An image of clothing 210 is passed to an input layer 221 of a first neural network 220. The input layer may have approximately as many input elements as there are pixels in the input image (i.e., more elements than are shown in this drawing). Alternatively, the input layer may comprise an element for each color channel of each input pixel (e.g., red, green and blue elements per pixel). The input layer 221 is coupled to a middle layer 223 by a network of variable connection weights 222, which in turn is coupled to an output layer 225 by a similar network 224. Although the connections are shown all-to-all, each connection may be associated with a variable weight, so that some connections may be effectively absent (weight=0). In addition to layer-to-layer connections, an embodiment may use feedback connections 226, which may be made to the immediately preceding layer or to an earlier layer. Each neural network may have more than the three layers shown here; in the preferred embodiment, about seven layers are used.

（左から右に進行する）ニューラルネットワーク２２０の各層が前の層より小さいことに留意されたい。したがって、ネットワークは、一種の圧縮を実行するとみなすことができ、結果としてＺベクター２５０を生じ、Ｚベクター２５０は、多くの実施形態において、ネットワーク２２０の最後の層の出力を備える実数のベクターである。 Note that each layer of neural network 220 (progressing from left to right) is smaller than the previous layer. Thus, the network can be thought of as performing a kind of compression, resulting in Z-vector 250, which in many embodiments is a vector of real numbers comprising the output of the last layer of network 220.

Ｚベクター２５０は、第２のニューラルネットワーク２３０への入力として用いられる。このネットワーク（可変荷重結合２３２、２３４によって相互接続されるモデルニューロン２３１、２３３および２３５の層を備える）は、構造の点でネットワーク２２０に類似するが、各層の要素の数は、（データフローの方向に）増加している。ネットワーク２２０と同様に、ネットワーク２３０は、データを「逆の」方向に運ぶフィードバック結合（例えば２３６）を含んでもよい。ネットワーク２３０の出力は、一旦ＧＡＮが訓練されると、モデルが着用すると衣類２１０が見えうるものを表現する合成画像２４０である。 The Z-vector 250 is used as input to a second neural network 230. This network (with layers of model neurons 231, 233 and 235 interconnected by variable weight connections 232, 234) is similar in structure to network 220, but the number of elements in each layer is increased (in the direction of data flow). Like network 220, network 230 may include feedback connections (e.g. 236) that carry data in the "reverse" direction. The output of network 230 is a synthetic image 240 that represents what the garment 210 would look like if worn by the model once the GAN has been trained.

さまざまなネットワーク内接続に加えて、一実施形態は、ネットワーク間接続２２７、２２８または２３７を用いてもよい。これらは、典型的には、同程度の深さの層を結合（２２７、２３７）するが、異なる深さの間の結合（２２８）が提供されてもよい（ここで、「深さ」は、ネットワークが互いにある程度ミラー画像であるという事実を認めて、「左側のネットワーク２２０の入力から離れたレベルまたは右側のネットワーク２３０の出力から離れたレベル」を意味する）。これらの接続は、「スキップ接続」と称してもよい。概念的には、スキップ接続は、ネットワークが、「圧縮」および「合成」動作にあまり影響を及ぼさない情報を通過させる単純で直接的な方法を提供する。例えば、衣類色は、図においてどのように織物が吊るされ、かかっている傾向があるかの質問に対して主に直角である。等価な構造の赤いドレスおよび青いドレスは、色自体を除いて、モデル上でほぼ同じに見える。したがって、入力ネットワークが色を認識し、それを、Ｚベクターを通して生成ネットワークに伝えるのに依存するよりはむしろ、スキップ接続は、サンプル画像の色を出力ネットワークまで直接運ぶことができる。 In addition to the various intra-network connections, an embodiment may use inter-network connections 227, 228, or 237. These typically connect layers of similar depth (227, 237), although connections between different depths (228) may be provided (where "depth" means "a level away from the input of the left network 220 or a level away from the output of the right network 230," acknowledging the fact that the networks are to some extent mirror images of each other). These connections may be referred to as "skip connections." Conceptually, skip connections provide a simple and direct way for the network to pass information that does not significantly affect the "compression" and "composition" operations. For example, clothing color is largely orthogonal to the question of how fabrics tend to hang and drape in the figure. A red dress and a blue dress of equivalent construction will look nearly identical on the model, except for the color itself. Thus, rather than relying on the input network to recognize the color and convey it to the generative network through the Z vector, the skip connections can carry the color of the sample image directly up to the output network.

ネットワーク２２０が圧縮のような何かを実行するという考えに戻って、Ｚベクター要素のいくつかは、訓練画像対を処理した後にＧＡＮが認識すること（および画像合成において使用すること）を学習した特性をコード化することがわかる。例えば、１つのＺベクター要素は、モデルの肌色を制御してもよい。他の要素は、（モデルがカメラに向かっているかまたは一方に向いているかまたは腕もしくは脚の位置を含む）モデルの姿勢を制御してもよい。要素は、合成画像において描写される靴のスタイルを制御してもよい。これらは、訓練画像から学習されるスタイルの１つに類似してもよいが、出力画像の靴は、訓練画像モデルの脚の１つから単に複製されるものではない。その代わりに、生成ネットワーク２３０は、靴または他のアクセサリを含む画像を構築し、その画像は、入力画像２１０が対応するモデル衣類の対の一方の画像を有した場合、訓練画像の中で見られたかもしれないもののように見える。そして決定的に、Ｚベクター要素を調整することは、生成ネットワークに、異なる肌色、モデルの姿勢または靴のスタイルを有する新しい画像を作成させることができる。Ｚベクター要素はまた、詳細、例えばドレスの長さ、袖の長さまたは襟のスタイルを制御してもよい。すなわち、合成画像は、衣類が変えられるかもしれない方法ならびにモデルが変えられるかもしれない方法を示してもよい。１つの例示的な実施形態では、入力ドレス画像が正面から示される場合であっても、Ｚベクター要素は、モデルの回転を制御し、妥当と思われる範囲でドレスを着たモデル画像を、その範囲にわたり左から右へのモデルの回転をともなって生成することができる。この例を図３に示す。全２５の画像は、さまざまな服装／モデル対の写真によって訓練されたネットワークによって、黒いドレスの単一の画像から合成された。 Returning to the idea that the network 220 performs something like compression, we see that some of the Z-vector elements encode properties that the GAN has learned to recognize (and use in image synthesis) after processing the training image pairs. For example, one Z-vector element may control the skin color of the model. Another element may control the model's pose (including whether the model is facing the camera or to one side, or the position of the arms or legs). An element may control the style of shoes depicted in the synthetic image. These may resemble one of the styles learned from the training images, but the shoes in the output image are not simply replicated from one of the legs of the training image model. Instead, the generative network 230 builds an image including the shoes or other accessories, which looks like what might have been seen in the training images if the input image 210 had an image of one of the corresponding model clothing pairs. And crucially, adjusting the Z-vector elements can cause the generative network to create new images with different skin tones, model poses, or shoe styles. The Z-vector elements may also control details, such as the length of the dress, the length of the sleeves, or the style of the collar. That is, the synthesized images may show how the clothing may be altered as well as how the model may be altered. In one exemplary embodiment, even if the input dress image is shown from the front, the Z vector component can control the rotation of the model to generate a range of plausible images of the model wearing the dress with left to right model rotations across that range. An example of this is shown in Figure 3. All 25 images were synthesized from a single image of a black dress by a network trained with photographs of various outfit/model pairs.

図２を参照して描写および記載されている敵対的生成ネットワーク（「ＧＡＮ」）が一実施形態で用いられるのに適切なニューラルネットワークの１つのタイプであるが、他の周知のタイプも使うことができる。例えば、回帰型ニューラルネットワーク（「ＲＮＮ」）、回帰型インタフェースマシン（「ＲＩＭ」）および変分オートエンコーダー（「ＶＡＥ」）として従来技術において周知のネットワーク構成は、すべて上述したように画像の対によって訓練可能であり、新しい入力画像のもっともらしい一対－片方でありうる新しい合成画像を生成することができ、合成画像の特徴または特性を調整するために比例して変化可能な定量的制御パラメータを露出させる。ニューラルネットワークの上述の態様は、本発明の実施形態における使用に重要であり、すなわち、
●画像の対によって学習可能である
●新しい入力画像から合成画像を生成し、合成画像は、新しい入力画像に対応する「対」画像に似ている（すなわち、新しい入力画像および合成画像は、もっともらしく訓練セット内で対でありえた）
●以下のような特性を含む合成画像の特性を変化させるように操作可能な定量的制御パラメータを露出させる。
〇モデルの肌色
〇モデルの姿勢
〇モデルの体重、体形
〇アクセサリ
〇衣類の長さ
〇衣類の袖のスタイル
〇衣類の襟のスタイル
〇衣類のフィット感 Although a generative adversarial network ("GAN") as depicted and described with reference to FIG. 2 is one type of neural network suitable for use in one embodiment, other well-known types can also be used. For example, network configurations known in the art as recurrent neural networks ("RNN"), recurrent interfering machines ("RIM"), and variational autoencoders ("VAE") can all be trained with pairs of images as described above to generate new synthetic images that may be plausible counterparts of new input images, exposing quantitative control parameters that can be varied proportionally to adjust the features or characteristics of the synthetic images. The above-described aspects of neural networks are important for use in embodiments of the present invention, namely:
● It is trainable by image pairs ● It generates a synthetic image from a new input image, where the synthetic image resembles the "pair" image that corresponds to the new input image (i.e., the new input image and the synthetic image could plausibly have been paired in the training set)
• Expose quantitative control parameters that can be manipulated to change the properties of the composite image, including properties such as:
○ Model's skin tone ○ Model's posture ○ Model's weight, body shape ○ Accessories ○ Garment length ○ Garment sleeve style ○ Garment collar style ○ Garment fit

一実施形態の使用に適しているニューラルネットワークの１つの特徴は、簡単に上で言及されたが、さらなる考察を行う。図２において、特定のレベルの各ノードは、次のレベルですべてのノードに結合されるように描写された。加えて、いくつかのレベル間の結合が示された。しかしながら、ニューラルネットワークは、完全に結合されてもよい。各ノードの加重出力は、（ノード自体さえを含む）すべての他のノードに対する入力信号の一部を形成してもよい。この種のネットワークは、２次元の図において描写するのが非常に困難であるが、それは、当業者に周知の一般的なトポロジである。他の代替例は、畳み込みニューラルネットワークである。このトポロジは、完全に結合されたネットワークより空間効率的であるので、それは、しばしば、より大きな入力画像に動作することができ（そして、より高い解像度の出力画像を生成することができる）。また、畳み込みネットワークは、当業者に周知であり、本明細書において記載されている原則および方法を支持することによって、有効に使用可能である。 One feature of neural networks that is suitable for use in one embodiment was mentioned briefly above, but is now considered further. In FIG. 2, each node at a particular level was depicted as being connected to all nodes at the next level. In addition, some inter-level connections were shown. However, neural networks may be fully connected. The weighted output of each node may form part of the input signal for all other nodes (even including the node itself). This type of network is very difficult to depict in a two-dimensional diagram, but it is a common topology well known to those skilled in the art. Another alternative is the convolutional neural network. Because this topology is more space-efficient than a fully connected network, it can often operate on larger input images (and generate higher resolution output images). Convolutional networks are also well known to those skilled in the art and can be used effectively by adhering to the principles and methods described herein.

敵対的生成ネットワークを用いた本発明の例示的な実施形態では、入力画像は、各ピクセル用の３つのカラーチャネル（赤、緑および青）を有する約１９２×２５６ピクセルでもよい。したがって、入力層は、１９２×２５６の×３＝１４７，４５６のニューロンモデル要素を備える。次の層は、２分の１に減少し、Ｚベクターは、５１２×４×３＝６，１４４スカラーとして終わる。生成ネットワークは、入力ネットワークを反映してもよく、Ｚベクターから開始し、１９２×２５６の合成カラー画像を放出する。 In an exemplary embodiment of the invention using a generative adversarial network, the input image may be approximately 192x256 pixels with three color channels (red, green and blue) for each pixel. The input layer therefore comprises 192x256x3 = 147,456 neuron model elements. The next layer reduces by a factor of two, and the Z vector ends up as 512x4x3 = 6,144 scalars. The generative network may mirror the input network, starting with the Z vector and emitting a 192x256 synthetic color image.

Ｚベクターのすべての要素が、「肌色」、「ドレスの長さ」、「靴のスタイル」または「モデルの姿勢」のような認識可能な特性に対応するというわけではない。個々のベクター要素（および、ベクターのサブセット）の影響は、経験的に、または、（例えば、画像に特性説明をタグ付けすることによって）訓練画像対について追加情報を提供し、訓練の間、ネットワークを通して追加情報を伝搬することによって決定されてもよい。 Not all elements of a Z-vector correspond to recognizable features such as "skin color," "dress length," "shoe style," or "model pose." The influence of individual vector elements (and subsets of vectors) may be determined empirically or by providing additional information about training image pairs (e.g., by tagging the images with feature descriptions) and propagating the additional information through the network during training.

Ｚベクター構成要素の影響を決定する１つの好ましい方法は、Ｚベクターの主成分分析（「ＰＣＡ」）を実行し、構成要素が主に線形独立であるより小さいベクターＺ’を識別することである。Ｚ’の要素は、テストされ、それらの影響を決定してもよく、関心の特性に影響を及ぼす要素は、ユーザに露出され、合成画像生成を制御してもよい。 One preferred method of determining the influence of the Z vector components is to perform a principal components analysis ("PCA") of the Z vector and identify a smaller vector Z' whose components are primarily linearly independent. The components of Z' may be tested to determine their influence, and those that affect the property of interest may be exposed to the user to control the synthetic image generation.

図４は、本発明の一実施形態による画像合成部のまわりで構築される完全な適用により用いられる方法を概説する。開始するために、システムオペレータは、画像対の訓練セットを用いて上述したようにニューラルネットワークを初期化し、訓練し、各対の一方の部材は、衣類を描写し、各対の他方の部材は、衣類を着用しているモデルを描写する（４１０）。次に、衣類画像のデータベースは、データが読み込まれる（４２０）。これらの画像は、訓練セットの第１の画像に類似し、訓練画像を含むことさえできる。これらは、例えば、システムオペレータによって販売される衣類の画像である。 Figure 4 outlines the methodology used by a complete application built around an image synthesis unit according to one embodiment of the present invention. To begin, the system operator initializes and trains the neural network as described above using a training set of image pairs, where one member of each pair depicts a garment and the other member of each pair depicts a model wearing the garment (410). Next, a database of garment images is populated (420). These images are similar to the first image in the training set and may even include training images. These may be, for example, images of garments sold by the system operator.

顧客がオペレータのシステムを訪問するとき（例えば、顧客が電子商取引ウェブサイトにアクセスするとき）、彼女は任意の適切な従来技術の方法を用いて、衣類のカタログを検索または閲覧してもよい（４３０）。例えば、衣類は、色、スタイル、重さ、デザイナー、サイズ、価格または他の任意の所望のアレンジによって分類され、提示されてもよい。ユーザが衣類を選択するとき（４４０）、システムは、衣類の画像をモデル上に合成し、表示する（４５０）。ユーザには、Ｚベクターの適切な要素に結合されている制御アレイが提供されてもよく、彼女は、要望通りそれらのパラメータを調整してもよい（４６０）。パラメータが調整されると、システムは、衣類の新しい画像をモデル上に合成し、表示する（４５０）。 When a customer visits the operator's system (e.g., when the customer accesses an e-commerce website), she may search or browse a catalog of clothing (430) using any suitable prior art method. For example, clothing may be categorized and presented by color, style, weight, designer, size, price, or any other desired arrangement. When the user selects a garment (440), the system composites and displays an image of the garment onto the model (450). The user may be provided with control arrays that are coupled to the appropriate elements of the Z vector, and she may adjust their parameters as desired (460). Once the parameters are adjusted, the system composites and displays a new image of the garment onto the model (450).

例えば「戻る」ボタンをクリックするかまたは検索結果のリストに戻ることによって、ユーザがこの衣類を購入しないことを決める場合（４７０）、彼女は、販売中の他の衣類を見るのを継続してもよい（４３０）。ユーザが衣類を購入することを決める場合（４８０）、選択された衣類に関する情報（例えばＳＫＵ）は、次の活動のために、従来技術の受注処理プロセスに伝えられる（４９０）。 If the user decides not to purchase the garment (470), for example by clicking a "back" button or returning to the list of search results, she may continue to view other garments for sale (430). If the user decides to purchase the garment (480), information about the selected garment (e.g., SKU) is communicated to the prior art order fulfillment process for further action (490).

本発明の実施形態は、衣類選択制御を、画像生成の他の態様（肌色、姿勢、体形、アクセサリなど）のための制御と組み合わせてもよい。次に、この複数の制御のうちの個々の制御を操作することによって、ユーザは、（肌色、姿勢およびアクセサリのみを残し）衣類を変えることができるか、または、（肌色、姿勢および衣類のみを残し）アクセサリを切り替えることができる。この実施形態は、人間のファッションコーディネータによって大きな費用をかけて提供され、したがって通常または乏しい財力の買い物客は大抵利用できない完全な「一式」または「外観」の可能性の中で、迅速で自発的な比較を可能にする。 An embodiment of the invention may combine clothing selection controls with controls for other aspects of image generation (skin tone, pose, body shape, accessories, etc.). Then, by manipulating individual controls of this plurality of controls, the user can change clothing (leaving only skin tone, pose and accessories) or switch accessories (leaving only skin tone, pose and clothing). This embodiment allows for rapid, spontaneous comparison among complete "outfits" or "looks" possibilities that are provided at great expense by human fashion coordinators and therefore often unavailable to shoppers of ordinary or modest means.

図５は、上述した画像合成ネットワークの他の適用を概説する。上述したように、方法は、画像対の訓練セットを用いて画像合成ネットワークを初期化し、訓練することから開始する（５００）。次に、システムは、衣類画像を獲得し（５１０）、衣類の画像をモデル上に合成し（５２０）、合成画像を格納する（５３０）。処理すべき衣類画像がまだある場合（５４０）、これらのステップは繰り返され、結果として、画像が獲得および処理されたさまざまな衣類を示す合成画像のライブラリを生ずる。画像合成は、合成画像ライブラリ内のさまざまな異なるモデルの肌色、体形、姿勢およびアクセサリを生ずる、ランダムに選択されたＺベクターパラメータを用いてもよい。 Figure 5 outlines another application of the image synthesis network described above. As described above, the method begins by initializing and training the image synthesis network with a training set of image pairs (500). The system then acquires clothing images (510), synthesizes the clothing images onto the model (520), and stores the synthesized images (530). If there are more clothing images to process (540), these steps are repeated, resulting in a library of synthesized images showing the various garments whose images have been acquired and processed. Image synthesis may use randomly selected Z-vector parameters that result in a variety of different model skin tones, body shapes, poses, and accessories in the synthesized image library.

処理すべき衣類画像がないとき（５５０）、ライブラリからの合成画像は、カタログレイアウト内に組み込まれてもよく（５６０）、印刷されてもよいか（５７０）、または、１つもしくは複数の合成画像を備える複数の静的ウェブページが生成されてもよく（５８０）、それらのウェブページは、ウェブサイトへの訪問客に提供されてもよい（５９０）。このプロセスは、製品画像のカタログまたは多くの衣類を表示するウェブサイトを生成するためのコストを減少することができる。 When there are no garment images to process (550), the composite images from the library may be incorporated into the catalog layout (560), printed (570), or static web pages may be generated (580) with one or more composite images, which may be served to visitors to the website (590). This process may reduce the cost of generating a catalog of product images or a website that displays many garments.

方法が従来技術の衣類処理シーケンス、例えばさまざまな製造業者から多くの外注の衣類を受け取る委託販売業者と統合されてもよいことを認識されたい。これらの衣類は、保管または運搬の匂いを消すため、および、しわを取り除くために、蒸気室を通過してもよい。衣類は、このプロセスの間、マネキン上に配置されてもよく、新たに蒸気に当てられた衣類の画像は、最後に自動的にキャプチャされてもよい。この画像は、図５において概説されるプロセスに、５１０の「獲得画像」として送達されてもよい。衣類の複数の図が獲得されてもよく、ニューラルネットワークが、対応してよりさまざまな画像対によって訓練されたならば、よりさまざまな「モデル上の衣類」画像の合成を可能にしてもよい。 It should be appreciated that the method may be integrated with prior art garment processing sequences, for example a consignment store that receives many outsourced garments from various manufacturers. These garments may be passed through a steam room to eliminate storage or transportation odors and to remove wrinkles. The garments may be placed on a mannequin during this process, and an image of the freshly steamed garment may be automatically captured at the end. This image may be delivered to the process outlined in FIG. 5 as an "acquired image" at 510. Multiple views of the garment may be acquired, allowing the synthesis of more diverse "garment on model" images once the neural network has been trained with correspondingly more diverse image pairs.

この出願に記載されているニューラルネットワークベースの画像合成システムは、上述した概念から実質的に逸脱することなく、より複雑で有用なタスクを実行するために拡張可能である。拡張は、概して、入力画像セットを用いてニューラルネットワークを訓練し、画像は多くのカテゴリに分類される。訓練の後、ニューラルネットワークを用いて、好ましくは訓練カテゴリに適合する見たことのない画像を含む新しい合成画像を入力から作成する。これを具体的に表現するために、第１の画像が衣類のみを示し、第２の画像が衣類を着用しているモデルを示す画像の対を用いて上述したシステムが訓練されることを思い出しなさい。一旦訓練されると、衣類のみの新しい画像が提示され、ネットワークは、新しい画像を合成し、その新しい画像では、衣類を着用している仮定的モデルに似ている。システムは、入力される衣類以外の特徴を含んだモデルの合成画像を生成することができるが、ただし、訓練画像のいくつかがこの種の要素を示したという条件である。例えば、上述したように、Ｚベクターの要素は、描写される靴のスタイルを制御しうる。しかし、靴は、モデル自体と同様に、完全に、画像合成部の製作であった。入力解析部または圧縮部は、靴を履いているモデルを含む訓練画像に出会い、この種の靴を含みうるもっともらしい合成画像を生成することを「学習した」。 The neural network-based image synthesis system described in this application can be extended to perform more complex and useful tasks without substantially departing from the concepts described above. Extension generally involves training a neural network with a set of input images, which are classified into a number of categories. After training, the neural network is used to create new synthetic images from the inputs, preferably including never-before-seen images that fit the training categories. To make this concrete, recall that the system described above is trained with pairs of images, the first of which shows only clothing, and the second of which shows a model wearing the clothing. Once trained, a new image of only the clothing is presented, and the network synthesizes a new image that resembles the hypothetical model wearing the clothing. The system can generate synthetic images of models that include features other than the input clothing, provided that some of the training images showed such elements. For example, as described above, an element of the Z vector may control the style of shoes depicted. However, the shoes, as well as the model itself, were entirely the creation of the image synthesis unit. The input analyzer or compressor encountered training images containing models wearing shoes and "learned" to generate plausible synthetic images that might contain these kinds of shoes.

後述するシステムでは、訓練画像は、２つ以上の関連した「構成要素」画像のグループまたはセット、すなわち、衣類、靴およびアクセサリ（例えば、ハンドバッグ、ブレスレット、帽子など）と、衣類、靴およびアクセサリを着用しているモデルの「ターゲット」画像と、を含む。訓練のゴールは、ネットワークを「教育し」、モデルが構成要素画像において描写されるアパレルアイテムを着用していることを示すターゲット画像の特徴を認識することである。最後に、敵対的生成ネットワーク（「ＧＡＮ」）の慣習に従って、画像合成部ネットワークは、認識部と比較対照されるので、合成部は、認識部が入力された衣類およびアクセサリを着用しているモデルの有効またはもっともらしい写真として認識する新しい画像を作成することを学習する。一旦このシステムが訓練されると、本当のモデルがアイテムを身に着け、写真セッションに現れる必要なく、それは、衣類、靴およびアクセサリを着用しているモデルのように見える画像を生成することができる。もちろん、上述した合成制御パラメータ（すなわちＺベクターの要素）を用いて、同様に、このより複雑なシステムによって合成される画像を調整することができる。 In the system described below, the training images include a group or set of two or more related "component" images: clothing, shoes, and accessories (e.g., handbags, bracelets, hats, etc.), and a "target" image of a model wearing the clothing, shoes, and accessories. The goal of training is to "teach" the network to recognize features in the target images that indicate the model is wearing the apparel items depicted in the component images. Finally, following the conventions of generative adversarial networks ("GANs"), the image synthesizer network is contrasted with the recognizer so that the synthesizer learns to create new images that the recognizer recognizes as valid or plausible photographs of a model wearing the input clothing and accessories. Once the system is trained, it can generate images that look like a model wearing the clothing, shoes, and accessories, without the need for a real model to wear the items and appear in a photo session. Of course, the synthesis control parameters (i.e., elements of the Z vector) described above can be used to adjust the images synthesized by this more complex system as well.

図６は、左のサンプル入力カテゴリと、右のサンプル出力（または訓練）画像と、ニューラルネットワーク６００と、を有する本発明のシステムの概要を示し、ニューラルネットワーク６００は、減少トポロジを有し、Ｚベクター６５０を生成する入力認識（「圧縮」）ネットワーク６２０として表現され、拡大トポロジを有する出力「合成」ネットワーク６３０によって動作され、アパレルのさまざまな入力アイテムを着用しているもっともらしいモデルを示す合成画像を生成する。このシステムの一実施形態は、少なくとも２つの異なるタイプの入力アパレルデータを受け入れ、ネットワークの訓練が与えられてもっともらしい方法に入力を組み合わせる少なくとも１つの合成画像を生成する。ここで、「もっともらしい」とは、「ある部分が入力画像の部分で置換された前に見た訓練画像のコピー」を意味しない。その代わりに、「もっともらしい」画像は、（訓練の後）認識部が、入力された衣類およびアクセサリを着用しているモデルを示す画像であるとして評価するものである。 Figure 6 shows an overview of the system of the present invention with sample input categories on the left, sample output (or training) images on the right, and a neural network 600, represented as an input recognition ("compression") network 620 with a reduced topology generating Z-vectors 650, operated by an output "composition" network 630 with an expanded topology to generate composite images showing plausible models wearing various input items of apparel. One embodiment of the system accepts at least two different types of input apparel data and generates at least one composite image that combines the inputs in a plausible way given the training of the network. Here, "plausible" does not mean "a copy of a previously seen training image with some parts replaced by parts of the input image". Instead, a "plausible" image is one that the recognizer (after training) would assess as being an image showing a model wearing the input clothing and accessories.

図６のニューラルネットワークのブラックボックスの単純な実施態様は、入力画像データのすべてを単一の大きな画像に単に合成することができ、次に、訓練プロセスに依存し、大きな画像のさまざまな部分を認識するようにネットワークを教育することができ、次に、入力アイテムのすべてを着用しているモデルの画像を合成することができる。しかしながら、図７に示されるより高度な実施態様は、より効率的により単純な訓練のために、入力情報をニューラルネットワークに送達することができる。それはまた、より有効な制御特徴をユーザに露出し、システムがほぼ「モジュラ」方法でも用いられるのを可能にする。 The black box simple implementation of the neural network of FIG. 6 may simply synthesize all of the input image data into a single large image, then rely on a training process to teach the network to recognize different parts of the larger image, and then synthesize images of a model wearing all of the input items. However, the more advanced implementation shown in FIG. 7 may deliver the input information to the neural network more efficiently for simpler training. It also exposes more useful control features to the user, allowing the system to be used in an almost "modular" manner.

好ましい実施形態において、２つ以上の入力ネットワーク７２１、７２２、…、７２９は、異なるカテゴリの入力を受信する。１つのネットワーク７２１は、衣類７１１の画像を受信してもよく、他のネットワーク７２９は、靴７１９ａ、ｂの画像を受信してもよい（発明者は、「靴」のネットワーク性能が、同じ靴の複数の画像（例えば正面および側面からの画像）を提供することによって改善できる点に注目した）。各入力ネットワークは、各レベルのノードの数が減少する分離した多層ニューラルネットワークでもよく、上述した圧縮のようなプロセスにより、そのそれぞれの入力の情報を対応するＺベクター７５１、７５２、…、７５９に抽出する。 In a preferred embodiment, two or more input networks 721, 722, ..., 729 receive different categories of inputs. One network 721 may receive images of clothing 711, while the other network 729 may receive images of shoes 719a,b (the inventors note that the performance of the "shoes" network can be improved by providing multiple images of the same shoe, e.g., front and side views). Each input network may be a separate multi-layer neural network with a reduced number of nodes at each level, distilling the information of its respective input into a corresponding Z-vector 751, 752, ..., 759 by a process such as compression as described above.

これらのＺベクターから、基本的な本発明の原則に従って独立して動作する複数の出力ネットワーク７３１、７３２、…、７３９は、訓練され、モデルおよびそれらのそれぞれの入力を含む合成画像を生成することができる。「衣類」のネットワークは、入力された衣類を着用しているモデルの画像を生成することができ、「靴」のネットワークは、入力された靴を履いているモデルの片脚（または両脚）の画像を生成することができる。しかし、この好ましい実施形態において、別々のＺベクター７５１、７５２、…、７５９は、結合され（例えば連結され）、結合したＺベクター７６０を形成し、これは、多数要素の画像合成部ニューラルネットワーク７７０に送達される。この出力ネットワークは、衣類および靴（＋その画像が他の入力ネットワークに提供され、複合Ｚベクター７６０に連結された他の任意のアクセサリ、例えば、画像が入力ニューラルネットワーク７２２を通して提供されたズボン７１２）を着用しているモデルの画像を作成する。 From these Z-vectors, multiple output networks 731, 732, ..., 739, operating independently according to basic inventive principles, can be trained to generate composite images that include the model and their respective inputs. The "clothing" network can generate images of the model wearing the input clothing, and the "shoes" network can generate images of one (or both) legs of the model wearing the input shoes. However, in this preferred embodiment, the separate Z-vectors 751, 752, ..., 759 are combined (e.g. concatenated) to form a combined Z-vector 760, which is delivered to a multi-element image combiner neural network 770. This output network creates an image of the model wearing the clothing and shoes (plus any other accessories whose images are provided to other input networks and concatenated into the composite Z-vector 760, e.g. pants 712, whose images are provided through the input neural network 722).

入力ネットワークがこのように分離されるとき、それらは、別々に使用可能であり、独立して訓練／再訓練可能である。この構成はまた、出力画像の態様に影響を及ぼすさまざまなＺベクターの要素の識別を単純化する。これらの要素は、上述したように主成分分析ＰＣＡを介して識別されてもよい。合成画像の有用な特性を制御する要素は、「結果に影響を与える」と記載されてもよい。例えば、１つのＺベクター要素（または共変する要素のセット）は、モデルの肌色を制御するのに効果的でもよい。他の結果に影響を与える変数は、モデルの姿勢を変えてもよい（例えば、左もしくは右を向いているか、または、腕もしくは脚の位置を変えるかなど）。いくつかの結果に影響を与える変数は、アパレルのアイテムの特性、例えば、織物のひだの柔らかさや硬さ、袖や襟の長さまたは靴のヒールの高さを制御してもよい。 When the input networks are separated in this way, they can be used separately and trained/retrained independently. This configuration also simplifies the identification of various Z-vector elements that affect aspects of the output image. These elements may be identified via Principal Component Analysis (PCA) as described above. Elements that control useful properties of the composite image may be described as "outcome-influencing". For example, one Z-vector element (or a set of co-varying elements) may be effective in controlling the skin color of the model. Another outcome-influencing variable may change the pose of the model (e.g., whether it looks left or right, or changes the position of its arms or legs, etc.). Some outcome-influencing variables may control the properties of an item of apparel, such as the softness or stiffness of the pleats of a fabric, the length of a sleeve or collar, or the heel height of a shoe.

この点について、２つの特に有用な結果に影響を与える変数が（架空の合成された）モデルの体のサイズまたは比率および衣服のきつさを制御すると認識されたい。これらの変数はともに、ユーザが異なる衣類のサイズのフィット感を視覚化することを可能にする。 In this regard, it should be appreciated that two particularly useful outcome-influencing variables control the (fictional, synthetic) model's body size or proportions and the tightness of the clothing. Together, these variables allow the user to visualize the fit of different clothing sizes.

靴７１９ａおよび７１９ｂの異なる図を描写する複数の画像を受信して、Ｚベクター７５９を生成する入力「認識部」ネットワーク７２９は、靴の写真をモデルの足７４９上に合成することができるよう提案されることに留意されたい。足は、説明の便宜上３つの黒い点によって示される。これらの点は、見えないメタデータとして画像内に実際に記録される。それらは、合成されたモデルの姿勢についての追加情報を運ぶので、（例えば）合成ネットワークは、モデルの体の部分が不可能な構成（例えば、左足が前方を向くが右足が後方を向く）である合成画像を生成するのを回避することができる。「姿勢」情報に加えて、画像を認識または合成するのを支援する他の情報はまた、ニューラルネットワークに提供され、その性能を改善してもよい。 Note that an input "recognizer" network 729 is proposed that receives multiple images depicting different views of shoes 719a and 719b and generates a Z-vector 759 so that a photo of the shoe can be synthesized onto the model's foot 749. The foot is indicated by three black dots for illustrative purposes. These dots are actually recorded in the image as invisible metadata. They carry additional information about the pose of the synthesized model, so that (for example) the synthesis network can avoid generating synthetic images where the model's body parts are in an impossible configuration (e.g. the left foot points forward but the right foot points backward). In addition to the "pose" information, other information that helps to recognize or synthesize images may also be provided to the neural network to improve its performance.

入力の新しいカテゴリを追加することは、訓練問題の複雑さを増加させ、ニューラルネットワークが所望の画像を認識および合成可能であるために必要な訓練画像の数を増加させると認識されたい。例えば、１着のドレス、１足の靴および１つのハンドバッグについて、（モデルは裸足でバッグなしの）ドレスのみを有する「モデル」画像、靴を有するがバッグなしのモデル、バッグを有するが靴なしのモデル、および、靴およびバッグを有するモデルを提示することは好ましい。システムが複数のモデルの姿勢を学習し、合成する場合、より多くのモデル画像が必要となりえ、訓練画像の要件は、システムが扱うことが望ましいアパレルの物品ならびにモデルの体形、サイズおよび姿勢の数で指数的に増加する。 It should be recognized that adding new categories of inputs increases the complexity of the training problem and increases the number of training images required for the neural network to be able to recognize and synthesize the desired images. For example, for a dress, a pair of shoes, and a handbag, it may be preferable to present a "model" image with only the dress (the model is barefoot and without the bag), the model with the shoes but without the bag, the model with the bag but without the shoes, and the model with the shoes and the bag. If the system is to learn and synthesize multiple model poses, more model images may be needed, with the training image requirements growing exponentially with the number of apparel articles and model body shapes, sizes, and poses that the system is desired to handle.

図８で概説される好ましい訓練プロセスにおいて、訓練画像は、さまざまな入力－カテゴリアイテムの３次元モデルから合成される。これらの３次元モデルは、高価であり、作成するのに時間がかかるが、それらは異なるグループ化に自動的に組み込み可能であり、写真のようにリアルなレンダリングは、パラメータを使って生成されたモデル特性および姿勢によって自動的に作成可能である。したがって、任意にかなりの数の訓練画像は、ニューラルネットワークに作成および提示することができる。さらに、これらの訓練画像は、生きているモデルの実際の写真内に存在する照明アーチファクト、軽微な姿勢変化および他の欠陥なしで作成可能である。訓練画像に対するこのきめ細かい制御および任意の数の画像を自動的に生成する能力により、ニューラルネットワークは、効率的に訓練可能であり、無関係な情報を認識または反応するようにネットワークを不注意に訓練する危険性を減少することができる。 In the preferred training process outlined in FIG. 8, training images are synthesized from 3D models of various input-category items. Although these 3D models are expensive and time-consuming to create, they can be automatically assembled into different groupings, and photorealistic renderings can be automatically created with parametrically generated model features and poses. Thus, an arbitrarily large number of training images can be created and presented to the neural network. Moreover, these training images can be created without lighting artifacts, minor pose changes, and other defects that are present in actual photographs of live models. This fine-grained control over training images and the ability to automatically generate any number of images allows the neural network to be trained efficiently and reduces the risk of inadvertently training the network to recognize or respond to irrelevant information.

このプロセスによれば、ユーザは、アパレルのいくつかのアイテムから３次元モデルを構築する（８００）。これらは、例えば、シャツ、ズボン、ジャケット、ドレス、スカーフ、帽子または靴でもよい。モデルは、写真のようにリアルな画像をレンダリングするのを支持する情報、例えば、材料特性、織り、色、パターンなどを含んでもよい。モデルは、例えば、彼らの身体的な特徴および力、例えば、彼らの材料上の重力および慣性の影響をシミュレーションすることによって、自動的に操作可能である。モデルのサイズおよび寸法もまた、自動的に操作可能である。例えば、スカートの裾の長さ、または、袖の円周サイズは増減可能である。 According to this process, a user builds 3D models (800) from several items of apparel. These may be, for example, a shirt, pants, a jacket, a dress, a scarf, a hat, or shoes. The models may contain information that supports rendering a photorealistic image, such as material properties, weave, color, pattern, etc. The models can be automatically manipulated, for example, by simulating their physical characteristics and forces, such as the effects of gravity and inertia on their materials. The size and dimensions of the models can also be automatically manipulated. For example, the hem length of a skirt or the circumference size of a sleeve can be increased or decreased.

次に、モデルの画像は、レンダリングされる（８１０）。これらの画像は、好ましくは、実際の衣類の写真を生成するときを模倣するのに安価である配向および条件にある。例えば、背景に対して平たく広げられたドレスの写真は、マネキン上に配置される同じドレスの写真より生成するのが高価ではない。したがって、同様にレンダリングされた画像のために、平坦面に対してドレスを示すことが好ましい。３次元モデルが平坦面に対して配置される場合であっても、衣類に関する情報（材料、色など）を用いて、現実的な層およびしわをレンダリングすることができる（すなわち、レンダリングは、完全に平坦である必要はなく、むしろ、プレスされたり拘束されずに、本当の衣類が平坦面に平たく広げられた場合に見えるものについて示さなければならない）。 Next, images of the model are rendered (810). These images are preferably in an orientation and condition that is inexpensive to mimic when producing photographs of the actual garment. For example, a photograph of a dress laid flat against a background is less expensive to produce than a photograph of the same dress placed on a mannequin. Therefore, for similarly rendered images, it is preferable to show the dress against a flat surface. Even when the three-dimensional model is placed against a flat surface, information about the garment (material, color, etc.) can be used to render realistic layers and wrinkles (i.e., the rendering does not need to be perfectly flat, but rather should be indicative of what the real garment would look like if laid flat on a flat surface, without being pressed or restrained).

最後に、多くの訓練画像の生成の準備のために、人間の姿の３Ｄ姿勢可能なモデルが作成される（８２０）。この姿は、身長、体重、体部位の長さおよび胴回り、髪および肌の色などのような情報を含んでもよい（事実、多くの異なる人間の姿のモデルが作成されてもよい）。これらのモデルはまた、彼らの部位が本当の人にとって可能なように位置決めされうるかまたはポーズがとられうるという点で、自動的に操作可能である。 Finally, a 3D poseable model of a human figure is created (820) in preparation for the generation of many training images. This figure may include information such as height, weight, lengths and girth of body parts, hair and skin color, etc. (indeed, models of many different human figures may be created). These models are also automatically manipulable, in that their parts can be positioned or posed as they would be for a real person.

次に、必要なだけ多くの異なる訓練画像のために、人間の姿のモデルは、ポーズをとり（８３０）、アパレルの入力アイテムの一部もしくは全部を着る（８４０）。「着ている」とは、人間の姿のモデルの表面から適切な距離にアパレルを配置する干渉認識ソフトウェアプロセスを用いて、人間の姿のモデルをアパレルモデルと結合することを意味し、服を着たモデルのレンダリングされた画像が現実的に見えるように重力のような影響をシミュレーションする。 The human figure model is then posed (830) and dressed (840) in some or all of the input items of apparel for as many different training images as necessary. By "dressed," we mean combining the human figure model with the apparel model using an interference recognition software process that places the apparel at an appropriate distance from the surface of the human figure model, simulating effects such as gravity so that the rendered image of the clothed model appears realistic.

ここで、ポーズをとり、服を着たモデルの写真のようにリアルな画像は、レンダリングすることによって作成され（８５０）、８１０からの入力画像および８５０からのポーズをとり、服を着た「ターゲット」画像を用いて、ニューラルネットワークを訓練する（８６０）。追加の訓練画像は、人間の姿のモデルに再びポーズをとらせ、再び服を着せ、他のターゲット画像をレンダリングすることにより、作成可能である。有益に、ポージングはソフトウェアによって自動的にかつパラメータを用いて行うことができるので、意図された姿勢の間の殆どまたは全く相違を有さないターゲット画像を生成することができる。換言すれば、例えば、異なる見かけの体重またはふくよかさの２人の人間の姿のモデルの場合、両方のモデルは、正確に同じ位置でポーズをとることができるので、レンダリングされた画像の間の任意の違いは、本当の写真撮影において同じ位置を仮定するように試みる２人の実際のモデルの姿勢の間の不注意な違いの人為的な結果よりむしろ、モデルの体重に関するように、ニューラルネットワークによって学習可能である。 Here, photorealistic images of the posed and clothed models are created by rendering (850), and the input images from 810 and the posed and clothed "target" images from 850 are used to train (860) the neural network. Additional training images can be created by reposing and reclothing the human figure models and rendering other target images. Beneficially, since the posing can be done automatically and parametrically by the software, target images can be generated that have little or no difference between the intended poses. In other words, for example, in the case of two human figure models of different apparent weights or plumpness, both models can be posed in exactly the same position, so that any differences between the rendered images, such as with respect to the weight of the models, can be learned by the neural network, rather than being an artifact of inadvertent differences between the poses of two real models attempting to assume the same position in a real photo shoot.

一旦ニューラルネットワークが訓練されると（数千または数万の自動的に生成された訓練画像を必要としうる）、アパレルのいくつかの本当のアイテムの写真画像が取得される（８７０）。上述したように、これらの画像は、訓練アパレルアイテムが８１０で準備されたものに類似の条件で作成されなければならない。これらの写真画像は、訓練されたニューラルネットワークに提供され、訓練されたニューラルネットワークは、写真画像のアパレルを着ているモデルを示すように見える対応する合成画像を送達する（８８０）。この合成写真において、アパレルの一部または全部は本当でもよいが（それらの画像はニューラルネットワークに提供されている）、モデルは本物ではない。ニューラルネットワークは、その訓練により、（例えばジャケットをシャツの上に描き、または、靴をソックスの上に描くことによって）適切に重ね着をすることができる。最後に、合成画像は、表示されてもよい（８９０）。 Once the neural network is trained (which may require thousands or tens of thousands of automatically generated training images), photographic images of several real items of apparel are obtained (870). As mentioned above, these images should be created in conditions similar to those in which the training apparel items were prepared in 810. These photographic images are provided to the trained neural network, which delivers corresponding synthetic images that appear to show a model wearing the apparel in the photographic images (880). In this synthetic image, some or all of the apparel may be real (these images have been provided to the neural network), but the model is not. The neural network, due to its training, is able to layer clothes appropriately (e.g., by drawing a jacket over a shirt, or shoes over socks). Finally, the synthetic image may be displayed (890).

なお、衣類、靴、アクセサリなどの３次元モデルの写真のようにリアルなレンダリングでネットワークを訓練することは、非常に時間がかかり、資源集約型であるが、結果として生じる訓練されたネットワークは、取得するのに著しくより容易かつより低価格である入力から、所望の合成画像を生成することができる。一旦訓練されると、ネットワークは、衣類の基本的に「平坦な」画像（すなわち、平面背景に対して平たく広げられた衣類の画像）、靴の正面および側面画像ならびに各アクセサリタイプの１つまたは少数の画像に動作することができる。訓練プロセスは、「アクセサリなし」、「靴なし」のオプションを学習することを含むことができるので、合成部は、靴、バッグ、宝石および他の特徴の任意の所望の組み合わせを有する画像と同様に、それらの画像を生成することができる。本発明の一実施形態によるコンポジットネットワークは、入力サブネットワークを含み、以下を含む画像および関連付けられた情報を受信することができる。
●平坦な衣類画像
●マネキンの衣類画像
●靴画像（好ましくは２つまたは３つの図）
●ハンドバッグ画像（好ましくは２つまたは３つの図）
●帽子画像
●ネックレス画像
●ブレスレット画像
●指輪画像
●スカーフ画像
●ネクタイ画像 It should be noted that while training a network on photorealistic renderings of three-dimensional models of clothing, shoes, accessories, etc. is very time-consuming and resource-intensive, the resulting trained network can generate the desired composite images from inputs that are significantly easier and less expensive to obtain. Once trained, the network can operate on essentially "flat" images of clothing (i.e., images of clothing laid out flat against a planar background), front and side images of shoes, and one or a small number of images of each accessory type. The training process can include learning options for "no accessories,""noshoes," so that the compositor can generate those images, as well as images with any desired combination of shoes, bags, jewelry, and other features. A compositing network according to an embodiment of the present invention includes an input sub-network and can receive images and associated information, including:
● Flat garment image ● Mannequin garment image ● Shoe image (preferably 2 or 3 views)
Handbag images (preferably two or three images)
●Hat image ●Necklace image ●Bracelet image ●Ring image ●Scarf image ●Tie image

訓練画像に加えて、本発明の一実施形態は、テキストおよび他のデータソースから学習することができる。これによって、相当するテキストおよび／または他のデータが後の動作において利用できるとき、合成または画像生成部ネットワークは、より良好な結果を生成できるようになる。例えば、画像に関連付けられたテキストの「タグ」は、ときどき利用できる。「衣類」の適用では、タグは、色、パターン、サイズ、寸法、襟形状、織物組成または特性などのようなものを記載してもよい。これらのタグは、画像が入力アイテムのすべてをもっともらしく描写するかを認識部が決めるのを支援してもよい。タグが衣類色を「青」または「パターンあり」と記載する場合、赤か無地の衣類を示す出力画像はもっともらしいと考えられそうにない。改善された認識／区別性能は、ネットワークの生成部分が、より良好な合成画像を生成し、認識部を「だます」ことを強いる。システムの所望の出力が実像のように「見える」合成画像であるので、（テキストタグのような）追加情報をネットワークに提供することは、合成部により良好な結果を生成させる。 In addition to training images, an embodiment of the present invention can learn from text and other data sources. This allows the synthesis or image generator network to produce better results when the corresponding text and/or other data is available in later operations. For example, text "tags" associated with images are sometimes available. In a "clothing" application, the tags may describe such things as color, pattern, size, dimensions, collar style, fabric composition or properties, etc. These tags may help the recognizer determine whether the images plausibly depict all of the input items. If the tags describe the clothing color as "blue" or "patterned," an output image showing red or plain clothing is unlikely to be considered plausible. Improved recognition/discrimination performance forces the generator portion of the network to generate better synthetic images and "trick" the recognizer. Since the desired output of the system is a synthetic image that "looks" like a real image, providing the network with additional information (such as text tags) will allow the synthesizer to produce better results.

本明細書に記載されるような複数入力のニューラルネットワークにおいて、入力の１つは、中立姿勢で立っている人（裸または無彩色のボディスーツを着ている）の画像でもよいことを認識されたい。この画像が入力認識部に提供され、ネットワークが、整合する出力画像を生成するように訓練される場合、全体システムは、アパレルの選択を着用している特定の個人であるように見える画像を合成することができる。したがって、（制御可能な身長、体重、肌色、姿勢および他の特性を有する）一般的なモデルをただ示す代わりに、合成画像は、中立姿勢の写真が提供され、同じアパレルを着用する特定の個人を示すことができる。本実施形態において、一実施形態によって生成される他の任意の合成画像に影響を及ぼす同じＺベクター制御を用いて、（本当の）モデルの姿勢または彼女の見かけの身長または体重を調整することが依然として可能である。したがって、出力は、モデル画像を塞ぐような衣類の単なるモーフィングまたは補間ではない。その代わりに、（本当の人の画像を含む）入力画像のすべては、Ｚベクターまで「圧縮され」、生成ネットワークは（おそらく修正された）結合Ｚベクターに基づいて新しい画像を合成する。この種の合成画像の明白な際立った特性は、モデルが入力モデル画像と異なる姿勢であるということである。例えば、入力モデル画像は、まっすぐに立ち、前方を向いていてもよいが、合成画像は、左または右を向いていてもよいし、または、その腕または脚が異なる位置にあってもよい。しかしながら、Ｚベクターは、体重が増えるかまたは減る場合、人がどのように見えうるかについて示す画像を生成するように調整可能である。この使用モデルは、特に有用になりうる。 It should be appreciated that in a multiple-input neural network as described herein, one of the inputs may be an image of a person (nude or wearing a neutral bodysuit) standing in a neutral pose. If this image is provided to the input recognizer and the network is trained to generate a matching output image, the entire system can synthesize an image that appears to be a specific individual wearing a selection of apparel. Thus, instead of just showing a generic model (with controllable height, weight, skin tone, pose, and other characteristics), the synthetic image may show a specific individual who is provided with a photo of a neutral pose and wearing the same apparel. In this embodiment, it is still possible to adjust the pose of the (real) model or her apparent height or weight, using the same Z-vector control that affects any other synthetic image generated by an embodiment. Thus, the output is not simply a morphing or interpolation of clothing that occludes the model image. Instead, all of the input images (including images of real people) are "compressed" down to a Z-vector, and the generative network synthesizes a new image based on the (possibly modified) combined Z-vector. An obvious distinguishing property of this kind of synthetic image is that the model is in a different pose than the input model image. For example, the input model image may be standing upright and facing forward, while the synthetic image may be facing left or right, or have its arms or legs in a different position. However, the Z vector can be adjusted to generate an image showing how a person might look if they gained or lost weight. This use model can be particularly useful.

発明者は、訓練の間、合成画像の品質を改善するために、ネットワークに提供可能な１つの追加の入力データタイプを識別した。これは、「姿勢」データである。人間の姿の３次元モデルが用いられるとき、骨格関節のための基準点を介してその姿勢を特定することは比較的一般的である。例えば、頭、肩、肘、手首、臀部、膝および足首の位置は、関連した関節の性質によって規定される制限範囲にわたって変化できるだけである。角度、配向および関節対関節の距離を特定することによって、訓練する人の姿勢を効率的に定義することができる。この情報が、訓練の間、衣類、タグおよび他の情報とともに提供される場合、ニューラルネットワークは、さまざまな姿勢の架空のモデルを描写する合成画像を生成することを学習することができる。一連の連続した姿勢の画像を生成することさえ可能であり、それは、連続して表示され、ファッションショーのように、衣類およびアクセサリを示すモデルのアニメーションを生成してもよい。体部位の胴回りデータは、姿勢または関節位置データのように、ニューラルネットワークが異なる体重のモデルを描写することを学習するのを支援するために用いてもよい。 The inventors have identified one additional type of input data that can be provided to the network during training to improve the quality of the synthetic images. This is "pose" data. When a three-dimensional model of the human figure is used, it is relatively common to specify its pose via reference points for skeletal joints. For example, the positions of the head, shoulders, elbows, wrists, hips, knees and ankles can only vary over a limited range dictated by the nature of the associated joints. By specifying the angles, orientations and joint-to-joint distances, one can effectively define the posture of the person to train. If this information is provided during training along with clothing, tag and other information, the neural network can learn to generate synthetic images that depict the fictional model in various poses. It is even possible to generate a series of successive pose images, which may be displayed in succession to generate an animation of the model showing clothing and accessories, as in a fashion show. Body part girth data may be used to help the neural network learn to depict the model at different weights, as may posture or joint position data.

上述した概念および手順の一部または全部は、有用なシステムを形成するさまざまな方法に組み込み可能である。例えば、１つの態様では、システムは、Ｚベクターを生成する減少トポロジ入力セクションおよびＺベクターに基づいて合成画像を生成する拡大トポロジ出力セクションを有する多層ニューラルネットワークを初期化し、多層ニューラルネットワークを訓練データセットで訓練してもよく、各訓練データセットは、（ｉ）アパレルの第１のアイテムを示す第１の訓練画像と、（ｉｉ）アパレルの第２の異なるアイテムの２つの図を示す第２の訓練画像と、（ｉｉｉ）アパレルの第１のアイテムおよびアパレルの第２の異なるアイテムを着ているモデルを示す訓練ターゲット画像と、を備え、前記訓練するステップは、訓練された画像合成ネットワークを生成するためであり、次に、システムを使用し、インスタンスデータセットを訓練された画像合成ネットワークに提供することができ、前記インスタンスデータセットは、（Ｉ）アパレルの第１のインスタンスアイテムを示す第１のインスタンス画像と、（ＩＩ）アパレルの第２の異なるインスタンスアイテムの２つの図を示す第２のインスタンス画像と、を備え、システムを使用し、インスタンスデータセットに基づいて、減少トポロジ入力セクションからインスタンスＺベクターを取得することができ、インスタンスＺベクターに基づいて、拡大トポロジ出力セクションから合成インスタンス画像をエクスポートすることができ、前記合成インスタンス画像は、アパレルの第１のインスタンスアイテムおよびアパレルの第２のインスタンスアイテムを着ているモデルを示すように見える。 Some or all of the above-described concepts and procedures can be incorporated in a variety of ways to form a useful system. For example, in one aspect, the system may initialize a multi-layer neural network having a reduced topology input section that generates Z-vectors and an augmented topology output section that generates a composite image based on the Z-vectors, and train the multi-layer neural network with training data sets, each training data set comprising: (i) a first training image showing a first item of apparel; (ii) a second training image showing two views of a second, different item of apparel; and (iii) a training target image showing a model wearing the first item of apparel and the second, different item of apparel, the training step being for generating a trained image synthesis network, and then using the system to generate an instance data set. A set of instance images can be provided to a trained image synthesis network, the instance dataset comprising: (I) a first instance image showing a first instance item of apparel; and (II) a second instance image showing two views of a second, different instance item of apparel. The system can be used to obtain an instance Z vector from the reduced topology input section based on the instance dataset, and to export a composite instance image from the expanded topology output section based on the instance Z vector, the composite instance image appearing to show a model wearing the first instance item of apparel and the second instance item of apparel.

前述のようなシステムにおいて、さらなる改良は、訓練の間、Ｚベクターの結果に影響を与える要素を識別するステップと、インスタンスＺベクターの結果に影響を与える要素を調整し、合成インスタンス画像の特性を変えるステップと、を含んでもよい。他の改良は、中立姿勢の人間の姿を示す訓練モデル画像を使用することを含んでもよく、訓練ターゲット画像（ｉｉｉ）は、アパレルの第１のアイテムおよびアパレルの第２の異なるアイテムを着用している人間の姿を示し、インスタンスデータセットは、（ＩＩＩ）中立姿勢の人を示すモデルインスタンス画像をさらに含み、合成画像は、アパレルの第１のインスタンスアイテムおよびアパレルの第２のインスタンスアイテムを着用している人を示すように見え、合成画像内の人の姿勢は、モデルインスタンス画像の中立姿勢と異なる。 In a system as described above, further refinements may include identifying factors that affect the Z-vector outcome during training and adjusting factors that affect the instance Z-vector outcome to alter characteristics of the synthetic instance image. Other refinements may include using training model images showing a human figure in a neutral pose, the training target images (iii) showing a human figure wearing a first item of apparel and a second, different item of apparel, and the instance dataset further includes (III) a model instance image showing the person in a neutral pose, and the synthetic image appears to show a person wearing the first instance item of apparel and the second instance item of apparel, and the pose of the person in the synthetic image is different from the neutral pose of the model instance image.

システムに対する他の改良は、第１の訓練画像に関する少なくとも１つの非画像データを訓練データ内に含むことである。非画像データは、とりわけ、テキストタグ、衣類材料、衣類サイズおよび衣類寸法でもよい。システムを適用する有用な方法は、アパレルが平坦であるが、平坦面に対して無拘束で描写される画像を使用することである。アパレルのアイテムは、例えば、シャツ、ブラウス、ドレス、スカートまたはズボンを含んでもよい。多視点から撮影される画像から利益を得るアパレルのアイテムは、例えば、靴、ハンドバッグ、帽子、ブレスレットおよび他の宝石を含んでもよい。 Another improvement to the system is to include in the training data at least one non-image data related to the first training image. The non-image data may be text tags, clothing materials, clothing sizes and clothing dimensions, among others. A useful way to apply the system is to use images in which the apparel is flat, but unconstrainedly depicted relative to a flat surface. Items of apparel may include, for example, shirts, blouses, dresses, skirts or pants. Items of apparel that would benefit from images captured from multiple viewpoints may include, for example, shoes, handbags, hats, bracelets and other jewelry.

他の態様において、一実施形態によるシステムは、アパレルの第１のアイテムの自動的に操作可能なモデルを構築することと、アパレルの第１のアイテムの画像をレンダリングし、第１の訓練画像を生成することと、アパレルの第２の異なるアイテムの自動的に操作可能なモデルを構築することと、アパレルの第２の異なるアイテムの２つの画像をレンダリングし、第２の訓練画像の対を生成することと、人間の姿の自動的に操作可能なモデルを構築することと、人間の姿のモデルを自動的に操作し、ポーズをとった姿のモデルを生成することと、アパレルの第１のアイテムのモデルおよびアパレルの第２のアイテムのモデルにより、ポーズをとった姿のモデルに自動的に服を着せ、服を着た姿のモデルを生成することと、服を着た姿のモデルの画像をレンダリングし、第３の訓練画像を生成することと、第１、第２および第３の訓練画像をニューラルネットワークに適用し、ニューラルネットワークを訓練し、第１の訓練画像および第２の訓練画像の対に類似のインスタンス画像から第３の訓練画像に類似の画像を合成することと、により訓練画像セットを作成してもよい。 In another aspect, a system according to an embodiment may create a training image set by constructing an automatically steerable model of a first item of apparel, rendering an image of the first item of apparel to generate a first training image, constructing an automatically steerable model of a second, different item of apparel, rendering two images of the second, different item of apparel to generate a second pair of training images, constructing an automatically steerable model of a human figure, automatically steering the model of the human figure to generate a posed figure model, automatically dressing the posed figure model with the model of the first item of apparel and the model of the second item of apparel to generate a dressed figure model, rendering an image of the dressed figure model to generate a third training image, and applying the first, second, and third training images to a neural network to train the neural network and synthesize an image similar to the third training image from an instance image similar to the pair of the first training image and the second training image.

記載されているように生成される画像セットによって訓練されるニューラルネットワークを用いて、アパレルの第１のインスタンスアイテムを撮影することによって、アパレルの第１のインスタンスアイテムの第１のインスタンス画像を取得し、２つの異なる視点からアパレルの第２のインスタンスアイテムを撮影することによって、アパレルの第２のインスタンスアイテムの画像の第２のインスタンスの対を取得し、第１のインスタンス画像および第２のインスタンス画像の対を訓練されたニューラルネットワークに送達し、訓練されたニューラルネットワークからターゲット画像を受信してもよく、前記ターゲット画像は、アパレルの第１のインスタンスアイテムおよびアパレルの第２のインスタンスアイテムを着ている存在しない人間の姿に似ている。 A neural network trained with the image set generated as described may be used to obtain a first instance image of a first instance item of apparel by photographing the first instance item of apparel, obtain a second instance pair of images of a second instance item of apparel by photographing the second instance item of apparel from two different viewpoints, deliver the first and second instance image pairs to the trained neural network, and receive a target image from the trained neural network, the target image resembling a non-existent human figure wearing the first instance item of apparel and the second instance item of apparel.

ニューラルネットワーク訓練画像は、中立姿勢のモデルとともに、人間の姿のモデルのレンダリングされた中立姿勢画像を含んでもよく、訓練されたネットワークは、中立姿勢の特定の人の画像を受信してもよいので、合成画像は、アパレルのアイテムを着用しているその特定の人を示すように見える。 The neural network training images may include rendered neutral pose images of a model of a human figure along with a model of a neutral pose, and the trained network may receive an image of a particular person in a neutral pose so that the composite image appears to show that particular person wearing the item of apparel.

上述したように訓練されるニューラルネットワークは、アパレルの第１のインスタンスアイテムを撮影することによって、アパレルの第１のインスタンスアイテムの第１のインスタンス画像を取得することと、２つの異なる視点からアパレルの第２のインスタンスアイテムを撮影することによって、アパレルの第２のインスタンスアイテムの画像の第２のインスタンスの対を取得することと、中立姿勢の人を撮影することによって、人の第３のインスタンス画像を取得することと、第１のインスタンス画像および第２のインスタンス画像の対および第３のインスタンス画像を訓練されたニューラルネットワークに送達することと、訓練されたニューラルネットワークからターゲット画像を受信することと、によって、商業ワークフロー内に組み込まれてもよい。前記ターゲット画像は、アパレルの第１のインスタンスアイテムおよびアパレルの第２のインスタンスアイテムを着ている人に似ており、ターゲット画像内の人の姿勢は、中立姿勢と異なるか、または、ターゲット画像の人の見かけの体重は、第３のインスタンス画像内の人の見かけの体重と異なる。 The neural network trained as described above may be incorporated into a commercial workflow by acquiring a first instance image of a first instance item of apparel by photographing the first instance item of apparel, acquiring a second instance pair of images of the second instance item of apparel by photographing the second instance item of apparel from two different viewpoints, acquiring a third instance image of a person by photographing the person in a neutral pose, delivering the pair of the first and second instance images and the third instance image to the trained neural network, and receiving a target image from the trained neural network. The target image resembles a person wearing the first instance item of apparel and the second instance item of apparel, and a pose of the person in the target image differs from a neutral pose or an apparent weight of the person in the target image differs from an apparent weight of the person in the third instance image.

本発明の実施形態は、機械可読媒体でもよく、非一時的な機械可読媒体を含むがこれに限定されるものではなく、データおよび命令を格納し、プログラマブルプロセッサに上述したように動作を実行させる。他の実施形態では、動作は、ハードワイヤードのロジックを含む特定のハードウェア構成要素によって実行されてもよい。代替的には、それらの動作は、プログラムされたコンピュータ構成要素およびカスタムハードウェア構成要素の任意の組み合わせによって実行されてもよい。 Embodiments of the invention may be machine-readable media, including but not limited to non-transitory machine-readable media, for storing data and instructions and for causing a programmable processor to perform the operations as described above. In other embodiments, the operations may be performed by specific hardware components that include hardwired logic. Alternatively, the operations may be performed by any combination of programmed computer components and custom hardware components.

プログラマブルプロセッサのための命令は、プロセッサによって直接実行可能である形（「オブジェクト」または「実行可能な」形）で格納されてもよいか、または、命令は、実行コードを生成する「コンパイラ」として一般に知られている開発ツールによって自動的に処理可能な「ソースコード」と呼ばれている人間が読み取れるテキスト形式で格納されてもよい。命令はまた、基本的なソースコードの所定のバージョンからの違いまたは「デルタ」として特定されてもよい。デルタ（「パッチ」とも呼ばれる）を用いて、命令を準備し、本発明の一実施形態を実施することができ、一実施形態を含まない一般に利用できるソースコードパッケージから開始する。 Instructions for a programmable processor may be stored in a form that is directly executable by the processor ("object" or "executable" form), or the instructions may be stored in a human-readable text format called "source code" that can be automatically processed by development tools commonly known as "compilers" to generate executable code. Instructions may also be specified as differences or "deltas" from a given version of the underlying source code. Using deltas (also called "patches"), instructions can be prepared to implement one embodiment of the present invention, starting from a publicly available source code package that does not include one embodiment.

いくつかの実施形態において、プログラマブルプロセッサのための命令は、データとして扱われてもよく、リモートレシーバにその後送信可能である搬送波信号を変調するために用いてもよく、信号は、命令を回復するために復調され、命令は、リモートレシーバで一実施形態の方法を実施するために実行される。専門語において、この種の変調および伝送は、命令を「提供する」こととして知られているが、受信および復調は、しばしば「ダウンロード」と呼ばれている。換言すれば、一実施形態は、しばしばインターネットのような分散データネットワークを介して、一実施形態の命令をクライアントに「提供する」（すなわち、コード化し、送信する）。したがって、送信される命令は、レシーバでハードディスクまたは他のデータ記憶装置に保存され、本発明の他の実施形態を作成することができ、上述した動作のいくつかを実行するデータおよび命令を格納している非一時的な機械可読媒体の説明を満たす。レシーバでこの種の一実施形態をコンパイルし（必要に応じて）、実行することの結果として、レシーバは、第３の実施形態に従って動作を実行してもよい。 In some embodiments, the instructions for the programmable processor may be treated as data and used to modulate a carrier signal that can then be transmitted to a remote receiver, the signal demodulated to recover the instructions, and the instructions executed at the remote receiver to implement the method of an embodiment. In technical terms, this type of modulation and transmission is known as "providing" the instructions, while the receiving and demodulation is often referred to as "downloading." In other words, an embodiment "provides" (i.e., encodes and transmits) the instructions of an embodiment to a client, often over a distributed data network such as the Internet. Thus, the transmitted instructions may be stored on a hard disk or other data storage device at the receiver, creating other embodiments of the invention and fulfilling the description of a non-transitory machine-readable medium storing data and instructions to perform some of the operations described above. As a result of compiling (if necessary) and executing this type of embodiment at the receiver, the receiver may perform operations according to the third embodiment.

以前の記述において、多数の詳細が記載されてきた。しかしながら、本発明がこれらの特定の詳細のいくつかなしで実施されうることは、当業者にとって明らかである。いくつかの例では、周知の構造およびデバイスは、本発明を曖昧にすることを回避するために、詳細であるよりはむしろブロック図の形で示される。 In the previous description, numerous details have been set forth. However, it will be apparent to those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

詳細な説明のいくつかの部分は、コンピュータメモリ内のデータビット上の動作のアルゴリズムおよび象徴的な表現に関して提示されてきた。これらのアルゴリズム記述および表現は、データ処理技術に熟練した人々によって最も効果的に彼らの仕事の内容を他の当業者に伝えるために用いられる手段である。アルゴリズムは、ここにあり、概して、所望の結果に至る首尾一貫した一連のステップであると理解される。ステップは、物理的量の物理的操作を要求するものである。必ずしもというわけではないが、大抵、これらの量は、格納、転送、結合、比較および操作が可能な電気または磁気信号という形をとる。これらの信号をビット，値，要素，シンボル，キャラクタ，ターム，数字等と呼ぶことは、主として一般的な用法という理由からときどき便利であることがわかっている。 Some portions of the detailed description have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here and generally understood to be a self-consistent sequence of steps leading to a desired result. The steps require physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared and otherwise manipulated. It has proven convenient at times, primarily for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

しかしながら、これらのすべてに類似する語句は適当な物理量に関連付けられるべきであり、主にこれらの量を適用する便利なラベルであると心にとどめるべきである。前の考察から明らかなように他の意味で特に述べられない限り、説明の全体にわたって、「処理する」または「コンピューティング」または「計算する」または「決定する」または「表示する」などのような用語を利用する考察は、コンピュータシステムまたは類似の電子コンピューティングデバイス、伝送もしくはディスプレイ装置のアクションおよびプロセスに関連し、コンピュータシステムは、コンピュータシステムのレジスタおよびメモリ内の物理（電子）量として表現されるデータを操作し、コンピュータシステムメモリまたはレジスタまたは他のこの種の情報記憶装置内の物理量と同じように表現される他のデータに変換することを認識されたい。 However, it should be borne in mind that all of these similar words and phrases should be associated with the appropriate physical quantities and are primarily convenient labels to apply to these quantities. Unless otherwise specifically stated as is clear from the preceding discussion, throughout the description, discussion utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" will relate to the actions and processes of a computer system or similar electronic computing device, transmission or display apparatus, and that the computer system manipulates and converts data represented as physical (electronic) quantities in the computer system's registers and memory into other data represented in the same manner as physical quantities in the computer system's memory or registers or other such information storage devices.

本発明はまた、本明細書に記載の動作を実行する装置に関する。この装置は、要求された目的のために特別に構築されてもよいし、コンピュータ内に格納されたコンピュータプログラムによって選択的に作動または再構成される汎用コンピュータを備えてもよい。この種のコンピュータプログラムは、コンピュータ可読の記憶媒体内に格納されてもよく、記憶媒体は、限定することなく任意のタイプのディスク、すなわち、フロッピーディスク、光ディスク、コンパクトディスク読み出し専用メモリ（「ＣＤ－ＲＯＭ」）および磁気光ディスク、読み出し専用メモリ（ＲＯＭ）、ランダム・アクセス・メモリ（ＲＡＭ）、消去可能なプログラマブル読み出し専用メモリ（「ＥＰＲＯＭ」）、電気的に消去可能な読み出し専用メモリ（「ＥＥＰＲＯＭ」）、磁気もしくは光カードまたはコンピュータ命令を格納するのに適している任意のタイプの媒体を含む。 The present invention also relates to an apparatus for performing the operations described herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer-readable storage medium, including, without limitation, any type of disk, i.e., floppy disk, optical disk, compact disk with read-only memory ("CD-ROM") and magneto-optical disk, read-only memory (ROM), random access memory (RAM), erasable programmable read-only memory ("EPROM"), electrically erasable read-only memory ("EEPROM"), magnetic or optical card, or any type of medium suitable for storing computer instructions.

本明細書において提示されるアルゴリズムおよびディスプレイは、いかなる特定のコンピュータまたは他の装置にも本質的に関連しない。さまざまな汎用システムが本明細書の教示に従ってプログラムによって用いられてもよいし、または、いくつかの方法ステップを実行するためのより専門的な装置を構築することが都合がよいと判明する場合もある。さまざまなこれらのシステムのための必要な構造は、以下の請求項において詳述される。加えて、本発明は、なんらかの特定のプログラミング言語に関して記載されているものではない。さまざまなプログラミング言語を用いて、本明細書において記載されている本発明の教示を実施してもよいことを認識されたい。 The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform some of the method steps. The required structure for a variety of these systems is set forth in the following claims. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein.

本発明の適用は、主に特定の例を参照することによって、および、特定のハードウェアおよび／またはソフトウェア構成要素に対する機能の特定の割り当てに関して記載されてきた。しかしながら、当業者は、衣類単独の画像に基づいて衣類を着用しているモデルの構成可能な合成画像が、本明細書において記載されているものとは異なって、本発明の実施形態の機能を分配するソフトウェアおよびハードウェアによって生成可能であることを認識するものである。この種のバリエーションおよび実施態様は、以下の請求項に従って保護されると理解されたい。 Applications of the present invention have been described primarily by reference to specific examples and in terms of specific allocations of functionality to particular hardware and/or software components. However, those skilled in the art will recognize that configurable composite images of a model wearing clothing based on images of the clothing alone may be generated by software and hardware that distribute the functionality of embodiments of the present invention differently than described herein. Such variations and implementations are to be understood as protected in accordance with the following claims.

Claims

1. A method comprising:
initializing a multi-layer neural network having a reduction topology input section for generating a Z-vector and an augmentation topology output section for generating a composite image based on the Z-vector;
training the multi-layer neural network with training data sets, each training data set comprising:
i) a first training image showing a first item of apparel;
ii) two second training images showing two views of a second different item of apparel; and
iii) training target images showing a model wearing the first item of apparel and the second, different item of apparel;
The training step is for generating a trained image synthesis network;
The method includes providing an instance dataset to the trained image synthesis network, the instance dataset comprising:
I) a first instance image showing a first instance item of apparel;
II) two second instance images showing two views of a second different instance item of apparel;
The method comprises:
obtaining an instance Z vector from the reduced topology input section based on the instance data set;
and exporting a composite instance image from the augmented topology output section based on the instance Z vector.
the composite instance image appears to show a model wearing the first instance item of apparel and the second instance item of apparel.
Method.

Identifying factors that influence the outcome of the Z vector during training;
adjusting factors that influence the outcome of the instance Z vector to change characteristics of the composite instance image;
Further comprising:
The method of claim 1.

Each training dataset is
iv) further comprising a training model image showing a human figure in a neutral pose;
the training target images (iii) depict the human figure wearing the first item of apparel and the second, different item of apparel;
The instance data set includes:
III) further comprising a model instance image showing a person in a neutral pose;
the synthetic image appears to show the person wearing the first instance item of apparel and the second instance item of apparel, and a pose of the person in the synthetic image differs from the neutral pose of the model instance image.
The method of claim 1.

Each training dataset is
v) further comprising at least one non-image data related to the first training image;
The method of claim 1.

the at least one non-image data is selected from the group consisting of text tags, clothing materials, clothing sizes, and clothing dimensions;
The method according to claim 4.

said first item of apparel being selected from the group consisting of a shirt, a blouse, a dress, a skirt and pants;
The method of claim 1.

the first item of apparel is flat but depicted unconstrained relative to a flat surface;
The method according to claim 6.

the second distinct item of apparel is selected from the group consisting of shoes, handbags, hats and bracelets;
The method of claim 1.

1. A method for training a neural network to synthesize an image of a model wearing multiple items of apparel, comprising:
building an automatically operable model of a first item of apparel;
rendering an image of the first item of apparel to generate a first training image;
building an automatically operable model of a second, different item of apparel;
rendering two images of the second different item of apparel to generate a second training image pair;
constructing an automatically operable model of the human figure;
automatically manipulating the model of the human figure to generate a posed figure model;
automatically dressing the posed figure model with the model of the first item of apparel and the model of the second item of apparel to generate a dressed figure model;
rendering images of the clothed model to generate third training images;
applying the first, second and third training images to a neural network, training the neural network to synthesize an image similar to the third training image from instance images similar to the pair of the first training image and the second training image;
The method includes:

The applying step produces a trained neural network, the method comprising:
acquiring a first instance image of a first instance item of apparel by photographing the first instance item of apparel;
obtaining a second instance pair of images of a second instance item of apparel by photographing the second instance item of apparel from two different viewpoints;
delivering the pair of the first instance image and the second instance image to the trained neural network;
receiving a target image from the trained neural network, the target image resembling a non-existent human figure wearing the first instance item of apparel and the second instance item of apparel.
The method of claim 9.

The applying step produces a trained neural network, the method comprising:
rendering a neutral pose image of the model of the human figure with the model in a neutral pose;
applying the neutral pose image along with the first, second and third training images to the neural network.
The method of claim 9.

The applying step produces a trained neural network, the method comprising:
acquiring a first instance image of a first instance item of apparel by photographing the first instance item of apparel;
obtaining a second instance pair of images of a second instance item of apparel by photographing the second instance item of apparel from two different viewpoints;
acquiring a third instance image of the person by photographing the person in a neutral pose;
delivering the pair of the first and second instance images and the third instance image to the trained neural network;
receiving a target image from the trained neural network, the target image resembling the person wearing the first instance item of apparel and the second instance item of apparel;
the pose of the person in the target image is different from the neutral pose, or
an apparent weight of the person in the target image is different from an apparent weight of the person in the third instance image;
The method of claim 11.