RU2844852C1

RU2844852C1 - Method for face image editing

Info

Publication number: RU2844852C1
Application number: RU2024113765A
Authority: RU
Inventors: Полина Владимировна КАРПИКОВА; Андрей Николаевич Спиридонов; Анна Борисовна ВОРОНЦОВА; Александр Георгиевич ЛИМОНОВ
Original assignee: Самсунг Электроникс Ко., Лтд.
Filing date: 2024-05-21
Publication date: 2025-08-07

Abstract

FIELD: data processing.

SUBSTANCE: invention relates to image processing. At the steps of the method, 3D-GAN inversion of the distorted input image of the face is carried out by optimizing the parameters of the camera and the hidden code of the face, which leads to obtaining a generated image. Obtained latent code and the new camera parameters are used to estimate the depth and then create 3D mesh, as well as to render this mesh to obtain a portrait. Blending is used, in which visible areas are reprojected, and occluded parts are restored using a generative neural network.

EFFECT: possibility of correcting human head rotation, eliminating perspective distortion.

5 cl, 7 dwg, 2 tbl

Description

Настоящее изобретение относится к коррекции изображения, а именно к коррекции искажений лица и позы головы на изображении, например, на селфи-фотографии (снимок самого себя), получаемой с камеры.The present invention relates to image correction, namely to correction of facial distortions and head pose in an image, such as a selfie photograph, obtained from a camera.

Описание связанного уровня техникиDescription of the related art

Селфи является, пожалуй, наиболее распространенным видом фотографий, получаемых с помощью смартфона. Современные камеры смартфонов обеспечивают высококачественные изображения, однако у них возникают нежелательные проблемы с геометрией лица.Selfies are perhaps the most common type of photograph taken with a smartphone. Modern smartphone cameras produce high-quality images, but they suffer from unwanted facial geometry issues.

Расстояние до камеры играет жизненно важную роль в восприятии портрета. Снимки селфи, получаемые с близких расстояний, зачастую страдают от искажения перспективы, которое проявляется в деформированных и ассиметричных чертах, огромном носе и крошечных или даже невидимых ушах, создавая неестественные и неприятные изображения.The distance from the camera plays a vital role in the perception of a portrait. Selfies taken from close distances often suffer from perspective distortion, which manifests itself in deformed and asymmetrical features, a huge nose and tiny or even invisible ears, creating unnatural and unpleasant images.

Другой проблемой, связанной с 3D-геометрией, является неправильная поза головы. Позирование для селфи требует практики, а выбор привлекательной точки обзора является нетривиальной задачей: либо требуется несколько попыток его скорректировать, либо в некоторых сценариях возможность такого выбора может отсутствовать. Поэтому возможность исправить позу головы во время постобработки является востребованной функцией при редактировании селфи.Another problem associated with 3D geometry is incorrect head pose. Posing for selfies requires practice, and choosing an attractive viewpoint is a non-trivial task: either several attempts are required to correct it, or in some scenarios, such a choice may not be possible. Therefore, the ability to correct the head pose during post-processing is a sought-after feature in selfie editing.

На данный момент были предложены различные подходы к манипулированию геометрией лица на основе либо 2D- и 3D-варпинга (warping), либо полей нейронного излучения (NeRF), либо генеративных моделей (нейронных сетей) для синтезирования портрета с обновленными условиями наблюдения.To date, various approaches have been proposed to manipulate facial geometry based on either 2D and 3D warping, neural emission fields (NeRF), or generative models (neural networks) to synthesize a portrait with updated observation conditions.

Известные способы устранения искажения на основе 2D варпинга полагаются на оценку 2D карты потока для выполнения варпинга изображения. Такие способы страдают от серьезных искажений из-за неточной подгонки 3D-геометрии. Подходы на основе 3D-варпинга прекрасно сохраняют детали исходного изображения, но при этом они не могут заполнять неизбежно возникающие окклюзированные области.Known 2D warping-based dewarping methods rely on 2D flow map estimation to perform image warping. Such methods suffer from severe distortions due to inaccurate 3D geometry fitting. 3D warping-based approaches are excellent at preserving the original image details, but they fail to fill in the inevitable occluded regions.

Способы, основанные на NeRF, обеспечивают полный контроль параметров камеры для синтеза нового ракурса, но не используют априорные данные о лице. Соответственно, оптимизация выполняется с нуля, и таким образом происходит очень медленно: несмотря на определенный прогресс, эти подходы далеки от производительности в реальном времени.NeRF-based methods provide full control over camera parameters to synthesize a new view, but do not use a priori data about the face. Accordingly, optimization is performed from scratch, and thus is very slow: despite some progress, these approaches are far from real-time performance.

Другая ветвь методики манипулирования изображениями лиц основывается на использовании генеративных нейронных сетей. В генеративном конвейере исходное изображение лица кодируется в скрытое представление и далее восстанавливается с новыми условиями наблюдения и с обновленными параметрами камеры, с помощью предобученной работающей с 3D-GAN. Скрытый код лица, позиция камеры и фокусное расстояние оцениваются на всем протяжении процедуры совместной оптимизации. Однако, подгонка этих параметров на основе единственного искаженного изображения является спорной и неверно поставленной задачей, так что существующие способы сталкиваются с трудностями при восстановлении точной 3D-геометрии. Несмотря на то, что это можно улучшить, в определенной мере, за счет введения ограничений геометрии, существует более существенный недостаток: генеративный подход не может гарантировать, что отличительные черты личности будут сохранены. Кроме того, GAN имеют тенденцию упускать мелкие детали, что серьезно влияет на качество изображения.Another branch of facial image manipulation techniques is based on the use of generative neural networks. In the generative pipeline, the original face image is encoded into a latent representation and then reconstructed with new viewing conditions and updated camera parameters using a pre-trained 3D-GAN. The latent face code, camera position, and focal length are estimated throughout the joint optimization procedure. However, fitting these parameters based on a single distorted image is a controversial and ill-posed problem, so existing methods face difficulties in reconstructing accurate 3D geometry. Although this can be improved to some extent by introducing geometry constraints, there is a more significant drawback: the generative approach cannot guarantee that distinctive features of an individual will be preserved. In addition, GANs tend to miss small details, which seriously affects the image quality.

Известные способы инверсии 2D-GAN (генеративно-состязательной сети) не сохраняют строение лица при трансформации, поэтому единообразность в многих ракурсах не гарантируется. Недавно представленные 3D-GAN демонстрируют свои способности к генерированию единообразных результатов на основе неявных 3D представлений.Known 2D-GAN (generative adversarial network) inversion methods do not preserve facial structure during transformation, so uniformity across multiple views is not guaranteed. Recently introduced 3D-GANs demonstrate their ability to generate uniform results based on implicit 3D representations.

Для восстановления изображения лица с новыми условиями наблюдения исходное лицо должно быть сначала преобразовано в скрытое пространство предварительно обученной GAN. Эта методика называется инверсией GAN. Способы инверсии 3D-GAN опираются на способы инверсии 2D-GAN с заранее определенными параметрами камеры, которые могут быть оценены с помощью другого способа, а затем исправлены или дополнительно подстроены вместе с оптимизацией скрытого кода лица.To reconstruct a face image under new viewing conditions, the original face must first be transformed into the latent space of a pre-trained GAN. This technique is called GAN inversion. 3D-GAN inversion methods rely on 2D-GAN inversion methods with pre-defined camera parameters that can be estimated using another method and then corrected or further tuned along with optimization of the latent face code.

Недавно DisCO предложила сложную схему 3D-инверсии оптимизации скрытого кода лица и параметров камеры, которая включает в себя инициализацию с коротким расстоянием от камеры до лица и репараметризацию камеры. Результаты, достигнутые с помощью многоэтапного плана оптимизации с регуляризацией геометрии и ориентиров, выглядят многообещающими. Тем не менее, инверсия GAN на основе оптимизации медленна и, будучи чисто генеративным подходом, DisCO не сохраняет черты личности и выдает изображения без мелких деталей.Recently, DisCO proposed a sophisticated 3D inversion scheme for optimizing latent face code and camera parameters, which includes initialization with a short camera-to-face distance and camera reparameterization. The results achieved by a multi-stage optimization plan with geometry and landmark regularization look promising. However, the GAN-based inversion optimization is slow and, being a purely generative approach, DisCO does not preserve personality features and produces images without fine details.

Вместо оптимизации, инверсия GAN может выполняться с помощью основанных на кодере методик, которые трансформируют входное изображение в скрытое пространство за единственный прямой проход и работают на порядки быстрее. До недавнего момента способы, основанные на оптимизации, были наилучшими по качеству реконструкции, но самые новые основанные на кодере способы могут обеспечивать более точную и более согласованную с ракурсом геометрию [1].Instead of optimization, GAN inversion can be performed using encoder-based techniques that transform the input image into latent space in a single forward pass and are orders of magnitude faster. Until recently, optimization-based methods were the best in terms of reconstruction quality, but newer encoder-based methods can produce more accurate and more view-consistent geometry [1].

В HFGI3D [2] обучение GAN контролируется при помощи получаемых путем варпинга изображений, что делаетIn HFGI3D [2], GAN training is supervised using warped images, making

синтезированные изображения более реалистичными. Следовательно, выходные данные генерируются полностью посредством GAN, а варпинг используется только как ориентир. С другой стороны, в предложенном конвейере GAN, варпинг напрямую влияет на конечное изображение.synthesized images are more realistic. Therefore, the output is generated entirely by the GAN, and warping is used only as a guide. On the other hand, in the proposed GAN pipeline, warping directly affects the final image.

При изменении позы головы деокклюзии неизбежны и более заметны, поскольку до этого невидимые части головы и лица человека становятся открытыми. Таким образом, способы, основанные на варпинге, недостаточны для коррекции позы головы, если применяются только они, и обучаемые работающие с 3D способы доминируют в данной области техники.When changing the head pose, deocclusions are inevitable and more noticeable, as previously invisible parts of the person's head and face become exposed. Thus, warping-based methods are insufficient for head pose correction if they are used alone, and 3D-trained methods dominate the field.

Одно направление работы подразумевает использование способов на основе NeRF для синтезирования нового ракурса по нескольким снимкам [17] или одному изображению [16]. Тем не менее, такие подходы могут не гарантировать сохранность черт личности, поскольку никакие априорные данные о лице не используются.One line of work involves using NeRF-based methods to synthesize a new view from multiple images [17] or a single image [16]. However, such approaches may not guarantee the preservation of personality traits, since no a priori data about the face is used.

Известные подходы на основе GAN нацелены на сохранение специфических для человека деталей путем обуславливания входным изображением или видео [18], скрытым кодом [19] или кодируемыми атрибутами лица [20]. Тем не менее, сохранение черт личности не может быть гарантировано, а генерируемые изображения демонстрирует немного различающиеся черты.Known GAN-based approaches aim to preserve human-specific details by conditioning on the input image or video [18], latent code [19], or encoded facial attributes [20]. However, the preservation of personality features cannot be guaranteed, and the generated images exhibit slightly different features.

Таким образом, необходимо ограничить различные парадигмы манипулирования лицами, подходы, основанные наThus, it is necessary to limit the various paradigms of face manipulation, approaches based on

GAN/NeRF/варпинге, путем выполнения объединения варпинга для видимых частей и генерации для окклюзированных частей. ЧертежиGAN/NeRF/warping, by performing a warping merge for the visible parts and generation for the occluded parts. Drawings

Вышеупомянутые и/или другие аспекты будут более очевидны из следующего описания примерных вариантов осуществления со ссылкой на прилагаемые чертежи, на которых:The above and/or other aspects will be more apparent from the following description of exemplary embodiments with reference to the accompanying drawings, in which:

Фиг. 1 иллюстрирует примеры результатов редактирования селфи согласно предлагаемому изобретению.Fig. 1 illustrates examples of selfie editing results according to the proposed invention.

Фиг. 2 иллюстрирует обзор предлагаемого конвейера.Fig. 2 illustrates an overview of the proposed conveyor.

Фиг. 3 иллюстрирует, слева: установку для съемки HeRo с помощью смартфонов, размещаемых на штативе; далее: серии снимков одних и тех же людей, полученных одновременно на камеры с передней, левой, правой и верхней стороны.Fig. 3 illustrates, on the left: the setup for filming HeRo using smartphones placed on a tripod; then: a series of pictures of the same people taken simultaneously by cameras from the front, left, right and top.

Фиг. 4 иллюстрирует качественное сравнение набора данных CMDP.Fig. 4 illustrates a qualitative comparison of the CMDP dataset.

Фиг. 5 иллюстрирует изображения, полученные в естественных условиях, с исправленным искажением.Fig. 5 illustrates images obtained in natural conditions with corrected distortion.

Фиг. 6 иллюстрирует показатели LPIPS, SSIM, ID и PSNR для различного количества итераций.Fig. 6 illustrates the LPIPS, SSIM, ID and PSNR performance for different numbers of iterations.

Фиг. 7 иллюстрирует примеры коррекции позы на взятом из набора данных HeRo примере.Fig. 7 illustrates examples of pose correction taken from the HeRo dataset.

СУЩНОСТЬ ИЗОБРЕТЕНИЯESSENCE OF THE INVENTION

Фотографии (изображения), например, автопортреты, снимаемые с небольшого расстояния, могут смотреться неестественно или даже непривлекательно из-за серьезных искажений, делающих черты лица деформированными, а позы головы странными. Согласно предлагаемому способу выполняется 3D-GAN инверсия искаженного изображения лица путем оптимизации параметров камеры и скрытого кода лица, что обеспечивает возможность полученияPhotographs (images), such as self-portraits, taken from a short distance can look unnatural or even unattractive due to severe distortions that make facial features deformed and head poses strange. The proposed method performs 3D-GAN inversion of the distorted face image by optimizing the camera parameters and the hidden face code, which makes it possible to obtain

сгенерированного изображения. Применяется основанный на видимости блендинг (blending), при котором видимые области репроецируются, а окклюзированные части восстанавливаются с помощью генеративной нейронной сети. Эксперименты по тестам неискаженности лица и самостоятельно собранному набору данных поворота головы (HeRo) демонстрируют, что заявленный способ превосходит предыдущие подходы как качественно, так и количественно, и таким образом открывает новые возможности для фотореалистичного редактирования селфи. Нормализация портретов (изображений лиц) позволяет придать желаемую позу головы и/или исключить искажения перспективы.generated image. Visibility-based blending is used, where visible areas are reprojected and occluded parts are reconstructed using a generative neural network. Experiments on face distortion tests and a self-collected head rotation dataset (HeRo) demonstrate that the claimed method outperforms previous approaches both qualitatively and quantitatively, and thus opens up new possibilities for photorealistic selfie editing. Normalization of portraits (face images) allows for the desired head pose and/or the elimination of perspective distortions.

Предлагается способ для редактирования изображений лиц на фотографии, который содержит:A method for editing images of faces in photographs is proposed, which contains:

a) выбор пользователем изображения для редактирования, содержащей изображение по меньшей мере одного лица;a) the user selecting an image for editing containing an image of at least one person;

b) обнаружение изображений лиц на выбранном изображении,b) detecting images of faces in the selected image,

с сегментированием одного изображения лица из числа обнаруженных изображений лиц;with segmentation of one face image from among the detected face images;

c) подачу сегментированного изображения лица на вход нейронной сети, которая прогнозирует параметры и положение камеры для сегментированного изображения лица,c) feeding the segmented face image to the input of a neural network that predicts the parameters and camera position for the segmented face image,

с получением на выходе исходных параметров и положения камеры;with the output receiving the initial parameters and position of the camera;

d) подачу сегментированного изображения лица на вход нейронной сети, которая прогнозирует скрытый код лица, с получением скрытого кода лица на выходе нейронной сети, которая прогнозирует скрытый код лица;d) feeding the segmented face image to the input of a neural network that predicts the latent code of the face, with the latent code of the face being obtained at the output of the neural network that predicts the latent code of the face;

этапы (с) и (d) выполняются параллельно;steps (c) and (d) are performed in parallel;

e) выполнение процесса итеративной оптимизации 3D-GAN, для этой цели:e) performing an iterative optimization process of 3D-GAN, for this purpose:

- подают спрогнозированные параметры и положение камеры и спрогнозированный скрытый код лица на вход 3D-GAN (генеративно-состязательной сети),- feed the predicted parameters and position of the camera and the predicted hidden code of the face to the input of a 3D-GAN (generative adversarial network),

- получают на выходе 3D-GAN сгенерированное изображение,- receive a 3D-GAN generated image at the output,

- вычисляют функцию потерь между полученным сегментацией изображением лица и сгенерированным изображением,- calculate the loss function between the face image obtained by segmentation and the generated image,

- итеративно изменяют спрогнозированные параметры и положение камеры и спрогнозированный скрытый код лица, и- iteratively change the predicted parameters and camera position and the predicted hidden face code, and

- подают изменяемые параметры и положение камеры и изменяемый скрытый код лица на вход 3D-GAN,- feed the variable parameters and camera position and the variable hidden face code to the 3D-GAN input,

при этом процесс итеративной оптимизации проводится до тех пор, пока упомянутая функция потерь не достигнет минимума функции, в котором параметры и положение камеры и скрытый код лица, удовлетворяющие этому условию, являются оптимальными параметрами и положением камеры и оптимальным скрытым кодом лица;wherein the iterative optimization process is carried out until the said loss function reaches a minimum of the function in which the parameters and position of the camera and the latent code of the face satisfying this condition are the optimal parameters and position of the camera and the optimal latent code of the face;

f) подачу оптимального скрытого кода лица на вход упомянутой 3D-GAN,f) feeding the optimal hidden face code to the input of said 3D-GAN,

с подачей произвольных новых параметров и положения камеры на вход упомянутой 3D-GAN;with the supply of arbitrary new parameters and camera positions to the input of the mentioned 3D-GAN;

g) генерацию, при помощи упомянутой 3D-GAN, нового сгенерированного изображения с изображением лица, соответствующим сегментированному изображению лица, с новым ракурсом лица, соответствующим упомянутым новым параметрам и положению камеры,g) generating, using said 3D-GAN, a new generated image with a face image corresponding to the segmented face image, with a new face view corresponding to said new parameters and camera position,

с прогнозированием карты глубины нового сгенерированного изображения;with prediction of the depth map of the newly generated image;

h) обработку спрогнозированной карты глубины для построения 3D-сетки нового сгенерированного изображения;h) processing the predicted depth map to construct a 3D mesh of the new generated image;

i) проецирование построенной 3D-сетки на плоскость изображения с оптимальными параметрами и положением камеры,i) projection of the constructed 3D mesh onto the image plane with optimal parameters and camera position,

с генерированием получаемого рендерингом изображения, основанного на данной проекции;with the generation of a rendered image based on the given projection;

j) определение того, какие вершины 3D-сетки являются видимыми, а какие окклюзированными после проецирования на плоскость изображения с оптимальными параметрами и положением камеры,j) determining which vertices of a 3D mesh are visible and which are occluded after projection onto the image plane with optimal camera parameters and position,

с получением маски видимости, основанной на упомянутом определении;obtaining a visibility mask based on said definition;

этапы (i)-(j) выполняются параллельно;steps (i)-(j) are performed in parallel;

k) выполнение блендинга нового сгенерированного изображения и полученного рендерингом изображения с использованием маски видимости,k) blending the new generated image and the rendered image using a visibility mask,

с получением отредактированного сегментированного изображения лица, показывающего лицо под новым ракурсом лица;producing an edited segmented face image showing the face from a new angle;

l) перенос полученного отредактированного сегментированного изображения лица на выбранное изображение; иl) transferring the obtained edited segmented face image to the selected image; and

этапы (b)-(k) выполняются для по меньшей мере одного лица на выбранном изображении;steps (b)-(k) are performed for at least one face in the selected image;

m) отображение выбранном изображении с по меньшей мере одним отредактированным сегментированным изображением лица на экране пользователю.m) displaying the selected image with at least one edited segmented face image on the screen to the user.

При этом произвольные новые параметры и положение камеры могут выбираться пользователем. Предлагаемый способ может дополнительно содержать этап выбора пользователем одного изображения лица для сегментирования из числа обнаруженных изображений лиц на выбранном изображении.In this case, arbitrary new parameters and camera position can be selected by the user. The proposed method may additionally contain a step of selecting by the user one face image for segmentation from among the detected face images in the selected image.

По меньшей мере один из множества модулей может быть реализован с помощью модели ИИ. Функция, связанная с ИИ, может выполняться через энергонезависимую память, энергозависимую память и процессор. Вышеописанный способ, выполняемый электронным устройством, может выполняться используя модель искусственного интеллекта.At least one of the plurality of modules may be implemented using an AI model. The AI-related function may be performed via non-volatile memory, volatile memory, and a processor. The above-described method performed by an electronic device may be performed using an artificial intelligence model.

Процессор может включать в себя один или множество процессоров. В это время один или множество процессоров могут быть процессором общего назначения, таким как центральный процессор (CPU), процессор приложений (АР) или им подобный, блок обработки только графики, такой как графический процессор (GPU), процессор машинного зрения (VPU) и/или специальный процессор для ИИ, такой как нейронный процессор (NPU).The processor may include one or more processors. At this time, the one or more processors may be a general-purpose processor such as a central processing unit (CPU), an application processor (AP) or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a vision processing unit (VPU), and/or a special processor for AI such as a neural processing unit (NPU).

Один или множество процессоров контролируют обработку входных данных согласно предопределенному правилу работы или модели искусственного интеллекта (ИИ), хранящихся в энергонезависимой памяти и в энергозависимой памяти. Это предопределенное правило работы или модель искусственного интеллекта обеспечивается путем тренировки или обучения.One or more processors control the processing of input data according to a predetermined operating rule or artificial intelligence (AI) model stored in non-volatile memory and volatile memory. This predetermined operating rule or AI model is provided by training or learning.

Здесь обеспечение путем обучения означает, что ко множеству обучающих данных применяется обучающий алгоритм, обеспечивающий предопределенное правило работы или модель ИИ желаемой характеристики. Обучение может выполняться на самом устройстве, на котором, согласно варианту осуществления, выполняется ИИ и/или может быть реализовано посредством отдельного сервера/системы.Here, provision by training means that a training algorithm is applied to a set of training data, providing a predetermined operating rule or AI model of a desired characteristic. The training may be performed on the device itself, on which, according to an embodiment, the AI is executed and/or may be implemented by a separate server/system.

Модель искусственного интеллекта может включать множество слоев нейронной сети. Каждый из множества слоев нейронной сети включает множество весовых коэффициентов и выполняет вычисление в нейронной сети путем вычисления, основанного на результате вычисления предыдущего слоя и множестве весовых коэффициентов. Примеры нейронных сетей включают, но без ограничения, сверточную нейронную сеть (CNN), глубокую нейронную сеть (DNN), рекуррентную нейронную сеть (RNN), ограниченную машину Больцмана (RBM), глубокую сеть доверия (DBN), двунаправленную рекуррентную глубокую нейронную сеть (BRDNN), генеративно-состязательные сети (GAN) и глубокие Q-сети.An artificial intelligence model may include multiple neural network layers. Each of the multiple neural network layers includes a plurality of weight coefficients and performs a computation in the neural network by computing based on the result of the computation of the previous layer and the plurality of weight coefficients. Examples of neural networks include, but are not limited to, a convolutional neural network (CNN), a deep neural network (DNN), a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GANs), and deep Q-networks.

Обучающий алгоритм представляет собой способ тренировки предопределенного целевого устройства (например, робота) с использованием множества обучающих данных, чтобы побуждать, позволять или управлять целевым устройством для принятия решения или выдачи прогнозов. Примеры алгоритмов обучения включают, но без ограничения, обучение с учителем, обучение без учителя, обучение с частичным привлечением учителя или обучение с подкреплением.A learning algorithm is a method for training a predetermined target device (e.g., a robot) using a set of training data to induce, enable, or control the target device to make a decision or make a prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Модель искусственного интеллекта может быть получена путем тренировки. Здесь "получена путем тренировки" означает, что предопределенное правило работы или модель искусственного интеллекта, сконфигурированное/сконфигурированная для выполнения желаемой функции (или назначения), получается путем тренировки базовой модели искусственного интеллекта на многих частях обучающих данных согласно алгоритму обучения.An AI model can be obtained by training. Here, "obtained by training" means that a predetermined operating rule or an AI model configured to perform a desired function (or purpose) is obtained by training the base AI model on many pieces of training data according to a learning algorithm.

Подробное описаниеDetailed description

Обеспечивается новых подход к корректировке позы головы и искажения изображения лица (портрета), использующий инверсию учитывающей перспективу 3D-GAN. Предлагаемый способ выполняет функции корректировки искажения лица и позы головы, например, на селфи-фотографиях или любых фотографиях (изображениях), содержащих лицо, получаемых с камер. С использованием предлагаемого способа обрабатывают исходное изображение и получают скорректированное изображение. С использованием предлагаемого изобретения осуществляется определение текстуры, геометрии и деталей лица при различных положениях камеры, в результате чего получаются селфи-фотографии со скорректированным искажением лица и скорректированной позой головы. Предлагаемое изобретение позволяет корректировать поворот головы человека или устранять перспективные искажения исходного изображения. Предлагаемый способ может эффективно выполнять коррекцию снимаемых камерой портретов (например, селфи, снимаемых в реальном времени, или портретов, выбираемых пользователем из памяти устройства, например смартфона) для эстетического улучшения изображений.A new approach to head pose and face image (portrait) distortion correction is provided, using the inversion of a perspective-taking 3D-GAN. The proposed method performs the functions of face distortion and head pose correction, for example, in selfie photographs or any photographs (images) containing a face obtained from cameras. Using the proposed method, the original image is processed and a corrected image is obtained. Using the proposed invention, the texture, geometry and details of the face are determined at different camera positions, resulting in selfie photographs with corrected face distortion and corrected head pose. The proposed invention allows for correcting the rotation of a person's head or eliminating perspective distortions of the original image. The proposed method can effectively perform correction of portraits taken by a camera (for example, selfies taken in real time or portraits selected by the user from the memory of a device, such as a smartphone) for aesthetic improvement of images.

Предлагаемый способ может быть использован в приложениях для изображений для нормализации портретов, для дальнейшей обработки в приложениях анализа изображений/видео; для улучшения качества видео, передаваемого в потоковом режиме с фронтальной камеры, например, в приложениях для видеоконференций или видеоблогов.The proposed method can be used in image applications for portrait normalization, for further processing in image/video analysis applications; to improve the quality of video streamed from the front camera, for example, in video conferencing or video blogging applications.

Предлагаемое изобретение может быть использовано для смартфона, ноутбука, интеллектуального устройства или любого электронного устройства, способного выполнять компьютерный анализ изображений, подключенного к камере RGB, или получающего изображения RGB из другого источника. Предлагаемый способ может быть реализован для компьютера с GPU. Минимальный требуемый набор компонентов включает: устройство хранения данных, процессор для обработки информации, графический процессор.The proposed invention can be used for a smartphone, laptop, smart device or any electronic device capable of performing computer analysis of images connected to an RGB camera or receiving RGB images from another source. The proposed method can be implemented for a computer with a GPU. The minimum required set of components includes: a data storage device, a processor for processing information, a graphics processor.

Предлагаемый способ может храниться в виде компьютерного кода на машиночитаемом носителе и может быть реализован на компьютере.The proposed method can be stored in the form of computer code on a machine-readable medium and can be implemented on a computer.

Предлагаемое изобретение объединяет геометрические и генеративные методики, предлагая надежное решение для генерирования высококачественных, сохраняющих черты личности, изображений с новых ракурсов, например, в контексте селфи. Из-за естественных ограничений данных 3D, известные подходы, основанные на 3D-варпинге (конвейер 3D-варпинга), имеют тенденцию пропускать окклюзированные области. Между тем известные генеративные подходы (генеративный конвейер) не могут гарантировать сохранность черт личности человека, поскольку сгенерированное лицо может отличаться от исходного.The proposed invention combines geometric and generative techniques, offering a robust solution for generating high-quality, feature-preserving images from new angles, for example in the context of selfies. Due to the natural limitations of 3D data, known approaches based on 3D warping (3D warping pipeline) tend to miss occluded areas. Meanwhile, known generative approaches (generative pipeline) cannot guarantee the preservation of human features, since the generated face may differ from the original one.

Предлагаемый способ сочетает сильные стороны генеративной и достигаемой варпингом парадигм, и опирается на мощность генеративной нейронной сети, используя при этом большую часть подхода варпинга на основе 3D (то есть, когда деформация исходного кадра происходит с использованием некоторой 3D-информации). Термин "варпинг" обозначает деформацию исходного кадра, термин "генеративный" указывает на искусственную генерацию изображения.The proposed method combines the strengths of the generative and warping paradigms, and relies on the power of the generative neural network, while using most of the 3D-based warping approach (i.e., warping the original frame using some 3D information). The term "warping" refers to the warping of the original frame, and the term "generative" refers to the artificial generation of the image.

Как ранее упоминалось, предлагаемое изобретение позволяет корректировать поворот головы человека или устранять перспективное искажение исходного изображения. Например, на исходном изображении голова повернута влево, а пользователь хочет получить из этого исходного изображения изображение, на котором голова повернута вправо. При любой такой манипуляции, новые области лица, которые не были видны на исходном изображении, будут видны на конечном изображении, то есть ранее окклюзированные области лица будут видны на изображении лица под новым углом. Окклюзированные области берутся из сгенерированного изображения, что позволяет восстановить невидимые части лица и головы. Тем временем, видимая часть лица репроецируются, а не генерируются, таким образом черты личности сохраняются. Генерация представляет собой получение синтетического, нового изображения. Репроецированное изображение представляет собой прошедшее через "3D-варпинг" исходное изображение.As mentioned earlier, the proposed invention allows to correct the rotation of a person's head or to eliminate perspective distortion of the original image. For example, in the original image the head is turned to the left, and the user wants to obtain from this original image an image in which the head is turned to the right. With any such manipulation, new areas of the face that were not visible in the original image will be visible in the final image, i.e. previously occluded areas of the face will be visible in the image of the face from a new angle. The occluded areas are taken from the generated image, which allows to restore invisible parts of the face and head. Meanwhile, the visible part of the face is reprojected, not generated, thus the personality traits are preserved. Generation is the receipt of a synthetic, new image. The reprojected image is the original image that has undergone "3D warping".

Превосходная эффективность предлагаемого способа доказана на тестах неискаженности лица и самостоятельно собранном наборе данных поворота головы (HeRo), получившем название HeRo и содержащем ряд личностей с различными положениями головы и расстояниями от камеры до лица. Набор данных HeRo был собран авторами настоящего изобретения и может быть использован для оценки способов, направленных на изменение позы головы человека.The excellent performance of the proposed method is proven on face distortion tests and a self-collected head rotation (HeRo) dataset, called HeRo and containing a number of individuals with different head positions and camera-to-face distances. The HeRo dataset was collected by the present authors and can be used to evaluate methods aimed at changing the human head pose.

Фиг. 1 иллюстрирует примеры результатов редактирования селфи согласно предлагаемому изобретению. Верхний ряд (исходники) показывает селфи-фотографии, снятые пользователями, которые являются исходными фотографиями, подаваемыми в GAN. В нижнем ряду (скорректированные) показаны изображения, полученные после обработки согласно настоящему изобретению. Как следует из фиг. 1, с помощью настоящего изобретения можно бесшовно изменять позу головы и устранять перспективное искажение, получая фотореалистичные и детализированные скорректированные портреты.Fig. 1 illustrates examples of selfie editing results according to the proposed invention. The top row (original) shows selfie photos taken by users, which are the original photos fed to the GAN. The bottom row (corrected) shows images obtained after processing according to the present invention. As can be seen from Fig. 1, the present invention can seamlessly change the pose of the head and eliminate perspective distortion, obtaining photorealistic and detailed corrected portraits.

Как будет подробно описано ниже, новизна предлагаемого изобретения состоит в применении 3D-GAN для генерации нового сгенерированного изображения и карты глубины, а затем построения 3D-етки на основе карты глубины. На основе этой 3D-сетки исходное изображение изменяется путем варпинга для обеспечения получаемого рендерингом изображения, и посредством анализа видимости определяется то, какие элементы следует взять из нового сгенерированного изображения, а какие из полученного рендерингом изображения. Наконец, для блендинг сгенерированного изображения и полученного рендерингом изображения выполняется для получения окончательного результата.As will be described in detail below, the novelty of the proposed invention consists in using 3D-GAN to generate a new generated image and a depth map, and then constructing a 3D mesh based on the depth map. Based on this 3D mesh, the original image is modified by warping to provide the rendered image, and by analyzing the visibility, it is determined which elements should be taken from the new generated image and which from the rendered image. Finally, blending of the generated image and the rendered image is performed to obtain the final result.

Известные из уровня техники способы чистого варпинга на основе 3D-сетки не могут извлекать невидимые части лица. Обычно, к таким областям относятся щеки и уши в случае устранения искажения, и невидимая сторона лица в случае поворота головы. В зависимости от необходимой степени корректировки портрета врисовываемые области могут меняться от нескольких пикселей до ощутимой части изображения.Prior art pure warping methods based on 3D mesh cannot extract invisible parts of the face. Typically, these areas include cheeks and ears in the case of distortion correction, and the invisible side of the face in the case of head rotation. Depending on the degree of portrait correction required, the areas to be drawn can vary from a few pixels to a noticeable part of the image.

Предлагаемое изобретение использует GAN как для генерации нового сгенерированного (синтезированного) изображения (для восстановления невидимых деталей лица), так и для восстановления информации о глубине для варпинга на основе 3D-сетки. Таким образом, как новое генерируемое изображение, так и получаемое рендерингом изображение являются согласованными друг с другом, поэтому к ним можно с легкостью применить блендинг с использованием маски видимости, что будет описано ниже.The proposed invention uses GAN both to generate a new generated (synthesized) image (to restore invisible facial details) and to restore depth information for warping based on a 3D mesh. Thus, both the new generated image and the rendered image are consistent with each other, so they can be easily blended using a visibility mask, which will be described below.

Предлагаемый способ показан схематически на Фиг. 2. Предлагаемый способ содержит следующие этапы:The proposed method is shown schematically in Fig. 2. The proposed method contains the following steps:

1) Пользователь выбирает изображение, содержащее по меньшей мере одно изображение лица, для редактирования. Выбор осуществляется из памяти устройства или пользователь делает снимок на камеру и выбирает это снимок для редактирования. Снимок может представлять собой портрет селфи или любое другое изображение, содержащее по меньшей мере одно лицо. Пользователь может выбирать по меньшей мере одно изображение лица на выбранной для редактирования фотографии, а также может выбирать произвольные новые параметры и положение камеры, т.е. новый угол поворота головы на выбранном изображении (что будет описано на этапе 5).1) The user selects an image containing at least one face image for editing. The selection is made from the device memory or the user takes a photo with the camera and selects this photo for editing. The photo may be a selfie portrait or any other image containing at least one face. The user may select at least one face image in the photo selected for editing, and may also select arbitrary new parameters and camera position, i.e. a new head rotation angle in the selected image (which will be described in step 5).

2) Обнаружение и сегментирование одного изображения лица на выбранном изображении проводятся для получения входного изображения лица («Ввод» на фиг. 2), представляющего собой сегментированное изображение лица. Процесс обнаружения включает в себя обнаружение изображений лиц на выбранном изображении. Процесс сегментирования включает в себя вырезание изображения одного лица из выбранного изображения. Предлагаемый способ реализует редактирование одного изображения лица. Для редактирования всех изображений лиц необходимо применить предлагаемый способ к каждому изображению лица отдельно. Изображение лица, подлежащее редактированию, может быть выбрано пользователем на этапе 1, либо все изображения лиц на выбранном изображении редактируются одно за другим. Сегментирование осуществляется с использованием подходящего конвейера предобработки, который берется из соответствующей GAN, такие конвейеры известны из уровня техники (см., например, источник [6]).2) Detection and segmentation of one face image in the selected image are performed to obtain an input face image (“Input” in Fig. 2), which is a segmented face image. The detection process includes detection of face images in the selected image. The segmentation process includes cutting out an image of one face from the selected image. The proposed method implements editing of one face image. To edit all face images, it is necessary to apply the proposed method to each face image separately. The face image to be edited can be selected by the user in step 1, or all face images in the selected image are edited one by one. Segmentation is performed using a suitable preprocessing pipeline, which is taken from the corresponding GAN, such pipelines are known in the prior art (see, for example, source [6]).

3) Инициализация. Для рендеринга изображений того же самого лица, но повернутого под другим углом (новым углом лица) с помощью 3D-GAN, в первую очередь необходимо получить исходный скрытый код лица сегментированного изображения лица, а также исходные параметры и положение камеры для сегментированного изображения лица.3) Initialization: To render images of the same face but rotated at a different angle (new face angle) using 3D-GAN, we first need to obtain the original latent face code of the segmented face image, as well as the original parameters and camera position of the segmented face image.

Положение камеры представляется матрицей, которая определяет положение камеры в пространстве. Оно может быть представлено как матрица вращения и перемещения, определяющая проективное преобразование из точки X в 3D-пространстве в точку х (2D координаты в пространстве изображения).The camera position is represented by a matrix that defines the position of the camera in space. It can be represented as a rotation and translation matrix that defines a projective transformation from a point X in 3D space to a point x (2D coordinates in image space).

Параметры камеры представляют собой матрица,The camera parameters are the matrix,

характеризующую внутренние параметры камеры, например, в простом случае, перспективная камера имеет 4 параметра, которые определяют проективное преобразование на плоскости изображения:characterizing the internal parameters of the camera, for example, in a simple case, a perspective camera has 4 parameters that determine the projective transformation on the image plane:

- Сх и Су, которые определяют проекцию оптического центра камеры на плоскость изображения,- Сх and Су, which determine the projection of the optical center of the camera onto the image plane,

- Fx и Fy - фокусное расстояние камеры в пикселях, (отметим, что Fx и Fy могут различаться, поскольку соотношение сторон пикселя не обязательно будет одинаковым. В случае именно квадратных пикселей Fx=Fy).- Fx and Fy - the focal length of the camera in pixels (note that Fx and Fy may differ, since the pixel aspect ratio is not necessarily the same. In the case of square pixels, Fx=Fy).

Известны более сложные модели камер, которые являются основой для большего количества параметров, известных в данной области техники и не имеющих отношения к разработке изобретения. Исходные параметры и положение камеры изначально неизвестны. Исходные параметры и положение камеры определяются с использованием нейронной сети (как описано ниже). Получение исходного скрытого кода лица для сегментированного изображения лица и исходных параметров и положения камеры для сегментированного изображения лица выполняются следующим образом:More complex camera models are known, which are the basis for a larger number of parameters known in the art and not related to the development of the invention. The initial parameters and the position of the camera are initially unknown. The initial parameters and the position of the camera are determined using a neural network (as described below). Obtaining the initial hidden face code for the segmented face image and the initial parameters and the position of the camera for the segmented face image are performed as follows:

a) Исходные параметры и положение камеры определяются в соответствии с сегментированным изображением лица. В частности, сегментированное изображение лица подается на вход нейронной сети, которая прогнозирует параметры и положение камеры (Deep3D FaceRecon на фиг. 2). После обработки упомянутая нейронная сеть выдает на выходе исходные параметры и положение камеры, с₀. с₀представляет собой тензор, описывающий параметры и положение камеры. Процессы и подходящие нейронные сети для определения параметров и положения камеры известны из уровня техники (см., например, источник [7]). Фигура 2 иллюстрирует известную нейронную сеть Deep3D FaceRecon [7] в качестве примера.a) The initial parameters and the camera position are determined according to the segmented face image. In particular, the segmented face image is fed to the input of a neural network that predicts the parameters and the camera position (Deep3D FaceRecon in Fig. 2). After processing, said neural network outputs the initial parameters and the camera position, with ₀ . with ₀ is a tensor describing the parameters and the camera position. Processes and suitable neural networks for determining the parameters and the camera position are known in the art (see, for example, [7]). Figure 2 illustrates the known neural network Deep3D FaceRecon [7] as an example.

b) Исходный скрытый код лица (он обозначен как w₀ на фиг. 2) определяют в соответствии с сегментированным изображением лица. Сегментированное изображение лица подается на вход нейронной сети, такой как TriPlaneNet на фиг. 2, которая прогнозирует скрытый код лица. После обработки упомянутая нейронная сеть выдает исходный скрытый код лица w₀ (нейронная сеть TriPlanetNet проиллюстрирована в качестве примера).b) The original latent code of the face (it is denoted as w ₀ in Fig. 2) is determined according to the segmented face image. The segmented face image is fed to the input of a neural network, such as TriPlaneNet in Fig. 2, which predicts the latent code of the face. After processing, said neural network outputs the original latent code of the face w ₀ (the neural network TriPlanetNet is illustrated as an example).

Скрытый код представляет собой тензор, описывающий изображение. Использование скрытых кодов известно из уровня техники (см., например, источник [21]). Процессы определения скрытого кода лица известны из уровня техники (см., например, источник [1]).The latent code is a tensor describing the image. The use of latent codes is known from the prior art (see, for example, [21]). The processes for determining the latent code of a face are known from the prior art (see, for example, [1]).

TriPlanetNet обучена прогнозировать скрытый код для 3D-GAN, чтобы с использованием этого скрытого кода и определенных параметров камеры можно было получить изображение, близкое к исходному. Соответственно, она обучается на фотографиях (реальных или искусственных) разных людей прогнозировать скрытый код по изображению.TriPlanetNet is trained to predict the hidden code for 3D-GAN so that using this hidden code and certain camera parameters, an image close to the original can be obtained. Accordingly, it is trained on photographs (real or artificial) of different people to predict the hidden code from the image.

Этапы (а) и (b) выполняются параллельно и называются процессом инициализации (инициализация на фиг. 2).Steps (a) and (b) are performed in parallel and are called the initialization process (initialization in Fig. 2).

4) Оптимизация. Часть 3D-GAN (генеративная искусственная нейронная сеть (генеративный конвейер)) не создает выходное изображение, достаточно похожее на сегментированное изображение лица при использовании исходных параметров w₀ и с₀. Поэтому необходимо выполнить несколько дополнительных этапов по оптимизации в 3D-GAN, чтобы получить сгенерированное изображение на выходе (Сгенерированное Изображение на фиг. 2), которое очень похоже на входное сегментированное изображение лица. Используют следующий процесс оптимизации (см. оптимизация на фиг. 2), при котором:4) Optimization. The 3D-GAN (generative artificial neural network (generative pipeline)) part does not produce an output image that is sufficiently similar to the segmented face image using the original parameters w ₀ and c ₀ . Therefore, several additional optimization steps need to be performed in the 3D-GAN to obtain a generated output image (Generated Image in Fig. 2) that is very similar to the input segmented face image. The following optimization process is used (see optimization in Fig. 2), in which:

подают спрогнозированные параметры и положение камеры и спрогнозированный скрытый код лица на вход 3D-GAN,feed the predicted camera parameters and position and the predicted hidden face code to the input of the 3D-GAN,

на выходе 3D-GAN получают сгенерированное изображение,The output of 3D-GAN is a generated image,

вычисляют функцию потерь (функция потерь на фиг. 2) между сегментированным изображением лица и сгенерированным изображением,calculate the loss function (loss function in Fig. 2) between the segmented face image and the generated image,

изменяют спрогнозированные параметры и положение камеры и спрогнозированный скрытый код лица,change the predicted parameters and position of the camera and the predicted hidden code of the face,

подают изменяемые параметры и положение камеры и изменяемый скрытый код лица на вход 3D-GAN.feed the variable parameters and camera position and the variable hidden face code to the input of the 3D-GAN.

Упомянутый процесс итеративной оптимизации проводят до тех пор, пока функция потерь не достигнет минимума функции. Параметры и положение камеры, а также скрытый код лица, удовлетворяющие этому условию, представляют собой оптимальные параметры и положение камеры и оптимальный скрытый код лица . При использовании полученных параметров черты личности на исходном изображении будут сохраняться на изображении, получаемом с помощью 3D-GAN, т.е. оптимальные параметры и положение камеры , а также оптимальный скрытый код лица соответствуют исходному сегментированному изображению лица.The mentioned iterative optimization process is carried out until the loss function reaches the minimum of the function. The parameters and position of the camera, as well as the hidden code of the face, that satisfy this condition are the optimal parameters and position of the camera. and the optimal hidden face code . When using the obtained parameters, the personality traits in the original image will be preserved in the image obtained using 3D-GAN, i.e. the optimal parameters and camera position , as well as the optimal hidden face code correspond to the original segmented face image.

Следует отметить, что в процессе итеративной оптимизации 3D-GAN веса не изменяются, изменяются только w и с, например, способом градиентного спуска с использованием функции потерь. Градиенты рассчитываются способом обратного распространения ошибки через сеть 3D-GAN. Таким образом вычисляется, какие значения векторов w и с вносят наибольший вклад в эту ошибку, и их изменяют для уменьшения этой ошибки.It should be noted that during the iterative optimization of 3D-GAN, the weights are not changed, only w and c are changed, for example, by gradient descent using a loss function. The gradients are calculated by backpropagating the error through the 3D-GAN network. In this way, it is calculated which values of the vectors w and c contribute the most to this error, and they are changed to reduce this error.

Функция потерь показывает, насколько одно изображение схоже с другим. Чем больше эта функция, тем менее изображения схожи друг с другом.The loss function shows how similar one image is to another. The larger this function, the less similar the images are to each other.

В предлагаемом изобретении функция потерь состоит из двух составляющих:In the proposed invention, the loss function consists of two components:

- Потеря LPIPS, представляющая собой меру похожести (известную из уровня техники https://richzhang.github.io/PerceptualSimilarity/) одного изображения на другое изображение, которая близка к восприятию изображений человеком, и- LPIPS loss, which is a measure of the similarity (known from the state of the art https://richzhang.github.io/PerceptualSimilarity/) of one image to another image that is close to human perception of images, and

- потеря на ориентирах, которая определяется как сумма расстояний между ключевыми точками лица.- landmark loss, which is defined as the sum of the distances between key points of the face.

Функция потерь может быть вычислена любым известным из уровня техники способом, главное, чтобы она удовлетворяла основному условию - определяла степень сходства двух изображений.The loss function can be calculated by any method known from the prior art, the main thing is that it satisfies the main condition - it determines the degree of similarity of two images.

Пример оптимизации положения и параметров камеры (с₀):Example of optimizing camera position and parameters (from ₀ ):

Согласно DisCo [3], исходное перемещение i_Z0 камеры (исходное расстояние от камеры до центра лица (головы)) устанавливается достаточно малым. i_Z0представляет собой расстояние от камеры до центра лица, с помощью которого осуществляется оптимизация фокусного расстояния f камеры, при оптимизации i_Z0изменяют. Расстояние i_Z0 выбирается экспериментально, и если взять слишком малое расстояние (то есть слушком близкое к лицу положение камеры), то лицо выйдет за пределы изображения; если взять слишком большое расстояние (т.е. слишком удаленное от лица положение камеры), это также ухудшит сходимость, так как сходимость может остаться в локальном минимуме, который не соответствует реальному расстоянию. В идеале необходимо выбрать исходное расстояние, которое обеспечит устойчивую сходимость алгоритма. Исходное расстояние экспериментально выбирается на большом количестве фотографий и выбирается оптимальное исходная аппроксимация.According to DisCo [3], the initial camera displacement i _Z0 (the initial distance from the camera to the center of the face (head)) is set small enough. i _Z0 is the distance from the camera to the center of the face, with which the focal length f of the camera is optimized, i _Z0 is changed during optimization. The distance i _Z0 is chosen experimentally, and if you take too small a distance (i.e., the camera position is too close to the face), the face will go beyond the image; if you take too large a distance (i.e., the camera position is too far from the face), this will also worsen the convergence, since the convergence can remain in a local minimum, which does not correspond to the real distance. Ideally, you need to choose an initial distance that will ensure stable convergence of the algorithm. The initial distance is experimentally selected on a large number of photographs and the optimal initial approximation is selected.

В соответствии с изменением t_z (расстояния от камеры до центра лица (головы)) фокусное расстояние f камеры изменяется от исходного f₀ (f в этом примере представляет положение и параметры камеры (с₀)). Каждый раз, когда t_z изменяется, получается новое значение f. При этом используется получаемая оценкой глубина d₀до глаз на исходном изображении лица и она остается неизменной. Получаемая оценкой глубина d₀ до глаз оценивается для сегментированного изображения лица и представляет собой расстояние от камеры до глаз на сегментированном изображении лица. При обработке данных, содержащих новое значение f, с помощью 3D-GAN получают новое изображение. На основе нового изображения вычисляется функция потерь, что будет описано ниже.According to the change of t _z (the distance from the camera to the center of the face (head)), the focal length f of the camera changes from the original f ₀ (f in this example represents the position and parameters of the camera (c ₀ )). Each time t _z changes, a new value of f is obtained. In this case, the estimated depth d ₀ to the eyes in the original face image is used and remains unchanged. The estimated depth d ₀ to the eyes is estimated for the segmented face image and is the distance from the camera to the eyes in the segmented face image. When processing data containing the new value of f, a new image is obtained using 3D-GAN. Based on the new image, a loss function is calculated, which will be described below.

Фокусное расстояние f камеры:Focal length f of the camera:

где:Where:

d₀ - получаемая оценкой глубина до глаз,d ₀ - the depth to the eyes obtained by the assessment,

t_z0 - исходное расстояние от камеры до центра лица (головы).t _z0 - the initial distance from the camera to the center of the face (head).

В деталях: после процесса инициализации, скрытый код лица, а также положение и параметры камеры регулируются совместно (используется итеративный процесс, известный из уровня техники, такой как способ градиентного спуска): Выражение (1) ниже показывает, что при оптимальном скрытом коде лица и оптимальных параметрах и положении камеры () достигается условие минимума . Таким образом, оптимальными являются те оптимальные параметры и положение камеры и скрытый код лица, при которых функция потерь L достигает точки минимума функции:In detail, after the initialization process, the face latent code and the camera position and parameters are adjusted jointly (using an iterative process known from the prior art, such as the gradient descent method): Expression (1) below shows that with the optimal face latent code and the optimal camera parameters and position ( ) the minimum condition is achieved . Thus, the optimal are those optimal parameters and the position of the camera and the hidden code of the face, at which the loss function L reaches the minimum point of the function:

где:Where:

является изображением, получаемым предобученной 3D-GAN, параметризуемой весовыми коэффициентами θ по скрытому коду w и параметрам камеры с; is the image produced by a pre-trained 3D-GAN parameterized by weights θ over the latent code w and camera parameters c;

х - входное сегментированное изображение лица.x is the input segmented face image.

Точнее, цель задается как комбинация потери LPIPS, которая сравнивает воспринимаемое сходство входного изображения х и сгенерированного изображения , и потери на ориентирах лица, которыми являются конкретные точки на изображении (например, контур лица, глаза), то есть сравнивается сумма расстояний между ключевыми точками лица во входном сегментированном изображении лица и в сгенерированном изображении:More precisely, the target is defined as a combination of an LPIPS loss that compares the perceived similarity of the input image x and the generated image , and losses on facial landmarks, which are specific points in the image (e.g. face contour, eyes), that is, the sum of the distances between the key points of the face in the input segmented face image and in the generated image is compared:

L - потери, сумма двух потерь (потери LPIPS и потери на ориентирах) с коэффициентами,L - losses, the sum of two losses (LPIPS losses and landmark losses) with coefficients,

х - входное сегментированное изображение лица,x - input segmented face image,

G - 3D-GAN, принимает на вход w - скрытый код лица, с - положение и параметры камеры;G - 3D-GAN, takes as input w - hidden face code, c - camera position and parameters;

G зависит от параметров θ GAN;G depends on the θ GAN parameters;

- весовые коэффициенты, с которыми суммируются потери; - weighting factors with which losses are summed up;

LPIPS - потеря, которая сравнивает воспринимаемое сходство входного изображения х и сгенерированного изображения ,LPIPS is a loss that compares the perceived similarity of the input image x and the generated image ,

f(х) вычисляет ориентиры на изображении,f(x) calculates landmarks in the image,

вычисляет ориентиры на сгенерированном изображении. Calculates landmarks in the generated image.

Ориентиры определяются с помощью нейронной сети, которая выполняет оптимизацию (например, моделью оценки ориентиров MediaPipe FaceMesh-V2 из источника [8]), на сегментированном изображении лица, новое изображение генерируется с использованием 3D-GAN, так что ориентиры на генерируемом изображении имеют такие же местоположения, что и на сегментированном изображении лица. Потеря на ориентирах представляет собой сумму расстояний между ключевыми точками лица на исходном изображении и сравниваемом сгенерированном изображении. В приведенной ниже формуле потеря на ориентирах представляет собой сумму квадратов расстояний:Landmarks are detected using a neural network that performs optimization (e.g., the MediaPipe FaceMesh-V2 landmark estimation model from [8]) on the segmented face image, and a new image is generated using 3D-GAN such that the landmarks in the generated image have the same locations as those in the segmented face image. The landmark loss is the sum of the distances between the facial keypoints in the original image and the compared generated image. In the formula below, the landmark loss is the sum of the squared distances:

- набор (х, у) координат точек на сгенерированном и эталонном (исходном) изображениях соответственно, имеем ||М|| число наборов (х, у) координат точек. - a set of (x, y) coordinates of points on the generated and reference (original) images, respectively, we have ||M|| the number of sets of (x, y) coordinates of points.

Потери учитывают суммарное отклонение всех наборов (х, у) координат точек от эталонных (то есть от координат точек исходного изображения (чем больше расстояние между ключевыми точками, тем выше будет функция потерь)), где m и это нормализованные ключевые 3D точки, а. это число ориентирных точек, никаких дополнительных ограничений не накладывается на основе неопределенности ориентиров. Скрытый код и параметры камеры оптимизируются со скоростью обучения 0,001 за 200 итераций.Losses take into account the total deviation of all sets (x, y) of coordinates of points from the reference ones (that is, from the coordinates of the points of the original image (the greater the distance between key points, the higher the loss function will be)), where m and These are normalized 3D key points, a. is the number of landmark points, no additional constraints are imposed based on landmark uncertainty. The hidden code and camera parameters are optimized with a learning rate of 0.001 over 200 iterations.

Вместо исчерпывающей инверсии GAN на основе оптимизации, используемой в DisCo [3], задача инверсии решается с помощью сети-кодера Е, которая отображает реальное изображение в скрытый код. Выбирается обуславливаемый камерой кодер TriPlaneNet [1], так как он способен отделять геометрию от эффектов камеры, что в данном случае является важным.Instead of the exhaustive optimization-based GAN inversion used in DisCo [3], the inversion task is solved using an E-network encoder that maps the real image to a latent code. The camera-conditioned encoder TriPlaneNet [1] is chosen because it is able to separate geometry from camera effects, which is important in this case.

В ходе оптимизации чередуются вычисления градиентов относительно скрытого кода лица и камеры, и они оптимизируются методом циклического перебора, то есть параметры оптимизируются один за другим. На первом шаге фиксируются параметры камеры и оптимизируется скрытый код; на втором шаге, наоборот, фиксируется скрытый код и оптимизируются параметры и положение камеры, и это выполняется аналогичным образом до тех пор, пока оптимальный скрытый код лица и оптимальное положение и параметры камеры () не будут достигнуты. В общем, это всего лишь одна стратегия оптимизации, которой не обязательно следовать. Можно использовать любой другой известный способ. Эксперименты доказывают, что стратегия попеременного обновления обеспечивает превосходящие результаты по сравнению с совместной оптимизацией, используемой в DisCO.During the optimization, gradient calculations with respect to the latent code of the face and the camera alternate, and they are optimized using the cyclic enumeration method, i.e. the parameters are optimized one after the other. In the first step, the camera parameters are fixed and the latent code is optimized; in the second step, on the contrary, the latent code is fixed and the parameters and position of the camera are optimized, and this is done in a similar way until the optimal latent code of the face and the optimal position and parameters of the camera ( ) will not be achieved. In general, this is just one optimization strategy that is not mandatory to follow. Any other known method can be used. Experiments prove that the alternating update strategy provides superior results compared to the joint optimization used in DisCO.

Следующие этапы относятся к синтезу новой перспективы (см. фиг. Новое Сгенерированное Изображение и Карта Глубины. Оптимальный скрытый код лица, полученный на предыдущем этапе, а также произвольные новые параметры и положение камеры ( и C_novel на фиг. 2) подаются на вход 3D-GAN. Произвольные новые параметры и положение камеры c_novel соответствуют новому ракурсу лица (новому ракурсу того же самого лица) и могут быть выбраны как произвольно, так и пользователем при выборе изображения для редактирования (этап 1). Например, для выбора новых параметров и положения камеры c_novel пользователь может использовать слайдеры на экране электронного устройства, с помощью которых пользователь может задавать поворот головы в стороны, вверх и вниз, а также расстояние от центра головы до центра камеры.The following steps relate to the synthesis of a new perspective (see Fig. New Generated Image and Depth Map. Optimal Hidden Code faces obtained in the previous stage, as well as arbitrary new parameters and camera position ( and C _novel in Fig. 2) are fed to the input of the 3D-GAN. Arbitrary new parameters and the position of the c _novel camera correspond to a new view of the face (a new view of the same face) and can be selected either arbitrarily or by the user when selecting an image for editing (step 1). For example, to select new parameters and the position of the c _novel camera, the user can use sliders on the screen of the electronic device, with which the user can set the rotation of the head to the sides, up and down, as well as the distance from the center of the head to the center of the camera.

3D-GAN генерирует новое сгенерированное изображение (Новое Сгенерированное изображение на фиг. 2) с изображением лица, соответствующим сегментированному изображению лица, с новым ракурсом лица, соответствующим новым параметрам и положению камеры, т.е. с другого, отличающегося угла.The 3D-GAN generates a new generated image (New Generated Image in Fig. 2) with a face image corresponding to the segmented face image, with a new face view corresponding to the new parameters and camera position, i.e. from a different angle.

Новое сгенерированное изображение может содержать новые, ранее невидимые части лица (как например, правое ухо при повороте влево) с сохранением качества (Исходного Разрешения) исходного сегментированного изображения лица.The newly generated image can contain new, previously invisible parts of the face (such as the right ear when turning left) while maintaining the quality (Original Resolution) of the original segmented face image.

Кроме того, 3D-GAN прогнозирует карту глубины (Карта Глубины на фиг. 2) нового сгенерированного изображения.In addition, 3D-GAN predicts the depth map (Depth Map in Fig. 2) of the new generated image.

б) 3D Сетка. Карта глубины обрабатывается с использованием подхода детерминированного варпинга на основе 3D сетки, например, описанного в документе [9]. 3D Сетка нового сгенерированного изображения (представление лица в 3D в виде 3D полигональной сетки) строится с использованием прогнозируемой карты глубины (3D Сетка на фиг. 2).b) 3D Mesh. The depth map is processed using a deterministic 3D mesh based warping approach, such as described in [9]. The 3D Mesh of the new generated image (a 3D representation of the face as a 3D polygonal mesh) is constructed using the predicted depth map (3D Mesh in Fig. 2).

7) Варпинг на основе 3D сетки, получаемое рендерингом изображение. Затем эта 3D сетка, представляющая собой параметрическую 3D аппроксимацию формы и поверхности трехмерного объекта, проецируется на плоскость изображения с оптимальными параметрами и положением камеры.7) Warping based on a 3D mesh, the resulting image is rendered. This 3D mesh, which is a parametric 3D approximation of the shape and surface of a 3D object, is then projected onto the image plane with optimal parameters and camera position.

Эта проекция дает представление о том, как 3D полигональная сетка (геометрическое представление лица в 3D пространстве) будет видна, если это положение лица будет снято при оптимальных параметрах и положении камеры , спрогнозированных для исходного сегментированного изображения лица (стрелки от 3D сетки и к "Основанный на 3D сетке варпинг" на фиг. 2).This projection gives an idea of how the 3D polygonal mesh (the geometric representation of the face in 3D space) would appear if this face position were captured with optimal camera settings and position. , predicted for the original segmented face image (arrows from the 3D mesh and to "3D Grid Based Warping" in Fig. 2).

На основе этой проекции формируется получаемое рендерингом изображение (полученное рендерингом изображение на фиг. 2). Получаемое рендерингом изображение получается только с использованием пикселей сегментированного изображения лица. Для получаемого рендерингом изображения, области, которые не были видны на сегментированном изображении лица, не генерируются, а используются только пиксели из сегментированного изображения лица.Based on this projection, the rendered image is generated (the rendered image in Fig. 2). The rendered image is generated using only the pixels of the segmented face image. For the rendered image, areas that were not visible in the segmented face image are not generated, but only the pixels from the segmented face image are used.

Полученное рендерингом изображение показывает, как лицо будет наблюдаться под новым углом, то есть с некоторыми определенными новыми параметрами и положением камеры (c_novel).The resulting rendered image shows how the face would be viewed from a new angle, that is, with some specific new parameters and camera position (c _novel ).

Другими словами, получаемое рендерингом изображение представляет собой изображение, получаемое путем преобразования исходного изображения с использованием варпинга на основе сетки (известным из уровня техники способом). Образно говоря, исходное сегментированное изображение лица натягивается на 3D сетку, и 3D сетка окрашивается в цвета исходного сегментированного изображения лица. А затем эта окрашенная 3D сетка отображается под другим ракурсом (согласно новым параметрам камеры c_novtl).In other words, the rendered image is an image obtained by transforming the original image using mesh-based warping (in a manner known from the prior art). Figuratively speaking, the original segmented face image is stretched onto a 3D mesh, and the 3D mesh is colored with the colors of the original segmented face image. And then this colored 3D mesh is displayed from a different angle (according to the new camera parameters c _novtl ).

Это позволяет открыть части, которые не были видны на исходном сегментированном изображении; в этом случае на изображении образуются «дыры», которые определяются путем анализирования видимых частей. Эти дыры заполняются частями сгенерированного изображения на этапе блендинга (см. этап 8).This allows parts that were not visible in the original segmented image to be revealed; in this case, “holes” are formed in the image, which are determined by analyzing the visible parts. These holes are filled with parts of the generated image during the blending step (see step 8).

8) Анализ видимости вершин, Маска видимости. Определяется видимость (является ли видимой или нет) каждой вершины 3D полигональной сетки при проецировании 3D полигональной сетки на плоскость изображения с оптимальными параметрами и положением камеры (стрелки к «Анализ видимости вершин» на фиг. 2). В частности, сначала вычисляется видимость для каждой вершины на сетке с использованием z-буфера в способе растеризации. Затем, грани сетки, которые почти ортогональны (имеют превышающий 80° угол), отфильтровываются по направлению просмотра.8) Vertex Visibility Analysis, Visibility Mask. The visibility (whether visible or not) of each vertex of a 3D polygonal mesh is determined when projecting the 3D polygonal mesh onto the image plane with optimal parameters and camera position. (arrows to "Vertex Visibility Analysis" in Fig. 2). Specifically, first the visibility for each vertex on the mesh is calculated using the z-buffer in the rasterization method. Then, mesh edges that are nearly orthogonal (have an angle greater than 80°) are filtered out in the viewing direction.

Поскольку лицо представлено в виде 3D сетки (полигональной сетки), некоторые вершины этой сетки не видны под разными углами. На основе такого определения, видимые вершины (не окклюзированные, например, когда человек повернут к камере своей левой стороной, правая сторона не видна) выбираются так, чтобы соответствующая нормаль треугольника полигональной сетки смотрела в камеру (нормалью к треугольнику является направление, перпендикулярное плоскости треугольника). Маска видимости генерируется в соответствии с приведенным выше анализом.Since the face is represented as a 3D mesh (polygon mesh), some vertices of this mesh are not visible at different angles. Based on this definition, the visible vertices (not occluded, for example, when the person is facing the camera with his left side, the right side is not visible) are selected so that the corresponding normal of the triangle of the polygon mesh faces the camera (the normal to the triangle is the direction perpendicular to the plane of the triangle). The visibility mask is generated according to the above analysis.

Использование масок видимости в компьютерной графике и рендеринге известно из уровня техники (см., например, VM Visibility Mask (telecomtrainer.com)). Маска видимости (VM) представляет технику, используемую для оптимизации процесса рендеринга путем выборочного рендеринга только видимых частей сцены. Маска видимости представляется буфером или структурой данных, которая отслеживает видимость отдельных пикселей или фрагментов сцены. Путем соединения соседних вершин на сетке получают грубую сетку. Она дополнительно уточняется за счет билатерального размывающего сглаживания, чтобы избежать появления каких-либо неестественных острых углов в получаемом рендерингом изображении. Используется ядро размером 5. Наконец, сетка проецируется с использованием оцененной исходной позиции камеры для получения координат текстуры, а текстура пересемплируется для получения нового ракурса.The use of visibility masks in computer graphics and rendering is known in the art (see, for example, VM Visibility Mask (telecomtrainer.com)). A visibility mask (VM) is a technique used to optimize the rendering process by selectively rendering only the visible parts of a scene. The visibility mask is represented by a buffer or data structure that keeps track of the visibility of individual pixels or fragments of a scene. A coarse mesh is obtained by connecting adjacent vertices on a grid. It is further refined by bilateral blurring antialiasing to avoid any unnatural sharp edges in the rendered image. A kernel of size 5 is used. Finally, the mesh is projected using the estimated original camera position to obtain texture coordinates, and the texture is resampled to obtain the new view.

9) Блендинг, Результат. Новое сгенерированное изображение и полученное рендерингом изображение подвергаются блендингу (смешиваются) с применением маски видимости (Блендинг на фиг. 2). Соответственно, эти два изображения хорошо согласованы по структуре, что позволяет скомпоновать их с минимальными усилиями. Блендинг изображений представляет собой известную из уровня техники методику, позволяющую вставлять часть одного изображения в другое таким образом, чтобы композиция изображений выглядела естественно, без швов на границах вставки. Блендинг изображений производится способом сложения двух изображений (нового сгенерированного изображения и полученного блендингом изображения) с разными весовым коэффициентами, при этом используются весовые коэффициенты из масок видимости, а также используется способ сглаживания краев этих двух складываемых изображений. Такие способы известны из уровня техники, например, Блендинг с помощью Пирамиды Лапласа: http://graphics.cs.cmu.edu/cours es/15-463/2005_fall/www/Lectures/Pyramids.pdf.9) Blending, Result. The newly generated image and the rendered image are blended using a visibility mask (Blend in Fig. 2). Accordingly, these two images are well-matched in structure, which allows them to be composed with minimal effort. Image blending is a technique known from the prior art that allows part of one image to be inserted into another in such a way that the composition of the images looks natural, without seams at the boundaries of the insert. Image blending is performed by adding two images (the newly generated image and the image obtained by blending) with different weight coefficients, using the weight coefficients from the visibility masks, and also using a method for smoothing the edges of these two images being added. Such methods are known from the state of the art, for example, Blending with the Laplace Pyramid: http://graphics.cs.cmu.edu/cours es/15-463/2005_fall/www/Lectures/Pyramids.pdf.

Посредством этой операции синтезируются участки, которые не были видны под исходным углом (наблюдения) лица, но стали видны под измененным углом лица (Результат на фиг. 2).By means of this operation, areas are synthesized that were not visible at the original angle (observation) of the face, but became visible at the changed angle of the face (Result in Fig. 2).

Описанный способ фокусируется исключительно на вырезанных областях лица, однако его можно легко расширить для работы с полнокадровыми изображениями, например, выполнив этапы, описанные в источнике [3]. Лица обнаруживаются с помощью решения [4] и накладывается маска при помощи MODNet [5].The described method focuses exclusively on cropped areas of the face, but it can be easily extended to work with full-frame images, for example by following the steps described in [3]. Faces are detected using the solution [4] and a mask is applied using MODNet [5].

ЭКСПЕРИМЕНТЫEXPERIMENTS

Наборы данныхDatasets

Оценивалось устранение искажения лица на двух общедоступных различных наборах данных. Caltech Multi-Distance Portraits (CMDP) [9] содержал фронтальные портреты 53 человек с различными атрибутами лица, каждый из которых сфотографирован с семи расстояний. Изображения, полученные в естественных условиях [3] представляли полученные в естественных условиях сильно искаженные портреты, взятые из сети. Поскольку не было никаких эталонов или истинных (GT) изображений, только упомянутые изображения использовались для качественного сравнения.Facial distortion removal was evaluated on two different publicly available datasets. The Caltech Multi-Distance Portraits (CMDP) [9] contained frontal portraits of 53 individuals with different facial attributes, each photographed from seven distances. The In-the-Wild Images [3] represented in-the-wild heavily distorted portraits taken from the web. Since no reference or ground truth (GT) images were available, only the reference images were used for qualitative comparison.

Наборы данных о поворотах головыHead rotation datasets

Был собран набор данных Head Rotation (HeRo), содержавший портреты 19 человек с различающимися атрибутами (очками, волосами на лице, выражениями лиц). Всего было 68 серий по четыре фотографии в каждой.A Head Rotation (HeRo) dataset was collected, containing portraits of 19 people with different attributes (glasses, facial hair, facial expressions). There were 68 series of four photographs each.

Портреты были сняты с помощью фронтальных камер четырех смартфонов Samsung Galaxy S23FE, закрепленных на штативе (фиг. 3). Фигура 3 иллюстрирует установку, с помощью которой был собран набор данных и образцы из него. Данное изобретение впоследствии было протестировано на этом наборе данных. Из собранных фотографий имеем то, как, например, должен выглядеть человек с левого ракурса, учитывая ракурс спереди. Соответственно, предлагаемый способ был протестирован на фотографии спереди и произведена коррекция позы головы на ракурсах слева/сверху/справа, затем эти фотографии сравнивались с реальными фотографиями. Все устройства были синхронизированы с 1, так что серию из 4 изображений снимали одномоментно.The portraits were taken using the front cameras of four Samsung Galaxy S23FE smartphones mounted on a tripod (Fig. 3). Figure 3 illustrates the setup used to collect the data set and samples from it. The invention was subsequently tested on this data set. From the collected photographs, we have, for example, what a person should look like from the left angle, given the front angle. Accordingly, the proposed method was tested on a frontal photograph and head pose correction was performed at left/top/right angles, then these photographs were compared with real photographs. All devices were synchronized with 1, so that a series of 4 images were taken simultaneously.

МетрикиMetrics

Были использованы четыре стандартных метрики оценки для оценивания коррекции перспективы портрета. В частности, были рассчитаны фотометрические ошибки между согласованными выходными изображениями и соответствующими эталонами, в том числе PSNR, SSIM и LPIPS [10]. Кроме того, оценивалось сохранение черт лица с помощью показателя ID, который представляет собой косинусное расстояние между прогнозируемыми и эталонными чертами лица из ArcFace [11].Four standard evaluation metrics were used to evaluate portrait perspective correction. Specifically, the photometric errors between the matched output images and the corresponding references were calculated, including PSNR, SSIM, and LPIPS [10]. In addition, the preservation of facial features was assessed using the ID metric, which is the cosine distance between the predicted and reference facial features from ArcFace [11].

РЕЗУЛЬТАТЫRESULTS

Исправление искажения лицаCorrecting facial distortion

В Таблице 1 представлена оценка заявленного подхода по сравнению с конкурирующими способами исправления искажения лица. Очевидно, что заявленный способ заметно превосходит остальные по большинству показателей, особенно по сохранению черт лица. За счет выборки пикселей из исходного изображения, заявленное решение сохраняет важные детали личности, такие как цвет глаз, морщины, серьги и т.д.Table 1 shows the evaluation of the claimed approach in comparison with competing methods of facial distortion correction. It is obvious that the claimed method significantly outperforms the others in most indicators, especially in preserving facial features. By sampling pixels from the original image, the claimed solution preserves important personality details, such as eye color, wrinkles, earrings, etc.

Таблица 1: Количественное сравнение способов исправления искажения лица на CMDP [9]. Лучшие результаты выделены жирным шрифтом.Table 1: Quantitative comparison of facial distortion correction methods on CMDP [9]. The best results are highlighted in bold.

В этой таблице 1 раскрытый здесь способ сравнивается с конкурентами. Используются 4 метрики, которые сравнивают кадры попарно. PSNR, SSIM сравнивают качество фотографий, LPIPS сравнивает воспринимаемое сходство изображений, ID сравнивает сходство личностей. Стрелка вверх означает, что чем больше, тем лучше, а стрелка вниз означает, что чем меньше, тем лучше. Предлагаемый способ превосходит конкурентов по всем метрикам.In this table 1, the method disclosed here is compared with its competitors. Four metrics are used that compare frames pairwise. PSNR, SSIM compare the quality of photos, LPIPS compares the perceived similarity of images, ID compares the similarity of individuals. The up arrow means that the higher the better, and the down arrow means that the lower the better. The proposed method outperforms its competitors in all metrics.

Портреты, скорректированные различными подходами, в том числе заявленным решением, представлены на Фиг. 4. Проиллюстрировано сравнение раскрытого здесь способа коррекции искажений с другими способами. Заявленный способ превосходно справляется с сильно искаженными лицами. Нейронная сеть не только успешно восстанавливает окклюзированные области, но также сохраняет важные детали личности, что выделено на приведенных фрагментах. Используются два образца: первый столбец содержит входное изображение, последний столбец содержит то, что должно быть получено. Во второй и четвертой строках показаны детали из строк 1 и 3, показывающие эффективность раскрытого здесь способа. Предлагаемый способ сохраняет детали, например, мы сохраняем серьги и глаза, a 3DP, HFGI3D, TriplaneNet, DisCO -этого не делает. В то же время в строках 1 и 3 видно, что искажение перспективы также было исправлено, по сравнению со способами Fried и Shin. Основанные на варпинге [9] и [12] решения, похоже, не оказывают существенного влияния на входные данные. В отличии от этого 3DP вносит заметные изменения, но при этом усиливает искажения, так что средняя часть лица демонстрирует меньше искажений, а голова и подбородок деформируются. Генеративные TriplaneNet и DisCo создают лицо, выглядящее заметно отличающимся. HFGI3D [2] способна сохранить черты личности за счет сочетания варпинга и генерации, но при этом создает визуальные артефакты и имеет тенденцию к чрезмерному сглаживанию изображений. Заявляемый способ генерирует лица с меньшими искажениями перспективы, сохраняя при этом черты личности (identity), что показано на Фиг. 5. Предлагаемый способ (Предлагаемый способ на Фиг. 5) показывает реалистичный результат, генерацию новых, ранее невидимых областей и сохранение исходного качества. Заявляемый способ успешно обеспечивает баланс между генерацией новых частей лица и сохранением черт личности.Portraits corrected by different approaches, including the claimed solution, are shown in Fig. 4. A comparison of the distortion correction method disclosed here with other methods is illustrated. The claimed method copes excellently with heavily distorted faces. The neural network not only successfully restores occluded areas, but also preserves important details of the personality, which is highlighted in the fragments provided. Two samples are used: the first column contains the input image, the last column contains what should be obtained. The second and fourth rows show the details from rows 1 and 3, demonstrating the effectiveness of the method disclosed here. The proposed method preserves the details, for example, we preserve the earrings and eyes, while 3DP, HFGI3D, TriplaneNet, DisCO do not. At the same time, it is seen in rows 1 and 3 that the perspective distortion has also been corrected, compared to the Fried and Shin methods. Warping-based solutions [9] and [12] do not seem to have a significant impact on the input data. In contrast, 3DP introduces noticeable changes, but at the same time increases distortions, so that the middle part of the face shows less distortions, and the head and chin are deformed. Generative TriplaneNet and DisCo create a face that looks noticeably different. HFGI3D [2] is able to preserve personality features by combining warping and generation, but at the same time creates visual artifacts and tends to over-smooth the images. The claimed method generates faces with less perspective distortions, while preserving personality features, as shown in Fig. 5. The proposed method (Proposed method in Fig. 5) shows a realistic result, generating new, previously invisible areas and preserving the original quality. The claimed method successfully provides a balance between generating new parts of the face and preserving personality features.

Для выявления необходимого и достаточного количества этапов оптимизации меняют количество итераций и оценивают достигнутое качество. Как можно увидеть на графиках по Фиг. 6 количество итераций оптимизации скрытого кода лица и параметров камеры влияет на качество выходного результата. Ось х показывает количество итераций, а ось у показывает метрику качества алгоритма. Для всех метрик качества необходимо около 2 00 этапов оптимизации, в течение которых способ выходит на плато по качеству. Оптимальное качество достигается в течение первых 100 итераций, в то время как дальнейшая оптимизация дает незначительный рост ID, но не улучшает LPIPS, SSIM и PSNR.To identify the necessary and sufficient number of optimization stages, the number of iterations is changed and the achieved quality is assessed. As can be seen in the graphs in Fig. 6, the number of iterations of optimization of the hidden face code and camera parameters affects the quality of the output result. The x-axis shows the number of iterations, and the y-axis shows the quality metric of the algorithm. For all quality metrics, about 200 optimization stages are required, during which the method reaches a plateau in quality. Optimal quality is achieved during the first 100 iterations, while further optimization gives a slight increase in ID, but does not improve LPIPS, SSIM and PSNR.

PSNR достигает максимума после 100 итераций, а затем немного снижается по мере увеличения количества итераций. Соответственно, по умолчанию используется 200 итераций. Следует обратить внимание, что базовый вариант DisCo требует примерно 1200 итераций сопоставимой сложности.PSNR peaks after 100 iterations and then decreases slightly as the number of iterations increases. Accordingly, 200 iterations are used by default. Note that the basic DisCo variant requires approximately 1200 iterations of comparable complexity.

Коррекция позы головыHead posture correction

В Таблице 2 представлена оценка подхода по сравнению с конкурирующими способами коррекции позы головы. Качественные результаты показаны на Фиг. 7. Предлагаемый способ способен менять позу головы человека, сохраняя при этом исходное качество изображения и сходство человека с самим собой, лучше, чем какой-либо другой конкурирующий способ. Основным преимуществом предлагаемого подхода является сохранение черт личности, о чем свидетельствует исключительный показатель ID.Table 2 presents the evaluation of the approach compared to competing head pose correction methods. The qualitative results are shown in Fig. 7. The proposed method is able to change the pose of a person while preserving the original image quality and the person’s similarity to himself, better than any other competing method. The main advantage of the proposed approach is the preservation of personality traits, as evidenced by the exceptional ID score.

Таблица 2: Количественное сравнение коррекции поз головы из набора данных HeRo.Table 2: Quantitative comparison of head pose corrections from the HeRo dataset.

В этой таблице 2 раскрытый здесь способ (ours) сравнивается с конкурентами. Используются 4 метрики, которые сравнивают кадры попарно. PSNR, SSIM сравнивают качество фотографий, LPIPS сравнивает воспринимаемое сходство изображений, ID сравнивает сходство личностей. Стрелка вверх означает, что чем больше, тем лучше, а стрелка вниз означает, что чем меньше, тем лучше.In this table 2, the method disclosed here (ours) is compared to its competitors. Four metrics are used that compare frames pairwise. PSNR, SSIM compare the quality of photos, LPIPS compares the perceived similarity of images, ID compares the similarity of individuals. An up arrow means that the higher the better, and a down arrow means that the lower the better.

ЗАКЛЮЧЕНИЕCONCLUSION

В этой заявке раскрыт способ редактирования селфи, который устраняет искажение перспективы лица при кадрировании лица крупным планом. Этот подход дополняет надежный, физически правильный 3D варпинг с гибкостью и выразительностью 3D генеративной нейронной сети. Итоговое изображение представляет собой результат блендинга получаемого варпингом изображения, получаемого при помощи основанного на сетке рендеринга, и другого изображения, получаемого при помощи 3D-GAN. Применяется основанный на видимости блендинг, при котором видимые области репроецируются, а окклюзированные части восстанавливаются с помощью генеративной модели. Оценка в тестах неискажения лица и на новом наборе данных поворота головы подтверждают, что заявленный способ обеспечивает более реалистичные результаты с более тонкими деталями и лучшим сохранением черт личности по сравнению с существующими методиками редактирования селфи, а также устанавливает новый уровень техники в задачах по устранению искажений лица и коррекции позы головы.This application discloses a method for selfie editing that removes facial perspective distortion when cropping a close-up face. This approach complements robust, physically correct 3D warping with the flexibility and expressiveness of a 3D generative neural network. The resulting image is a blend of a warped image obtained using grid-based rendering and another image obtained using a 3D-GAN. Visibility-based blending is used, in which visible regions are reprojected and occluded parts are reconstructed using a generative model. Evaluation on face dewarping tests and a new head rotation dataset confirm that the claimed method provides more realistic results with finer details and better preservation of personality features compared to existing selfie editing methods, and sets a new state-of-the-art in facial dewarping and head pose correction tasks.

СПИСОК ИСТОЧНИКОВ ИНФОРМАЦИИLIST OF INFORMATION SOURCES

[1] Ananta R. Bhattarai, Matthias NieBner, and Artem Sevastopolsky, "Triplanenet: An encoder for eg3d inversion," 2024.[1] Ananta R. Bhattarai, Matthias NieBner, and Artem Sevastopolsky, "Triplanenet: An encoder for eg3d inversion," 2024.

[2] Jiaxin Xie, Hao Ouyang, Jingtan Piao, Chenyang Lei, and Qifeng Chen, "High-fidelity 3D-GAN inversion by pseudo-multi-view optimization," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), стр. 321-331, 2022.[2] Jiaxin Xie, Hao Ouyang, Jingtan Piao, Chenyang Lei, and Qifeng Chen, "High-fidelity 3D-GAN inversion by pseudo-multi-view optimization," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 321-331, 2022.

[3] Zhixiang Wang, Yu-Lun Liu, Jia-Bin Huang, Shin'ichi Satoh, Sizhuo Ma, Gurunandan Krishnan, and Jian Wang, "DisCo: Portrait distortion correction with perspective-aware 3D-GANs," 2023.[3] Zhixiang Wang, Yu-Lun Liu, Jia-Bin Huang, Shin'ichi Satoh, Sizhuo Ma, Gurunandan Krishnan, and Jian Wang, "DisCo: Portrait distortion correction with perspective-aware 3D-GANs," 2023.

[4] Valentin Bazarevsky, Yury Kartynnik, Andrey Vakunov, Karthik Raveendran, and Matthias Grundmann, "BlazeFace: Sub-millisecond neural face detection on mobile gpus," 2019.[4] Valentin Bazarevsky, Yury Kartynnik, Andrey Vakunov, Karthik Raveendran, and Matthias Grundmann, “BlazeFace: Sub-millisecond neural face detection on mobile gpus,” 2019.

[5] Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Rynson W.H. Lau, "Modnet: Real-time trimap-free portrait matting via objective decomposition," in AAAI, 2022.[5] Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Rynson W.H. Lau, "Modnet: Real-time trimap-free portrait matting via objective decomposition," in AAAI, 2022.

[6] Eric Chan, Connor Z. Lin, Matthew Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J. Guibas, Jonathan Tremblay, S. Khamis, Tero Karras, and Gordon Wetzstein, "Efficient geometry-aware 3d generative adversarial networks," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), стр. 16102-16112, 2021.[6] Eric Chan, Connor Z. Lin, Matthew Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J. Guibas, Jonathan Tremblay, S. Khamis, Tero Karras, and Gordon Wetzstein, “Efficient geometry-aware 3d generative adversarial networks,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16102-16112, 2021.

[7] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong, "Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), стр. 285-295, 2019.[7] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong, “Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 285-295, 2019.

[8] Camillo Lugaresi, Jiugiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann, "Mediapipe: A framework for perceiving and processing reality," in Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019, 2019.[8] Camillo Lugaresi, Jiugiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann, “Mediapipe: A framework for perceiving and processing reality,” in Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019, 2019.

[9] Ohad Fried, Eli Shechtman, Dan В Goldman, and Adam Finkelstein, "Perspective-aware manipulation of portrait photos," ACM Transactions on Graphics (TOG), том 35, н. 4, стр. 1-10, 2016.[9] Ohad Fried, Eli Shechtman, Dan B Goldman, and Adam Finkelstein, “Perspective-aware manipulation of portrait photos,” ACM Transactions on Graphics (TOG), vol. 35, no. 4, pp. 1-10, 2016.

[10] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang, "The unreasonable effectiveness of deep features as a perceptual metric," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, стр. 586-595, 2018.[10] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586-595, 2018.

[11] Jiankang Deng, J. Guo, J. Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou, "Arcface: Additive an-gular margin loss for deep face recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, том 44, стр. 5962-5979, 2018.[11] Jiankang Deng, J. Guo, J. Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou, “Arcface: Additive an-gular margin loss for deep face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 5962-5979, 2018.

[12] YiChang Shih, Wei-Sheng Lai, and Chia-Kai Liang, "Distortion-free wide-angle portraits on camera phones," ACM Transactions on Graphics (TOG), том 38, н. 4, стр. 1-12, 2019.[12] YiChang Shih, Wei-Sheng Lai, and Chia-Kai Liang, “Distortion-free wide-angle portraits on camera phones,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1-12, 2019.

[13] Meng-Li Shih, Shin-Yang Su, Johannes Kopf, and Jia-Bin Huang, "3d photography using context-aware layered depth inpainting," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, стр. 8028-8038.[13] Meng-Li Shih, Shin-Yang Su, Johannes Kopf, and Jia-Bin Huang, “3d photography using context-aware layered depth inpainting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8028-8038.

[14] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or, "Pivotal tuning for latent-based editing of real images," ACM Transactions on graphics (TOG), том 42, н. 1, стр. 1-13, 2022.[14] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or, “Pivotal tuning for latent-based editing of real images,” ACM Transactions on Graphics (TOG), vol. 42, no. 1, pp. 1-13, 2022.

[15] Jaehoon Ko, Kyusun Cho, Daewon Choi, Kwang seok Ryoo, and Seung Wook Kim, "3D-GAN inversion with pose optimization," 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), стр. 2966-2975, 2022.[15] Jaehoon Ko, Kyusun Cho, Daewon Choi, Kwang seok Ryoo, and Seung Wook Kim, “3D-GAN inversion with pose optimization,” 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2966-2975, 2022.

[16] Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhigang Wang Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo and Xuelong Li, One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field, arXiv preprint, arXiv:2304.05097, 2023.[16] Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhigang Wang Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo and Xuelong Li, One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field, arXiv preprint, arXiv:2304.05097, 2023.

[17] Jiawei Yang, Marco Pavone and Yue Wang, "FreeNeRF: Improving Few-shot Neural Rendering with Free Freguency Regularization", Conference on Computer Vision and Pattern Recognition (CVPR) 2023.[17] Jiawei Yang, Marco Pavone and Yue Wang, “FreeNeRF: Improving Few-shot Neural Rendering with Free Freguency Regularization,” Conference on Computer Vision and Pattern Recognition (CVPR) 2023.

[18] Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, and Jan Kautz, "Generalizable One-shot Neural Head Avatar", arXiv preprint.[18] Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, and Jan Kautz, “Generalizable One-shot Neural Head Avatar,” arXiv preprint.

[19] Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Ogras and Linjie Luo, "PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360 degree", arXiv preprint, arXiv:2303.13071, 2023.[19] Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Ogras and Linjie Luo, “PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360 degree,” arXiv preprint, arXiv:2303.13071, 2023.

[20] Yue Wu, Yu Deng, Jiaolong Yang, and Fangyun Wei, Qifeng Chen, and Xin Tong, "AniFaceGAN: Animatable 3D-Aware Face Image Generation for Video Avatars", Advances in Neural Information Processing Systems (NeurlPS), 2022.[20] Yue Wu, Yu Deng, Jiaolong Yang, and Fangyun Wei, Qifeng Chen, and Xin Tong, “AniFaceGAN: Animatable 3D-Aware Face Image Generation for Video Avatars,” Advances in Neural Information Processing Systems (NeurlPS), 2022.

[21] Karras, Т., Laine, S., & Aila, T. (2018). A Style-Based Generator Architecture for Generative Adversarial Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4396-4405. https://arxiv.org/abs/1812.04948[21] Karras, T., Laine, S., & Aila, T. (2018). A Style-Based Generator Architecture for Generative Adversarial Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 4396-4405. https://arxiv.org/abs/1812.04948

Claims

1. A method for editing a face image in an image, the method comprising:

a) the user selecting an image for editing containing at least one image of a face;

b) detecting images of faces in the selected image,

with segmentation of one face image from among the detected face images;

c) feeding the segmented face image to the input of a neural network that predicts the parameters and camera position for the segmented face image,

with the output receiving the initial parameters and position of the camera;

d) feeding the segmented face image to the input of a neural network that predicts the latent code of the face, with the latent code of the face being obtained at the output of the neural network that predicts the latent code of the face;

e) performing an iterative optimization process using a generative adversarial network (3D-GAN), for this purpose:

- feed the predicted parameters and camera position and the predicted hidden face code to the input of the 3D-GAN,

- receive a 3D-GAN generated image at the output,

- calculate the loss function between the face image obtained by segmentation and the generated image,

iteratively change the predicted parameters and camera position and the predicted hidden face code, and

- feed the modified parameters and camera position and the modified hidden face code to the 3D-GAN input,

wherein the iterative optimization process is carried out until the said loss function reaches a minimum of the function in which the parameters and position of the camera and the latent code of the face satisfying this condition are the optimal parameters and position of the camera and the optimal latent code of the face;

f) feeding the optimal hidden face code to the input of said 3D-GAN,

with the supply of arbitrary new parameters and camera positions to the input of the mentioned 3D-GAN;

g) generating, using said 3D-GAN, a new generated image with a face image corresponding to the segmented face image, with a new face view corresponding to said new parameters and camera position,

with prediction of the depth map of the newly generated image;

h) processing the predicted depth map to construct a 3D mesh of the new generated image;

i) projection of the constructed 3D mesh onto the image plane with optimal parameters and camera position,

with the generation of a rendered image based on the given projection;

j) determining which vertices of a 3D mesh are visible and which are occluded after projection onto the image plane with optimal camera parameters and position,

obtaining a visibility mask based on said definition;

steps (i)-(j) are performed in parallel;

k) blending the new generated image and the rendered image using a visibility mask,

producing an edited segmented face image showing the face from a new angle;

l) transferring the obtained edited segmented face image to the selected image;

steps (b)-(k) are performed for at least one face in the selected image; and

m) displaying the selected image with at least one edited segmented face image on the screen to the user.

2. The method according to claim 1, wherein the arbitrary new parameters and position of the camera are selected by the user.

3. The method according to any one of paragraphs 1, 2, additionally comprising the step

the user selecting one face image for segmentation from among the detected face images in the selected image.

4. The method according to any one of claims 1-3, wherein steps (c) and (d) are performed in parallel.