RU2806009C2

RU2806009C2 - Method for constructing a depth map from a pair of images

Info

Publication number: RU2806009C2
Application number: RU2022110243A
Authority: RU
Inventors: Николай Романович Маслович; Дмитрий Александрович Яшунин; Илья Васильевич Дерендяев
Original assignee: Общество с ограниченной ответственностью "Сбер Автомотив Технологии" (ООО "СберАвтоТех")
Filing date: 2022-04-15
Publication date: 2023-10-25

Abstract

FIELD: image data processing.

SUBSTANCE: invention is aimed to increase the accuracy of the depth map values. It is achieved by the fact that the claimed solution provides for stages in which first and second images are obtained from the first and second cameras; the rectification procedure of the first and second images is performed; the first and second image tensors are generated, where each element of the tensor represents the brightness value of the corresponding pixel; the tensors are normalized; the tensors are combined using an encoder; by means of a transformer, vector representations of at least one object contained in the tensor obtained at the previous stage are compared line by line to form a tensor containing information about the values of shifts of image pixels relative to each other; by means of a decoder, the first and the second shift maps are generated for the first and second images, containing said shift values, based on the tensor obtained at the previous stage; based on the values of the shift maps generated at the previous stage, an image depth map is formed.

EFFECT: increase in the accuracy of the depth map values.

8 cl, 6 dwg

Description

ОБЛАСТЬ ТЕХНИКИTECHNICAL FIELD

[0001] Представленное техническое решение относится, в общем, к области обработки данных изображения, а в частности к способу и устройству построения карты глубины по паре изображений, полученных, например, посредством стерео-камер с использованием устройства TUDE (Transformer-Unet for Depth Estimation, Трансформер-Юнет для Оценки Глубины).[0001] The presented technical solution relates generally to the field of image data processing, and in particular to a method and device for constructing a depth map from a pair of images obtained, for example, by means of stereo cameras using a TUDE (Transformer-Unet for Depth Estimation) device ,Transformer-Unet for Depth Estimation).

УРОВЕНЬ ТЕХНИКИBACKGROUND OF THE ART

[0002] Существующие аналоги, наиболее близкие к представленному решению, можно условно разделить на два семейства:[0002] Existing analogs that are closest to the presented solution can be divided into two families:

1. Классические - использующие для построения карты глубины построчное сравнение окон со значениями яркости пикселей с кадров двух камер;1. Classic - using line-by-line comparison of windows with pixel brightness values from frames of two cameras to build a depth map;

2. Нейросетевые - использующие для сравнения признаки, выделенные нейронной сетью.2. Neural network - using features identified by a neural network for comparison.

[0003] Несмотря на относительно высокую скорость работы, решения, реализованные на основе классических подходов, имеют ряд недостатков:[0003] Despite the relatively high speed of operation, solutions implemented based on classical approaches have a number of disadvantages:

- плохо работают в зонах с однородной текстурой (например, изображением автодороги);- do not work well in areas with a uniform texture (for example, an image of a road);

- дают много пропусков в предсказаниях и часто ошибаются;- give many omissions in predictions and are often mistaken;

- точность предсказаний сильно зависит от размера объектов.- the accuracy of predictions strongly depends on the size of objects.

[0004] Большинство решений, доступных в открытом виде, основываются на методах библиотеки OpenCV. OpenCV (англ. Open Source Computer Vision Library) - библиотека алгоритмов компьютерного зрения, обработки изображений и численных алгоритмов общего назначения с открытым кодом.[0004] Most solutions available in the open source are based on methods from the OpenCV library. OpenCV (Open Source Computer Vision Library) is an open source library of computer vision, image processing and general-purpose numerical algorithms.

[0005] Нейросетевые подходы, такие как AANet [см. AANet: Adaptive Aggregation Network for Efficient Stereo Matching], LEAStereo [см. LEAStereo: Learning Effective Architecture Stereo] значительно превосходят в качестве классические, однако работают медленнее. Это критично для применения в таких областях, как беспилотное вождение, поскольку это напрямую отражается на скорости реакции автопилота на препятствия. Также эти решения не имеют механизма отсечения предсказаний нейросети по пороговому значению уверенности, что приводит к артефактам на краях объектов и в областях объектов, которые видны на одной камере и не видны на другой. Эти неточности способны негативно влиять на другие алгоритмы, работающие с картой глубины, что ухудшает работу алгоритма автопилота.[0005] Neural network approaches such as AANet [see AANet: Adaptive Aggregation Network for Efficient Stereo Matching], LEAStereo [see. LEAStereo: Learning Effective Architecture Stereo] are significantly superior in quality to classic ones, but they work slower. This is critical for applications such as autonomous driving, as it directly affects how quickly the autopilot reacts to obstacles. Also, these solutions do not have a mechanism for cutting off neural network predictions based on a confidence threshold, which leads to artifacts at the edges of objects and in areas of objects that are visible on one camera and not visible on another. These inaccuracies can negatively affect other algorithms that work with the depth map, which impairs the performance of the autopilot algorithm.

[0006] Для безопасного вождения беспилотного автомобиля требуется оперативная реакция на события дорожной сцены, например, внезапный выезд автомобиля со встречной полосы. При высокой скорости движения автомобиля быстрое принятие решений автопилотом становится критичным. Это мешает использованию алгоритмов, которые работают с данными от лидаров, частота работы которых ниже, чем частота работы камеры. Кроме того, полученное с лидара облако точек является сильно разреженным, вследствие чего по нему не всегда можно сделать правильный вывод о наличии и природе объекта.[0006] Safe driving of an unmanned vehicle requires rapid response to events in a traffic scene, such as a vehicle suddenly leaving the oncoming lane. At high vehicle speeds, quick decision making by the autopilot becomes critical. This prevents the use of algorithms that work with data from lidars, the frequency of which is lower than the frequency of the camera. In addition, the point cloud obtained from the lidar is very sparse, as a result of which it is not always possible to draw a correct conclusion about the presence and nature of the object.

[0007] Использование двух камер, объединенных в стерео-пару, позволяет быстро и относительно точно получить плотную карту глубины (каждому пикселю с камеры сопоставляется дистанция до объекта), что дает плотное покрытие 3Д точками небольших объектов [см. статью Smolyanskiy, Nikolai and Kamenev, Alexey and Birchfield, Stan «On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach», 2018.].[0007] The use of two cameras combined into a stereo pair allows you to quickly and relatively accurately obtain a dense depth map (each pixel from the camera is associated with a distance to an object), which gives a dense coverage of 3D points of small objects [see. article Smolyanskiy, Nikolai and Kamenev, Alexey and Birchfield, Stan “On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach”, 2018.].

[0008] Предложенное техническое решение работает до пяти раз быстрее аналогичных нейросетевых подходов AANet и LEAStereo, работающих также с парой изображений, при сравнимой точности. Высокая скорость работы алгоритма достигается за счет использования легковесных слоев, работающих в относительно низком пространственном разрешении. Также предложенное решение позволяет фильтровать области с высокой ошибкой определения карты глубины.[0008] The proposed technical solution works up to five times faster than similar neural network approaches AANet and LEAStereo, which also work with a pair of images, with comparable accuracy. The high speed of the algorithm is achieved through the use of lightweight layers operating in a relatively low spatial resolution. Also, the proposed solution allows you to filter areas with a high error in determining the depth map.

РАСКРЫТИЕ ИЗОБРЕТЕНИЯDISCLOSURE OF INVENTION

[0009] Технической проблемой или задачей, поставленной в данном техническом решении, является создание нового эффективного, простого и надежного метода построения карты глубины по паре изображений.[0009] The technical problem or task posed in this technical solution is to create a new efficient, simple and reliable method for constructing a depth map from a pair of images.

[0010] Техническим результатом является повышение точности значений карты глубины.[0010] The technical result is to increase the accuracy of the depth map values.

[0011] Указанный технический результат достигается благодаря осуществлению способа построения карты глубины по паре изображений, выполняемого по меньшей мере одним вычислительным устройством, содержащего этапы, на которых:[0011] The specified technical result is achieved by implementing a method for constructing a depth map from a pair of images, performed by at least one computing device, containing the steps of:

- получают с первой и второй камер первое и второе изображения, содержащие изображение по меньшей мере одного объекта;- obtaining from the first and second cameras first and second images containing an image of at least one object;

- выполняют процедуру ректификации первого и второго изображений посредством проецирования их в одну плоскость;- perform the procedure of rectification of the first and second images by projecting them into one plane;

- определяют для каждого пикселя первого и второго изображений значение сдвига, указывающее на количество пикселей, на которое сдвинут наиболее похожий пиксель другого изображения;- determine for each pixel of the first and second images a shift value indicating the number of pixels by which the most similar pixel of the other image is shifted;

- формируют первую и вторую карты сдвигов для первого и второго изображений, содержащие упомянутые значения сдвига;- generating first and second shift maps for the first and second images, containing said shift values;

- на основе значений карт сдвигов, сформированных на предыдущем этапе, формируют карту глубины изображения.- based on the values of the shift maps generated at the previous stage, an image depth map is formed.

[0012] В одном из частных примеров осуществления способа этап определения значения сдвига для каждого пикселя первого и второго изображений содержит этапы, на которых:[0012] In one of the particular examples of the method, the step of determining the shift value for each pixel of the first and second images contains steps in which:

- определяют значение яркости (величину освещенности) каждого пикселя первого и второго изображений;- determine the brightness value (the amount of illumination) of each pixel of the first and second images;

- сопоставляют значения яркости пикселей первого изображения со значениями яркости пикселей второго изображения для определения значения сдвига для каждого пикселя первого и второго изображений, причем при сопоставлении учитывают также значения яркости соседних пикселей.- the brightness values of the pixels of the first image are compared with the brightness values of the pixels of the second image to determine the shift value for each pixel of the first and second images, and the brightness values of neighboring pixels are also taken into account during the comparison.

[0013] В другом частном примере осуществления способа дополнительно выполняют этапы проверки согласованности значений сдвигов пикселей левого и правого изображений.[0013] In another particular example of the method, the steps of checking the consistency of the pixel shift values of the left and right images are additionally performed.

[0014] В другом частном примере осуществления способа этап формирования карты глубины изображения на основе значений карт сдвигов содержит этапы, на которых:[0014] In another particular example of the method, the step of generating an image depth map based on the values of the shift maps contains the steps of:

- на основе значений сдвигов пикселей, содержащихся в картах сдвигов, определяют расстояние от линии, соединяющей центры камер, до каждого пикселя по меньшей мере одного объекта;- based on the pixel shift values contained in the shift maps, the distance from the line connecting the centers of the cameras to each pixel of at least one object is determined;

- на основе полученных значений расстояний от линии, соединяющей центры камер, до каждого пикселя по меньшей мере одного объекта, формируют карту глубины изображения.- based on the obtained distance values from the line connecting the centers of the cameras to each pixel of at least one object, an image depth map is formed.

[0015] В другом частном примере осуществления способа упомянутое расстояние определяется по формуле: distance=B×f/D,[0015] In another particular example of the method, said distance is determined by the formula: distance=B×f/D,

где В - размер базы (расстояние между камерами),where B is the size of the base (distance between cameras),

f - фокусное расстояние в пикселях,f - focal length in pixels,

D - значение сдвига.D - shift value.

[0016] В другом частном примере осуществления способа вычислительное устройство дополнительно оснащено кодировщиком, трансформером и декодером, а этап определения для каждого пикселя первого и второго изображений значения сдвига, указывающее на количество пикселей, на которое сдвинут наиболее похожий пиксель другого изображения, содержит этапы, на которых:[0016] In another particular example of the method, the computing device is additionally equipped with an encoder, transformer and decoder, and the step of determining for each pixel of the first and second images a shift value indicating the number of pixels by which the most similar pixel of the other image is shifted, contains the steps of: of which:

- формируют первый и второй тензоры изображений, содержащие векторные представления (вектора признаков) по меньшей мере одного объекта, причем каждый элемент тензора представляет собой значение яркости соответствующего пикселя;- form the first and second image tensors containing vector representations (feature vectors) of at least one object, each element of the tensor representing the brightness value of the corresponding pixel;

- нормируют полученные на предыдущем этапе тензоры;- normalize the tensors obtained at the previous stage;

- посредством кодировщика объединяют упомянутые два тензора в один тензор;- by means of an encoder, the mentioned two tensors are combined into one tensor;

- посредством трансформера построчно сравнивают векторные представления по меньшей мере одного объекта, содержащиеся в полученном на предыдущем этапе тензоре, для формирования тензора, содержащего информацию о значениях сдвигов пикселей изображений друг относительно друга;- by means of a transformer, vector representations of at least one object contained in the tensor obtained at the previous stage are compared line by line to form a tensor containing information about the values of shifts of image pixels relative to each other;

при этом этап формирования первой и второй карт сдвигов для первого и второго изображения выполняется декодером на основе полученного на предыдущем этапе тензора.in this case, the stage of forming the first and second shift maps for the first and second images is performed by the decoder based on the tensor obtained at the previous stage.

[0017] В другом частном примере осуществления способа кодировщик, трансформер и декодер реализованы на базе нейронных сетей, заранее обученных на тренировочном наборе данных.[0017] In another particular example of the method, the encoder, transformer and decoder are implemented on the basis of neural networks pre-trained on a training data set.

[0018] В другом частном примере осуществления способа дополнительно выполняют этапы, на которых:[0018] In another particular example of the method, the steps are additionally performed in which:

- на основе карты глубины формируют облако точек в трехмерном пространстве;- based on the depth map, a point cloud is formed in three-dimensional space;

- используют облако точек для планирования траектории движения автономного беспилотного транспортного средства- use a point cloud to plan the trajectory of an autonomous unmanned vehicle

[0019] В другом предпочтительном варианте осуществления заявленного решения представлено устройство построения карты глубины по паре изображений, содержащее по меньшей мере одно вычислительное устройство и по меньшей мере одно устройство памяти, содержащее машиночитаемые инструкции, которые при их исполнении по меньшей мере одним вычислительным устройством выполняют вышеуказанный способ.[0019] In another preferred embodiment of the claimed solution, a device for constructing a depth map from a pair of images is presented, containing at least one computing device and at least one memory device containing machine-readable instructions that, when executed by at least one computing device, perform the above way.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[0020] Признаки и преимущества настоящего технического решения станут очевидными из приводимого ниже подробного описания технического решения и прилагаемых чертежей, на которых:[0020] The features and advantages of the present technical solution will become apparent from the following detailed description of the technical solution and the accompanying drawings, in which:

[0021] На Фиг. 1 - представлен пример реализации системы обработки изображений.[0021] In FIG. 1 - an example of the implementation of an image processing system is presented.

[0022] на Фиг. 2 - представлены примеры изображений, полученных с левой и правой камер.[0022] in FIG. 2 - examples of images obtained from the left and right cameras are presented.

[0023] на Фиг. 3 - представлены примеры изображений с заслоненными областями.[0023] in FIG. 3 - examples of images with obscured areas are presented.

[0024] на Фиг. 4 - представлен пример изображений с фильтрацией на карте глубины.[0024] in FIG. 4 - shows an example of images with filtering on a depth map.

[0025] на Фиг. 5 - представлен пример схемы архитектуры нейросети.[0025] in FIG. 5 - shows an example of a neural network architecture diagram.

[0026] на Фиг. 6 - представлен пример общего вида вычислительного устройства.[0026] in FIG. 6 shows an example of a general view of a computing device.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯIMPLEMENTATION OF THE INVENTION

[0027] Ниже будут описаны понятия и термины, необходимые для понимания данного технического решения.[0027] The concepts and terms necessary to understand this technical solution will be described below.

[0028] В данном техническом решении под системой подразумевается, в том числе компьютерная система, ЭВМ (электронно-вычислительная машина), ЧПУ (числовое программное управление), ПЛК (программируемый логический контроллер), компьютеризированные системы управления и любые другие устройства, способные выполнять заданную, четко определенную последовательность операций (действий, инструкций).[0028] In this technical solution, a system means, including a computer system, a computer (electronic computer), CNC (computer numerical control), PLC (programmable logic controller), computerized control systems and any other devices capable of performing a given task. , a clearly defined sequence of operations (actions, instructions).

[0029] Под устройством обработки команд подразумевается электронный блок, вычислительное устройство, либо интегральная схема (микропроцессор), исполняющая машинные инструкции (программы).[0029] By command processing device is meant an electronic unit, a computing device, or an integrated circuit (microprocessor) that executes machine instructions (programs).

[0030] Устройство обработки команд считывает и выполняет машинные инструкции (программы) с одного или более устройств хранения данных. В роли устройства хранения данных могут выступать, но не ограничиваясь, жесткие диски (HDD), флеш-память, ПЗУ (постоянное запоминающее устройство), твердотельные накопители (SSD), оптические приводы.[0030] A command processing device reads and executes machine instructions (programs) from one or more storage devices. Storage devices can include, but are not limited to, hard drives (HDD), flash memory, ROM (read-only memory), solid-state drives (SSD), and optical drives.

[0031] Вычислительное устройство - счетно-решающее устройство, автоматически выполняющее одну какую-либо математическую операцию или последовательность их с целью решения одной задачи или класса однотипных задач (Большая советская энциклопедия. - М.: Советская энциклопедия. 1969-1978.).[0031] A computing device is a computing device that automatically performs one mathematical operation or a sequence of them in order to solve one problem or a class of similar problems (Great Soviet Encyclopedia. - M.: Soviet Encyclopedia. 1969-1978.).

[0032] Программа - последовательность инструкций, предназначенных для исполнения устройством управления вычислительной машины или устройством обработки команд.[0032] Program - a sequence of instructions intended for execution by a computer control device or command processing device.

[0033] База данных (БД) - совокупность данных, организованных в соответствии с концептуальной структурой, описывающей характеристики этих данных и взаимоотношения между ними, причем такое собрание данных, которое поддерживает одну или более областей применения (ISO/IEC 2382: 2015, 2121423 «database»).[0033] Database (DB) - a collection of data organized in accordance with a conceptual structure that describes the characteristics of that data and the relationships between them, and such a collection of data that supports one or more areas of application (ISO/IEC 2382: 2015, 2121423 " database").

[0034] Сигнал - материальное воплощение сообщения для использования при передаче, переработке и хранении информации.[0034] A signal is a material embodiment of a message for use in the transmission, processing and storage of information.

[0035] Механизм внимания (англ. attention mechanism, attention model) - техника, используемая в рекуррентных нейронных сетях (сокр. RNN) и сверточных нейронных сетях (сокр. CNN) для поиска взаимосвязей между различными частями входных и выходных данных.[0035] Attention mechanism (attention model) is a technique used in recurrent neural networks (RNN) and convolutional neural networks (CNN) to find relationships between different parts of input and output data.

[0036] В соответствии со схемой, представленной на фиг.1, система обработки изображений содержит: первую камеру 1, вторую камеру 2 и устройство 10 построения карты глубины по паре изображений.[0036] In accordance with the diagram presented in Fig. 1, the image processing system includes: a first camera 1, a second camera 2, and a device 10 for constructing a depth map from a pair of images.

[0037] В качестве камер могут быть использованы любые широко известные из уровня техники камеры, расположенные, например, на транспортном средстве, на заданном удалении друг от друга в любом направлении, например, на 10-100 см., и направленные в одну сторону на по меньшей мере один объект. Объектами, на которые направлены камеры, могут быть, например, пешеходы, дорожные заграждения, другие транспортные средства и участники дорожного движения, стены зданий, бордюры, небо, деревья, животные, дорога, тротуары и прочее. Для удобства далее будем считать, что камеры удалены друг от друга по горизонтали. В качестве камер могут быть использованы камеры, работающие как в видимом спектральном диапазоне (см., например, «Видимое излучение»), так и в УФ (см. «Ультрафиолетовое излучение») и ИК (см. «Инфракрасное излучение») диапазонах. В камерах могут быть использованы матрицы CMOS (complementary metal-oxide-semiconductor, комплементарная логика на транзисторах металл-оксид-полупроводник, КМОП) с активными чувствительными элементами (Active Pixel Sensor) и CCD (charge-coupled device, прибор с обратной зарядной связью).[0037] Any cameras widely known from the prior art can be used as cameras, located, for example, on a vehicle, at a given distance from each other in any direction, for example, 10-100 cm, and directed in one direction towards at least one object. The objects that the cameras are aimed at can be, for example, pedestrians, road barriers, other vehicles and road users, building walls, curbs, the sky, trees, animals, roads, sidewalks, etc. For convenience, we will further assume that the cameras are distant from each other horizontally. Cameras can be used that operate both in the visible spectral range (see, for example, “Visible radiation”), and in the UV (see “Ultraviolet radiation”) and IR (see “Infrared radiation”) ranges. The cameras can use CMOS (complementary metal-oxide-semiconductor, complementary logic on metal-oxide-semiconductor transistors, CMOS) matrices with active sensing elements (Active Pixel Sensor) and CCD (charge-coupled device, device with charge feedback) .

[0038] Например, в качестве камеры может быть использована камера модели «LI-IMX390-GW5200-GMSL2-120H» (см, например, https://www.leopardimaqing.com/product/autonomous-camera/maxim-gmsl2-cameras/li-imx390-gw5200-gmsl2/li-imx390-gw5200-gmsl2-120h/) или камера модель «acA2040-25gc - Basler асе» (см., например, https://www.baslerweb.com/en/products/cameras/area-scan-cameras/ace/аса2040-25gc/).[0038] For example, the camera model “LI-IMX390-GW5200-GMSL2-120H” can be used as a camera (see, for example, https://www.leopardimaqing.com/product/autonomous-camera/maxim-gmsl2-cameras /li-imx390-gw5200-gmsl2/li-imx390-gw5200-gmsl2-120h/) or camera model “acA2040-25gc - Basler ace” (see, for example, https://www.baslerweb.com/en/products /cameras/area-scan-cameras/ace/aca2040-25gc/).

[0039] Камеры калибруются известными из уровня техники методами, а информация о положении и углах наклона камер относительно друг друга заносятся разработчиком в устройство 10 построения карты глубины по паре изображений. Также известными методами могут быть скорректированы линзовые искажения на изображениях, получаемых с камер (см. например, процедуру калибровки камер, раскрытую по ссылке в Интернет: https://docs.opencv.org/4.5.2/dc/dbb/tutorial_ру_calibration.html).[0039] The cameras are calibrated by methods known from the prior art, and information about the position and angles of inclination of the cameras relative to each other is entered by the developer into the device 10 for constructing a depth map from a pair of images. Lens distortions in images obtained from cameras can also be corrected using known methods (see, for example, the camera calibration procedure disclosed at the Internet link: https://docs.opencv.org/4.5.2/dc/dbb/tutorial_р_calibration.html ).

[0040] Устройство 10 построения карты глубины по паре изображений может быть реализовано на базе по меньшей мере одного вычислительного устройства и содержать: модуль 11 сбора данных, модуль 12 ректификации изображений, модуль 20 формирования карты сдвигов и модуль 30 формирования карты глубины. Упомянутые модули могут быть реализованы на базе программно-аппаратных средств вычислительного устройства, в частности на базе его процессора или микроконтроллера, и оснащены соответствующими интерфейсами связи, логическими элементами, АЦП и ЦАП для обмена сигналами с целью передачи данных, в том числе информации об изображении, раскрытой ниже.[0040] The device 10 for constructing a depth map from a pair of images can be implemented on the basis of at least one computing device and contain: a data acquisition module 11, an image rectification module 12, a shift map generation module 20, and a depth map generation module 30. The mentioned modules can be implemented on the basis of software and hardware of a computing device, in particular on the basis of its processor or microcontroller, and are equipped with appropriate communication interfaces, logic elements, ADC and DAC for exchanging signals for the purpose of transmitting data, including image information, disclosed below.

[0041] Соответственно, первое и второе изображения, содержащие по меньшей мере одно изображение объекта, например, левое изображение и правое изображение, с камер 1 и 2 поступают в модуль 11 сбора данных, который может быть оснащен, например, буфером - устройством, обеспечивающим синхронное получение данных с двух камер. Полученные изображения направляются упомянутым модулем 11 в модуль 12 ректификации изображений, который проецирует изображения в одну плоскость. Процедура ректификация пары изображений может осуществляться известными из уровня техники методами, например, раскрытыми в книге Zisserman R. Н. A. Multiple view geometry in computer vision, опубл. в 2004 г., размещенной в Интернет по адресу: https://www.r-5.org/files/books/computers/algo-list/image-processing/vision/Richard_Hartley_Andrew_Zisserman-Multiple_View_Geometry_in_Computer_Vision-EN.pdf, и позволяет получить два кадра с построчным соответствием расположения объектов на них (эпиполярные линии проектируются по горизонтали), в следствие чего повысится точность значений карты глубины.[0041] Accordingly, the first and second images containing at least one image of the object, for example, a left image and a right image, from cameras 1 and 2 are supplied to the data acquisition module 11, which may be equipped, for example, with a buffer device providing synchronous acquisition of data from two cameras. The resulting images are sent by the mentioned module 11 to the image rectification module 12, which projects the images into one plane. The rectification procedure for a pair of images can be carried out using methods known from the prior art, for example, those disclosed in the book Zisserman R. N. A. Multiple view geometry in computer vision, publ. in 2004, posted on the Internet at: https://www.r-5.org/files/books/computers/algo-list/image-processing/vision/Richard_Hartley_Andrew_Zisserman-Multiple_View_Geometry_in_Computer_Vision-EN.pdf, and allows you to get two frames with line-by-line correspondence of the location of objects on them (epipolar lines are projected horizontally), as a result of which the accuracy of the depth map values will increase.

[0042] Далее первое и второе изображения, прошедшие процедуру ректификации, поступают в модуль 20 формирования карт сдвигов, который выполняет сопоставление пикселей изображений. При сопоставлении пикселей изображения сравниваются значение яркости самого пикселя и значение яркости соседних с ним пикселей. Результатом сопоставления пикселей изображения является карта сдвигов, в которой для каждого сравниваемого пикселя изображения стоит значение сдвига, указывающее на количество пикселей, на которое сдвинут наиболее похожий пиксель изображения с другой камеры. Для каждой камеры получается своя карта сдвигов. Таким образом, получаются первая и вторая карты сдвигов, в частности, для первого и второго изображений, которые направляются в модуль 30 формирования карты глубины.[0042] Next, the first and second images that have undergone the rectification procedure are supplied to the shift map generation module 20, which performs a comparison of image pixels. When matching pixels in an image, the brightness value of the pixel itself is compared with the brightness value of the pixels adjacent to it. The result of comparing image pixels is a shift map, in which for each compared image pixel there is a shift value indicating the number of pixels by which the most similar image pixel from another camera is shifted. Each camera gets its own shift map. Thus, first and second shift maps are obtained, in particular for the first and second images, which are sent to the depth map generating unit 30.

[0043] Для формирования карты глубины из карт сдвигов упомянутый модуль 30 на основе значений сдвигов пикселей определяет расстояние от линии, соединяющей центры камер, до каждого пикселя по меньшей мере одного объекта на изображении. Это расстояние обратно-пропорционально значению сдвига и вычисляется, например, по формуле:[0043] To generate a depth map from the shift maps, said module 30, based on the pixel shift values, determines the distance from the line connecting the centers of the cameras to each pixel of at least one object in the image. This distance is inversely proportional to the shift value and is calculated, for example, by the formula:

distance=B×f/D;distance=B×f/D;

где В - размер базы (расстояние между камерами);where B is the size of the base (distance between cameras);

f - фокусное расстояние в пикселях (в частности, используются одинаковые камеры и изображения со скорректированными линзовыми искажениями);f is the focal length in pixels (in particular, the same cameras and images with corrected lens distortions are used);

D - значение сдвига.D - shift value.

Фокусное расстояние f и размер базы В задаются разработчиком в памяти упомянутого модуля 30, которой он может быть дополнительно оснащен.The focal length f and base size B are specified by the developer in the memory of the mentioned module 30, with which it can be additionally equipped.

[0044] Соответственно, для первой и второй карт сдвигов модулем 30 формируются первая и вторая матрицы, значения которых характеризуют расстояние от точек по меньшей мере одного объекта на изображении до линии, соединяющей центры камер, после чего модуль 30 назначает первую или вторую матрицу в качестве карты глубины, в зависимости от заданного разработчиком программного алгоритма.[0044] Accordingly, for the first and second shift maps, module 30 generates first and second matrices, the values of which characterize the distance from the points of at least one object in the image to the line connecting the centers of the cameras, after which module 30 assigns the first or second matrix as depth maps, depending on the software algorithm specified by the developer.

[0045] Дополнительно модуль 30 формирования карты глубины может быть выполнен с возможностью фильтрации карт сдвигов посредством проверки согласованности значений сдвигов пикселей левого и правого изображений, по итогу которой модуль 30 сформирует матрицу, содержащую информацию о координатах пикселей и значений, указывающих на то, что значения сдвигов пикселей, содержащихся в первой и второй картах сдвигов, являются согласованными или несогласованными. Алгоритм проверки согласованности значений сдвигов пикселей будет описан более подробно далее в тексте заявки. При формировании карты глубины описанным выше способом несогласованные значения сдвига пикселей при формировании карты глубины не учитываются.[0045] Additionally, the depth map generating module 30 can be configured to filter the shift maps by checking the consistency of the pixel shift values of the left and right images, as a result of which the module 30 will generate a matrix containing information about the coordinates of the pixels and values indicating that the values the shifts of the pixels contained in the first and second shift maps are consistent or inconsistent. The algorithm for checking the consistency of pixel shift values will be described in more detail later in the text of the application. When generating a depth map using the method described above, inconsistent pixel shift values are not taken into account when generating a depth map.

[0046] Далее на основе данных полученной карты глубины упомянутый модуль 30 известными методами формирует облако точек в трехмерном пространстве, выполнив обратную проекцию с камеры в 3Д пространство (см, например, Bostanci, Gazi Erkan & Kanwal, Nadia & Clark, Adrian. (2015). Augmented reality applications for cultural heritage using Kinect. Human-centric Computing and Information Sciences. 5. 1-18. 10.1186/s13673-015-0040-3.). Облако точек (набор точек в трехмерном пространстве) может быть использовано для планирования траектории движения автономного беспилотного транспортного средства широко известными методами.[0046] Next, based on the data of the obtained depth map, the mentioned module 30 uses known methods to form a cloud of points in three-dimensional space, performing a back projection from the camera into 3D space (see, for example, Bostanci, Gazi Erkan & Kanwal, Nadia & Clark, Adrian. (2015 Augmented reality applications for cultural heritage using Kinect. Human-centric Computing and Information Sciences. 5. 1-18. 10.1186/s13673-015-0040-3.). A point cloud (a collection of points in three-dimensional space) can be used to plan the trajectory of an autonomous unmanned vehicle using well-known methods.

[0047] В альтернативном варианте реализации заявленного решения карта сдвига может быть сформирована посредством использования нейронной сети, которая состоит из нескольких других нейронных сетей. В данном варианте модуль 20 формирования карты сдвигов дополнительно оснащается модулем 21 нормирования тензоров, кодировщиком 22, трансформером 23 и декодером 24.[0047] In an alternative embodiment of the claimed solution, the shift map can be generated by using a neural network that consists of several other neural networks. In this embodiment, the shift map generation module 20 is additionally equipped with a tensor normalization module 21, an encoder 22, a transformer 23 and a decoder 24.

[0048] В качестве модуля 21 нормирования тензоров может быть использован по меньшей мере один процессор или микроконтроллер, сконфигурированные в программно-аппаратной части таким образом, чтобы выполнять приписанные модулю 21 ниже функции.[0048] The tensor normalization module 21 can be used at least one processor or microcontroller configured in firmware to perform the functions assigned to the module 21 below.

[0049] В качестве кодировщика 22 может быть использована по меньшей мере одна нейросеть, выполняющая выделение признаков на изображении, получая векторные представления объектов в уменьшенном пространственном разрешении. Например, нейросеть может быть реализована в виде стандартной сверточной нейросети ResNet18 (см., например, статью Не, Kaiming et al, «Deep Residual Learning for Image Recognition», 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)).[0049] At least one neural network can be used as encoder 22, which performs feature extraction in the image, obtaining vector representations of objects in reduced spatial resolution. For example, the neural network can be implemented as a standard convolutional neural network ResNet18 (see, for example, the article by He, Kaiming et al, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)).

[0050] В качестве трансформера 23 может быть использована по меньшей мере одна нейросеть, выполняющая построчное сопоставление признаков, выделенных предыдущей сетью. Упомянутая нейросеть может быть оснащена механизмом внимания, который оценивает, насколько похожи векторные признаки строки изображения, снятого первой камерой, на векторные признаки соответствующей строки изображения, полученного со второй камеры. Выходные тензоры с картой активации имеют такую же размерность, что и входные данные. Однако вектора тензора несут в себе информацию не о локальных признаках одного кадра, а информацию о сдвиге пикселей одной камеры относительного другой. При обучении трансформер 23 предсказывает тензор, из которого можно получить карту сдвигов в меньшем пространственном разрешении, чем у входного изображения. Трансформер 23 может быть реализован в виде нейросети семейства BERT (Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv: 1810.04805 (2018)) с двумя слоями и восемью головами.[0050] At least one neural network can be used as transformer 23, performing line-by-line comparison of features identified by the previous network. Said neural network may be equipped with an attention mechanism that evaluates how similar the vector features of a line of the image captured by the first camera are to the vector features of the corresponding line of the image obtained from the second camera. The output tensors with the activation map have the same dimension as the input data. However, tensor vectors carry information not about local features of one frame, but information about the shift of pixels of one camera relative to another. During training, transformer 23 predicts a tensor from which a shift map can be obtained in a lower spatial resolution than that of the input image. Transformer 23 can be implemented as a neural network of the BERT family (Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv: 1810.04805 (2018)) with two layers and eight heads.

[0051] В качестве декодера 24 может быть использована по меньшей мере одна нейросеть, декодирующая тензоры признаков, вышедших с кодировщика 22 и трансформера 23, до исходного пространственного разрешения. С кодировщика берутся несколько тензоров признаков разного пространственного разрешения. Декодер 24 может представлять собой часть сети Unet, дополненную механизмом внимания. Выходом декодера 24 является нормированная от 0 до 1 карта сдвигов.[0051] At least one neural network can be used as decoder 24, decoding feature tensors output from encoder 22 and transformer 23 to the original spatial resolution. Several feature tensors of different spatial resolutions are taken from the encoder. The decoder 24 may be part of the Unet, complemented by an attention mechanism. The output of decoder 24 is a shift map normalized from 0 to 1.

[0052] Изображения в цифровой среде представляются в виде тензоров (или матриц), содержащих значения, характеризующие цвета пикселей или их яркость. Соответственно, первое (например, левое) и второе (например, правое) изображения, прошедшие процедуру ректификации, в виде тензоров поступают в модуль 21 нормирования тензоров. Тензоры первого и второго изображений, Т1 и Т2 соответственно, могут быть представлены, например, в виде тензоров с размерностями C×H×W, где С - число каналов, Н - высота изображения, W - ширина изображения. Например, для изображения с разрешением 512 на 960 пикселей - Н=512, W=960 пикселей. Также изображения могут быть как цветными, так и черно-белыми. Цветные изображения содержат 3 канала (С=3, по одному каналу для каждого из трех цветов: красный, зеленый и синий), а черно-белые изображения содержат один канал (С=1). Значения размерностей С, Н и W могут быть заданы разработчиком.[0052] Images in the digital environment are represented as tensors (or matrices) containing values that characterize the colors of the pixels or their brightness. Accordingly, the first (for example, left) and second (for example, right) images that have passed the rectification procedure are supplied in the form of tensors to the tensor normalization module 21. The tensors of the first and second images, T1 and T2, respectively, can be represented, for example, in the form of tensors with dimensions C×H×W, where C is the number of channels, H is the height of the image, W is the width of the image. For example, for an image with a resolution of 512 by 960 pixels - H=512, W=960 pixels. Also, images can be either color or black and white. Color images contain 3 channels (C=3, one channel for each of the three colors: red, green and blue), and black and white images contain one channel (C=1). The values of dimensions C, H and W can be specified by the developer.

[0053] Далее алгоритм построения карты глубины будет раскрыт на примере цветных изображений. В данном варианте реализации технического решения каждый элемент тензора изображения содержит целое число со значениями от 0 до 255 включительно, которое показывает величину освещенности - яркости соответствующего пикселя матрицы камеры. Соответственно, сформированные первый и второй тензоры изображения поступают на вход модулю 21 нормирования тензоров.[0053] Next, the algorithm for constructing a depth map will be described using color images as an example. In this embodiment of the technical solution, each element of the image tensor contains an integer with values from 0 to 255 inclusive, which shows the amount of illumination - the brightness of the corresponding pixel of the camera matrix. Accordingly, the generated first and second image tensors are input to the tensor normalization module 21.

[0054] Модуль 21 нормирует входящие тензоры следующим образом. Из значения яркости (величины освещенности) каждого пикселя вычитается величина к1 и затем делится на величину к2. Коэффициенты к1 и к2 задаются разработчиком. Затем упомянутые тензоры объединяются в один тензор N×C×H×W (N=2, С=3, Н=512, W=960) для одновременной обработки первого и второго тензоров изображений с помощью кодировщика 22. Размер пакета N определяется упомянутым модулем 21 на основе количества изображений (тензоров), поступивших на вход модуля 21 одновременно или последовательно в заданный разработчиком интервал времени. Поскольку на вход модуля 21 поступило два изображения, то значение размера пакета N будет определено как 2. Кодировщик 22, а также все остальные модули, могут быть реализованы с помощью библиотеки PyTorch (см. https://pytorch.org/). Для реализации могут быть использованы и другие библиотеки, например, TensorFlow (https://www.tensorflow.org/), MxNet (https://mxnet.apache.org). Одновременная обработка данных первого и второго тензоров посредством объединения их в один тензор позволяет ускорить работу алгоритма по сравнению с последовательной обработкой двух исходных тензоров. Например, два тензора Т1 и Т2 размерности 3×512×960 объединяются в один тензор О путем добавления новой размерности (увеличения ранга тензора). Первая размерность тензора О фактически нумерует объединенные тензоры: О=(Т1, Т2).[0054] Module 21 normalizes the input tensors as follows. From the brightness value (the amount of illumination) of each pixel, the k1 value is subtracted and then divided by the k2 value. Coefficients k1 and k2 are set by the developer. Then the mentioned tensors are combined into one tensor N×C×H×W (N=2, C=3, H=512, W=960) for simultaneous processing of the first and second image tensors using the encoder 22. The packet size N is determined by the mentioned module 21 based on the number of images (tensors) received at the input of module 21 simultaneously or sequentially in a time interval specified by the developer. Since the input of module 21 received two images, the value of the packet size N will be determined as 2. Encoder 22, as well as all other modules, can be implemented using the PyTorch library (see https://pytorch.org/). Other libraries can be used for implementation, for example, TensorFlow (https://www.tensorflow.org/), MxNet (https://mxnet.apache.org). Simultaneous processing of data from the first and second tensors by combining them into one tensor makes it possible to speed up the algorithm compared to sequential processing of the two original tensors. For example, two tensors T1 and T2 of dimension 3×512×960 are combined into one tensor O by adding a new dimension (increasing the rank of the tensor). The first dimension of the tensor O actually numbers the combined tensors: O = (T1, T2).

По первой размерности тензора О в индексе 0 содержится тензор Т1, в индексе 1 - Т2:According to the first dimension of the tensor O, index 0 contains the tensor T1, and index 1 contains T2:

О[0]=Т1O[0]=T1

0[1]=Т20[1]=T2

[0055] Пример объединения для случая двух матриц 2×2.[0055] An example of merging for the case of two 2×2 matrices.

[0056] Выходом кодировщика 22 при обработке одной пары тензоров изображений, объединенных в один тензор, является тензор, содержащий векторные представления (векторы признаков) по меньшей мере одного объекта в уменьшенном пространственном разрешении, например, в 16 раз, т.е. имеющий размерность 2×256×32×60 (N=2, С=256, Н=32, W=60), где первое измерение N соответствует размеру пакета обработки, второе С - размерности вектора признаков каждого элемента, а оставшиеся два измерения Н и W - уменьшенному пространственному размеру в 16 раз. Описание процедуры формирования векторного представления по меньшей мере одного объекта из входного тензора изображения раскрыто, например, в статье Не, Kaiming et al, «Deep Residual Learning for Image Recognitions, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)). В результате обработки нейронной сетью входного тензора изображения получается новый тензор, который может быть представлен пользователю в виде карты признаков, содержащей информацию о по меньшей мере одном объекте, в том числе информацию о границах объекта. Эта информация закодирована в виде числовых значений. Также кодировщик 22 формирует дополнительные тензоры в заданном разработчиком по меньшей мере одном пространственном масштабе (разрешении), например, три тензора с размерностями 2×128×64×120, 2×64×128×240, 2×64×256×480, которые будут использованы далее декодером 24. Процедура получения тензоров (карт признаков) на разных пространственных масштабах с кодировщика 22 описана в статье Не, Kaiming et al, «Deep Residual Learning for Image Recognition», 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).[0056] The output of the encoder 22 when processing one pair of image tensors combined into one tensor is a tensor containing vector representations (feature vectors) of at least one object in a reduced spatial resolution, for example, by a factor of 16, i.e. having a dimension of 2×256×32×60 (N=2, C=256, H=32, W=60), where the first dimension N corresponds to the size of the processing package, the second C - the dimensions of the feature vector of each element, and the remaining two dimensions H and W - reduced spatial size by 16 times. A description of the procedure for generating a vector representation of at least one object from an input image tensor is disclosed, for example, in the article by He, Kaiming et al, “Deep Residual Learning for Image Recognitions, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).” As a result of processing the input image tensor by the neural network, a new tensor is obtained, which can be presented to the user in the form of a feature map containing information about at least one object, including information about the boundaries of the object. This information is encoded as numeric values. Also, the encoder 22 generates additional tensors in at least one spatial scale (resolution) specified by the developer, for example, three tensors with dimensions 2×128×64×120, 2×64×128×240, 2×64×256×480, which will be used further by decoder 24. The procedure for obtaining tensors (feature maps) at different spatial scales from encoder 22 is described in the article by He, Kaiming et al, “Deep Residual Learning for Image Recognition”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

[0057] Далее устройство 10 построения карты глубины переходит к этапу построчного сравнения векторных представлений (векторов признаков) по меньшей мере одного объекта для получения информации о сдвиге пикселей друг относительно друга. Векторные представления (вектора признаков) содержат информацию о по меньшей мере одном объекте на изображениях и извлекаются из тензоров, полученных на предыдущем этапе, и, при необходимости, могут быть отображены пользователю с помощью известной операции обращения по индексам. Для построчного сравнения векторных представлений (векторов признаков) объекта на левом и правом изображениях необходимо посредством кодировщика 22 построчно объединить признаки с двух изображений. Для этого кодировщик 22 объединяет в строку элементы карт признаков, которые соответствуют одной и той же строке входных изображений. Для каждого пакета (тензора) вдоль размерности N кодировщик 22 объединяет строки по размерности W, соответствующей ширине изображения. По индексу с номером 0 расположены данные первого пакета, по индексу 1 - данные второго пакета. Например, пусть полученная карта признаков П имеет размерность N×C×W (2×2×3, N=2, С=2, W=3). Для простоты размерность Н пропущена:[0057] Next, the depth map device 10 proceeds to the stage of line-by-line comparison of vector representations (feature vectors) of at least one object to obtain information about the shift of pixels relative to each other. Vector representations (feature vectors) contain information about at least one object in the images and are extracted from the tensors obtained in the previous step, and, if necessary, can be displayed to the user using the well-known index inversion operation. To compare line-by-line vector representations (feature vectors) of an object on the left and right images, it is necessary to combine features from the two images line-by-line using the encoder 22. To do this, encoder 22 strings together elements of feature maps that correspond to the same string of input images. For each packet (tensor) along dimension N, encoder 22 combines lines along dimension W, corresponding to the width of the image. Index number 0 contains the data of the first packet, index 1 contains the data of the second packet. For example, let the resulting feature map P have dimensions N×C×W (2×2×3, N=2, C=2, W=3). For simplicity, the dimension H is omitted:

Делая объединение строк по размерности W, получим тензор размерности C×2W (2×6):By combining rows along the dimension W, we obtain a tensor of dimension C×2W (2×6):

[0058] Таким образом, из тензора N×C×H×W (2×256×32×60, N=2, С=256, Н=32, W=60) кодировщик 22 формирует тензор C×H×2W (256×32×120), после чего для пакетной обработки всех строк кодировщиком 22 переставляются размерности тензора (числа содержащиеся в тензоре при этом остаются без изменений), например, тензор приводится к виду H×2W×C (32×120×256). Алгоритм перестановки размерностей может быть задан разработчиком известными из уровня техники методами. Упомянутый тензор далее передается кодировщиком 22 на вход трансформеру 23, который возвращает тензор такого же размера 32×120×256. Трансформер 23 извлекает из входного тензора векторные представления объекта, определенные для первого изображения, векторные представления объекта, определенные для второго изображения, после чего сравнивает упомянутые векторные представления для определения значений сдвигов пикселей изображений друг относительно друга (см. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017)), которые также могут называться как значения предсказаний сдвигов пикселей. После обработки трансформер 23 формирует тензор, в который включается полученная информацию о значениях сдвигов пикселей изображений друг относительно друга. Эта информация закодирована в виде числовых значений. Далее полученный тензор преобразуется обратно в тензор с размерностью N×C×H×W (2×256×32×60) путем разбиения исходного тензора по размерности строк и дальнейшими перестановками размерностей. Алгоритм перестановки размерностей задан разработчиком.[0058] Thus, from the tensor N×C×H×W (2×256×32×60, N=2, C=256, H=32, W=60), the encoder 22 generates the tensor C×H×2W ( 256×32×120), after which, for batch processing of all lines, the encoder 22 rearranges the dimensions of the tensor (the numbers contained in the tensor remain unchanged), for example, the tensor is reduced to the form H×2W×C (32×120×256). The algorithm for permuting dimensions can be specified by the developer using methods known from the prior art. The mentioned tensor is then transmitted by the encoder 22 to the input of the transformer 23, which returns a tensor of the same size 32×120×256. Transformer 23 extracts from the input tensor the object vector representations defined for the first image, the object vector representations defined for the second image, and then compares said vector representations to determine the values of the pixel shifts of the images relative to each other (see Vaswani, Ashish, et al. " Attention is all you need." Advances in neural information processing systems 30 (2017)), which can also be referred to as pixel shift prediction values. After processing, transformer 23 forms a tensor, which includes the received information about the values of shifts of image pixels relative to each other. This information is encoded as numeric values. Next, the resulting tensor is converted back into a tensor with dimensions N×C×H×W (2×256×32×60) by dividing the original tensor into row dimensions and further permuting the dimensions. The algorithm for permuting dimensions was specified by the developer.

[0059] Соответственно, декодер 24 получает на вход от модуля 21 тензор с нормированными изображениями (2×3×512×960), тензор с выхода трансформера 23 (2×256×32×60), содержащий информацию о значениях сдвигов пикселей изображений друг относительно друга, а также дополнительное подкрепление в виде карт признаков (тензоров) с кодировщика 22 на заданных разработчиком пространственных масштабах (2×128×64×120, 2×64×128×240, 2×64×256×480). Описание работы декодера 24 приведено в статье (Roy, Abhijit Guha, Nassir Navab, and Christian Wachinger. "Recalibrating fully convolutional networks with spatial and channel "squeeze and excitation" blocks." IEEE transactions on medical imaging 38.2 (2018): 540-549). Декодер 24 на основе полученных данных формирует два тензора, содержащих значения сдвигов пикселей, которые могут быть представлены пользователю в виде карт сдвигов. Тензоры соответствуют первому (например, левому) и второму (например, правому) изображениям и имеют заданную разработчиком размерность, например, 512×960. Далее полученные два тензора будем называть картами сдвигов. Затем карты сдвигов направляются в модуль 30 формирования карты глубины.[0059] Accordingly, the decoder 24 receives as input from the module 21 a tensor with normalized images (2×3×512×960), a tensor from the output of the transformer 23 (2×256×32×60), containing information about the values of the pixel shifts of each image relative to each other, as well as additional reinforcement in the form of feature maps (tensors) from encoder 22 at spatial scales specified by the developer (2×128×64×120, 2×64×128×240, 2×64×256×480). A description of the operation of decoder 24 is given in the article (Roy, Abhijit Guha, Nassir Navab, and Christian Wachinger. "Recalibrating fully convolutional networks with spatial and channel "squeeze and excitation" blocks." IEEE transactions on medical imaging 38.2 (2018): 540-549 ). Based on the received data, decoder 24 generates two tensors containing pixel shift values, which can be presented to the user in the form of shift maps. The tensors correspond to the first (for example, left) and second (for example, right) images and have a dimension specified by the developer, for example, 512x960. In what follows, we will call the resulting two tensors shift maps. The displacement maps are then sent to the depth map generating module 30.

[0060] Получение сразу двух карт сдвигов необходимо для алгоритма постобработки, выполняемого модулем 30, который проверяет согласованность значений сдвига пикселей для левого и правого изображений. Алгоритм согласования значений сдвига пикселей следующий. Рассматриваем последовательно все пиксели на первом (левом) изображении. Для каждого пикселя по карте сдвигов, сформированной декодером 24 для первого (левого) изображения, модуль 30 находит значение сдвига пикселя. По значению сдвига модуль 30 определяет, какому пикселю на втором (правом) изображении соответствует пиксель на первом (левом) изображении. Аналогично для найденного пикселя на правом изображении с помощью карты сдвигов, сформированной декодером 24, модуль 30 находит соответствующий пиксель на левом изображении.[0060] Obtaining two shift maps at once is necessary for the post-processing algorithm performed by module 30, which checks the consistency of the pixel shift values for the left and right images. The algorithm for matching pixel shift values is as follows. We consider sequentially all the pixels in the first (left) image. For each pixel, using the shift map generated by the decoder 24 for the first (left) image, module 30 finds the pixel shift value. Based on the shift value, module 30 determines which pixel in the second (right) image corresponds to the pixel in the first (left) image. Similarly, for a found pixel on the right image, using the shift map generated by the decoder 24, module 30 finds the corresponding pixel on the left image.

[0061] В частности, для каждого пикселя первой (левой) карты сдвигов, имеющего координаты x_left, y_left (по горизонтали и вертикали матрицы соответственно) и значение d_left (величина сдвига), модуль 30 находит соответствующий пиксель на второй (правой) карте сдвигов с координатами x_right, y_right, где x_right=x_left+d_left, y_right=y_left. Далее модуль 30 на основе определенных координат найденного пикселя извлекает из второй карты сдвига значение сдвига, определенное для данного пикселя: d_right. Значение сдвига d_right, определенное для пикселя второго изображения, сравнивается модулем 30 с значением сдвига пикселя d_left, определенное для пикселя первого изображения, причем если разница по модулю между d_left и d_right меньше заданного порогового значения (например, 2), то модуль 30 определяет, что значения сдвигов согласованы, после чего модуль 30 формирует матрицу согласованности значений сдвигов пикселей, размер которой соответствует размеру исходной карты сдвигов, и назначает для пикселя значение (например, указывает значение True в ячейке с координатами x_left, y_left), указывающее на то, что значения сдвигов, определенные для данного пикселя, согласуются. Если разница по модулю между d_left и d_right больше заданного порогового значения, то модуль 30 назначает для пикселя значение (например, False), указывающее на то, что упомянутые значения сдвигов не согласуются. Пороговое значение может быть задано разработчиком модуля 30. Соответственно, если пиксель вернулся в тоже самое место с ошибкой менее 2 пикселей, то считается, что пиксели согласованы. Если ошибка составила более 2 пикселей, то значения сдвигов не согласованы.[0061] Specifically, for each pixel of the first (left) shift map having coordinates x_left, y_left (horizontally and vertically of the matrix, respectively) and a value d_left (shift amount), module 30 finds a corresponding pixel on the second (right) shift map with coordinates x_right, y_right, where x_right=x_left+d_left, y_right=y_left. Next, module 30, based on the determined coordinates of the found pixel, extracts from the second shift map the shift value defined for this pixel: d_right. The shift value d_right determined for the pixel of the second image is compared by module 30 with the pixel shift value d_left determined for the pixel of the first image, and if the difference in magnitude between d_left and d_right is less than a given threshold value (for example, 2), then module 30 determines that the shift values are consistent, whereupon module 30 generates a pixel shift value consistency matrix, the size of which corresponds to the size of the original shift map, and assigns a value to the pixel (for example, indicates a True value in the cell at coordinates x_left, y_left) indicating that the shift values , defined for a given pixel, are consistent. If the magnitude difference between d_left and d_right is greater than a predetermined threshold value, then module 30 assigns a value (eg, False) to the pixel indicating that the offset values are inconsistent. The threshold value can be set by the developer of module 30. Accordingly, if a pixel returns to the same location with an error of less than 2 pixels, then the pixels are considered consistent. If the error is more than 2 pixels, then the shift values are not consistent.

[0062] Это необходимо для увеличения точности полученных значений сдвигов за счет фильтрации зон, в которых значения сдвигов не согласуются. Как правило это зоны, которые видны на одной камере и не видны на другой. Примеры показаны на Фиг. 2 и 3. На Фиг. 2 показаны исходные изображения для левой и правой камер. На Фиг. 3 показаны области, которые видны на одной камере и не видны на другой из-за эффекта параллакса (https://en.wikipedia.org/wiki/Parallax). Для таких зон значение сдвига не определено. Наш подход может делать предсказания сдвигов в таких областях, но точность предсказаний здесь получается значительно ниже, чем в случае видимости объекта на двух камерах одновременно. На Фиг. 4 показана исходная карта глубины и карта глубины после фильтрации. Как видно, повысилась четкость границ объектов и были обнаружены области с высокой ошибкой предсказаний сдвигов, что положительно скажется на алгоритмах анализа препятствий, которые будут использовать полученные карты глубины.[0062] This is necessary to increase the accuracy of the obtained offset values by filtering out areas where the offset values are inconsistent. As a rule, these are areas that are visible on one camera and not visible on another. Examples are shown in FIGS. 2 and 3. In Fig. Figure 2 shows the original images for the left and right cameras. In FIG. Figure 3 shows areas that are visible on one camera but not on another due to the parallax effect (https://en.wikipedia.org/wiki/Parallax). For such zones the shift value is not defined. Our approach can predict shifts in such areas, but the accuracy of the predictions here is much lower than in the case of an object being visible on two cameras simultaneously. In FIG. Figure 4 shows the original depth map and the depth map after filtering. As can be seen, the clarity of object boundaries has increased and areas with a high error in displacement predictions have been detected, which will have a positive effect on obstacle analysis algorithms that will use the resulting depth maps.

[0063] Соответственно, по итогу проверки согласованности значений сдвигов пикселей модуль 30 формирует матрицу, содержащую информацию о координатах пикселей и значений, указывающих на то, что значения сдвигов пикселей, содержащиеся в первой и второй картах сдвигов, являются согласованными или несогласованными. Далее модуль 30 переходит к этапу формирования известными методами, например, раскрытыми ранее, карты глубины на основе значений карт сдвигов, причем несогласованные значения сдвига пикселей при формировании карты глубины не учитываются.[0063] Accordingly, based on the result of checking the consistency of the pixel shift values, the module 30 generates a matrix containing information about the coordinates of the pixels and values indicating that the pixel shift values contained in the first and second shift maps are consistent or inconsistent. Next, module 30 proceeds to the stage of generating, using known methods, for example, those disclosed earlier, a depth map based on the values of the shift maps, and inconsistent pixel shift values are not taken into account when generating the depth map.

[0064] Таким образом, за счет того, что изображения, полученные с камер, проходят процедуру ректификации, а карта глубины строится на основе значений карт сдвигов пикселей, определенных для каждого полученного изображения, повышается точность значений карты глубины.[0064] Thus, due to the fact that the images received from the cameras undergo a rectification procedure, and the depth map is built based on the values of the pixel shift maps determined for each acquired image, the accuracy of the depth map values increases.

[0065] Для обучения нейронной сети используется общепринятый подход [см Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.]. Берется датасет с изображениями и известными картами глубины. В качестве датасетов, например, можно использовать SceneFlow (https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html) и KITTI (http://www.cvlibs.net/datasets/kitti/eval_stereo_flow.php?benchmark=stereo, http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=stereo).[0065] A common approach is used to train a neural network [see Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016]. A dataset with images and known depth maps is taken. For example, you can use SceneFlow (https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html) and KITTI (http://www.cvlibs.net/datasets/kitti) as datasets /eval_stereo_flow.php?benchmark=stereo, http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=stereo).

[0066] Нейронная сеть обучается на тренировочном наборе изображений так, чтобы предсказывать размеченную карту глубины.[0066] A neural network is trained on a training set of images to predict a labeled depth map.

Ввиду проблем с обучением трансформера 23, обучение всей нейросети ведется в три этапа:Due to problems with training transformer 23, training of the entire neural network is carried out in three stages:

1. Обучаются части нейронной сети, в частности модули: кодировщик 22 и трансформер 23 следующим образом. Изображения с двух камер в виде тензоров поступают на модуль нормировки 21. Полученные нормированные изображения далее обрабатываются кодировщиком 22 и затем трансформером 23. Тензор размерности 2×256×32×60 с выхода трансформера 23 приводится к карте сдвигов с помощью усреднения по измерению канала С=256. Из тензора размерности 2×256×32×60 получается тензор размерности 2×32×60, который соответствует двум картам сдвигов для левого и правого изображений (первая размерность тензора нумерует карты сдвигов). Полученные карты передаются в функцию потерь вместе с известной картой сдвигов, уменьшенной в 16 раз. Этот этап нужен, чтобы предобучить трансформер 23. Без этого этапа трансформер 23 будет обучаться медленно, т.к. преимущественно будут обучаться кодировщик 22, декодер 24, и модель будет оставаться в локальном минимуме.1. Parts of the neural network are trained, in particular the modules: encoder 22 and transformer 23 as follows. Images from two cameras in the form of tensors are supplied to normalization module 21. The resulting normalized images are further processed by encoder 22 and then by transformer 23. A tensor of dimension 2×256×32×60 from the output of transformer 23 is reduced to a shift map using averaging over the channel dimension C= 256. From a tensor of dimension 2×256×32×60, a tensor of dimension 2×32×60 is obtained, which corresponds to two shift maps for the left and right images (the first dimension of the tensor numbers the shift maps). The resulting maps are passed to the loss function along with the known shift map, reduced by a factor of 16. This stage is needed to pretrain transformer 23. Without this stage, transformer 23 will learn slowly, because Encoder 22 and decoder 24 will predominantly be trained, and the model will remain in a local minimum.

2. Обучаются все части нейронной сети, модули кодировщик 22, трасформер 23 и декодер 24. Но функция потерь для карты сдвига, полученной с трансформера 23 не вычисляется. В этом режиме обучается преимущественно декодер, предсказывающий карты сдвига на полном разрешении (таком же как у входного изображения).2. All parts of the neural network are trained, modules encoder 22, transformer 23 and decoder 24. But the loss function for the shift map obtained from transformer 23 is not calculated. This mode trains primarily a decoder that predicts shift maps at full resolution (the same as the input image).

3. Обучаются все части нейронной сети, модули кодировщик 22, трасформер 23 и декодер 24. Оптимизируемая функция потерь выглядит как взвешенная сумма значений функций потерь для двух выходов нейросети: с трансформера 23 и с декодера 24. Трансформер 23 предсказывает карты сдвигов в уменьшенном разрешении, декодер 24 - в полном разрешении. Веса равны соответственно 0.1 и 0.001. Это позволяет провести более точную настройку весов всей нейронной сети. Функция потерь для предсказаний трансформера 23 играет роль регуляризации, заключающейся в том, что признаки с трансформера 23 должны нести в себе информацию, достаточную для создания карты сдвигов.3. All parts of the neural network are trained, modules encoder 22, transformer 23 and decoder 24. The optimized loss function looks like a weighted sum of the values of loss functions for two outputs of the neural network: from transformer 23 and from decoder 24. Transformer 23 predicts shift maps in reduced resolution, decoder 24 - in full resolution. The weights are 0.1 and 0.001, respectively. This allows for more precise adjustment of the weights of the entire neural network. The loss function for predictions of transformer 23 plays the role of regularization, which consists in the fact that the features from transformer 23 must contain information sufficient to create a shift map.

[0067] При обучении известная карта сдвигов нормализуется с помощью выбранного максимального значения сдвига равного 200 пикселей. Если значение больше этого порога, то оно обрезается до 200. Таким образом, все значения входят в отрезок [0;1], что позволяет использовать на выходе нейросети сигмоиду в качестве функции активации и вести обучение с помощью 2D бинарной кросс-энтропии (ВСЕ - Binary Cross Entropy, см ссылки https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a[0067] During training, the known shift map is normalized using a selected maximum shift value of 200 pixels. If the value is greater than this threshold, then it is cut off to 200. Thus, all values are included in the segment [0;1], which allows using the sigmoid at the output of the neural network as an activation function and training using 2D binary cross-entropy (ALL - Binary Cross Entropy, see links https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html).https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html).

[0068] Пример вычисления функции ВСЕ. Пусть p_i,gt и p_i,pred известная и предсказанная нормированная карта сдвигов для пикселя номером i, соответственно. Тогда значении функции потерь L_bce равно:[0068] An example of calculating the ALL function. Let p _i,gt and p _i,pred be the known and predicted normalized shift map for pixel number i, respectively. Then the value of the loss function L _bce is equal to:

где N - число пикселей, суммирование идет по всем пикселям, логарифм натуральный.where N is the number of pixels, the summation is over all pixels, the logarithm is natural.

[0069] В качестве оптимизатора был выбран Adabelief [см Adabelief]. Реализация алгоритма оптимизатора более подробно раскрыта по ссылке: https://github.com/jettify/pytorch-optimizer.[0069] Adabelief was selected as the optimizer [see Adabelief]. The implementation of the optimizer algorithm is described in more detail at the link: https://github.com/jettify/pytorch-optimizer.

[0070] В общем виде (см. Фиг. 6) вычислительное устройство (200) содержит объединенные общей шиной информационного обмена один или несколько процессоров (201), средства памяти, такие как ОЗУ (202) и ПЗУ (203), интерфейсы ввода/вывода (204), устройства ввода/вывода (205), и устройство для сетевого взаимодействия (206).[0070] In general (see Fig. 6), a computing device (200) contains one or more processors (201), memory devices such as RAM (202) and ROM (203), and input/ output (204), input/output devices (205), and network communication device (206).

[0071] Процессор (201) (или несколько процессоров, многоядерный процессор и т.п.) может выбираться из ассортимента устройств, широко применяемых в настоящее время, например, таких производителей, как: Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. Под процессором или одним из используемых процессоров в системе (200) также необходимо учитывать графический процессор, например, GPU NVIDIA с программной моделью, совместимой с CUDA, или Graphcore, тип которых также является пригодным для полного или частичного выполнения способа, а также может применяться для обучения и применения моделей машинного обучения в различных информационных системах.[0071] The processor (201) (or multiple processors, multi-core processor, etc.) may be selected from a variety of devices commonly used today, for example, from manufacturers such as: Intel™, AMD™, Apple™, Samsung Exynos ™, MediaTEK™, Qualcomm Snapdragon™, etc. The processor or one of the processors used in the system (200) must also include a graphics processor, for example an NVIDIA GPU with a CUDA-compatible programming model or Graphcore, the type of which is also suitable for carrying out the method in whole or in part, and can also be used for training and application of machine learning models in various information systems.

[0072] ОЗУ (202) представляет собой оперативную память и предназначено для хранения исполняемых процессором (201) машиночитаемых инструкций для выполнения необходимых операций по логической обработке данных. ОЗУ (202), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.). При этом, в качестве ОЗУ (202) может выступать доступный объем памяти графической карты или графического процессора.[0072] RAM (202) is a random access memory and is designed to store machine-readable instructions executed by the processor (201) to perform the necessary logical data processing operations. The RAM (202) typically contains executable operating system instructions and associated software components (applications, program modules, etc.). In this case, the available memory capacity of the graphics card or graphics processor can act as RAM (202).

[0073] ПЗУ (203) представляет собой одно или более устройств постоянного хранения данных, например, жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.[0073] The ROM (203) is one or more permanent storage devices, such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0074] Для организации работы компонентов системы (200) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (204). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п.[0074] To organize the operation of system components (200) and organize the operation of external connected devices, various types of I/O interfaces (204) are used. The choice of appropriate interfaces depends on the specific design of the computing device, which can be, but is not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0075] Для обеспечения взаимодействия пользователя с вычислительным устройством (200) применяются различные средства (205) В/В информации, например, клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п.[0075] To ensure user interaction with the computing device (200), various means (205) of I/O information are used, for example, a keyboard, a display (monitor), a touch display, a touch pad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality tools, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retina scanner, fingerprint scanner, voice recognition module), etc.

[0076] Средство сетевого взаимодействия (206) обеспечивает передачу данных посредством внутренней или внешней вычислительной сети, например, Интранет, Интернет, ЛВС и т.п. В качестве одного или более средств (206) может использоваться, но не ограничиваться: Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др.[0076] The network communication means (206) provides data transmission via an internal or external computer network, for example, an Intranet, the Internet, a LAN, etc. One or more means (206) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and/or BLE module, Wi-Fi module and etc.

[0077] Дополнительно могут применяться также средства спутниковой навигации в составе устройства (200), например, GPS, ГЛОНАСС, BeiDou, Galileo.[0077] Additionally, satellite navigation tools can also be used as part of the device (200), for example, GPS, GLONASS, BeiDou, Galileo.

[0078] Конкретный выбор элементов устройства (200) для реализации различных программно-аппаратных архитектурных решений может варьироваться с сохранением обеспечиваемого требуемого функционала.[0078] The specific selection of device elements (200) for implementing various software and hardware architectural solutions may vary while maintaining the required functionality provided.

[0079] Модификации и улучшения вышеописанных вариантов осуществления настоящего технического решения будут ясны специалистам в данной области техники. Предшествующее описание представлено только в качестве примера и не несет никаких ограничений. Таким образом, объем настоящего технического решения ограничен только объемом прилагаемой формулы изобретения.[0079] Modifications and improvements to the above-described embodiments of the present technical solution will be apparent to those skilled in the art. The foregoing description is provided by way of example only and is not intended to be limiting. Thus, the scope of the present technical solution is limited only by the scope of the attached claims.

Claims

1. A method for constructing a depth map from a pair of images, performed by at least one computing device, comprising the steps of:

- obtaining from the first and second cameras first and second images containing an image of at least one object;

- perform the procedure of rectification of the first and second images by projecting them into one plane;

- form the first and second image tensors containing vector representations (feature vectors) of at least one object, each element of the tensor representing the brightness value of the corresponding pixel;

- normalize the tensors obtained at the previous stage;

- by means of an encoder, the mentioned two tensors are combined into one tensor;

- by means of a transformer, vector representations of at least one object contained in the tensor obtained at the previous stage are compared line by line to form a tensor containing information about the values of shifts of image pixels relative to each other;

- generating, by means of a decoder, the first and second shift maps for the first and second images, containing the mentioned shift values, based on the tensor obtained at the previous stage;

- based on the values of the shift maps generated at the previous stage, an image depth map is formed.

2. The method according to claim 1, characterized in that the step of determining the shift value for each pixel of the first and second images is additionally performed, containing the steps of:

- determine the brightness value (the amount of illumination) of each pixel of the first and second images;

- the brightness values of the pixels of the first image are compared with the brightness values of the pixels of the second image to determine the shift value for each pixel of the first and second images, and the brightness values of neighboring pixels are also taken into account during the comparison.

3. The method according to claim 1, characterized in that it additionally performs the steps of checking the consistency of the pixel shift values of the left and right images.

4. The method according to claim 1, characterized in that the stage of generating an image depth map based on the values of the shift maps contains stages in which:

- based on the pixel shift values contained in the shift maps, the distance from the line connecting the centers of the cameras to each pixel of at least one object is determined;

- based on the obtained distance values from the line connecting the centers of the cameras to each pixel of at least one object, an image depth map is formed.

5. The method according to claim 4, characterized in that said distance is determined by the formula distance=B×f/D,

where B is the size of the base (distance between cameras),

f - focal length in pixels,

D - shift value.

6. The method according to claim 1, characterized in that the encoder, transformer and decoder are implemented on the basis of neural networks pre-trained on the training data set.

7. The method according to claim 1, characterized in that additional steps are performed in which:

- based on the depth map, a point cloud is formed in three-dimensional space;

- use a point cloud to plan the trajectory of an autonomous unmanned vehicle.

8. A device for constructing a depth map from a pair of images, comprising at least one computing device and at least one memory device containing machine-readable instructions, which, when executed by at least one computing device, perform the method according to any one of claims. 1-7.