WO2023244135A1

WO2023244135A1 - Method and system for segmenting scenes in a video sequence

Info

Publication number: WO2023244135A1
Application number: PCT/RU2022/000381
Authority: WO
Inventors: Роман Валерьевич ЛЕКСУТИН; Евгений Юрьевич ЖИЛИН
Original assignee: Sberbank PAO
Current assignee: Sberbank PAO
Priority date: 2022-06-16
Filing date: 2022-12-20
Publication date: 2023-12-21
Anticipated expiration: 2024-12-16

Abstract

The present invention relates to the field of computing. A method for segmenting scenes in a video sequence (100) is implemented using a computer and comprises the steps of: receiving an input video sequence containing video and voice data (101); splitting the input video sequence into the following three data streams: facial images, text data, and content images (102); determining features for each frame of the video sequence in each of the data streams (103); vectorizing the features in each of the data streams (104); normalizing and concatenating the vector representations to produce a common set of features for each frame of the video sequence in the form of a single vector (105); determining a distance metric for each data stream (106); calculating, on the basis of said metrics, a common metric for the single vector; segmenting the video sequence into contextually related scenes by comparing the single vector representations of the frames in a vector space (107). The invention is directed toward improving the accuracy with which contextual scenes are determined for segmenting a video by the parallel analysis of data streams forming the video.

Description

СПОСОБ И СИСТЕМА СЕГМЕНТАЦИИ СЦЕН ВИДЕОРЯДА METHOD AND SYSTEM FOR VIDEO SCENE SEGMENTATION

ОБЛАСТЬ ТЕХНИКИ TECHNICAL FIELD

[0001] Настоящее решение относится к области компьютерных технологий, в частности к способу и системе для сегментации сцен видеоряда. [0001] This solution relates to the field of computer technology, in particular to a method and system for segmenting video scenes.

УРОВЕНЬ ТЕХНИКИ BACKGROUND OF THE ART

[0002] В различных процессах компаний (например - продажи, разработки продуктов, и т.п.) необходимо использование коммуникаций с контрагентами (клиентами, сотрудниками других подразделений и т.п.). В рамках этих коммуникаций соз даются/пере даются большие объемы неструктурированной/слабоструктурированной информации, которая используется для корректировки/изменений реализации соответствующих процессов. [0002] In various company processes (for example, sales, product development, etc.) it is necessary to use communications with counterparties (customers, employees of other departments, etc.). As part of these communications, large volumes of unstructured/weakly structured information are created/transmitted, which is used to adjust/change the implementation of the relevant processes.

[0003] Одним из наиболее распространенных способов коммуникации является онлайн встреча, в рамках которой используются различные каналы передачи данных - люди видят друг друга (видео связь), общаются с помощью аудио (может быть телефонная линия) и могут демонстрировать контент (презентации, демонстрации экрана и т.п.). [0003] One of the most common methods of communication is an online meeting, in which various data transmission channels are used - people see each other (video communication), communicate via audio (maybe a telephone line) and can demonstrate content (presentations, screen sharing and so on.).

[0004] Например, в процессах продаж, при коммуникациях с клиентами необходимо выявить потребность клиента и затем на основании выявленной потребности предложить соответствующие позиции из ассортимента товаров/услуг компании. Соответственно, необходимо в коммуникациях с клиентом определить промежуток времени и артефакты/объекты, относящиеся к потребностям клиента (например, описание/запрос коммерческого предложения и т.п.). Такой промежуток времени называется сценой. Определение сцен - это задача сегментации видео и другой не структурированной информации которой обмениваются стороны в процессе коммуникаций. По результатам анализа соответствующих сцен, могут совершаться определенные действия (фактически решая задачу классификации сцены по определенному набору действий/классов) - отправить покупателю соответствующие предложения из имеющегося в наличии ассортимента, рассчитать и предложить скидку на дополнительные товары и т.п. [0004] For example, in sales processes, when communicating with clients, it is necessary to identify the client’s need and then, based on the identified need, offer appropriate items from the company’s range of goods/services. Accordingly, it is necessary in communications with the client to determine the period of time and artifacts/objects related to the client’s needs (for example, description/request for a commercial proposal, etc.). This period of time is called a scene. Scene detection is the task of segmenting video and other unstructured information exchanged between parties during the communication process. Based on the results of the analysis of the corresponding scenes, certain actions can be performed (in fact, solving the problem of classifying the scene according to a certain set of actions/classes) - send the buyer appropriate offers from the available assortment, calculate and offer a discount on additional products, etc.

[0005] Для сегментации видео известным и доступным (включен во многие открытые библиотеки по работе с видео, например, opencv) подходом является разбиение видеоряда на сцены по переходу между кадрами (оценивая разницу между характеристиками последовательных кадров). Такой подход не учитывает контекстную составляющую соответствующих коммуникаций (смысловое содержание видео изображений, аудио, слайдов презентаций и т.п.) и не позволяет производить классификацию сцен для получения практических результатов/действий зависящих от контекста (т.е. смысла/содержания) сцен. [0005] For video segmentation, a well-known and accessible approach (included in many open-source video libraries, for example, opencv) is to divide the video sequence into scenes based on the transition between frames (estimating the difference between the characteristics of successive frames). This approach does not take into account the contextual component of the relevant communications (semantic content of video images, audio, presentation slides, etc.) and does not allow the classification of scenes to obtain practical results/actions depending on the context (i.e. meaning/content) of the scenes.

[0006] В патенте RU 2628192 С2 (Акционерное общество "Творческо- производственное объединение "Центральная киностудия детских и юношеских фильмов им. М. Горького", 15.08.2017) описано средство сегментации и классификации видео, но в качестве входных признаков используется только один канал передачи данных - видеоизображение. Данный подход не подходит для формата онлайн коммуникаций, в которых видеоизображение может быть статичным долгий промежуток времени, но при этом в аудио коммуникациях может обсуждаться и затрагиваться несколько тем, относящихся к разным контекстным сценам. [0006] Patent RU 2628192 C2 (Joint Stock Company "Creative and Production Association "Central Film Studio of Children's and Youth Films named after M. Gorky", 08/15/2017) describes a means for video segmentation and classification, but only one channel is used as input features data transmission - video image This approach is not suitable for the format of online communications, in which the video image can be static for a long period of time, but at the same time in audio communications several topics related to different contextual scenes can be discussed and touched upon.

[0007] В статье A Local-to-Global Approach to Multi-modal Movie Scene Segmentation (https://arxiv.org/abs/2004.02678) описан фреймворк выделения контекстных сцен в фильмах, использующий мультимодальные характеристики каждого кадра (место, актерский состав, действие и аудио). Метод извлечения признаков в этом фреймворке является наиболее близким решением, но имеет отличия в той части, что для сегментации и классификации сцен используется подход на основе «обучения с учителем» (supervised) с помощью сети BNet, предназначенный именно под художественные фильмы. Использование supervised подхода невозможно в случае различной стилистики онлайн коммуникаций (в зависимости от назначения, стиля спикеров, используемого демонстрационного материала). [0007] The article A Local-to-Global Approach to Multi-modal Movie Scene Segmentation (https://arxiv.org/abs/2004.02678) describes a framework for identifying contextual scenes in films that uses multimodal characteristics of each frame (location, cast , action and audio). The feature extraction method in this framework is the closest solution, but differs in that for segmentation and classification of scenes an approach based on “supervised learning” using the BNet network is used, designed specifically for feature films. Using a supervised approach is impossible in the case of different styles of online communications (depending on the purpose, style of speakers, demonstration material used).

[0008] В заявленном решении для преодоления недостатков, присущих решениям, известным из уровня техники, предлагается подход, позволяющий выполнять сегментацию сцен с помощью классификации по трем каналам данных, формирующих видеоряд, с помощью моделей машинного обучения. [0008] The claimed solution, to overcome the shortcomings inherent in solutions known in the prior art, proposes an approach that allows for scene segmentation using classification on three channels of data forming the video sequence using machine learning models.

СУЩНОСТЬ ИЗОБРЕТЕНИЯ SUMMARY OF THE INVENTION

[0009] Заявленное изобретение направлено на решение технической проблемы, заключающейся в создании эффективного способа сегментации видео, содержащего демонстрацию контента. [0009] The claimed invention is aimed at solving the technical problem of creating an effective method for segmenting a video containing a demonstration of content.

[0010] Техническим результатом является повышение точности определения контекстных сцен для сегментации видео, за счет параллельного анализа потоков данных, формирующих видео. [0011] Заявленный результат достигается за счет осуществления способа сегментации сцен видеоряда, выполняемого с помощью вычислительного устройства и содержащего этапы, на которых: получают входной видеоряд, содержащий видео и речевые данные; выполняют разделение входного видеоряда по трем потокам данных: изображения лиц, текстовые данные на основании транскрибированной речевой информации, и изображения контента, представленных в видеоряде; определяют признаки для каждого кадра видеоряда в каждом из упомянутых потоков данных; выполняют векторизацию упомянутых признаков в каждом из упомянутых потоков данных; осуществляют нормализацию векторных представлений, полученных в каждом потоке, и последующую конкатенацию нормализованных векторных представлений для получения общего набора признаков для каждого кадра видеоряда в виде единого вектора; определяют метрику расстояния для каждого потока данных, как косинусное расстояние между векторами для изображений лиц в видео, и как евклидово расстояние между векторами для текстовых данных и изображений контента; вычисляют на основании упомянутых метрик показатель общей метрики для упомянутого единого вектора, характеризующего каждый кадр видеоряда; выполняют сегментацию видеоряда на контекстно связанные сцены на основании сравнения получаемых единых векторных представлений кадров в векторном пространстве, при этом разделение выполняется на основании превышения порогового значения общей метрики векторных представлений единых векторов кадров видеоряда. [0010] The technical result is to increase the accuracy of determining contextual scenes for video segmentation, due to parallel analysis of data streams that form the video. [0011] The claimed result is achieved by implementing a method for segmenting scenes of a video sequence, performed using a computing device and containing the stages of: obtaining an input video sequence containing video and speech data; perform division of the input video sequence into three data streams: images of faces, text data based on transcribed speech information, and images of content presented in the video sequence; determining features for each frame of the video sequence in each of said data streams; perform vectorization of said features in each of said data streams; carry out normalization of the vector representations obtained in each stream, and subsequent concatenation of the normalized vector representations to obtain a common set of features for each frame of the video sequence in the form of a single vector; defining a distance metric for each data stream as the cosine distance between vectors for face images in the video, and as the Euclidean distance between vectors for text data and content images; calculating, based on the mentioned metrics, the indicator of the general metric for the mentioned single vector characterizing each frame of the video sequence; perform segmentation of the video sequence into contextually related scenes based on comparison of the resulting unified vector representations of frames in vector space, while the division is performed based on exceeding the threshold value of the general metric of vector representations of unified vectors of video frames.

[0012] В одном из частных примеров осуществления способа на основании изображений лиц формируют векторные представления, характеризующие по меньшей мере одно из: лицевые характеристики, пол, возраст, направление взгляда, эмоции. [0012] In one of the particular examples of implementation of the method, based on images of faces, vector representations are formed that characterize at least one of: facial characteristics, gender, age, direction of view, emotions.

[0013] В другом частном примере осуществления способа дополнительно распознают жесты, отображаемые в видеоряде. [0013] In another particular example of the method, gestures displayed in the video sequence are additionally recognized.

[0014] В другом частном примере осуществления способа дополнительно из речевых данных выделяют аудиохарактеристики голосов в видеоряде. [0014] In another particular example of the method, the audio characteristics of voices in the video sequence are additionally extracted from the speech data.

[0015] В другом частном примере осуществления способа аудиохарактеристики голосов включают в себя по меньшей мере одно из: тональность, интенсивность, форманты. [0016] В другом частном примере осуществления способа демонстрируемый контент дополнительно подвергается OCR обработке для распознавания представленной информации. [0015] In another particular example of the method, the audio characteristics of voices include at least one of: tonality, intensity, formants. [0016] In another particular example of the method, the displayed content is additionally subjected to OCR processing to recognize the presented information.

[0017] Заявленное изобретение также осуществляется за счет системы сегментации сцен видеоряда, содержащая по меньшей мере один процессор и память, хранящую машиночитаемые инструкции, которые при их исполнении процессором реализуют вышеуказанный способ сегментации сцен видеоряда. [0017] The claimed invention is also implemented by a video scene segmentation system containing at least one processor and a memory storing machine-readable instructions, which, when executed by the processor, implement the above method of video scene segmentation.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙ BRIEF DESCRIPTION OF THE DRAWINGS

[0018] Фиг. 1 иллюстрирует блок-схему заявленного способа. [0018] FIG. 1 illustrates a block diagram of the claimed method.

[0019] Фиг. 2 иллюстрирует пример потоков данных видеоряда. [0019] FIG. 2 illustrates an example of video sequence data streams.

[0020] Фиг. 3 иллюстрирует пример формирования усредняющего нормализованного вектора для потока видео. [0020] FIG. Figure 3 illustrates an example of the formation of an averaging normalized vector for a video stream.

[0021] Фиг. 4 иллюстрирует общую схему вычислительного устройства. [0021] FIG. 4 illustrates a general design of a computing device.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯ IMPLEMENTATION OF THE INVENTION

[0022] На Фиг. 1 представлена блок-схема выполнения заявленного способа (100) сегментации видеоряда. На первом этапе (101) исполняющее способ (100) устройство получает входные данные, которые представляют собой видео данные, в частности видеоряд, содержащий как изображения видео контента, так и демонстрацию сопутствующего контента в видеоряде, например, видео презентация. В качестве исполняющего устройства может применяться любой пригодный тип вычислительной техники, например, компьютер, сервер и т.п. Передача информации может осуществляться любым пригодным способ связи, например, с помощью вычислительной сети «Интернет», с помощью непосредственной загрузки данных в вычислительное устройство или любым другим известным принципом передачи цифровой информации. [0022] In FIG. 1 shows a block diagram of the implementation of the claimed method (100) for video segmentation. At the first stage (101), the device executing the method (100) receives input data, which is video data, in particular a video sequence containing both images of video content and a demonstration of related content in the video sequence, for example, a video presentation. Any suitable type of computer technology, for example, a computer, server, etc., can be used as an execution device. Information transfer can be carried out by any suitable communication method, for example, using the Internet, by directly loading data into a computing device, or by any other known principle of digital information transfer.

[0023] Как показано на Фиг. 2 на этапе (102) из полученного видеоряда выделяется три потока данных, в каждом из которых будет происходить вычисление соответствующих признаков: [0023] As shown in FIG. 2 at stage (102), three data streams are selected from the received video sequence, in each of which the corresponding features will be calculated:

- видео данные (201); - video data (201);

- изображения контента, представленных в видеоряде (202); - images of content presented in the video sequence (202);

- аудиопоток (203) и получаемые текстовые данные (2031) на основании транскрибированной речевой информации и их последующие векторные представления (2032). [0024] Выделение потоков может выполняться с помощью известных в уровне техники подходов по выделению из кадра видеопотока контента, отображаемого в видео. Например из кадров видеоряда может выделяться область интереса (задаваемую определенной и фиксированной областью в кадре, например с помощью OpenCV алгоритмов) содержащая демонстрируемый контент. Из потока видеоданных выделяются изображения лиц людей, например, с помощью технологии распознавания лиц (алгоритмы Face recognition). Аудиопоток транскрибируется в текстовую форму для последующего анализа. - audio stream (203) and resulting text data (2031) based on transcribed speech information and their subsequent vector representations (2032). [0024] Stream extraction can be performed using approaches known in the art for extracting content displayed in a video from a frame of a video stream. For example, from the frames of a video sequence, an area of interest (defined by a specific and fixed area in the frame, for example using OpenCV algorithms) containing the demonstrated content can be selected. Images of people's faces are extracted from the video data stream, for example, using face recognition technology (Face recognition algorithms). The audio stream is transcribed into text form for subsequent analysis.

[0025] Полученные данные в каждом из потоков на этапе (103) обрабатываются для определения признаков в каждый момент времени (кадр видеоряда). В частности, для каждого потока (201 - 203) может устанавливаться временное окно, в котором будет происходить обработка информации (Tl, Т2, ТЗ). Также, может определяться частота кадров для анализа кадров видеоданных (F1) и изображений контента (F3), представленного в видеоряде. Частота кадров - настраиваемый параметр, определяющие баланс между точностью и производительностью системы (рекомендуемое значение - 2 кадра в секунду, но не реже 1 кадра в 2 секунды). [0025] The received data in each of the streams at step (103) is processed to determine the features at each point in time (video frame). In particular, for each thread (201 - 203) a time window can be set in which information processing will occur (Tl, T2, T3). Also, the frame rate can be determined to analyze frames of video data (F1) and image content (F3) presented in the video sequence. Frame rate is an adjustable parameter that determines the balance between accuracy and system performance (recommended value is 2 frames per second, but at least 1 frame per 2 seconds).

[0026] Окно обработки информации в потоках данных выполняет две основные функции: [0026] The data processing window performs two main functions:

- исправление ошибок и артефактов для видеоканалов (за счет последующего отброса аномальных значений в окне и усреднения изображений) - выявляются аномалии по значению прогноз-факт на основе определения ошибки (методом Upper Control Limit) в сравнении с прогнозом модели VAR (Vector Auto-Regression) с установкой параметра max lag равным размеру окна обработки информации; - correction of errors and artifacts for video channels (due to the subsequent discarding of anomalous values in the window and averaging of images) - anomalies are identified in the forecast-fact value based on error determination (Upper Control Limit method) in comparison with the forecast of the VAR (Vector Auto-Regression) model with setting the max lag parameter equal to the size of the information processing window;

- обеспечение возможности работы с речевыми признаками для аудио потока - использование алгоритмов речь-в-текст невозможно “в моменте” (речь всегда обрабатывается за определенный промежуток времени), поэтому для того, чтобы соотнести векторное представление смысла произнесенной речи с кадром видеоряда необходима обработка аудиопотока с движущимся окном ТЗ. - providing the ability to work with speech characteristics for an audio stream - the use of speech-to-text algorithms is impossible “in the moment” (speech is always processed in a certain period of time), therefore, in order to correlate the vector representation of the meaning of spoken speech with a video frame, processing of the audio stream is necessary with a moving TZ window.

[0027] Окна обработки (Tl, Т2, ТЗ) - настраиваемые параметры, определяющие устойчивость (robustness) алгоритма (рекомендуемое значение для потока видео Т1 = 100/F1 секунд, для контента Т2 = 100/F3 секунд, для потока аудио ТЗ - 30 секунд). [0027] Processing windows (Tl, T2, TZ) - customizable parameters that determine the robustness of the algorithm (recommended value for video stream T1 = 100/F1 seconds, for content T2 = 100/F3 seconds, for audio stream TZ - 30 seconds).

[0028] Для каждого потока данных определяются признаки, которые затем преобразуются в векторный вид (эмбеддинги на этапе (104) для их последующей обработки с помощью модели машинного обучения). [0029] На каждый момент времени t (каждая секунда видео, аудио и т.п.) производится определение вектора признаков в метрическом пространстве (Р, d), где Р -

множество векторов, характеризующий контекст (включающий смысловое содержание видео и демонстрируемого контента, аудио и пр.) на данный момент времени, a d - метрика, определяющая расстояние между векторами из множества Р. [0028] For each data stream, features are determined, which are then converted into vector form (embeddings at step (104) for subsequent processing using a machine learning model). [0029] At each moment of time t (every second of video, audio, etc.), a feature vector is determined in the metric space (P, d), where P is

a set of vectors characterizing the context (including the semantic content of the video and demonstrated content, audio, etc.) at a given point in time, ad is a metric that determines the distance between vectors from the set P.

[0030] Например, для кадра формируется вектор признаков: 1¹7 = [0030] For example, a feature vector is generated for a frame: 1 ¹ 7 =

"roll": 3.5928382873535156, "roll": 3.5928382873535156,

"pitch": -3.403892993927002, "pitch": -3.403892993927002,

"yaw": 11.955580711364746, "yaw": 11.955580711364746,

"looks_aside": false, "looks_aside": false,

"pos_percent_left": -0.02490421455938696, "pos_percent_left": -0.02490421455938696,

"pos_percent_top ": 0.0029940119760478723, "pos_percent_top": 0.0029940119760478723,

"bad_position": false "bad_position": false

А для транслируемого контента на этом кадре: уз ₌ And for the broadcast content on this frame: knot ₌

' 17 '17

"duration": 62.879999999999995, "duration": 62.879999999999995,

"text": "", "text": "",

"similar_to_previous": false, "similar_to_previous": false,

"slide words": "", "slide words": "",

"slide sentences": "slide sentences":

"num words": 141 "num words": 141

[0031] Описанное решение позволяет работать на любых числовых признаках, но в качестве опорного перечня выбираются следующие признаки: [0031] The described solution allows you to work on any numerical characteristics, but the following characteristics are selected as a reference list:

- видео - лицевые эмбеддинги (в том числе закодированные в них лицевых характеристики такие как направление взгляда, пол, возраст, эмоции), определение жестов, значения HSV. - аудио - перевод речи в текст и применение языковой модели для выделения эмбедингов предложений, аудиохарактеристики голосов всех людей, находящихся в анализируемом временном окне (тональность, интенсивность, форманты). - video - facial embeddings (including facial characteristics encoded in them such as direction of view, gender, age, emotions), detection of gestures, HSV values. - audio - translation of speech into text and the use of a language model to highlight sentence embeddings, audio characteristics of the voices of all people in the analyzed time window (tone, intensity, formants).

- видео демонстрируемого контента - значения HSV, эмбеддинг изображения, детектирование (с помощью OCR) и получения эмбеддингов текста, описывающего соответствующий контент. - video of the displayed content - HSV values, image embedding, detection (using OCR) and obtaining text embeddings describing the corresponding content.

[0032] Далее для полученных векторов признаков в каждом из потоков (201 - 203) выполняется их нормализация и последующая конкатенация на этапе (105) для получения усредняющего вектора признаков для каждого потока данных. На Фиг. 3 приведен пример получения усредняющего вектора (2011) для потока видео (201). [0032] Next, for the obtained feature vectors in each of the streams (201 - 203), they are normalized and subsequently concatenated at step (105) to obtain an averaging feature vector for each data stream. In FIG. Figure 3 shows an example of obtaining an averaging vector (2011) for a video stream (201).

[0033] Для потока видеоизображений (201) производится отбрасывание аномальных векторов в рамках окна и усреднение оставшихся. Аналогично для потоков контента (202) и аудио (203) выполняется нормализация признаков векторов методом Max/Min Normalization в рамках групп признаков и их конкатенирование в единый вектор (так как конкатенирование отдельных эмбеддингов без нормализации повлияет на дальнейшее вычисление расстояния). [0033] For the video stream (201), anomalous vectors within the window are discarded and the remaining ones are averaged. Similarly, for content (202) and audio (203) streams, vector features are normalized using the Max/Min Normalization method within feature groups and concatenated into a single vector (since concatenation of individual embeddings without normalization will affect further distance calculations).

[0034] Например, итоговый единый вектор признаков для кадра №17 будет иметь следующий вид:

[0034] For example, the final single feature vector for frame No. 17 will look like this:

"roll": 0.5928382873535156, "roll": 0.5928382873535156,

"pitch": 0.403892993927002, "pitch": 0.403892993927002,

"yaw": 0.955580711364746, "yaw": 0.955580711364746,

"looks_aside": 0, "looks_aside": 0,

"pos_percent_left": 0.02490421455938696, "pos_percent_left": 0.02490421455938696,

"bad_position": 0 "bad_position": 0

"slide sentences": 0, "slide sentences": 0,

"num words": 0.01 [0035] Аналогично для каждого кадра видеоряда для соответствующего потока (202- 203) также формируется усредненный вектор. "num words": 0.01 [0035] Similarly, for each frame of the video sequence for the corresponding stream (202-203), an averaged vector is also formed.

[0036] На основании полученного набора признаков для каждого потока данных (201[0036] Based on the obtained set of features for each data stream (201

- 203) на этапе (106) определяется метрика расстояния, которая задает метрическое пространство, в котором каждый вектор описывает текущее состояние в момент времени. Под метрикой расстояния подразумевается числовая функция, удовлетворяющая требованиям/аксиомам определения расстояния в этом метрическом пространстве. Примерами такой метрики могут быть расстояние Хэмминга, евклидово расстояние, косинусное расстояние и т.д. Так как сравнение векторов в разных потоках данных может определяться разными метриками, то формально метрика d - это набор метрик (dl, d2, d3) и сравнение векторов выражается в применении отдельных метрик к разным потокам (201- 203) at step (106) a distance metric is determined, which specifies the metric space in which each vector describes the current state at a time. By distance metric we mean a numerical function that satisfies the requirements/axioms for determining distance in this metric space. Examples of such a metric include Hamming distance, Euclidean distance, cosine distance, etc. Since the comparison of vectors in different data streams can be determined by different metrics, then formally the metric d is a set of metrics (dl, d2, d3) and the comparison of vectors is expressed in the application of individual metrics to different streams (201

- 203) и их последующее взвешенное усреднение (используя разные «веса» составляющих метрик dl, d2, d3). - 203) and their subsequent weighted averaging (using different “weights” of the component metrics dl, d2, d3).

[0037] Например, состав метрики d из набора метрик (dl, d2, d3) может иметь следующий вид: [0037] For example, the composition of metric d from a set of metrics (dl, d2, d3) may have the following form:

— Метрика dl для канала видео с изображениями лиц (с использованием предобученной архитектуры ResNet50 для получения эмбеддингов) представляет из себя косинусный коэффициент (косинусная близость) двух векторов; — The dl metric for a video channel with facial images (using the pre-trained ResNet50 architecture to obtain embeddings) is the cosine coefficient (cosine proximity) of two vectors;

— Метрика d2 для видео демонстрируемого контента (с использованием предобученной модели на базе ResNet50 для определения контекста сцены и модели GPT3 для получения эмбеддингов предложений) представляет из себя евклидовое расстояние между векторами; — The d2 metric for video content (using a pre-trained model based on ResNet50 to determine the context of the scene and the GPT3 model to obtain sentence embeddings) is the Euclidean distance between vectors;

— Метрика d3 для аудио канала (с использованием предобученной модели эмбеддингов на базе архитектуры BERT) представляет из себя евклидовое расстояние между векторами. — The d3 metric for an audio channel (using a pre-trained embedding model based on the BERT architecture) is the Euclidean distance between vectors.

[0038] В результате выполнения предыдущих шагов для каждого времени Т определяется вектор в метрическом пространстве (Р, (dl,d2,d3)). Вектор признаков Р меняется во времени по ходу видеоряда (так как вектор определяется для каждого кадра видеоряда демонстрируемого в момент времени t). То есть каждый момент видеоряда представляет из себя набор векторов, привязанных ко времени. [0038] As a result of the previous steps, for each time T a vector in the metric space (P, (dl,d2,d3)) is determined. The feature vector P changes over time as the video sequence progresses (since the vector is determined for each frame of the video sequence shown at time t). That is, each moment of a video sequence is a set of vectors tied to time.

[0039] Далее на этапе (107) выполняется сегментация входного видеоряда на сцены за счет сравнения получаемых единых векторных представлений кадров. Для выявления данных и контекста сцен с последующей сегментацией используется вышеуказанное метрическое пространство (набор признаков и соответствующая метрика расстояния) и датасет с примерами сегментации информации в каналах коммуникаций (т.е. фактически, математическое описание отдельных областей в метрическом пространстве). [0039] Next, at step (107), the input video is segmented into scenes by comparing the resulting single vector representations of the frames. The above is used to identify data and scene context followed by segmentation metric space (a set of features and the corresponding distance metric) and a dataset with examples of information segmentation in communication channels (i.e., in fact, a mathematical description of individual areas in the metric space).

[0040] Так как векторы признаков для каждого кадра видеоряда не являются независимыми, а должны быть рассмотрены как последовательность, то в отличие от применения просто методов кластеризации, применяются методы кластеризации последовательности Optimal Sequential Grouping. [0040] Since the feature vectors for each frame of the video sequence are not independent, but must be considered as a sequence, in contrast to the use of simple clustering methods, Optimal Sequential Grouping sequence clustering methods are used.

[0041] Для определения последовательностей, в которых разные кадры могут сильно отличаться, но при этом контекстно содержаться в одной сцене, алгоритм проходит в два этапа. [0041] To identify sequences in which different frames can be very different, but still contextually contained in the same scene, the algorithm proceeds in two stages.

[0042] На первом этапе для каждых последовательных кадров производится сравнение по метрике б и в случае превышения порога чувствительности L производится предварительное разделение на s сегментов, которые определяются последовательностью временных

[0042] At the first stage, for each successive frames a comparison is made according to the metric b and if the sensitivity threshold L is exceeded, a preliminary division into s segments is carried out, which are determined by a sequence of time

Пример результатов сравнения последовательных кадров по метрике d: An example of the results of comparing successive frames by metric d:

[..., 0.31, 0.25, 0.79, 0.12, 0.17, ... , 0.47, 0.85, 0.14] [..., 0.31, 0.25, 0.79, 0.12, 0.17, ... , 0.47, 0.85, 0.14]

Соответственно для этого примера границы сцен соответствуют кадрам с расстояниями от предыдущих равными 0.79 и 0.85. Accordingly, for this example, the scene boundaries correspond to frames with distances from the previous ones equal to 0.79 and 0.85.

[0043] Для полученных сегментов segms производится кластеризация Optimal Sequential Grouping с помощью решения оптимизационной задачи минимизации расстояния между центроидами векторов, входящих в сегменты segms по построенной матрице попарных расстояний между сегментами. В качестве алгоритма сегментации предлагается использование методов кластеризации и использование метода локтя для определения количества кластеров. [0043] For the resulting segms segments, Optimal Sequential Grouping clustering is performed by solving the optimization problem of minimizing the distance between the centroids of the vectors included in the segms segments using the constructed matrix of pairwise distances between segments. As a segmentation algorithm, it is proposed to use clustering methods and use the elbow method to determine the number of clusters.

[0044] Для калибровки и выбора параметров, используемых для сегментирования в данном подходе может использоваться калибровочная выборка. Под калибровочной выборкой понимается размеченный датасет коммуникаций в видео, с разметкой сцен (сегментов) в виде меток начала и окончания сцены. В отличие от supervised подхода, для калибровки/выбора параметров при сегментации фильмов нужно не 21000 сцен, как это представлено в A Local-to-Global Approach to Multi-modal Movie Scene Segmentation, а всего 200 сцен. [0044] Calibration sampling can be used to calibrate and select parameters used for segmentation in this approach. A calibration sample is understood as a labeled dataset of communications in video, with scenes (segments) marked in the form of scene start and end marks. Unlike the supervised approach, to calibrate/select parameters for movie segmentation, you need not 21,000 scenes, as presented in A Local-to-Global Approach to Multi-modal Movie Scene Segmentation, but only 200 scenes.

[0045] При этом с помощью калибровочной выборки возможна оптимизация и выбор параметров, используемых для сегментации. Калибруемые параметры: - весовые коэффициенты wl, w2, w3 метрик dl,d2,d3 для расчёта метрики d; [0045] In this case, with the help of calibration sampling, optimization and selection of parameters used for segmentation are possible. Calibrated parameters: - weighting coefficients wl, w2, w3 of metrics dl, d2, d3 for calculating metric d;

- порог чувствительности L. - sensitivity threshold L.

[0046] Калибровка производится путем корректировки Cost функции, которая представляет сумму ошибок на калибровочной выборке (т.е. классическая задача минимизации ошибки). [0046] Calibration is performed by adjusting the Cost function, which represents the sum of errors on the calibration sample (ie, the classic error minimization problem).

[0047] Для отладки модели машинного обучения, применяемой для классификации сцен, может выполнять классификация по тэгам, которая производится на базе тех же самых единых векторов, которые получены на шаге (105), и сцен, полученных в ходе сегментации на этапе (107). Для классификации используется обучающая выборка извлекаемых объектов из сцен - тэгов. Для каждой сцены может быть несколько тэгов, то есть решается задача multilabel (множество меток) классификации. [0047] To debug the machine learning model used for scene classification, tag classification can be performed based on the same uniform vectors obtained in step (105) and the scenes obtained during segmentation in step (107) . For classification, a training sample of extracted objects from scenes - tags - is used. For each scene there can be several tags, that is, the problem of multilabel classification is solved.

[0048] Представлением сцены для классификации является усредненный вектор по всей сцене (под усреднением понимается вектор из сцены, который имеет наименьшее евклидово расстояние до среднего вектора). Так как вектор признаков уже подготовлен на этапе (105), то для классификации можно использовать не сложные deep learning end2end подходы, а производить обучение на обучающей выборке тэгов классическими методами машинного обучения (ML), например, с помощью метода градиентного бустинга. [0048] The representation of a scene for classification is the average vector over the entire scene (averaging means the vector from the scene that has the smallest Euclidean distance to the average vector). Since the feature vector has already been prepared at stage (105), for classification you can use not complex deep learning end2end approaches, but train on a training set of tags using classical machine learning (ML) methods, for example, using the gradient boosting method.

[0049] Предложенный подход может найти широкое применение в части эффективной автоматизированной сегментации видеоряда с помощью применяемых технологий и алгоритмов машинного обучения, которые за счет тренировки на соответствующих датасетах могут с высокой вероятностью классифицировать контекстно связанные сцены для их выделения из общего потока данных. Например, такое применение может быть полезно для эффективного разделения блоков презентаций или конференций, в части анализа демонстрируемого контента и сегментации на основании контекстно несвязанных блоков, что может потом передаваться в качестве сегментов во внешние системы демонстрирования контента, например, системы предоставления виде по запросу (video on demand) или т.п. [0049] The proposed approach can find wide application in terms of effective automated segmentation of video sequences using the applied technologies and machine learning algorithms, which, through training on appropriate datasets, can with a high probability classify contextually related scenes to isolate them from the general data stream. For example, such an application can be useful for effectively dividing blocks of presentations or conferences, in terms of analyzing the displayed content and segmenting based on contextually unrelated blocks, which can then be transmitted as segments to external content display systems, for example, video-on-demand systems (video on demand) or etc.

[0050] На Фиг. 4 представлен общий вид вычислительной системы на базе вычислительного устройства (300), пригодного для выполнения способа (100). Устройство (300) может представлять собой, например, сервер или иной тип вычислительного устройства, который может применяться для реализации заявленного способа. [0050] In FIG. 4 shows a general view of a computing system based on a computing device (300) suitable for performing the method (100). The device (300) may be, for example, a server or other type of computing device that can be used to implement the claimed method.

[0051] В общем случае вычислительное устройство (300) содержит объединенные общей шиной информационного обмена один или несколько процессоров (301), средства памяти, такие как ОЗУ (302) и ПЗУ (303), интерфейсы ввода/вывода (304), устройства ввода/вывода (305), и устройство для сетевого взаимодействия (306). [0052] Процессор (301) (или несколько процессоров, многоядерный процессор) могут выбираться из ассортимента устройств, широко применяемых в текущее время, например, компаний Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. В качестве процессора (301) может также применяться графический процессор, например, Nvidia, AMD, Graphcore и пр. [0051] In general, a computing device (300) contains one or more processors (301), memory devices such as RAM (302) and ROM (303), input/output interfaces (304), and input devices connected by a common information exchange bus. /output (305), and a device for network communication (306). [0052] The processor (301) (or multiple processors, multi-core processor) may be selected from a variety of devices commonly used today, such as those from Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ and etc. A graphics processor, for example, Nvidia, AMD, Graphcore, etc., can also be used as the processor (301).

[0053] ОЗУ (302) представляет собой оперативную память и предназначено для хранения исполняемых процессором (301) машиночитаемых инструкций для выполнение необходимых операций по логической обработке данных. ОЗУ (302), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.). [0053] RAM (302) is a random access memory and is designed to store machine-readable instructions executable by the processor (301) for performing the necessary logical data processing operations. RAM (302) typically contains executable operating system instructions and associated software components (applications, program modules, etc.).

[0054] ПЗУ (303) представляет собой одно или более устройств постоянного хранения данных, например, жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш- память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD- R/RW, BlueRay Disc, MD) и др. [0054] The ROM (303) is one or more permanent storage devices, such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0055] Для организации работы компонентов устройства (300) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (304). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п. [0055] To organize the operation of device components (300) and organize the operation of external connected devices, various types of I/O interfaces (304) are used. The choice of appropriate interfaces depends on the specific design of the computing device, which can be, but is not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0056] Для обеспечения взаимодействия пользователя с вычислительным устройством (300) применяются различные средства (305) В/В информации, например, клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п. [0056] To ensure user interaction with the computing device (300), various means (305) of I/O information are used, for example, a keyboard, a display (monitor), a touch display, a touch pad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality tools, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retina scanner, fingerprint scanner, voice recognition module), etc.

[0057] Средство сетевого взаимодействия (306) обеспечивает передачу данных устройством (300) посредством внутренней или внешней вычислительной сети, например, Интранет, Интернет, ЛВС и т.п. В качестве одного или более средств (306) может использоваться, но не ограничиваться: Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др. [0057] The network communication facility (306) enables the device (300) to transmit data via an internal or external computer network, such as an Intranet, Internet, LAN, or the like. One or more means (306) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and/or BLE module, Wi-Fi module and etc.

[0058] Дополнительно могут применяться также средства спутниковой навигации в составе устройства (300), например, GPS, ГЛОНАСС, BeiDou, Galileo. [0059] Представленные материалы заявки раскрывают предпочтительные примеры реализации технического решения и не должны трактоваться как ограничивающие иные, частные примеры его воплощения, не выходящие за пределы испрашиваемой правовой охраны, которые являются очевидными для специалистов соответствующей области техники. [0058] Additionally, satellite navigation tools can also be used as part of the device (300), for example, GPS, GLONASS, BeiDou, Galileo. [0059] The submitted application materials disclose preferred examples of implementation of a technical solution and should not be interpreted as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

FORMULA

1. A method for segmenting scenes of a video sequence, performed using a computing device and containing the steps of: obtaining an input video sequence containing video and speech data; perform division of the input video sequence into three data streams: images of faces, text data based on transcribed speech information, and images of content presented in the video sequence; determining features for each frame of the video sequence in each of said data streams; perform vectorization of said features in each of said data streams; carry out normalization of the vector representations obtained in each stream, and subsequent concatenation of the normalized vector representations to obtain a common set of features for each frame of the video sequence in the form of a single vector; defining a distance metric for each data stream as the cosine distance between vectors for face images in the video, and as the Euclidean distance between vectors for text data and content images; calculating, based on the mentioned metrics, the indicator of the general metric for the mentioned single vector characterizing each frame of the video sequence; perform segmentation of the video sequence into contextually related scenes based on comparison of the resulting unified vector representations of frames in vector space, while the division is performed based on exceeding the threshold value of the general metric of vector representations of unified vectors of video frames.

2. The method according to claim 1, in which, based on images of faces, vector representations are formed that characterize at least one of: facial characteristics, gender, age, direction of view, emotions.

3. The method according to claim 2, in which the gestures displayed in the video sequence are additionally recognized.

4. The method according to claim 1, in which the audio characteristics of voices in the video sequence are additionally extracted from the speech data.

5. The method of claim 4, wherein the audio characteristics of the voices include at least one of: pitch, intensity, formants.

6. The method according to claim 1, in which the displayed content is additionally subjected to OCR processing to recognize the presented information.

7. A system for segmenting video scenes, containing at least one processor and memory storing machine-readable instructions, which, when executed by the processor, implement a method for segmenting video scenes according to any one of claims. 1-6.