RU2654126C2

RU2654126C2 - Method and device for highly efficient compression of large-volume multimedia information based on the criteria of its value for storing in data storage systems

Info

Publication number: RU2654126C2
Application number: RU2016136329A
Authority: RU
Inventors: Владимир Александрович Свириденко
Original assignee: Общество с ограниченной ответственностью "Спирит Корп"
Priority date: 2016-09-09
Filing date: 2016-09-09
Publication date: 2018-05-16
Also published as: RU2016136329A3; RU2016136329A

Abstract

FIELD: data processing.

SUBSTANCE: invention relates to means for compressing, transmitting and storing multimedia information in a compact form. Method for compressing large-volume multimedia information (MI) in digital form for transmission over communication channels or storage in data storage systems, in which video, voice and audio streams are coded taking into account their specificity, respectively, by video, voice and audio codecs, compressed into a common multimedia stream transmitted over telecommunications channels or placed as separate files or in a common file into storage devices, and are restored at the output of the channel or when retrieved from the memory device in a form acceptable to the consumer or the decision-maker separately for each stream or after decompression of the common stream and decoding of the compressed video, the voice and audio information are combined into a common recovered media stream, the overall stream and its individual components are divided into an informationally significant part and an informational insignificant part according to the information value criteria, while the information-insignificant part is significantly reduced in volume.

EFFECT: technical result consists in increasing the speed during compression of multimedia information.

15 cl, 6 dwg

Description

Область техники, к которой относится изобретениеFIELD OF THE INVENTION

Изобретение относится к области сжатия, передачи и хранения в компактном виде мультимедийной информации (МИ), циркулирующей в системах передачи/записи видео, изображений, речевых сообщений, аудиосигналов, графических и текстовых файлов, включая системы речевой связи, видеоконференсинга, видеонаблюдений, ТВ и радиовещания, системы хранения данных, а также поисковые системы, в которых вся МИ или ее отдельные фрагменты должны соответствовать критерию информационной достаточности (или полезности/ценности) для приема решения относительно адекватности информации, воспроизводимой после ее передачи по телекоммуникационной среде или помещения в систему хранения данных (СХД), поставленной цели при значительном сжатии данных с исключением (потерями) той доли МИ, которая не отвечает критериям ценности для системы принятия решения и/или лица (лиц), принимающего(их) решение (ЛПР), и качественным представлением тех фрагментов МИ, которые отвечают критериям ценности.The invention relates to the field of compression, transmission and storage in a compact form of multimedia information (MI) circulating in systems for transmitting / recording video, images, voice messages, audio signals, graphic and text files, including voice communication systems, video conferencing, video surveillance, TV and broadcasting , data storage systems, as well as search engines in which the entire MI or its individual fragments must meet the criterion of information sufficiency (or utility / value) to make a decision regarding a the adequacy of the information reproduced after its transfer through the telecommunication medium or placement in a data storage system (SHD), the goal with significant data compression with the exception (loss) of the share of MI that does not meet the criteria of value for the decision-making system and / or person (s) ), making (their) decision (DM), and a qualitative representation of those fragments of MI that meet the criteria of value.

Уровень техникиState of the art

В науке и технике широко используются методы сжатия данных с потерями и без потерь, которые позволяют сократить их естественную избыточность и экономить ресурсы систем передачи информации (СПИ) и/или систем хранения данных (СХД), а также систем поиска информации за счет уменьшения первоначального объема данных, порождаемых их источником. Теоретико-информационный аспект сжатия данных (кодирования источника) был развит в работах К. Шеннона и других исследователей и определялся функцией «скорость-искажения» (Rate-distortion function) или R=ƒ(e), где R - скорость потока данных в бит/с на выходе кодера источника, а е - ошибка (погрешность) или искажения при воспроизведении данных на выходе декодера источника (при сжатии без потерь можно обеспечить е=0, если исходные данные представлены в цифровом виде), применительно к передачи данных по каналу связи с пропускной способностью С бит/с, причем должно выполняться соотношение R<C для обеспечения нормальной связи. Здесь R и е выступают в роли основных критериев качества кодирования источника. Фактически речь здесь идет о так называемом сжатии данных с потерями, т.е. кодировании потока данных на выходе источника при отсутствии ошибок в канале и выполнении условия передачи R<C, контролирующего уровень информационных потерь е>0, которые можно оценить на выходе декодера источника [1, 2]. При этом задержка кодирования и сложность кодирования, используемые как дополнительные критерии (факторы), должны быть ограниченными для реальных приложений.In science and technology, lossy and lossless data compression methods are widely used that can reduce their natural redundancy and save resources of information transfer systems (SPI) and / or data storage systems (SHD), as well as information retrieval systems by reducing the initial volume data generated by their source. The information-theoretical aspect of data compression (source coding) was developed in the works of K. Shannon and other researchers and was determined by the “rate-distortion function” or R = ƒ (e), where R is the data stream bit rate / s at the output of the source encoder, and e is the error (error) or distortion when reproducing data at the output of the source decoder (lossless compression can provide e = 0 if the original data is presented in digital form), as applied to data transmission over the communication channel with bandwidth C bit / s, moreover, dol but the relation R <C to ensure the normal communication. Here R and e act as the main criteria for the quality of source coding. In fact, we are talking about the so-called lossy data compression, i.e. encoding the data stream at the source output in the absence of errors in the channel and fulfilling the transmission condition R <C, which controls the level of information loss e> 0, which can be estimated at the output of the source decoder [1, 2]. At the same time, coding delay and coding complexity used as additional criteria (factors) should be limited for real applications.

Характеристикой эффективности сжатия данных служит коэффициент сжатия K=Iвх/Iвых, где Iвx - объем информации на выходе источника (возможно, порождаемой за некоторый интервал времени Т, если анализу подвергается поток данных на выходе источника, т.е. на входе кодера источника), а Iвых - объем информации на выходе кодера источника (возможно, за указанный интервал времени Т, если анализируется результат кодирования потока данных на выходе кодера источника). Потери е при этом определении эффективности сжатия фиксируются. Эффективность сжатия данных зависит от уровня их избыточности, но также метода кодирования и его способности сократить или даже устранить эту избыточность.The compression compression coefficient K = Iin / Iout is a characteristic of the data compression efficiency, where Ivx is the amount of information at the source output (possibly generated over a certain time interval T, if the data stream at the source output, i.e., at the input of the source encoder) is analyzed, and Iout is the amount of information at the output of the source encoder (possibly for the specified time interval T, if the result of encoding the data stream at the output of the source encoder is analyzed). Losses e in this definition of compression efficiency are fixed. The efficiency of data compression depends on the level of their redundancy, but also on the coding method and its ability to reduce or even eliminate this redundancy.

Такие мультимедийные потоки как видео/изображений или речи/аудио являются примерами выходных данных на выходе источника (например, фото и видеокамеры или микрофона). При их приведении к цифровому виду и кодировании обязательно вносится погрешность е, определяющая качество оцифровки и кодирования. При оценке качества широко используются такие объективные критерии качества как среднеквадратическая ошибка (СКО) и ее вариации, максимальная погрешность и отношение сигнал/шум (SNR)/пиковое SNR (PSNR).Multimedia streams such as video / image or speech / audio are examples of output at the source (for example, a photo and a video camera or microphone). When they are reduced to digital form and coding, an error e is necessarily introduced, which determines the quality of digitization and coding. In assessing quality, such objective quality criteria as standard error (RMS) and its variations, maximum error and signal-to-noise ratio (SNR) / peak SNR (PSNR) are widely used.

Для кодирования изображений широко используются кодеки JPEG и JPEG-2000, включая опцию сжатия без потерь цифровой копии изображения, а для кодирования видео используются кодеки MPEG-1, MPEG-2, MPEG-4 или их версии в виде стандартов (Рекомендаций) ITU-T Н.26х, а также проприетарные кодеки VP8, VP9 и др. [3]. Коэффициент сжатия К варьируется от 10 до 500 раз в зависимости от допустимой скорости видеопотока или выделенного объема памяти для хранения изображений/видео, уровня заданного качества, типа кодека и специфики изображений/видео. Широко используемый в таких кодеках принцип сжатия изображений и видео с потерями (в предположении, что они цветные (в частном случае монохромные, полутоновые) и представлены в известном формате RGB или YUV) таков: устранение пространственной избыточности на базе перехода из пространственной области в частотную путем трансформации матрицы изображения как в JPEG (или опорного кадра, как в MPEG 2 и 4) и использования системы ортогональных функций (преобразование Фурье, Уолша, дискретного косинусного преобразования (DCT), вейвлетов и др.) и тонкого или грубого квантования компонент, что вносит погрешность е, и последующее кодирование квантованных компонент на принципах энтропийного кодирования без потерь (в частности, арифметического кодирования); устранение временной избыточности в соседних кадрах видеопотока, в которых, как правило, есть небольшие изменения за счет движения объектов в кадре и/или видеокамеры, выявляемые оценивателем движения (motion estimator) и определением векторов движения для кодирования только измененных фрагментов нового кадра по отношению к опорному (что детально описано в стандартах MPEG 2 и 4, Н.26х).JPEG and JPEG-2000 codecs are widely used for image encoding, including the lossless compression option of a digital image copy, and for video encoding, MPEG-1, MPEG-2, MPEG-4 codecs or their versions in the form of ITU-T standards (Recommendations) are used H.26x, as well as proprietary codecs VP8, VP9, etc. [3]. The compression coefficient K varies from 10 to 500 times depending on the permissible speed of the video stream or the allocated amount of memory for storing images / videos, the level of the specified quality, the type of codec, and the specifics of the images / video. The principle of lossy image and video compression widely used in such codecs (assuming that they are color (in particular monochrome, grayscale) and are presented in the well-known RGB or YUV format) is as follows: elimination of spatial redundancy based on the transition from the spatial domain to the frequency transforming the image matrix as in JPEG (or a reference frame, as in MPEG 2 and 4) and using a system of orthogonal functions (Fourier, Walsh transform, discrete cosine transform (DCT), wavelets, etc.) and thin or rubogo quantization component that introduces an error e, and the subsequent coding of quantized components on the principles of entropy coding lossless (in particular, arithmetic coding); elimination of temporary redundancy in adjacent frames of the video stream, in which, as a rule, there are small changes due to the movement of objects in the frame and / or video camera detected by the motion estimator and determination of motion vectors for encoding only the changed fragments of the new frame relative to the reference (which is described in detail in the MPEG 2 and 4 standards, H.26x).

Для кодирования речи используются речевые кодеки по стандартам ITU-T серии G.7xx (G.711, G.718, G.719, G.722.2 (AMR WB), G.723.1, G.726, G.729, G.729.1 и др.), кодеки GSM-FR, SILC, iLBC, IPMR и другие проприетарные кодеки. В тех случаях, когда кодирование учитывает специфику речеобразования (на базе модели «источник-фильтр») и слухового восприятия, а мера качества кодирования речи является субъективной (например, разборчивость по ГОСТ или Mean Opinion Score (MOS)), то такой кодек называется вокодером (voice codec) [4]. Коэффициент сжатия варьируется от 5 до 50 раз в зависимости от требуемой скорости речевого потока на выходе кодера, уровня заданного качества, допустимой задержки и специфики речевого сигнала (с учетом пауз в речи). Если же форма оригинального сигнала сохраняется на выходе кодека с контролируемой погрешностью е, то такие кодеки называются кодеками речевой волны (waveform codecs). Примером такого кодека является речевой кодек G.726, реализующий метод адаптивной дифференциальной импульсно-кодовой модуляции (ADPCM) во временной области, но его эффективность по значению К невелика: обеспечивается К=3…5.For speech coding, speech codecs are used according to ITU-T standards of the G.7xx series (G.711, G.718, G.719, G.722.2 (AMR WB), G.723.1, G.726, G.729, G. 729.1 and others), GSM-FR, SILC, iLBC, IPMR and other proprietary codecs. In cases where the coding takes into account the specifics of speech formation (based on the “source-filter” model) and auditory perception, and the measure of the quality of speech coding is subjective (for example, intelligibility according to GOST or Mean Opinion Score (MOS)), such a codec is called a vocoder (voice codec) [4]. The compression ratio varies from 5 to 50 times, depending on the required speed of the speech stream at the encoder output, the level of the specified quality, the acceptable delay and the specificity of the speech signal (taking into account pauses in speech). If the shape of the original signal is stored at the output of the codec with a controlled error e, then such codecs are called waveform codecs. An example of such a codec is the G.726 speech codec, which implements the adaptive differential pulse-code modulation (ADPCM) method in the time domain, but its efficiency with respect to the value of K is small: K = 3 ... 5 is provided.

Для кодирования аудиосигналов используются такие известные и широко используемые аудиокодеки как МР3, ААС, ААС+, WMA, Ogg Vorbis и др. [5]. Практически все аудиокодеки построены на основе метода waveform coding, но обработка сигнала производится, как правило, в частотной области. Коэффициент сжатия К аудиопотока варьируется от 5 до 30 раз и зависит от полосы частот аудиосигнала и требуемого качества воспроизведения аудио при декодировании.For encoding audio signals, such well-known and widely used audio codecs as MP3, AAC, AAC +, WMA, Ogg Vorbis and others are used [5]. Almost all audio codecs are based on the waveform coding method, but signal processing is usually performed in the frequency domain. The compression coefficient K of the audio stream varies from 5 to 30 times and depends on the frequency band of the audio signal and the required quality of audio playback during decoding.

Уровень потерь (ошибок) е можно опустить до нуля, что имеет место в кодеках, обеспечивающих сжатие без потерь, а также в методах дедупликации, используемых в СХД для сжатия массива символьных данных путем исключение дублирующих копий повторяющихся данных [6]. Примером могут служить архиваторы zip и rar, широко используемые для кодирования символьной информации (например, буквенно-цифровых текстов), а также энтропийные кодеки в отмеченных выше стандартных методах сжатия изображений и видеоданных, но они представляют относительно малый самостоятельный интерес для кодирования мультимедийной информации из-за весьма низкой эффективности сжатия (коэффициент сжатия данных, равный двум-трем, считается для кодирования текстов вполне нормальным), но используются как составная часть общего метода сжатия мультимедийных данных.The level of losses (errors) e can be lowered to zero, which occurs in codecs that provide lossless compression, as well as in the deduplication methods used in storage systems to compress an array of character data by eliminating duplicate copies of duplicate data [6]. An example is the zip and rar archivers, which are widely used for encoding symbolic information (for example, alphanumeric texts), as well as entropy codecs in the standard methods for compressing images and video data noted above, but they are of relatively little independent interest for encoding multimedia information from due to the very low compression efficiency (a data compression ratio of two to three is considered quite normal for text encoding), but they are used as an integral part of the general method I multimedia data.

Одна из особенностей этого классического теоретико-информационного подхода состоит в том, что при декодировании информации, переданной по каналам СПИ или хранящейся в сжатом виде в системах памяти, требуется максимально качественно ее воспроизвести и, возможно, с небольшой погрешностью (ошибкой) е, которая является некой платой за относительно высокий коэффициент сжатия К первоначального объема данных Iвx. Считаем что очень качественное (HD) видео при частоте кадров (fps) 30 требует скорости R=50 Мбит/с, а относительно низкое по качеству требует R=128 Кбит/с. Для перевода высокоскоростного видеопотока в низкоскоростной нужно обеспечить коэффициент сжатия не выше К=400. Применительно к речи качественный речевой сигнал в полосе 7 КГц при передаче или записи требует скорости R≥32 Кбит/с. Он может быть сжат до скорости 1.2 Кбит/с в «телефонной» полосе 4 КГц (хотя качество речи при декодировании будет довольно низким: по пятибалльной шкале оценки качества MOS~2.8…3.0). Т.е. К~ 30…50.One of the features of this classical information-theoretic approach is that when decoding information transmitted through the SPI channels or stored in compressed form in memory systems, it is necessary to reproduce it in the highest quality and, possibly, with a small error (error) e, which is some payment for a relatively high compression ratio K of the initial data volume Ivx. We believe that very high-quality (HD) video at a frame rate (fps) of 30 requires a speed of R = 50 Mbit / s, and relatively low quality requires R = 128 Kbit / s. To convert a high-speed video stream to a low-speed one, it is necessary to provide a compression ratio of no higher than K = 400. For speech, a high-quality speech signal in the 7 KHz band during transmission or recording requires a speed of R≥32 Kbit / s. It can be compressed to a speed of 1.2 Kbit / s in the “telephone” band of 4 KHz (although the speech quality during decoding will be quite low: on a five-point scale for assessing the quality of MOS ~ 2.8 ... 3.0). Those. K ~ 30 ... 50.

Однако такой подход (его можно назвать традиционным), широко освещенный в технической литературе, формирует фундаментальную границу эффективности кодирования источника мультимедийной информации из-за требования «воспроизвести полный сигнал (изображения, видео, речь, аудио) практически в его первоначальном виде, но с относительно небольшими контролируемыми потерями», чтобы наше зрение или слух почти не заметили при просматривании или прослушивании этих потерь. Этот подход предполагает, что почти ничего нельзя упустить («все важно») при восприятии и воспроизведении и человек или распознающая машина должны получить все в деталях при декодировании изображения/видео или речи/звука даже в ситуациях, когда полное видео или речевой/аудиосигнал в целом не несет информации, важной для зрителя и/или слушателя при вынесении им решений о сущностных ситуациях или событиях, которые показываются, описываются, сопровождаются звуком или высказываются в видео, аудио и/или речевом потоке. Указанный подход не воспринимает одну часть мультимедийной информации как «информационный шум», отвлекающий внимание и существенные ресурсы, включая временные, или даже мешающий принимать правильное решение. Целесообразно эту часть в поисковых системах и системах принятия решений отнести к «потерям». Другая его часть (как правило, значительно меньшая по объему) является информационно содержательной, полезной, ценной для пользователя и именно ее надо записывать в СХД с целью последующего воспроизведения с требуемым качеством и анализа при поиске релевантных данных или для поддержки систем принятия решений. Таким образом, сжатие данных во многих реальных ситуациях рассматривается в парадигме их ценности для ЛПР, а не в парадигме «скорость-искажения» безотносительно к их информационной значимости для того же ЛПР.However, this approach (it can be called traditional), widely covered in the technical literature, forms the fundamental limit of the coding efficiency of the source of multimedia information due to the requirement “to reproduce the complete signal (images, video, speech, audio) practically in its original form, but with relatively small controlled losses ”so that our eyesight or hearing is almost not noticed when viewing or listening to these losses. This approach assumes that almost nothing can be missed (“everything is important”) in perception and reproduction, and the person or recognition machine should get everything in detail when decoding an image / video or speech / sound, even in situations where the full video or speech / audio signal generally does not carry information that is important for the viewer and / or listener when making decisions on essential situations or events that are shown, described, accompanied by sound or expressed in video, audio and / or speech stream. The indicated approach does not perceive one part of multimedia information as “information noise”, which distracts attention and significant resources, including temporary ones, or even interferes with making the right decision. It is advisable to attribute this part in search engines and decision-making systems to “losses”. Another part of it (usually much smaller in volume) is informational, useful, valuable for the user and it should be written to the storage system for subsequent playback with the required quality and analysis when searching for relevant data or to support decision-making systems. Thus, data compression in many real situations is considered in the paradigm of their value for decision makers, and not in the “speed-distortion” paradigm, regardless of their informational significance for the same decision maker.

Классификация потока данных для целей более эффективного их сжатия широко используется, в частности, в патенте [2], где выбор собственно кодирования для сжатия данных и кодеков зависит или не зависит от контента. Другой пример применительно к речевому потоку связан с использование детектора активности речи (Voice Activity Detector - VAD), который включается в состав речевых кодеков практически во всех СПИ с коммутацией пакетов (в частности, в Интернете). VAD позволяет классифицировать речевой поток на участки наличия речи и участки отсутствия речи (паузы), т.е. это классификатор «речь/пауза». При этом паузы считаются неценными фрагментами потока и их не передают, а участки с речевым сигналом - ценным фрагментом.The classification of the data stream for the purposes of more efficient compression is widely used, in particular, in the patent [2], where the choice of the actual encoding for data compression and codecs depends or does not depend on the content. Another example with respect to the speech stream is associated with the use of the Voice Activity Detector (VAD), which is included in the speech codecs in almost all packet switched speech communication systems (in particular, on the Internet). VAD allows you to classify speech flow into areas of speech presence and areas of speech absence (pause), i.e. this is a speech / pause classifier. At the same time, pauses are considered invaluable fragments of the stream and they are not transmitted, and sections with a speech signal are considered a valuable fragment.

Критерий, позволяющий оценить в потоке мультимедийных данных что является информационным шумом, а что - полезной информацией, определяется поставленной задачей, включающей составной частью анализ релевантной информации, и самим пользователем (даже если обработку данных ведет машина), которого интересует результат решения такой задачи, т.е. достижения поставленной цели. В тех ситуациях, когда важно оставить при сжатии данных с потерями только информативную с позиций пользователя как ЛПР (заинтересованного зрителя или слушателя) часть всего объема данных, можно даже получить очень большой коэффициент сжатия (например, К=1000 и более). Можно назвать такой подход суперсжатием (или сжатием данных с ценностным критерием) и его не надо путать с фрактальными методами сжатия данных [7], обещавшими указанные значения К, но так и не реализованными для большинства типов изображений и видео, которые, однако, вписываются применительно к компресии МИ в рамки традиционной парадигмы «сжатия с потерями» и функции «скорость-искажения». Указанный подход, определяемый полезностью или ценностью информации для пользователя, изучался разными исследователями, включая М.М. Бонгарда, Р.Л. Стратоновича, А.А. Харкевича, А.П. Веревченко и других крупных ученых. Важная цитата из [8] по этому вопросу такова: «Известно, что работа с информацией осуществляется с определенной целью. Увеличение вероятности достижения цели оценивается пользователем, к ней стремящимся. Поэтому стоит задача в получении точной и однозначной информации, освобожденной от избыточности. Определено, что избыточная, повторная информация имеет нулевую полезность, так как не увеличивает и не уменьшает вероятность достижения цели… Таким образом, полезность информации - это оптимальное удовлетворение определенным требованиям информационных запросов потребителей при принятии ими решений в конкретных условиях (ситуациях)».The criterion that allows us to evaluate in the multimedia data stream what is information noise and what is useful information is determined by the task, which includes an analysis of relevant information, and by the user (even if the machine is processing data), who is interested in the result of solving such a problem, t .e. achieving the goal. In those situations where it is important to leave only the part of the entire data volume that is informative from the user's perspective as a decision maker (interested viewer or listener) when compressing data with losses, you can even get a very large compression ratio (for example, K = 1000 or more). This approach can be called supercompression (or data compression with a value criterion) and it should not be confused with fractal data compression methods [7], which promised the indicated K values, but were not implemented for most types of images and videos, which, however, fit with respect to to the compression of MI in the framework of the traditional paradigm of "lossy compression" and the "speed-distortion" function. The indicated approach, determined by the usefulness or value of information for the user, has been studied by various researchers, including M.M. Bongarda, R.L. Stratonovich, A.A. Kharkevich, A.P. Verevchenko and other major scientists. An important quote from [8] on this issue is as follows: “It is known that work with information is carried out for a specific purpose. The increase in the probability of achieving a goal is evaluated by the user who is striving for it. Therefore, the task is to obtain accurate and unambiguous information freed from redundancy. It has been determined that redundant, repeated information has zero utility, since it does not increase or decrease the probability of achieving a goal ... Thus, the usefulness of information is the optimal satisfaction of certain requirements of consumer information requests when they make decisions in specific conditions (situations). ”

Приведем поясняющие примеры отбора мультимедийной информации по критерию ценности/полезности для решения некоторых нестандартных задач, считая, что в них можно использовать известные перечисленные выше методы сжатия мультимедийных данных, а также человека (или виртуальную распознающую машину) для интеллектуальной классификации кадров в видеопоследовательности, выявления специфических фрагментов (ключевых слов и выражений) в речевом сообщении и детектировании определенных акустических событий в аудиопотоке.Here are illustrative examples of the selection of multimedia information by the criterion of value / usefulness for solving some non-standard tasks, considering that they can use the well-known multimedia data compression methods listed above, as well as a person (or a virtual recognition machine) for intellectual classification of frames in a video sequence, identifying specific fragments (keywords and expressions) in speech communication and detection of certain acoustic events in the audio stream.

В первом примере пользователем информации в изображении является врач, диагностирующий заболевание по цифровой рентгенограмме легких пациента на своем компьютере. Оригинальная рентгенограмма снята с высоким качеством (малое значение е), трансформирована в цифровую форму и кодирована с использованием кодека JPEG, контролирующего CKO/PSNR. В исходной рентгенограмме (в оригинале на фотопленке) в правом легком видна маленькая (по отношению к площади оригинала) черная точка, свидетельствующая о начале болезни. Но в JPEG-версии рентгенограммы этой точки нет (она исключена), но сама рентгенограмма в целом имеет довольно высокое разрешение. Т.е. с позиций использования критерия СКО погрешность е, связанная с исключением указанной точки, невелика, а с позиции врача-диагноста в ней упущена очень важная диагностическая информация (не был указан критерий ценности при использовании конкретного метода кодирования). Для врача интересна только эта часть целого изображения во всех ее деталях, а остальная часть или не интересна, или как фон может быть представлена с низким разрешением, т.е. может быть сжата со значительно большим значением К и большей погрешностью е. Т.е. для диагноста (в частности, лица принимающего решение - ЛПР) важно не пропустить релевантную для принятия решения информацию на фоне несущественной, но детально представленной информации, причем большого, как правило, объема. В данном примере критерий СКО для всего изображения без указания что полезно/ценно в цифровой копии рентгенограммы не может обеспечить выделение важной для врача информации и его необходимо дополнить адекватным задаче диагностики критерием ценности и в соответствии с ним выделять из информации только ценную для диагностики часть, нивелируя (или даже совсем исключая) неценную часть, и при этом ценная часть должна быть представлена с высоким качеством, обеспечивающем отображение диагностических данных.In the first example, the user of the information in the image is a doctor diagnosing the disease using a digital radiograph of the patient’s lungs on his computer. The original x-ray was taken with high quality (low e-value), digitized and encoded using a JPEG codec that controls CKO / PSNR. In the initial x-ray (in the original on film) in the right lung, a small (in relation to the original area) black dot is visible, indicating the onset of the disease. But in the JPEG version of the radiograph, this point is not (it is excluded), but the radiograph itself as a whole has a rather high resolution. Those. from the standpoint of using the standard deviation criterion, the error e associated with the exclusion of the indicated point is small, and from the standpoint of the diagnostician, it omits very important diagnostic information (the value criterion was not specified using a specific encoding method). For the doctor, only this part of the whole image in all its details is interesting, and the rest is either not interesting, or how the background can be presented with low resolution, i.e. can be compressed with a significantly larger K value and a larger error e. for the diagnostician (in particular, the decision-maker - the decision-maker) it is important not to miss the information relevant to the decision-making against the background of non-essential, but detailed information, moreover, usually large. In this example, the standard deviation criterion for the entire image, without indicating what is useful / valuable in a digital copy of the radiograph, cannot provide the selection of important information for the doctor and must be supplemented with an adequate diagnostic task by the value criterion and in accordance with it, select only the part that is valuable for diagnosis from the information, leveling (or even completely excluding) the non-valuable part, and at the same time the valuable part must be presented with high quality, providing the display of diagnostic data.

Во втором примере рассматривается видеоконференция между двумя удаленными участниками (р2р-сеанс), т.е. имеются два видеопотока по двусторонней линии связи (виртуальной или реальной) между этими участниками. Участники обсуждают некоторую тему с частыми отвлечениями от нее. Фон, на котором представлены участники, статичен. Вся эта видеоконференция записывается и запись сеанса связи (исходный видеофайл) в сжатом виде помещается в систему хранения данных (СХД). Для некоего лица или организации, принимающих решения (ЛПР), через некоторое время на основании записи необходимо выяснить: 1) кто конкретно участвовал в конференции с каждой стороны, 3) имеются какие-либо специальные события в видеопотоке и 3) были ли некоторые ключевые слова произнесены при разговоре на определенную тему и кто их произнес. (Другие ЛПР могут ставить другие вопросы. Если таких лиц много, то записываемая в СХД информация, потенциально востребованная этими лицами, должна отвечать на вопросы этой группы лиц). Для данного случая это и будет критериями ценности, если отобранная в соответствии с ним информация в сеансе ВКС позволяет ответить на перечисленные вопросы.In the second example, a video conference is considered between two remote participants (P2P session), i.e. there are two video streams on a two-way communication line (virtual or real) between these participants. Participants discuss a topic with frequent distractions from it. The background on which the participants are represented is static. All this video conference is recorded and the recording of the communication session (the original video file) in compressed form is placed in the data storage system (SHD). For a certain person or organization that makes decisions (DM), after some time, based on the record, it is necessary to find out: 1) who specifically participated in the conference on each side, 3) there are any special events in the video stream, and 3) were there some keywords pronounced during a conversation on a specific topic and who said them. (Other decision-makers may raise other questions. If there are many such people, then the information recorded in the storage system that is potentially claimed by these people should answer the questions of this group of people). For this case, this will be the criteria of value, if the information selected in accordance with it in the videoconferencing session allows you to answer the above questions.

Пусть скорость исходного видео со сжатием по стандарту ITU-T Н.264 (MPEG-4, part 10) равна 512 Кбит/с при 30 кадрах/с, а скорость речи 32 Кбит/с, т.е. общий уже обработанный мультимедийный поток в одну сторону составит 544 Кбит/с, в обе - 1.088 Мбит/с (не учитываем служебные данные при организации связи и разрешение видео считаем стандартным (SD)). Пусть два участники видеоконференцсвязи (ВКС) говорят T=1 час. Общий объем переданной информации Iвx = R * T = 1.088 Мбит/с × 3600 с = 3916.6 Мбит = 489.6 MB. Этим самым задан критерий качества исходного видео (значение искажения е) применительно ко всему видеопотоку (т.е. как к информативной, так и неинформативной его части).Let the speed of the original video with compression according to the ITU-T H.264 standard (MPEG-4, part 10) be 512 Kbps at 30 frames / s, and the speech speed 32 Kbps, i.e. the total already processed multimedia stream in one direction will be 544 Kbit / s, in both - 1,088 Mbit / s (we do not take service data into account when organizing communications and consider video resolution as standard (SD)). Let two participants in video conferencing (VKS) say T = 1 hour. The total amount of transmitted information Ivx = R * T = 1.088 Mbps × 3600 s = 3916.6 Mbps = 489.6 MB. This sets the quality criterion for the original video (distortion value e) as applied to the entire video stream (i.e., to both its informative and non-informative parts).

Если интересуют только участники и не было ли при видеосвязи других людей в кадре, то достаточно выявить только опорные I-кадры (изображения участников), которые в среднем повторяются через каждые 32 кадра (их частота может задаваться в диапазоне 8…100) и представляются в сжатом виде (~30 KB на кадр). При этом очень быстрые изменения в видеоряде могут и не фиксироваться оценивателем движения, встроенным в кодек Н.264 [9]. Для записи в СХД с целью хранения длительное время нужно оценить и записать только информативные кадры (если в текущем кадре по отношению к предыдущему информативному кадру были довольно сильные изменения в сцене, например, появляется новый человек или участник выходит из поля зрения видеокамеры, то такой кадр объявляется информативным). Пусть на каждой стороне были в кадре другие люди помимо самих участников сеанса видеосвязи (т.е. они на некоторое время появлялись в поле зрения видеокамеры), которые постоянно присутствовали в кадре, а в целом динамика участника в кадре была относительно мала (имеются в виду моргание, разговор, повороты головы, движения руками и пр.). Пусть мы записали для каждого участника сеанса ВКС такие кадры: участник j, j=1,2, один в кадре; участник j с другим человеком в кадре; участник j один в кадре (снова), т.е. на каждого участника получилось три информативных кадра и всего таких кадров 6 и для их записи требуется память в размере 30 KB × 6 = 180 КВ. Отметим, что информативные по критерию ценности для ЛПР кадры (далее - ключевые кадры) совпадают с некоторыми опорными кадрами, определяемыми кодером Н.264, но число последних в среднем за 1 час равно 30×3600/32=3375, т.е. существенно больше числа действительно информативных (ключевых) для ЛПР кадров. При записи этих информативных кадров теряются движения и динамика в потоке видеокадров, когда участники говорят, моргают, жестикулируют, но известно и зафиксировано время t их появления в видеопотоке. Т.е небольшая динамика в поведении участников и лиц, вошедших в кадр, никак не учитывается в процессе записи в этих шести кадрах, также не учитываются возможные эмоции участников, а только их и других лиц присутствие в некоторые моменты времени из интервала Т.If only participants are interested and if there were no other people in the frame during the video connection, then it is enough to identify only reference I-frames (images of participants), which on average are repeated every 32 frames (their frequency can be set in the range of 8 ... 100) and presented in compressed (~ 30 KB per frame). At the same time, very fast changes in the video sequence may not be recorded by the motion evaluator built into the H.264 codec [9]. For recording in the storage system for storage for a long time, only informative frames need to be evaluated and recorded (if the current frame in relation to the previous informative frame had quite strong changes in the scene, for example, a new person appears or a participant leaves the camera’s field of view, then such a frame declared informative). Suppose that on each side there were other people in the frame besides the participants of the video communication session (that is, they appeared for some time in the field of view of the video camera), who were constantly present in the frame, but in general the dynamics of the participant in the frame were relatively small (meaning blinking, talking, head turns, arm movements, etc.). Let us record the following frames for each participant in the videoconferencing session: participant j, j = 1,2, one in the frame; participant j with another person in the frame; participant j is alone in the frame (again), i.e. for each participant, three informative frames were obtained, and there are 6 such frames in total, and for their recording a memory of 30 KB × 6 = 180 KV is required. Note that the frames informative according to the criterion of values for the decision-maker (hereinafter referred to as key frames) coincide with some reference frames defined by the H.264 encoder, but the number of the latter in average for 1 hour is 30 × 3600/32 = 3375, i.e. significantly more than the number of really informative (key) for decision-makers. When recording these informative frames, movements and dynamics are lost in the stream of video frames, when participants say, blink, gesticulate, but the time t of their appearance in the video stream is known and recorded. That is, the small dynamics in the behavior of the participants and persons included in the frame are not taken into account in the recording process in these six frames, nor are the possible emotions of the participants taken into account, but only their and other persons presence at some points in time from the interval T.

При этом выделить в речи, информационный поток которой существенно слабее по значению битовой скорости, чем видеопоток, ключевые речевые события (высказывания, словосочетания, слова) довольно сложно, если не прибегать к современным относительно надежным методам распознавания в дикторонезависимом режиме слитной речи с произвольной тематикой и анализа текста на выходе распознавателя на предмет выделения специальных событий семантического (возможно, и прагматического) уровня. Если такого надежного механизма нет, то речевой сигнал в своем полном виде должен быть записан (желательно в сжатом виде для экономии памяти в СХД, но с приемлемым качеством). Пусть такие ключевые речевые события выявлены и для сохранения мини-контекста для этих событий записывается речь до и после их наступления (и пусть длина записи 20с = 10с + 10с). Предположим, что пять выявленных речевых событий (слов или высказываний) приписаны участнику 1 и два - участнику 2. Тогда речевой сигнал общей длительностью (5+2) × 20с = 140 с записывается в СХД. Это требует объема памяти 32 Кбит/с × 140с = 4480 Кбит = 560 КВ. Вместе с записанными кадрами это составит 740 КВ. В целом эта информация позволяет ответить на вопросы 1), 2) и 3), интересующие ЛПР. Методы и средства распознавания ключевых слов в потоке слитной речи уже разработаны и представлены в виде соответствующих продуктов на рынке [10, 11].At the same time, it is quite difficult to isolate key speech events (utterances, phrases, words) in speech, the information stream of which is significantly weaker in bit rate than the video stream, if you do not resort to modern relatively reliable methods for recognizing continuous speaker with arbitrary subjects in speaker-independent mode and analysis of the text at the output of the recognizer in order to highlight special events of the semantic (possibly pragmatic) level. If there is no such reliable mechanism, then the speech signal in its full form should be recorded (preferably in a compressed form to save memory in the storage system, but with acceptable quality). Let such key speech events be identified and to preserve the mini-context for these events, speech is recorded before and after their occurrence (and let the recording length be 20s = 10s + 10s). Suppose that five identified speech events (words or sentences) are assigned to participant 1 and two to participant 2. Then, a speech signal with a total duration of (5 + 2) × 20 s = 140 s is recorded in the storage system. This requires a memory capacity of 32 Kbps × 140s = 4480 Kbps = 560 kW. Together with the recorded frames this will amount to 740 kW. In general, this information allows you to answer questions 1), 2) and 3) of interest to decision makers. Methods and means of recognizing keywords in the stream of continuous speech have already been developed and presented in the form of relevant products on the market [10, 11].

Полученный в этом примере коэффициент сжатия данных составляет К=489600/740=661, т.е. уже сжатый видео поток данных (с К ~ 30) дополнительно прорежен без информационных потерь для ЛПР более чем в 600 раз. Общий коэффициент сжатия К=30×661=19848. При этом в данном примере выделения ценной для ЛПР части мультимедийного потока запись речи требует больше памяти, чем запись изображений (выделенных ключевых кадров).The data compression coefficient obtained in this example is K = 489600/740 = 661, i.e. the already compressed video data stream (with K ~ 30) is additionally thinned out without information loss for decision makers by more than 600 times. The total compression ratio K = 30 × 661 = 19848. Moreover, in this example, the allocation of valuable for the decision maker part of the multimedia stream, voice recording requires more memory than recording images (selected key frames).

В том случае, если механизм распознавания речевых событий (т.е. ключевых слов и выражений - КСВ) на семантическом уровне не используется, то речевой сигнал должен быть записан в сжатом виде целиком, включая паузы (и считая, что участники говорят по очереди). Пусть используется вокодер на скорости 8 Кбит/с (например, кодек G.729). Тогда общий объем речевой информации равен 8 Кбит/с × 3600 = 28800 Кбит = 28.8 Мбит = 3600 KB = 3.6 MB. Вместе с кадрами из видеопотока общий объем станет 180КВ + 3600КВ = 3780КВ. При этом коэффициент сжатия уменьшится до значения К=489600/3780=129.5, т.е. в пять раз. При этом объем записи всей речи в данном случае существенно превосходит объем записи выделенных изображений (кадров). Если исключить паузы (обычно их не более 20% в речи), то можно немного увеличить значение К.In the event that the mechanism for recognizing speech events (i.e., keywords and expressions - SWR) is not used at the semantic level, then the speech signal should be recorded in its entirety, including pauses (and assuming that the participants speak in turn) . Let us use a vocoder at a speed of 8 Kbps (for example, the G.729 codec). Then the total amount of voice information is 8 Kbps × 3600 = 28800 Kbps = 28.8 Mbps = 3600 KB = 3.6 MB. Together with the frames from the video stream, the total volume will become 180KV + 3600KV = 3780KV. In this case, the compression ratio decreases to the value K = 489600/3780 = 129.5, i.e. five times. Moreover, the recording volume of all speech in this case significantly exceeds the recording volume of the selected images (frames). If pauses are excluded (usually no more than 20% in speech), then you can slightly increase the value of K.

Третий пример связан с видеонаблюдением в интересах безопасности. Видеокамера направлена на определенное место и работает 24 часа в сутки. Качество видео, как исходного материала, довольно высокое и соответствует данным из второго примера (скорость 512 Кбит/с, кодек Н.264, ночью используется ИК-подсветка). Общий объем записываемой информации за сутки составляет Iвx=512×24×3600=44236800 Кбит=5529600 KB=5529.6 MB. Службу безопасности интересуют только кадры, где 1) представлена наблюдаемая сцена утром, днем, вечером, ночью, т.е. достаточно 4 кадра с разрешением SD, 2) динамика в сцене, вызванная проходом людей и проездом подвижных средств, 3) оставлением предметов, которые фиксирует подсистема отслеживания и распознавания объектов, входящая в систему видеонаблюдения, в разных местах наблюдаемой сцены. Таким образом, статические ситуации представлены одиночными кадрами, а динамические ситуации должны быть полностью записаны в память (возможно, с дополнительным сжатием данных) даже в случае, когда динамика (изменения) в кадре по отношению к предыдущему совсем мала.The third example is related to video surveillance for security reasons. The camcorder is aimed at a specific place and works 24 hours a day. The quality of the video as a source material is quite high and corresponds to the data from the second example (512 Kbps speed, H.264 codec, IR illumination is used at night). The total amount of recorded information per day is Iвx = 512 × 24 × 3600 = 44236800 Kbit = 5529600 KB = 5529.6 MB. The security service is only interested in frames where 1) the observed scene is presented in the morning, afternoon, evening, night, i.e. 4 frames with SD resolution are enough, 2) the dynamics in the scene caused by the passage of people and the passage of vehicles, 3) the abandonment of objects that are captured by the object tracking and recognition subsystem included in the video surveillance system in different places of the observed scene. Thus, static situations are represented by single frames, and dynamic situations should be completely recorded in the memory (possibly with additional data compression) even in the case when the dynamics (changes) in the frame relative to the previous one are very small.

Предположим, что примерно 60% времени сцена статична (динамика ниже определенного порога). В этом случае, как указано выше, она представлена четырьмя кадрами общим объемом 30 KB × 4 = 120 КВ. Остальное время (40%) производится запись динамической сцены (с любыми изменениями в ней, превышающими некоторый порог динамики), что требует за сутки информационного объема 512×0.40×24×3600=17694720 Кбит=2211840 КВ=2211.8 MB. Т.е. общий объем информации, записанный в память СХД, составляет Iвых = 2211840 + 120 = 2211960 KB = 2211.9 MB и коэффициент сжатия К=5529.6/2211.9=2.5, т.е. значение К весьма скромное и при учете первоначального сжатия в 30 раз (как в предыдущем примере) в общем составит всего 75. Но даже в этом случае имеется выигрыш в сокращении объема памяти для записи цифрового видео и сокращения времени анализа ситуаций и положения объектов в сцене человеком или машиной для системы безопасности, хотя он может быть существенно увеличен, если добавить в систему интеллектуальный анализ сцены для выбора информативных кадров, что является самостоятельной задачей.Assume that approximately 60% of the time the scene is static (dynamics below a certain threshold). In this case, as indicated above, it is represented by four frames with a total volume of 30 KB × 4 = 120 KV. The rest of the time (40%), a dynamic scene is recorded (with any changes in it that exceed a certain dynamics threshold), which requires 512 × 0.40 × 24 × 3600 = 17694720 Kbps = 2211840 KB = 2211.8 MB per day. Those. the total amount of information recorded in the storage memory is Iout = 2211840 + 120 = 2211960 KB = 2211.9 MB and the compression ratio K = 5529.6 / 2211.9 = 2.5, i.e. the value of K is very modest and taking into account the initial compression by 30 times (as in the previous example) will total only 75. But even in this case, there is a gain in reducing the amount of memory for recording digital video and reducing the time of analysis of situations and the position of objects in the scene by a person or a machine for a security system, although it can be significantly increased if intelligent scene analysis is added to the system to select informative frames, which is an independent task.

Четвертый пример рассматривает запись ТВ-передачи (видео, речь, аудио) в качестве оригинала на предмет ее цензуры с параметрами сжатия данных: видеокодек MPEG-4, 25 кадров/с, скорость 2.048 Мбит/с, длительность записи 1 час; звуковой сигнал (речь и аудио) записывается аудиокодеком МР3 и его битовая скорость 128 Кбит/с, т.е. скорость общего медиапотока 2.176 Мбит/с и за 1 час объем информации составит 7833.6 Мбит=979.2 MB. Эти параметры определяют качество всего мультимедийного потока (всех трех его составляющих), хотя для реализации цензуры этого файла можно снизить качество как видео, так и звукового потока без потери информации для цензора.The fourth example considers recording of a TV program (video, speech, audio) as an original for censorship with data compression parameters: MPEG-4 video codec, 25 frames / s, speed 2.048 Mbit / s, recording time 1 hour; an audio signal (speech and audio) is recorded by an MP3 audio codec and its bit rate is 128 Kbps, i.e. the speed of the total media stream is 2.176 Mbps and in 1 hour the amount of information will be 7833.6 Mbps = 979.2 MB. These parameters determine the quality of the entire multimedia stream (all three of its components), although to implement censorship of this file, you can reduce the quality of both video and audio stream without losing information for the censor.

При анализе записанного медиа-файла необходимо выявить речевые фрагменты и игнорировать музыкальное сопровождение записи, а в речевых фрагментах определить по критериям цензурирования отдельные высказывания из заданного словаря высказываний и слов и оставить речевой контекст для этих высказываний до и после их произнесения длительностью 15 с, что составит в общем 30с. В видеоряде интерес для анализа (по критериям ценности) представляют кадры, в которых человек произносит эти высказывания, а также сцены или фрагменты сцен, подобные тем, что имеются в базе данных (БД) специальных изображений (кадров). Выделенные таким образом фрагменты видеопотока и речевого потока помещаются в СХД.When analyzing a recorded media file, it is necessary to identify speech fragments and ignore the musical accompaniment of the recording, and in speech fragments to determine individual statements from the given dictionary of statements and words by censoring criteria and leave the speech context for these statements before and after their utilization for 15 seconds, which will be generally 30s. In the footage, the analysis (according to the criteria of value) is represented by frames in which a person makes these statements, as well as scenes or fragments of scenes similar to those in the database (DB) of special images (frames). The fragments of the video stream and speech stream that are highlighted in this way are placed in the storage system.

Предположим, что в конкретном медиа-файле 30% времени занимает музыка, 20% паузы, а 50% речь (возможно, иногда сопровождаемая музыкой), а мы располагаем надежным классификатором речь/аудио/пауза для выявления речевых фрагментов и пауз. Предположим также, что в речевых фрагментах мы можем довольно надежно выделить участки, где произносится слова или выражения из заданного словаря и таких выражений/слов выявлено в полном файле 20. Т.к. необходимо записать видео этих высказываний, то это соответствует записи видеопотока общей длительностью 20 × 30с = 600 сек. При этом его скорость можно понизить с 2.048 Мбит/с до 512 Кбит/с за счет некоторого снижения качества, т.к. стоит задача подвергуть файл цензуре и несколько сниженное качество вполне обеспечит ее решение.. Соответствующий объем равен 512 × 600 = 307200 КВ = 307.2 МВ.Suppose that in a specific media file music takes 30% of the time, 20% of the pause, and 50% of the speech (possibly sometimes accompanied by music), and we have a reliable classifier for speech / audio / pause to identify speech fragments and pauses. Suppose also that in speech fragments we can quite reliably identify areas where words or expressions from a given dictionary are pronounced and such expressions / words are identified in the full file 20. Since if you need to record a video of these statements, this corresponds to recording a video stream with a total duration of 20 × 30s = 600 sec. At the same time, its speed can be reduced from 2.048 Mbit / s to 512 Kbit / s due to some decrease in quality, because the task is to censor the file and a somewhat reduced quality will completely ensure its solution .. The corresponding volume is 512 × 600 = 307200 KB = 307.2 MV.

Пусть из БД изображений отобрано 15 кадров, наличие близких кадров к которым должно быть выявлено и зафиксировано в ТВ-видеофильме. Считаем, что мы располагаем подсистемой, которая с приемлемой надежностью может обнаруживать похожие изображения (такие решения уже присутствуют в известных поисковых системах, например, в поисковых системах Яндекс или Google [12, 13]. Пусть обнаружено 120 таких кадров в файле, причем 10% из них мало соответствуют критерию похожести, но они все попадают в интересующий цензора набор. Для запоминания каждого кадра в отдельности требуется 30 КВ (качество будет немного снижено по сравнению с оригиналом). Требуемая память 120 × 30 = 3600 КВ = 3.6 МВ. А всего на кадры требуется памяти в объеме 307.2 + 3.6 MB = 313.8 MB. Объем служебной информации будет существенно меньше (не более 10 байтов на кадр) и поэтому она не принимается в расчет.Let 15 frames be selected from the image database, the presence of close frames to which should be detected and recorded in a TV video film. We believe that we have a subsystem that can detect similar images with reasonable reliability (such solutions are already present in well-known search engines, for example, Yandex or Google search engines [12, 13]. Let 120 such frames be found in a file, with 10% of them do not meet the similarity criterion a little, but they all fall into the set of interest to the censor. To memorize each frame individually, 30 KV is required (the quality will be slightly reduced compared to the original). Required memory 120 × 30 = 3600 KV = 3.6 MV. per frame The memory requires 307.2 + 3.6 MB = 313.8 MB of memory, the amount of service information will be significantly less (no more than 10 bytes per frame) and therefore it will not be taken into account.

Запись 20 речевых высказываний в контексте звучит 20 × 30 сек = 600 сек. При этом для указанной задачи можно при некотором снижении качества записанной речи использовать вокодер G.729 (8 кбит/с), который подготовит выделенные речевые участки (без пауз) объемом 8 × 600 = 4800 Кбит = 600 KB, что практически не влияет на общий объем 313.8 + 0.6 MB = 314.4 MBRecording 20 speech utterances in context sounds 20 × 30 sec = 600 sec. At the same time, for this task, with a slight decrease in the quality of recorded speech, you can use the G.729 vocoder (8 kbit / s), which will prepare selected speech sections (without pauses) of 8 × 600 = 4800 Kbit = 600 KB, which practically does not affect the overall volume 313.8 + 0.6 MB = 314.4 MB

Итого, для хранения в СХД этого «урезанного» файла для цензурирования ТВ-видеофильма достаточно выделить объем в 314 MB. Коэффициент сжатия К=3.11, т.е. уже сжатый медиопоток (коэффициент сжатия К не менее 10) дополнительно сокращается в объеме примерно в три раза для решения задачи цензуры, которая может быть решена значительно быстрее ввиду сокращения объема просматриваемого материала. Но для этого требуются надежные классификаторы аудио/речь/пауза и надежный поиск кадров в видеопотоке для сравнения со сценами на эталонных кадрах из БД изображений.In total, to store this “truncated” file in the storage system for censoring a TV video, it is enough to allocate a volume of 314 MB. The compression ratio K = 3.11, i.e. the already compressed media stream (compression coefficient K of at least 10) is additionally reduced in volume by about three times to solve the censorship problem, which can be solved much faster due to the reduction in the volume of the viewed material. But this requires reliable classifiers for audio / speech / pause and a reliable search for frames in the video stream for comparison with scenes on the reference frames from the image database.

Раскрытие изобретенияDisclosure of invention

Цель данного изобретения - создание такого способа и реализующего его устройства для высокоэффективного сжатия мультимедийной информации большого объема по критериям ее ценности/полезности, которые по сравнению с существующими методами и устройствами, контролирующими «искажения» или усредненные ошибки в полном медиа-потоке, включающем как информационно важные для ЛПР, так и информационно незначимые его части, позволяют отобрать для запоминания только информационно-содержательные фрагменты для решения возникающих в реальных условиях задач и достижения поставленных ЛПР целей и представить эти фрагменты по возможности в более компактном виде, чтобы тем самым экономить память в системах хранения данных, но также время для решения поставленных задач и достижения поставленных целей, так как «информационный шум» и присутствующая избыточность исключаются или значительно сокращаются в своем объеме.The purpose of this invention is the creation of such a method and a device that implements it for highly efficient compression of large volume multimedia information according to the criteria of its value / usefulness, which, in comparison with existing methods and devices that control “distortions” or average errors in a full media stream, including information important for the decision maker, and its informationally insignificant parts, allow you to select only information-content fragments for remembering to solve arising in real conditions x tasks and achieve the goals set by the DM and present these fragments in the most compact form possible, thereby saving memory in data storage systems, but also the time to solve the tasks and achieve the goals, since “information noise” and the redundancy present are eliminated or significantly reduced in volume.

Для достижения этой цели предлагаются выполнить следующие позиции:To achieve this goal, it is proposed to fulfill the following positions:

1. Задать или выбрать из определенного списка критерии ценности/полезности текстовой (символьной), графической, речевой, аудио и видеоинформации в мультимедийном потоке на основе специфики ее использования и/или опроса представительской группы лиц, принимающих решения (ЛПР), входящих в круг тех людей и организаций, которые используют или потенциально могут использовать мультимедийную информацию, хранимую в СХД, для решения своих информационных задач в рамках систем принятия решений. Состав ранжированного в соответствии с приоритетами списка критериев может меняться (некоторые его позиции могут исключаться, а другие - добавляться).1. To set or select from a specific list criteria for the value / usefulness of text (symbolic), graphic, speech, audio and video information in a multimedia stream based on the specifics of its use and / or a survey of a representative group of decision-makers (DM) included in the circle of those people and organizations that use or potentially can use multimedia information stored in storage systems to solve their information problems within the framework of decision-making systems. The composition of the list of criteria ranked in accordance with the priorities may change (some of its positions may be excluded, while others may be added).

2. Определить эталонные элементы (записи или образцы) в базе данных (БД) изображений, БД ключевых слов и выражений, БД акустических сигналов и акустических событий, которые представляют собой наборы символьных и сигнальных (или параметров сигналов) записей и несут ценностный интерес для ЛПР с точки зрения вхождения соответствующих элементов этих баз данных в видеопоток, речевой и аудиопоток с той или иной мерой соответствия эталонным элементам (записям в БД), мерой похожести на них.2. Define the reference elements (records or samples) in the database (DB) of images, the DB of keywords and expressions, the DB of acoustic signals and acoustic events, which are sets of symbolic and signal (or signal parameters) records and are of value interest for decision makers from the point of view of the entry of the corresponding elements of these databases into the video stream, voice and audio stream with one measure or another of conformity to the reference elements (entries in the database), a measure of similarity to them.

3. При необходимости выявить в мультимедийном потоке или файле фрагменты, которые подобны в определенном смысле заданным эталонным образцам из баз данных (БД) эталонных элементов, методом анализа и сравнения, то считать критерием ценности соответствие указанных фрагментов заданным образцам. Считать данные в медиапотоке, удовлетворяющие заданным критериям, информационно ценными (информативными), а неудовлетворяющие этим критериям - «информационным шумом», т.е. неинформативными. При пороговом определении ценности медиоданных обеспечить сравнение количественных оценок критериев ценности с выбранными порогами (пороги должны быть заданы).3. If it is necessary to identify fragments in the multimedia stream or file that are similar in a certain sense to the specified reference samples from the databases (databases) of reference elements, by the analysis and comparison method, then consider the correspondence of the indicated fragments to the given samples as a criterion of value. To consider the data in the media stream that satisfy the specified criteria, informationally valuable (informative), and unsatisfying these criteria with “information noise”, i.e. uninformative. In case of threshold determination of the value of media data, provide a comparison of quantitative assessments of the criteria of value with the selected thresholds (thresholds must be set).

В качестве количественного критерия ценности применительно к изображениям и кадрам видеопотока предлагается использовать степень корреляции r_ij цифровых изображений (метод DIC), определяющий статистическую связь двух изображений (эталонного образца из БД изображений и отдельного кадра или его фрагмента в видеопотоке, для которого определяется степень его «похожести» на эталонный образец) [14]. Метод DIC широко применяется на практике для проведения точных плоских и объемных измерений изменений на изображении на основе оценок коэффициентов взаимной корреляции двух изображений и для его реализации разработаны различные средства вычислений коэффициентов r_ij.As a quantitative criterion of value in relation to images and frames of a video stream, it is proposed to use the degree of correlation r _{ij of} digital images (DIC method), which determines the statistical relationship of two images (a reference sample from the image database and a single frame or its fragment in the video stream, for which its degree is determined similarities ”to the reference sample) [14]. The DIC method is widely used in practice for accurate flat and volumetric measurements of changes in the image based on estimates of the cross-correlation coefficients of two images, and various means of calculating the coefficients r _{ij have been} developed for its implementation.

Для речевого потока определяются не похожие речевые сигналы, а ключевые слова и выражения, т.е. обнаружение в потоке слитной речи такого слова или высказывания (Key Word Spotting - KWS) оценивается вероятностью правильного обнаружения P_kws и вероятностью ошибки P_error. Т.е. в данном случае ввиду символьной определенности слова или выражения не используется сравнение с порогом. Системы KWS представляют собой частный случай систем распознавания слитной речи и более часто используются на практике ввиду большей надежности.For a speech stream, not similar speech signals are determined, but keywords and expressions, i.e. detection of such a word or statement in the continuous speech stream (Key Word Spotting - KWS) is estimated by the probability of correct detection of P _kws and the probability of error P _error . Those. in this case, due to the symbolic certainty of a word or expression, comparison with the threshold is not used. KWS systems are a special case of continuous speech recognition systems and are more often used in practice due to their greater reliability.

Для аудиопотока распознавание акустических событий (крики, удары, столкновения, сильные звуки и т.п.) сводится к классификации каждого звука с учетом прежде всего энергетической составляющей (мощность звука) и, может, спектральных характеристик аудиособытия, что является хорошо отработанным методом при обработке и анализе аудиосигналов. Решение о каждом таком акустическом событии также может быть принято на основе сравнения оценок его параметров с порогами.For an audio stream, the recognition of acoustic events (screaming, shock, collision, strong sounds, etc.) is reduced to the classification of each sound, taking into account primarily the energy component (sound power) and, possibly, the spectral characteristics of the audio event, which is a well-established method for processing and analysis of audio signals. The decision on each such acoustic event can also be made by comparing the estimates of its parameters with thresholds.

Для информационно ценных фрагментов видео, речи и аудио задать отдельно критерий качества в виде допустимого уровня искажений е=(е_в,е_р,е_а) по отношению к оригиналу при их воспроизведении в процессе считывания из СХД и декодирования фрагментов каждого потока, формирующих (при необходимости) вместе мультимедиапоток или мультимедиафайл.For the information of fragments of video, speech, and audio to specify separately the quality criterion as the allowable distortion level e = (f _c, f _r, f _a) with respect to the original when playing during reading from storage and decode fragments each stream forming ( if necessary) together multimedia stream or multimedia file.

4. Разуплотнить (демультиплексировать) общий мультимедийный поток на отдельные потоки: видео, речевой, аудио; обрабатывать и анализировать их отдельно и каждый своими методами ввиду значительного различия между ними.4. To decompress (demultiplex) the general multimedia stream into separate streams: video, voice, audio; process and analyze them separately and each by their own methods in view of the significant difference between them.

5. Выделить в каждом из отдельных потоков информационно значимую часть с учетом заданных критериев ценности/полезности данных. Для анализа исходной мультимедийной информации и выбора ее информативной части по критериям ценности/полезности использовать методы обработки и классификации сигнала и элементы искусственного интеллекта (ИИ) для оценки существенных изменений в видеопотоке на покадровой основе и отбора важных по критерию ценности кадров или последовательности кадров, выявления ключевых слов и выражений из заданного словаря в речевом потоке, выделения значимых для ЛПР акустических событий в аудиопотоке. При этом для обработки и анализа звукового (речевого и аудио) потока обеспечить надежное выделение речевой ее части, пауз и аудиочасти, относя смешанные участки «речь на фоне неречевых звуков» к речевой части. Выделенные ценные фрагменты каждого потока снабдить служебной информацией в определенном формате в виде пакета, позволяющей точно задать их временнОе местоположение в соответствующем потоке и, возможно, другие параметры, описывающие выделенные фрагменты.5. To allocate in each of the individual streams an informationally significant part, taking into account the given criteria of data value / usefulness. To analyze the initial multimedia information and select its informative part according to the criteria of value / usefulness, use signal processing and classification methods and artificial intelligence (AI) elements to assess significant changes in the video stream on a frame-by-frame basis and select frames that are important according to the criterion of value or frame sequence, identify key words and expressions from a given dictionary in the speech stream, highlighting acoustic events significant for the decision maker in the audio stream. At the same time, for processing and analysis of the sound (speech and audio) stream, it is necessary to reliably distinguish its speech part, pauses and audio parts, relating mixed sections “speech against the background of non-speech sounds” to the speech part. The highlighted valuable fragments of each stream should be provided with service information in a specific format in the form of a package that allows you to precisely set their temporal location in the corresponding stream and, possibly, other parameters that describe the selected fragments.

6. Выделенные информативные фрагменты видео, речевого и аудиопотоков подвергнуть при необходимости перед хранением в СХД процедуре кодирования для сжатия данных (возможно дополнительного) с контролируемой погрешностью или процедуре транскодирования с учетом последующего восстановления соответствующего потока декодером, причем значение указанной погрешности е=(е_в,е_p,е_а) должно быть релевантным с позиций решения информационной задачи или достижения поставленной цели.6. If necessary, prioritize the storage of data, speech and audio streams to the encoding procedure, use the encoding procedure to compress the data (possibly optional) with a controlled error or the transcoding procedure, taking into account the subsequent restoration of the corresponding stream by the decoder, the value of the specified error e = (e _in , e _p , e _a ) must be relevant from the standpoint of solving an information problem or achieving a goal.

7. При извлечении записанных данных из СХД и восстановлении выделенных информационных фрагментов общего медиапотока в декодерах видео, речи и аудио обеспечить (при необходимости) согласование и синхронизацию отдельных потоков видео, речи и аудио на основе временных меток и других описывающих эти фрагменты параметров с целью формирования (при необходимости) единого мультимедийного потока для его анализа ЛПР. При восстановлении отдельных потоков или общего мультимедийного потока можно исключить участки, где отсутствуют изменения в видео, ключевые слова и выражения в речи, акустические события в аудиопотоке (для общего потока такими исключаемыми участками являются те, где одновременно отсутствуют соответствующие изменения в каждом отдельном потоке).7. When extracting the recorded data from the storage system and restoring the selected information fragments of the common media stream in the video, speech and audio decoders, ensure (if necessary) the coordination and synchronization of individual video, speech and audio streams based on time stamps and other parameters describing these fragments in order to generate (if necessary) a single multimedia stream for its analysis of DM. When restoring individual streams or a common multimedia stream, it is possible to exclude areas where there are no changes in the video, keywords and expressions in speech, acoustic events in the audio stream (for the general stream, such excluded areas are those where there are no corresponding changes in each separate stream at the same time).

Поясним для каждого отдельного потока каким образом выделить из него информативные фрагменты. Наиболее емким по объему и скорости генерации данных является источник видеопотока (например, выход видеокамеры, выход кодера или декодера в терминальном оборудовании СПИ, видеофайл в памяти некоторой системы), который представляется последовательностью видеокадров (видеорядом). Обычно источником видео является цифровая или аналоговая видеокамера или несколько видеокамер. Если видеокамера является аналоговой, то видеопоток с ее выхода подвергают аналого-цифровому преобразованию (оцифровывают). Далее предполагается, что видеопоток оцифрован и именно такой поток подвергается обработке перед тем, как его информативные фрагменты помещаются в память СХД. Предполагается также, что видеоряд конечный, т.е. содержит N кадров, и сгенерирован видеокамерой, характеризуемой скоростью битового потока на выходе (R бит/с) и частотой кадров (ƒ кадр/с). Назовем такой цифровой видеопоток исходным. В некоторых случаях можно считать исходным и видеопоток, который уже подвергнут сжатию данных с погрешностью е_о, но визуальное качество такого видео считается ЛПР высоким (если исключить ситуацию «только такой исходный видеопоток с низким качеством мы имеем и снижать качество нецелесообразно»). Он может храниться (в том числе и в сжатом виде) в памяти некоторой системы (включая и СХД), но перед обработкой и анализом он, как правило, должен быть восстановлен декодером в виде последовательности равноправных кадров.Let us explain for each individual stream how to extract informative fragments from it. The most capacious in terms of volume and speed of data generation is the source of the video stream (for example, the output of the video camera, the output of the encoder or decoder in the terminal equipment of the SPI, the video file in the memory of some system), which is represented by a sequence of video frames (video sequence). Typically, the video source is a digital or analog video camera or multiple cameras. If the video camera is analog, then the video stream from its output is subjected to analog-to-digital conversion (digitized). It is further assumed that the video stream is digitized and that such a stream is processed before its informative fragments are stored in the storage system memory. It is also assumed that the video sequence is finite, i.e. contains N frames, and is generated by a video camera characterized by the output bit rate (R bit / s) and frame rate (ƒ frame / s). We call such a digital video stream source. In some cases, you can consider the source stream and the video stream, which has already been subjected to data compression with an error e _о , but the visual quality of such a video is considered to be high-resolution solution (if we exclude the situation "only such an initial video stream with low quality we have and it is impractical to reduce the quality"). It can be stored (including in compressed form) in the memory of some system (including storage systems), but before processing and analysis it, as a rule, should be restored by the decoder as a sequence of equal frames.

Исходный поток подвергается покадровой обработке с целью выделения новых опорных ключевых кадров (КК), которые существенно отличаются от уже выделенных предыдущих ключевых кадров в видеоряде, т.к. в них есть обновление. Ключевые кадры отбираются последовательно и первым КК, по умолчанию, считается первый по номеру кадр (i=1, i=1, …, N) в видеоряде.The initial stream is subjected to frame-by-frame processing in order to select new key reference frames (QCs), which are significantly different from the already highlighted previous key frames in the video sequence, because they have an update. Key frames are selected sequentially and the first QC, by default, the first frame by number is considered (i = 1, i = 1, ..., N) in the video sequence.

Оценка существенности изменений (СИ) в новом кадре по отношению к уже выделенному эталонному кадру (считающемуся ключевым) не является формальной. Она предполагает, что имеется некоторый порог Delta изменений, значение которого выбирается человеком на основе информации от ЛПР или непосредственно ЛПР. Если уровень изменений превосходит порог, то они считаются существенными, в ином случае - несущественными. Можно использовать двухпороговую схему с порогами Delta1 и Delta2. Если уровень (коэффициент) изменений ниже порога Delta1, то изменения считаются несущественными, а если выше порога Delta2, то существенными. Значение коэффициента между порогами - пограничный случай и ЛПР должен определить значимость изменений для таких ситуаций. При однопороговой схеме Delta1=Delta2=Delta. Такой подход к оценке СИ резонен при сравнении относительно удаленных по времени двух кадров, которые проанализированы с позиций определения изменений второго кадра по отношению к первому в развивающемся сюжете или же соседних кадров при резкой смене сюжета.An assessment of the materiality of changes (SI) in a new frame in relation to the already selected reference frame (considered key) is not formal. She assumes that there is a certain threshold of Delta changes, the value of which is selected by a person on the basis of information from the decision maker or directly the decision maker. If the level of changes exceeds the threshold, then they are considered significant, otherwise - insignificant. You can use a two-threshold circuit with thresholds Delta1 and Delta2. If the level (coefficient) of changes is below the Delta1 threshold, then the changes are considered insignificant, and if above the Delta2 threshold, then significant. The coefficient between the thresholds - the borderline case and the decision-maker should determine the significance of changes for such situations. With a single threshold scheme, Delta1 = Delta2 = Delta. This approach to evaluating SI is reasonable when comparing two frames relatively remote in time, which are analyzed from the standpoint of determining changes in the second frame relative to the first in a developing plot or neighboring frames with a sharp change in plot.

Целесообразно в рамках оценивания изменений и принятия решения об их существенности провести деление каждого кадра на две части: информационную и неинформационную. К информационной части отнести ту, где в сцене представлены обычно интересные для ЛПР объекты (элементы сцены) возможно, что с некоторой динамикой (например, люди), а к неинформационной части отнести ту, где изменений почти нет (например, неподвижный или малоподвижный фон). Изменения в неинформационной части кадра не представляют, как правило, интереса (за редким исключением, которое также должно приниматься во внимание).It is advisable, within the framework of evaluating changes and deciding on their materiality, to divide each frame into two parts: information and non-information. The information part includes the one where objects (elements of the scene) that are usually interesting for the decision maker are presented, possibly with some dynamics (for example, people), and the non-information part includes one where there are almost no changes (for example, a motionless or inactive background) . Changes in the non-informational part of the frame are usually not of interest (with rare exceptions, which should also be taken into account).

Рассмотрим в качестве примера ситуацию с видеоконференсингом, где обычно на неподвижном фоне представлен участник или участники ВКС (например, один или несколько человек, сидящие за общим столом). Информационная часть сцены связана с участниками, неинформационная - с фоном. Для выделения информационной части кадра можно применить известную процедуру в обработке изображений и видео, называемую сегментацией. Если динамика участника невелика (он говорит, моргает, изменяет слегка свое положение, жестикулирует, машет руками и пр.), то сегментация по силуэту участника(ов) ВКС (пусть с некоторыми погрешностями) может быть проведена достаточно надежно. Если динамика относительно велика, то можно оценить ее теми же методами, что используются в видеокодеках MPEG-2 и MPEG-4 на основе оценивателя движения (motion estimator). Вполне достаточно будет выделить интересующий объект, но, может, захватывая небольшие участки неинформационной части кадра. Если интересующие ЛПР объекты в сцене - люди, то можно предварительно детектировать лица участников ВКС на основе известных надежных методов детектирования лиц (например, на основе эффективного алгоритма Виолы-Джонса, входящего в состав открытой библиотеки алгоритмов компьютерного зрения OpenCV и позволяющего обнаруживать в реальном времени различные объекты на изображениях (не только лица)) [15].Consider, as an example, the situation with video-conferencing, where usually a participant or participants of the videoconferencing are represented on a motionless background (for example, one or more people sitting at a common table). The informational part of the scene is associated with the participants, non-informational - with the background. To highlight the information part of the frame, you can apply the well-known procedure in image and video processing, called segmentation. If the participant’s dynamics are small (he speaks, blinks, slightly changes his position, gestures, waves his hands, etc.), then segmentation by the silhouette of the participant of the videoconferencing (albeit with some errors) can be carried out quite reliably. If the dynamics is relatively large, then you can evaluate it using the same methods that are used in the MPEG-2 and MPEG-4 video codecs based on the motion estimator. It will be quite enough to highlight the object of interest, but, perhaps, capturing small sections of the non-informational part of the frame. If the objects of interest in the decision-making scene are people, then you can pre-detect the faces of VKS participants based on known reliable methods for detecting faces (for example, based on the effective Viola-Jones algorithm, which is part of the OpenCV open library of computer vision algorithms and allows real-time detection of various objects in images (not just faces)) [15].

Для эффективного сравнения информационных частей разных кадров, выделенных путем процедуры сегментации объектов (может, с некоторой динамикой), которые присутствуют в соседних кадрах, можно использовать подход на основе операций над множествами (объединения, пересечения, разности [16]) применительно к пикселам как элементам изображения (кадра). При этом каждый кадр представляется матрицей элементов (пикселов) Pij, i=i, …, n; j=1, …, m, размером m×n, и одновременно кадр рассматривается как конечное множество пикселов с общим их числом (мощностью конечного множества) N_об=n×m, а его часть - как подмножество.To effectively compare the information parts of different frames selected by the segmentation procedure of objects (maybe with some dynamics) that are present in neighboring frames, you can use the approach based on operations on sets (union, intersection, difference [16]) as applied to pixels as elements image (frame). Moreover, each frame is represented by a matrix of elements (pixels) Pij, i = i, ..., n; j = 1, ..., m, with size m × n, and at the same time the frame is considered as a finite set of pixels with their total number (power of a finite set) N _rev = n × m, and part of it as a subset.

Пусть в сравниваемых двух кадрах (нумеруем кадры 1 и 2, причем кадр 1 является ключевым, а потому по умолчанию считается, что его информационная часть заведомо интересна для ЛПР, а ее объем (мощность) определяется числом пикселей, входящим в нее), выделены информационные части. Это - множества пикселов (подмножества соответствующих кадров) и именуем их M1 и М2 с числом элементов соответственно N1 и N2. Выделим общие части двух множеств (т.е. пикселов с одинаковыми индексами) в результате операции пересечения множеств: M1 ∩ М2. Очевидно, что сюда входят те элементы, которые присутствуют и в M1, и в М2. Их сравниваем («равны - неравны»), причем сравнению подлежат пикселы с одинаковыми индексами i и j. Фрагмент информационной части кадра 1 (M1\М2 - разность двух множеств), не входящий в М2 и содержащий L1 элементов, и фрагмент информационной части кадра 2 (М2\М1), не входящей в M1 и содержащий L2 элементов, вместе определяют изменения в кадре 2 по отношению к (ключевому) кадру 1. Предположим, что в пересечении M1 ∩ М2 найдется L3 отличающихся пикселов. Т.е. уровень изменения L=[(L1+L2+L3)/N_Σ]*100%, где N_Σ - число элементов в объединении множеств M1 U М2. Для однопорогового решателя при пороге Delta=30%, если L≥30%, то изменения в кадре 2 по отношению к кадру 1 существенны.Let in the two frames being compared (we number frames 1 and 2, where frame 1 is the key, and therefore by default it is assumed that its information part is obviously interesting for the decision maker, and its volume (power) is determined by the number of pixels included in it), information parts. These are sets of pixels (subsets of the corresponding frames) and we call them M1 and M2 with the number of elements, respectively, N1 and N2. We single out the common parts of two sets (i.e., pixels with the same indices) as a result of the operation of intersecting sets: M1 ∩ M2. Obviously, this includes those elements that are present in both M1 and M2. We compare them ("equal - unequal"), and pixels with the same indices i and j are subject to comparison. A fragment of the information part of frame 1 (M1 \ M2 is the difference of two sets) that is not in M2 and contains L1 elements, and a fragment of the information part of frame 2 (M2 \ M1) that is not in M1 and contains L2 elements together determine the changes in the frame 2 with respect to the (key) frame 1. Suppose that at the intersection M1 ∩ M2 there are L3 different pixels. Those. level of change L = [(L1 + L2 + L3) / N _Σ ] * 100%, where N _Σ is the number of elements in the union of the sets M1 U M2. For a single-threshold solver at the threshold Delta = 30%, if L≥30%, then the changes in frame 2 with respect to frame 1 are significant.

Вопрос о выборе ключевого кадра также непрост (кроме первого кадра в видеопоследовательности, считающегося и информативным, и ключевым)). Приведем пример: пусть рассматривается видеопоследовательность, отображающая участника ВКС. На первом кадре, представляющем участника на некотором фоне, он представлен в фас и его лицо занимает 20% кадра, а часть его фигуры на кадре - 35%. У участника в процессе коммуникации есть манера «крутиться на кресле» от -90 до +90 градусов, т.е. другой участник ВКС или ЛПР (или распознающий автомат) видит «крутящегося» участника, который не выходит из кадра, и для него информационных изменений в каждом кадре по отношению к предыдущему почти нет (они малы), хотя этот участник развернулся от положения в фас (0 гр.) до положения в профиль (90 гр.) и формально между этими двумя удаленными кадрами (участник в фас и в профиль) изменения весьма существенны, если сравнивать их информационные части (по информационной части почти на 100%, а по всему кадру - на 35%, но не менее 20%).The question of choosing a key frame is also not easy (except for the first frame in the video sequence, which is considered both informative and key)). To give an example: let us consider a video sequence displaying a participant in the videoconferencing. In the first frame, representing the participant against a certain background, he is presented in front and his face occupies 20% of the frame, and part of his figure in the frame is 35%. In the process of communication, a participant has a “spin on a chair” manner from -90 to +90 degrees, i.e. another participant of the videoconferencing or decision-maker (or recognition machine) sees a “spinning” participant who does not leave the frame, and for him there are almost no informational changes in each frame compared to the previous one (they are small), although this participant turned around from the position in front ( 0 gr.) To the position in profile (90 gr.) And formally between these two deleted frames (participant in front and in profile) the changes are very significant if we compare their information parts (almost 100% for the information part, and for the whole frame - by 35%, but not less than 20%).

Примерно такие же соображения можно высказать, если участник надевает маску, прикрывает лицо и/или же накидывает на свои голову и плечи покрывало или другую одежду. При наблюдении покадрово всего видеоряда изменения малы (от одного кадра к последующему), а удаленные друг от друга кадры (лицо без маски/покрывала и лицо в маске/покрывале) сильно разнятся. Очевидно, что простого формального сравнения здесь недостаточно для выбора информативных кадров и нужно выбор делать или человеку, или распознающей машине, обученной на такие ситуации (в частности, на базе искусственных нейронных сетей - ANN). В данном случае при формальном отборе информативных кадров, описанном выше, их будет больше (в некоторых случаях значительно больше), чем при интеллектуальном отборе. Однако в первом случае решение о существенных изменениях в кадре принимает относительно простое логическое устройство на базе простых вычислений и сравнения, а во втором случае требуется использования сложной распознающей подсистемы (машины) с обучением на разных тестовых ситуациях. Но и то, и другое - решающие устройства.Approximately the same considerations can be made if the participant puts on a mask, covers his face and / or throws a blanket or other clothes on his head and shoulders. When observing frame-by-frame the entire video sequence, the changes are small (from one frame to the next), and the frames remote from each other (a face without a mask / bedspread and a face in a mask / bedspread) vary greatly. Obviously, a simple formal comparison is not enough here to select informative frames and you need to make a choice either to a person or to a recognition machine trained for such situations (in particular, based on artificial neural networks - ANN). In this case, with the formal selection of informative frames described above, there will be more (in some cases, significantly more) than with the intellectual selection. However, in the first case, the decision on significant changes in the frame is made by a relatively simple logic device based on simple calculations and comparisons, and in the second case, the use of a complex recognition subsystem (machine) with training in different test situations is required. But both are decisive devices.

Ключевые кадры, выделенные на основе оценки СИ, могут быть подвергнуты кодированию с контролируемой погрешностью е, если они не подвергались сжатию данных. А если они уже представлены в сжатой форме с погрешностью e_о, то дальнейшему кодированию (или транскодированию) при допустимой погрешности е они будут подвергнуты при условии е>e_о.Keyframes selected on the basis of the SI estimate can be subjected to coding with a controlled error e, if they were not subjected to data compression. And if they are already presented in a compressed form with an error e _о , then they will be subjected to further coding (or transcoding) with an allowable error e provided that e> e _о .

Рассмотрим речевой поток, как важную составляющую общего медиапотока. Обычно его источником является аналоговый сигнал с выхода микрофона. Но при оцифровывании речи в АЦП можно говорить об исходном «цифровом» речевом потоке. Он наиболее важен ЛПР в разных приложениях, так как часто несет прямую, а не косвенную, требующую интерпретации, информацию, заключенную, в частности, в ключевых словах и выражениях (КСВ). Разумеется, семантика и прагматика речевого потока в его полных фразах и предложениях дают больше информации, чем отдельные слова и высказывания, но требуют более сложной системы для распознавания и понимания речи произвольного диктора (а часто и распознавания языка), т.е. преобразования речи в текст (STT - Speech То Text) и выявления в нем интересующей информации, чем распознавания ограниченного набора ключевых слов и выражений из заранее заданного словаря. Система STT обеспечивает предельно возможный метод сжатия речи с коэффициентом K ~ 1000…2000 (от 64 кбит/с до 20-50 бит/с), т.е. обеспечивает запись речи в текстовой форме, но при этом, конечно, теряется информация о специфике речи диктора. Для повышения информативности участка речевого сигнала, который включает ключевое высказывание/слово, можно не просто отмечать в какой момент оно начало произноситься диктором и номер слова из словаря, а полностью сохранять в памяти соответствующий фрагмент речи с некоторым окружением (сигнальным контекстом) выявленного ключевого высказывания/слова длиной Δt' до и после высказывания (можно предложить для Δt' значения в 10 с и 15 с как в ранее рассмотренных примерах). При этом соответствующий фрагмент речи может быть подвергнут кодированию в низкоскоростном вокодере или транскодированию перед записью в СХД, если он ранее не был кодирован или был кодирован высокоскоростным речевым кодеком. Очевидно, что любой подход для выявления релевантной информации в речевом сигнале имеет свои достоинства и недостатки и для финального выбора подхода надо принять во внимание разные факторы (не только значение K).Consider the speech stream as an important component of the overall media stream. Usually its source is an analog signal from the microphone output. But when digitizing speech in the ADC, we can talk about the original "digital" speech stream. It is most important for decision makers in different applications, as it often carries direct, rather than indirect, requiring interpretation information contained, in particular, in keywords and expressions (CWS). Of course, the semantics and pragmatics of the speech flow in its full phrases and sentences provide more information than individual words and sentences, but require a more complex system for recognizing and understanding the speech of an arbitrary speaker (and often language recognition), i.e. converting speech to text (STT - Speech To Text) and identifying information of interest in it than recognizing a limited set of keywords and expressions from a predefined dictionary. The STT system provides the maximum possible method of speech compression with a coefficient of K ~ 1000 ... 2000 (from 64 kbit / s to 20-50 bit / s), i.e. provides speech recording in text form, but at the same time, of course, information about the specifics of the speaker’s speech is lost. To increase the information content of the speech signal section, which includes the key statement / word, you can not only mark at what point it began to be pronounced by the speaker and the word number from the dictionary, but completely store the corresponding fragment of speech with some environment (signal context) of the identified key statement / words of length Δt 'before and after utterance (it is possible to suggest values for Δt' in 10 s and 15 s as in the previously considered examples). At the same time, the corresponding fragment of speech can be encoded in a low-speed vocoder or transcoded before recording in the storage system, if it has not been previously encoded or has been encoded by a high-speed speech codec. Obviously, any approach for identifying relevant information in a speech signal has its advantages and disadvantages, and for the final choice of the approach, various factors must be taken into account (not only the value of K).

Применительно к аудиопотоку целесообразно считать существенным изменением или акустическим событием отдельные фрагменты звукового сигнала, которые по некоторым признакам (в частности, энергетическим) отличаются от «среднего звучания». В качестве примера можно указать выстрелы, крики, скрежет, удары импульсного типа, нетипичные громкие звуки (клаксон, столкновение объектов, падение предметов и др.). Если энергетический фактор важен, то он может быть определяющим. К нему можно добавить спектральные характеристики подобных звуков. Роль акустических событий в аудиопотоке напоминает роль ключевых слов в речевом потоке, но для выявления таких событий не требуется использовать распознавание слуховых образов, а только обработку аудио на сигнальном уровне.With regard to the audio stream, it is advisable to consider individual fragments of the sound signal as significant changes or acoustic events, which, according to some signs (in particular, energetic), differ from the “average sound”. As an example, you can specify shots, screams, rattle, pulsed-type punches, atypical loud sounds (horn, collision of objects, falling objects, etc.). If the energy factor is important, then it can be decisive. To it you can add the spectral characteristics of such sounds. The role of acoustic events in the audio stream resembles the role of keywords in the speech stream, but for the detection of such events it is not necessary to use recognition of auditory images, but only processing audio at the signal level.

Проиллюстрируем на фиг. 1 возможный эффект сжатия по критерию ценности на примере трех подпотоков (видео - ВП, речевого - РП и аудио - АП), составляющих вместе общий мультимедийный поток (МП). На фиг. 1а все три подпотока представлены общей «лентой», разделенных на три «ленточки» (строки): верхнюю (ВП) 101, среднюю (РП) 102 и нижнюю (АП) 103. Вместе они представляют собой условно объем исходного МП.We illustrate in FIG. 1 possible compression effect according to the value criterion on the example of three sub-streams (video - VP, speech - RP and audio - AP), which together form a common multimedia stream (MP). In FIG. 1a, all three substreams are represented by a common “ribbon”, divided into three “ribbons” (strings): upper (VP) 101, middle (RP) 102 and lower (AP) 103. Together they represent the conditional volume of the initial MP.

На верхней строке 104 фиг. 1б представлены вертикальными черточками только выделенные ключевые кадры (КК). На средней строке 105 этого рисунка примерно таким же образом выделены пунктирными черточками середины обнаруженных ключевых слов и выражений (КСВ) в речевом сообщении, по обеим сторонам от которых нарисованы прямоугольники речевого сигнала длительностью Δt', представляющие речевой контекст КСВ. На нижней строке 106 фиг. 1б примерно так же представлены штрихпунктирными черточками середины акустических событий (АС), по обеим сторонам которых нарисован дугами длительностью Δt'' акустический сигнал, представляющий акустический контекст КАС. (Отметим, что для некоторых применений достаточно указать время и номер обнаруженных КСВ и КАС из заданного списка, не «обрамленных» указанными контекстами, что сокращает объем информации о КСВ и КАС).On the top line 104 of FIG. 1b, only selected key frames (CC) are represented by vertical bars. On the middle line 105 of this figure, the midpoints of the detected keywords and expressions (SWRs) in the speech message are marked out in dotted lines in approximately the same way, on both sides of which are drawn rectangles of the speech signal of length Δt 'representing the SWR speech context. On the bottom line 106 of FIG. 1b are represented in approximately the same way by dash-dotted lines of the middle of acoustic events (AC), on both sides of which an acoustic signal representing the acoustic context of the CAS is drawn by arcs of duration Δt ''. (Note that for some applications it is enough to indicate the time and number of the detected SWR and CAS from the given list, not “framed” by the indicated contexts, which reduces the amount of information about the CWS and CAS).

На фиг. 1в в верхней строке 107 показан обработанный (сжатый) МП, где выделенные КК, КСВ и КАС показаны на своих временных позициях, причем выделенные ключевые фрагменты МП включают соответствующую служебную информацию, описывающую каждый выделенный фрагмент, а на нижней позиции 108 показан условно тот же МП с исключенными паузами между выделенными фрагментами. Этот «сжатый» МП демонстрирует сокращение времени для его анализа по сравнению с первоначальным МП (как показано на этом рисунке - примерно в два раза). В случае неиспользования речевого и акустического контекстов время анализа значительно сокращается, т.к. фактически используется для просмотра КК и прочтения информации об обнаруженных КСВ и КАС. При просмотре ключевых кадров (после их декодирования и восстановления), если есть необходимость, можно определить, кто сказал КСВ, если в исходном ВП это зафиксировано, и какие КАС произошли при выявленной картинке.In FIG. 1c, the top line 107 shows the processed (compressed) MP, where the allocated CC, CWS, and CAS are shown at their temporary positions, the selected key fragments of the MP include the corresponding service information describing each selected fragment, and the conditionally same MP is shown at the lower position 108 with excluded pauses between the selected fragments. This “compressed” MP demonstrates a reduction in time for its analysis compared to the initial MP (approximately two times as shown in this figure). In the case of non-use of speech and acoustic contexts, the analysis time is significantly reduced, because actually used to view QC and read information about detected SWR and CAS. When viewing key frames (after decoding and restoring them), if necessary, you can determine who said the SWR, if this is recorded in the source VI, and which UAS occurred when the picture was revealed.

Краткое описание чертежейBrief Description of the Drawings

На Фиг. 1 схематично представлен результат обработки трех составляющих (ВП, РП и АП) мультимедийного потока.In FIG. 1 schematically shows the result of processing three components (VP, RP and AP) of a multimedia stream.

На Фиг. 2 дана блок-схема устройства эффективного сжатия МП по критерию ценности.In FIG. 2 shows a block diagram of a device for efficient compression of MPs according to the value criterion.

На Фиг. 3 показана блок-схема устройства отбора ключевых кадров в видеопотоке.In FIG. 3 shows a block diagram of a device for selecting key frames in a video stream.

На Фиг. 4 представлена блок-схема устройства детектирования ключевых слов и выражений в речевом потоке.In FIG. 4 is a block diagram of a device for detecting keywords and expressions in a speech stream.

На Фиг. 5 представлена блок-схема устройства детектирования акустических событий в аудиопотоке.In FIG. 5 is a block diagram of an apparatus for detecting acoustic events in an audio stream.

На Фиг. 6 показана блок-схема устройства восстановления ключевой информации, хранимой в сжатом виде в СХД для ее анализа лицом, принимающим решение.In FIG. 6 shows a block diagram of a device for recovering key information stored in compressed form in a storage system for analysis by a decision maker.

Осуществление изобретенияThe implementation of the invention

Блок-схема устройства высокоэффективного сжатия мультимедийной информации (СМИ) большого объема по критериям ее ценности для запоминания в системах хранения данных, в котором использован способ выявления информативных фрагментов в видео, речевом и аудиопотоках, являющихся отдельными компонентами общего мультимедийного потока, в соответствии с предлагаемым изобретением, показана на фиг. 2.A block diagram of a device for highly efficient compression of multimedia information (mass media) of a large volume according to the criteria of its value for storage in data storage systems, in which a method for identifying informative fragments in video, speech and audio streams that are separate components of a common multimedia stream is used, in accordance with the invention shown in FIG. 2.

На вход устройства СМИ поступает мультимедийный поток (МП) или мультимедийный файл. Этот поток/файл демультиплексируется (разуплотняется) в блоке DeMUX 201 и разбивается на три подпотока (видео, речевой и аудио) и служебный синхросигнал, который несет также информацию о МП (параметрах кодирования каждого подпотока и сведения о МП в целом). Указанные подпотоки поступают каждый в свой отдельный блок выделения ключевой информации - БВКИ (ключевых фрагментов указанных подпотоков): видеоподпоток (ВП)- в блок выделения ключевых кадров (БВКК) 202, речевой подпоток (РП) - в блок выделения ключевых слов/выражений (БВКСВ) 203, аудиоподпоток (АП) - в блок выделения ключевых акустических событий (БВКАС). Служебный синхросигнал и информация о МП в целом поступает из блока 201 в блок местного синхрогенератора и анализатора информации о МП 205.A multimedia stream (MP) or a multimedia file is input to the media device. This stream / file is demultiplexed (decompressed) in the DeMUX 201 block and is divided into three sub-streams (video, voice and audio) and a service clock signal, which also carries information about the MP (encoding parameters of each substream and information about the MP as a whole). These substreams enter each in its own separate block for extracting key information - BVKI (key fragments of the specified substreams): video substream (VP) - in the block for allocating key frames (BVKK) 202, speech substream (RP) - in the block for extracting keywords / expressions (BVKSV ) 203, audio substream (AP) - into the block of allocation of key acoustic events (BVKAS). The service clock and information about the MP as a whole comes from block 201 to the block of the local clock generator and analyzer of information about MP 205.

Каждый ключевой фрагмент (КК, КСВ, КАС) снабжается временной меткой (time stamp) и служебной информацией (тип подпотока, параметры фрагмента, включая длительность фрагмента, параметры его обработки для сжатия с контролируемой погрешностью (если каждый кадр, речевой или акустический сигнал (контекст) сжимается «своим» кодером), номер КК в ВП, номер КСВ из списка для РП и время Δt', номер КАС из списка для АП и время Δt''). Выделение ключевых фрагментов в каждом подпотоке производится независимо, но с учетом синхроинформации от блока синхрогенератора и анализатора 205.Each key fragment (CC, SWR, CAS) is provided with a time stamp (time stamp) and service information (type of subflow, fragment parameters, including fragment duration, processing parameters for compression with a controlled error (if each frame, speech or acoustic signal (context ) is compressed by “its” encoder), the number of the spacecraft in the airspace, the number of SWRs from the list for the RP and the time Δt ', the number of the spacecraft from the list for the airspace and the time Δt The selection of key fragments in each subflow is carried out independently, but taking into account the synchronization information from the synchro generator block and analyzer 205.

Для выделения КК в ВП в блок выделения КК 202 поступает из БД изображений 206 эталонные изображения для поиска в каждом кадре ВП похожих объектов на эталонные изображения и пороги для оценки уровня похожести (ЦП) из блока критериев В/И для отбора КК 209 или, если поиск похожих объектов не требуется или дополняется определением КК, отличающихся существенными изменениями (СИ) в части сцены, считающейся информационной, по отношению в предыдущему КК с выделенной информационной частью (начиная с первого кадра, автоматически считающегося ключевым), на блок выделения КК 202 поступают данные о выбранных критериях для отбора КК и порогов СИ для них (ПСИ) из блока критериев КК в ВП 209. Выделенные КК поступают из блока их выделения 202 в блок кодирования (или транскодирования [17]) изображений 212, если выполняется условие для контролируемой погрешности: е>е_o, а если не выполняется это условие, то блок 212 обходится и выделенные КК поступают непосредственно в блок формирования пакета КК (изображения) 216, содержащего сам КК, а также данные о нем. Отметим, что если дополнительное кодирование/транскодирование изображений в блоке 212 используется, то параметры кодека задаются в блоке параметров кодеков 215, на который подаются управляющие сигналы из блоков критериев 209, 210, 211 и информация о параметрах входного МП по трем его составляющим (ВП, РП, АП) из блока DeMUX 202.To select the QC in the VP, the KK 202 extraction block receives from the image database 206 reference images for searching similar objects and reference thresholds in each VP frame for evaluating the similarity level (CPU) from the I / O criteria block for selecting QC 209 or, if the search for similar objects is not required or is supplemented by the definition of QC, which differ in significant changes (SI) in the part of the scene considered informational, in relation to the previous QC with the highlighted information part (starting from the first frame, automatically considered the key) QC 202 extraction block, data about the selected criteria for QC selection and SI thresholds for them (PSI) from the QC criteria block in the VP 209 is received. The allocated QCs come from their selection block 202 to the encoding (or transcoding [17]) block of images 212 if the condition for the controlled error is satisfied: e> e _o , and if this condition is not fulfilled, then the block 212 is bypassed and the allocated QCs go directly to the QC (image) packet generation block 216 containing the QC itself, as well as data about it. Note that if additional encoding / transcoding of images in block 212 is used, then the codec parameters are set in the codec parameters block 215, to which control signals from the criteria blocks 209, 210, 211 and information on the parameters of the input MP in its three components (VP, RP, AP) from the DeMUX 202 block.

Для выделения КСВ в РП на блок выделения КСВ 203 поступают из БД КСВ 207 информация о КСВ, которые необходимо обнаружить в РП, но также при необходимости выделить речевой фрагмент длительностью 2Δt', окружающий КСВ («речевой контекст»), поступает сигнал из блока критериев КСВ 210. Именно этот речевой контекст поступает на вход речевого кодера (или транскодера) 213, параметры которого определяются сигналом от блока параметров 215. Если же в этом контексте нет необходимости для ЛПР, то информация о номере обнаруженного КСВ из списка КСВ и его временной метке поступает в блок формирования пакета данных об обнаруженном КСВ 217, минуя кодер 213.To select the SWR in the RP, the SWR 203 isolation block receives information from the SWR 207 from the SWR database 207, which must be detected in the RP, but also, if necessary, select a speech fragment of 2Δt 'duration surrounding the SWR (“speech context”), a signal is received from the criteria block SWR 210. It is this speech context that enters the input of the speech encoder (or transcoder) 213, the parameters of which are determined by the signal from the parameter block 215. If, in this context, it is not necessary for the decision maker, then information about the number of the detected SWR from the list of SWRs and its time the label enters the block for generating a data packet about the detected SWR 217, bypassing the encoder 213.

Примерно также действует блок выделения КАС 204, в который поступает информация о специфических КАС, которые надо выделить в АП, из БД АС 208. При выделении КАС данные о нем, а также при необходимости контекст в виде акустического сигнала этого КСВ длительностью Δt'' в соответствии с сигналом из блока критериев КАС 211, причем акустический сигнал подвергается сжатию в акустическом кодере (или транскодере) 214, параметры которого определяются сигналом от блока параметров 215, формируется в блоке 218 соответствующий пакет данных, причем в него может и не включаться информация об «акустическом контексте» КАС, если ЛПР посчитает, что в ней нет необходимости.The CAS 204 extraction unit also operates approximately, which receives information about specific CAS, which must be selected in the AU, from the database of the AC 208. When CAS is selected, data about it, as well as, if necessary, the context in the form of an acoustic signal of this CWS with a duration of Δt in accordance with the signal from the block of criteria CAS 211, and the acoustic signal is compressed in the acoustic encoder (or transcoder) 214, the parameters of which are determined by the signal from the block of parameters 215, a corresponding data packet is generated in block 218, and it can information on the “acoustic context” of CAS should not be included if the decision maker considers that it is not necessary.

Таким образом, на выходе устройства формируются три самостоятельных прореженных потока пакетов КК, КСВ и КАС (как это отражено условно на фиг. 1б, где пакеты представлены вертикальными черточками (сплошными для КК, пунктирными для КСВ, штрихпунктирными для КАС), которые можно подавать в СХД после операции уплотнения пакетов в блоках уплотнения 220 (изображений или КК), 221 (КСВ), 222 (КАС). При необходимости мультиплексировать эти пакеты в блоке MUX 219 все три потока пакетов поступают на блок 219, на который поступает также сигнал из блока синхронизации и анализа 205 для корректного форматирования общего обработанного МП в виде потока пакетов, как показано условно в примере на фиг. 1в в нижней строке, в котором пакеты следуют в соответствии со своими временными метками, и этот поток может быть подан в СХД. Очевидно, что в блоке 219 пакеты можно уплотнять и по своим классам (например, вначале все пакеты КК, потом пакеты КСВ, а уже после них пакеты КАС, причем все эти классы уплотняются также в общий поток сжатой МИ).Thus, at the output of the device, three independent thinned streams of QC, SWR and CAS packets are formed (as reflected conditionally in Fig. 1b, where the packets are represented by vertical bars (solid for QC, dotted for CWS, dash-dotted for CAS), which can be fed into SHD after the packet compaction operation in the compaction units 220 (images or CC), 221 (SWR), 222 (CAS) .If necessary, multiplex these packets in the MUX 219 block, all three packet streams go to block 219, which also receives the signal from the block synchronization and analysis 205 for the correct formatting of the total processed MP as a packet stream, as shown conditionally in the example of Fig. 1c in the bottom line, in which the packets follow in accordance with their time stamps, and this stream can be submitted to the storage system. in block 219, packets can be compressed according to their classes (for example, at first all QC packets, then SWR packets, and after them CAS packets, and all these classes are also compressed into a common stream of compressed MI).

Рассмотрим структуру и основные функции основных блоков предлагаемого устройства. Это прежде всего БВКИ, которые обрабатывают отдельные подпотоки (ВП, РП, АП) и выделяют из них информативную часть. Выходной видео подпоток от блока DeMUX 201 после обработки поступает в блок запоминающего устройства (ЗУ) 301, в котором может временно храниться Н кадров видеоряда. Из ЗУ по очереди FIFO эти кадры в соответствии с их нумерацией извлекаются и подаются одновременно на блоки 302 и 303. Блок 302 реализует соответственно сравнение текущего кадра с образцом, поступающим на второй вход блока 302 из БД эталонных изображений 206 (например, на базе метода корреляции изображений или метода поиска фрагментов заданной сцены на текущем кадре). В результате оценки коэффициента корреляции или в результате поиска вычисляется мера похожести одного из набора эталонных изображений на сцену или фрагмент сцены в текущем кадре. В решающем устройстве 304 выносится решение «содержится ли в текущем кадре элементы, представленные на одном из эталонных изображений, или нет», т.е. отнести текущий кадр к ключевым (КК) или нет на основе сравнения с порогом, который подается на блок 304 из блока 209. При положительном решении данный кадр из блока 301 прямо поступает в блок сбора КК 305 через переключатель 307, на который подается разрешающий сигнал из блока 304, и далее поступает вместе с временной меткой, номером и сопровождающей информацией на выход БВКК. Отметим, что первый кадр по умолчанию считается ключевым и из блока ЗУ 301 поступает в блок 303 для выбора его информационной части (например, на базе метода сегментации интересующих ЛПР объектов с некоторой динамикой в сцене), а одновременно в блок 305.Consider the structure and main functions of the main blocks of the proposed device. These are, first of all, BVKIs, which process individual sub-flows (VP, RP, AP) and extract the informative part from them. After processing, the output video substream from the DeMUX 201 unit enters the storage unit (memory) 301, in which N frames of the video sequence can be temporarily stored. From the FIFO memory, these frames, in accordance with their numbering, are extracted and fed simultaneously to blocks 302 and 303. Block 302 respectively compares the current frame with the sample supplied to the second input of block 302 from the database of reference images 206 (for example, based on the correlation method image or search method for fragments of a given scene on the current frame). As a result of evaluating the correlation coefficient or as a result of the search, a measure of the similarity of one of the set of reference images to the scene or scene fragment in the current frame is calculated. In the resolver 304, a decision is made “whether the elements present in one of the reference images are contained in the current frame or not”, i.e. assign the current frame to the key (QC) or not based on a comparison with the threshold that is supplied to block 304 from block 209. If the decision is positive, this frame from block 301 directly goes to the QC 305 collection block through switch 307, to which the enable signal from block 304, and then comes along with the time stamp, number and accompanying information to the output of the BVKK. Note that the first frame is considered to be the key by default, and from the memory block 301 it goes to block 303 to select its information part (for example, based on the method of segmentation of objects of interest to the decision maker with some dynamics in the scene), and simultaneously to block 305.

При необходимости выбора КК из видеоряда, который в своей информационной части отличается от предыдущего КК, в котором его информационная часть уже обозначена, текущий кадр поступает в блок 303, реализующий выбор информационной части текущего кадра, а затем в блок сравнения 306, на второй вход которого подается порог Delta (или два порога Delta1 и Delta2). В блоке 306 сравниваются две информационные части текущего кадра и предшествующего КК. Если сравнение прямое (по числу отличающихся в информационных частях пикселов, то определяется уровень изменения L в текущем кадре по отношению к предшествующему ему КК. Принятие «порогового» решения после сравнения L с порогом Delta выносится прямо и при L>Delta текущий кадр считается ключевым (КК) и поступает в блок 305 через переключатель 308. При использовании в блоке 306 «интеллектуального» сравнения ожидается, что число КК для ВП будет меньшим, что требует разработки соответствующего метода.If it is necessary to select a QC from the video sequence, which differs from the previous QC in its information part, in which its information part is already indicated, the current frame goes to block 303, which implements the choice of the information part of the current frame, and then to the comparison block 306, to the second input of which Delta threshold (or two Delta1 and Delta2 thresholds) is applied. In block 306, two information parts of the current frame and the previous QC are compared. If the comparison is direct (by the number of pixels differing in the information parts), then the level of L change in the current frame relative to the previous CC is determined. The adoption of a “threshold” decision after comparing L with the Delta threshold is made directly and for L> Delta the current frame is considered key ( QC) and enters block 305 via switch 308. When using the “smart” comparison in block 306, it is expected that the number of QC for the VP will be less, which requires the development of an appropriate method.

Все блоки работают синхронно в соответствии с тактовыми импульсами, вырабатываемыми местным синхрогенератором 309.All blocks operate synchronously in accordance with the clock pulses generated by the local sync generator 309.

Выбор КСВ в речевом подпотоке обеспечивается в БВКСВ, структура которого представлена на фиг. 4. На вход БВКСВ поступает речевой сигнал в виде последовательности речевых кадров длительностью 20…80 мс (речевых сэмплов или речевых параметров в зависимости от метода представления/кодирования этого сигнала) и служебная информация о кодировании РП. Конечное число речевых кадров записывается в блок ЗУ и анализа служебной информации 401. С выхода этого блока речевые кадры поступают в блок декодера 403 через переключатель 402, если они предварительно кодированы, т.е. по внешнему управляющему сигналу на переключатель 402, а затем в блок детектирования (отбора) КСВ 404 или прямо в этот блок, если речевые кадры представлены сэмплами, т.е. не кодированы. Переключатель 402, на который поступает управляющий сигнал, регламентирует этот процесс.The selection of the SWR in the speech sub-stream is provided in the BCSW, the structure of which is shown in FIG. 4. A speech signal in the form of a sequence of speech frames lasting 20 ... 80 ms (speech samples or speech parameters depending on the method of presentation / coding of this signal) and service information about RP coding are received at the BVKSV input. The final number of speech frames is recorded in the memory unit and analysis of service information 401. From the output of this block, speech frames are transmitted to the decoder unit 403 through the switch 402, if they are precoded, i.e. by an external control signal to the switch 402, and then to the detection (selection) block of the SWR 404 or directly to this block, if the speech frames are represented by samples, i.e. not encoded. The switch 402, which receives the control signal, regulates this process.

Если КСВ обнаружено, то из блока 404 на блок выделения (выбора) речевого контекста 405 для этого КСВ поступает при необходимости управляющий сигнал, который «вырезает» из речевого сигнала соответствующий участок длительностью 2Δt', включающий обнаруженное КСВ. С выхода блока 404 поступает номер выделенного КСВ на блок 406, куда на его второй вход поступает выделенный «речевой контекст». Таким образом, блок 406 собирает текущую информацию об обнаруженном КСВ и его речевом контексте в один пакет на своем выходе. Отметим, что если вместо выделения КСВ в РП имеется возможность заметить блок 404 на блок трансформации речи в текст (STT), то такая реализация блока 404 даст более эффективное решение во многих ситуациях для ЛПР.If an SWR is detected, then, if necessary, a control signal is supplied from the block 404 to the speech context isolation (selection) block 405 for this SWR, which “cuts out” the corresponding section of 2Δt ′ duration including the detected SWR from the speech signal. From the output of block 404, the number of the selected CWS is sent to block 406, where the dedicated “speech context” is sent to its second input. Thus, block 406 collects current information about the detected SWR and its speech context into one packet at its output. Note that if instead of highlighting the SWR in the RP, it is possible to notice block 404 on the speech-to-text transformation (STT) block, then such an implementation of block 404 will provide a more effective solution in many situations for decision-makers.

Аналогично строится и блок выделения КАС, представленный на фиг. 5. На вход блока поступает аудиосигнал в виде последовательности аудиокадров и служебная информация о кодировании АП. Конечное число Н аудиокадров записывается в блок ЗУ и анализа служебной информации 501. С выхода этого блока аудиокадры поступают в блок декодера 503 через переключатель 502, если они предварительно кодированы, а затем в блок детектирования (отбора) КАС 502 или прямо в этот блок, если аудиокадры представлены сэмплами, т.е. не кодированы. Переключатель 504, на который поступает управляющий сигнал, регламентирует этот процесс.The CAS extraction unit shown in FIG. 5. An audio signal in the form of a sequence of audio frames and service information about the encoding of the AP is received at the block input. The final number H of audio frames is recorded in a memory unit and service information analysis block 501. From the output of this block, audio frames are sent to decoder block 503 via switch 502 if they are precoded, and then to CAS 502 detection (selection) block or directly to this block if audio frames are represented by samples, i.e. not encoded. The switch 504, which receives the control signal, regulates this process.

Если КАС обнаружено, то из блока 503 на блок выбора аудиоконтекста 505 для этого КАС поступает при необходимости управляющий сигнал, который «вырезает» из аудиосигнала соответствующий участок длительностью 2Δt'', включающий обнаруженное КАС. С выхода блока 504 поступает номер выделенного КАС на блок 506, куда на его второй вход поступает выделенный «аудиоконтекст». Таким образом, блок 506 собирает текущую информацию об обнаруженном КАС и его аудиоконтексте на своем выходе и формирует соответствующий пакет.If the CAS is detected, then, from the block 503, the control context selection block 505 for this CAS receives, if necessary, a control signal that “cuts” the corresponding section of 2Δt ″ duration including the detected CAS from the audio signal. From the output of block 504, the assigned CAS number is sent to block 506, where the dedicated “audio context” is received at its second input. Thus, block 506 collects current information about the detected CAS and its audio context at its output and generates a corresponding packet.

Таким образом, устройство высокоэффективного сжатия МИ большого объема решает задачу существенного уплотнения потока МИ, выделяя самые информационно ценные его фрагменты для ЛПР и представляет их в сжатом виде с комплексированием служебными данными в пакетной форме перед помещением в СХД.Thus, the device of high-efficiency compression of large-volume MI solves the problem of substantial compression of the MI stream, isolating its most informationally valuable fragments for decision-makers and presents them in a compressed form with complexing service data in batch form before being placed in storage.

Для извлечения сжатой МИ из СХД необходимо ее разуплотнить, выделить служебные данные и декодировать отдельные подпотоки, представленные в своей специфической уплотненной форме в виде набора пакетов. Структура устройства разуплотнения и декодирования информации для последующего анализа со стороны ЛПР представлена на фиг. 6.To extract a compressed MI from a storage system, it is necessary to decompress it, extract service data, and decode individual substreams presented in their specific compressed form as a set of packets. The structure of the decompression and decoding device for subsequent analysis by the decision maker is shown in FIG. 6.

Три подпотока (ВП, РП и АП), сохраняемые в СХД как взаимосвязанные, но относительно автономные данные могут считываться и восстанавливаться независимо, как показано на фиг. 6, попадая соответственно в свои блоки разборки пакетов и декодирования 601 (для ВП), 602 (для РП) и 603 (для АП). В этих блоках происходит отделение служебных данных от собственно информации в выделенных фрагментах и восстановление ее в исходной форме (изображений, данных о КСА и КАС и, возможно, речевого и аудиоконтекстов, сопровождающих указанные КСВ и КАС). Эту информацию можно просматривать или прослушивать в блоках воспроизведения изображений/видео 604, речи 605, аудио 606 в том времени, как она возникала, но можно и в сокращенном времени, из которого исключены временные интервалы, на которых в исходной МИ содержится «информационный шум» для ЛПР.Three substreams (VP, RP and AP) stored in the storage as interconnected but relatively autonomous data can be read and restored independently, as shown in FIG. 6, respectively falling into their packet disassembling and decoding blocks 601 (for RP), 602 (for RP) and 603 (for AP). In these blocks, service data is separated from the information itself in the selected fragments and restored to its original form (images, data on the KSA and CAS and, possibly, speech and audio contexts accompanying the indicated CWS and CAS). This information can be viewed or listened to in the image / video 604, speech 605, and audio 606 playback units at the time it occurred, but also in the reduced time from which time intervals are excluded, in which the source MI contains “information noise” for the decision maker.

При необходимости можно анализировать полный поток МИ в сжатой форме, предварительно упакованный мультиплексором 219, после того как он будет демультиплексирован вместе с синхроданными и служебными данными в блоке сортировки пакетов 607. Синхроданные от блока 607 передаются в блок местного синхрогенератора 610 и он вместе блоком согласования 608, на который поступают декодированные ключевые кадры из блока 601, речевой и аудиоконтексты из блоков 602 и 603, задает согласованную и синхронизированную работу всех блоков в устройстве восстановления информации в выбранном режиме времени (т.е. полном или сокращенном времени).If necessary, it is possible to analyze the complete MI stream in compressed form, pre-packaged by multiplexer 219, after it is demultiplexed together with sync data and service data in the packet sorting unit 607. Synchro data from block 607 are transmitted to the local sync generator 610 and together with the matching block 608 , which receives the decoded key frames from block 601, speech and audio contexts from blocks 602 and 603, sets the coordinated and synchronized operation of all blocks in the recovery device inf Formations in the selected time mode (i.e., full or shortened time).

Источники информации:Information sources:

1. http://www.compression.ru/arctest/descript/comp-hist.htm1. http://www.compression.ru/arctest/descript/comp-hist.htm

2. US 9236882 В2. Data compression systems and methods - патент США, приоритет от 1.06.2015, автор James J. Fallon.2. US 9236882 B2. Data compression systems and methods - US patent, priority dated 01/06/2015, by James J. Fallon.

3. MPEG2 и MPEG4 - описание форматов (http://www.truehd.ru/03.htm)3. MPEG2 and MPEG4 - description of the formats (http://www.truehd.ru/03.htm)

4. РФ. 2464651. Способ и устройство многоуровневого масштабируемого устойчивого к информационным потерям кодирования речи для сетей с коммутацией пакетов, приоритет от 22.12.2009, автор В.А. Свириденко.4. RF. 2464651. Method and device for multi-level scalable information loss-resistant speech coding for packet-switched networks, priority of 12.22.2009, author V.A. Sviridenko.

5. Аудиокодек. Википедия: https://ru.wikipedia.org/wiki/%D0%90%D1%83%D0%B4%D0%B8%D0%BE%D0%BA%D0%BE%D0%B4%D0%B5%D0%BA5. Audio codec. Wikipedia: https://en.wikipedia.org/wiki/%D0%90%D1%83%D0%B4%D0%B8%D0%BE%D0%BA%D0%BE%D0%B4%D0%B5 % D0% BA

6. Дедупликация. Википедия: https://ru.wikipedia.org/wiki/%D0%94%D0%B5%D0%B4%D1%83%D0%BF%D0%BB%D0%B8%D0%BA%D0%B0%D1%86%D0%B8%D1%8F6. Deduplication. Wikipedia: https://en.wikipedia.org/wiki/%D0%94%D0%B5%D0%B4%D1%83%D0%BF%D0%BB%D0%B8%D0%BA%D0%B0 % D1% 86% D0% B8% D1% 8F

7. Фрактальное сжатие изображений. Википедия: https://ru.wikipedia.org/wiki/%D0%A4%D1%80%D0%B0%D0%BA%D1%82%D0%B0%D0%BB%D1%8C%D0%BD%D0%BE%D0%B5%D1%81%D0%B6%D0%B0%D1%82%D0%B8%D0%B57. Fractal image compression. Wikipedia: https://en.wikipedia.org/wiki/%D0%A4%D1%80%D0%B0%D0%BA%D1%82%D0%B0%D0%BB%D1%8C%D0%BD % D0% BE% D0% B5% D1% 81% D0% B6% D0% B0% D1% 82% D0% B8% D0% B5

8. Полезность информации: http://studall.org/all-72812.html8. Usefulness of information: http://studall.org/all-72812.html

9. Структура кодека Н.264. Опорные кадры: http://www.videomax-server.ru/articles/opornyj-kadr-v-h-264-malenkij-parametr-s.html9. The structure of the H.264 codec. Reference frames: http://www.videomax-server.ru/articles/opornyj-kadr-v-h-264-malenkij-parametr-s.html

10. Пилипенко В.В. Распознавание ключевых слов в потоке речи при помощи фонетического стенографа, 2013. (см. http://www.km.ru/referats/334471-raspoznavanie-klyuchevykh-slov-v-potoke-rechi-pri-pomoshchi-foneticheskogo-stenografa10. Pilipenko V.V. Recognizing keywords in a speech stream using a phonetic stenographer, 2013. (see http://www.km.ru/referats/334471-raspoznavanie-klyuchevykh-slov-v-potoke-rechi-pri-pomoshchi-foneticheskogo-stenografa

11. http://www.dialog-21.ru/digests/dialog2006/materials/html/Kiselov.htm).11.Http: //www.dialog-21.ru/digests/dialog2006/materials/html/Kiselov.htm).

12. Поиск похожего изображения: https://yandex.ru/support/images/similar.xml12. Search for a similar image: https://yandex.ru/support/images/similar.xml

13. Поиск по картинке: https://support.google.com/websearch/answer/1325808?hl=ru13. Search by image: https://support.google.com/websearch/answer/1325808?hl=en

14. Корреляция цифровых изображений: https://en.wikipedia.org/wiki/Digital image correlation14. Digital Image Correlation: https://en.wikipedia.org/wiki/Digital image correlation

15. Алгоритм Виолы-Джонса:15. Viola-Jones algorithm:

https://en.wikipedia.org/wiki/Viola%E2%80%93Jones_obiect_detection_framework)https://en.wikipedia.org/wiki/Viola%E2%80%93Jones_obiect_detection_framework)

16. Операции над множествами: http://umk.portal.kemsu.ru/uch-mathematics/papers/posobie/r2-2.htm Y)16. Operations on sets: http://umk.portal.kemsu.ru/uch-mathematics/papers/posobie/r2-2.htm Y)

17. Транскодирование: см.: https://en.wikipedia.org/wiki/Transcoding17. Transcoding: see: https://en.wikipedia.org/wiki/Transcoding

Аббревиатуры в тексте заявки:Abbreviations in the text of the application:

МИ - мультимедийная информацияMI - multimedia information

КК - ключевые кадрыQC - key frames

КСВ - ключевые слова и выраженияSWR - Keywords and Expressions

КАС - ключевые акустичекие событияCAS - key acoustic events

СКО - среднеквадратическая ошибкаStandard deviation - standard error

СХД - система хранения данныхSHD - data storage system

БД - база данныхDB - database

ТВ - телевидениеTv tv

СПИ - система передачи информацииSPI - information transfer system

СИ - существенность измененийSI - materiality of change

ВП - видеопотокVP - video stream

РП - речевой потокRP - speech flow

АП - аудиопотокAP - audio stream

МП - мультимедийный потокMP - multimedia stream

ПП - порог похожестиPP - threshold of similarity

ПСИ - порог существенности измененийPSI - threshold of materiality of changes

ЗУ - запоминающее устройствоMemory - storage device

SNR - отношение сигнал/шум (ОСШ)SNR - signal to noise ratio (SNR)

PSNR - пиковое ОСШPSNR - peak SNR

Claims

1. A method of compressing large amounts of multimedia information (MI) in digital form for transmission through communication channels or storing in data storage systems (SHD), in which video, speech and audio streams are encoded taking into account their specifics, respectively, video, speech and audio codecs, They are compressed into a common multimedia stream transmitted via telecommunication channels or placed as separate files or in a common file in storage devices, and when output from a channel or when removed from a storage device, they are restored to acceptable to the consumer or the decision maker (DM), separately for each stream or after decompression of the common stream and decoding of the compressed video, voice and audio information are combined into a common restored multimedia stream, characterized in that to increase the compression efficiency of the multimedia stream and / or its components, the general flow and its individual components are divided into an informationally significant part and an informationally insignificant part (information noise) in accordance with the criteria for the value of information, for given by the DM, while the informationally insignificant part is excluded or significantly reduced in volume.

2. The method in accordance with paragraph 1 of the formula in which the decision maker sets criteria for the value of fragments of multimedia information for the selection of key frames, keywords and statements (CWS) and key acoustic events (CAS), respectively, in its video, speech and audio components .

3. The method in accordance with paragraph 1 of the formula, which sets the samples of parts of scenes that are determined in individual frames of the video component of the MI by the level of similarity based on the calculation of the correlation of images or search procedures, and levels and thresholds of similarity are set based on requirements from the decision maker.

4. The method in accordance with paragraph 1 of this formula, which sets the lists of keywords and statements, as well as acoustic events that must be identified in the speech and audio component of the MI.

5. The method in accordance with paragraph 1 of the formula, which selects the informationally significant part of the video, speech and audio component of multimedia information based on the value criterion specified by the decision maker,

additionally, the selected informationally important part of multimedia information is encoded or transcoded in order to compress it with a controlled error in accordance with the requirements of the decision maker to the quality of the restored information that meets the criteria of value.

6. The method in accordance with paragraph 1 of the formula, which implements in each frame of the video component of the multimedia stream the allocation of the information part of the current frame during processing and compares it with the corresponding information part of the reference key frame to obtain an estimate of the degree of difference between the compared frames and the purpose of the compared current frame as the key if the specified score is greater than the specified threshold.

7. The method in accordance with paragraph 1 of the formula, which implements the selection of keywords and expressions from a given list in the speech component of the multimedia stream based on their recognition in the stream of continuous speech.

8. The method in accordance with paragraph 1 of this formula, which implements the selection of acoustic events from a given list in the acoustic component of the multimedia stream based on the analysis of their energy and spectral characteristics.

9. The method in accordance with paragraph 1 of the formula, which provides compression of the selected frames in the video stream, keywords and expressions in the speech stream, acoustic events in the audio stream along with overhead information describing the selected information fragments of the MI, and also provides compression of the selected fragments in general multimedia stream.

10. The method in accordance with paragraph 1 of the formula, which performs extraction from the storage system and decompression of individual streams (a set of separate selected frames, keywords and expressions together with their speech context (compressed speech signal), acoustic events together with their acoustic context (compressed audio signal) and decoding of these streams for subsequent analysis by the decision maker, as well as decompression of the general stream and coordinated decoding of its individual components.

11. A device that implements a method of compressing multimedia information in accordance with paragraphs. 1-9 formulas and containing a demultiplexer, the input of which receives a multimedia stream together with service information describing the parameters of this stream, the main three outputs of which are respectively sent to the frame extractor in the video stream, the extractor of keywords and expressions in the speech stream, the extractor of acoustic events in the audio stream the outputs of which are connected to series-connected encoders, formatters and compressors of images, speech signals and acoustic signals; at the same time, the three service outputs of the demultiplexer, which carry information on the parameters of the three components of the multimedia stream, go to the video, speech and audio stream parameters block, and the fourth service output synchronizes the operation of the local clock generator, the outputs of which are fed to the service inputs of the isolators, encoders, formatters for their coordinated operating time, and the outputs of the parameter block are respectively connected to the parametric inputs of encoders and formatters, and the outputs of three formatters are connected to three ovnym inputs of the multiplexer, the output of which forms a seal MI total flow, and its service clock terminal connected to the fourth output local clock; in order to extract key frames, keywords and expressions and key acoustic events, data on scene samples, keywords and expressions and acoustic events that are to be extracted from the corresponding databases, and to the second special inputs of extractors from blocks are fed to the second special input of each highlighter criteria for the value of video / images, voice messages and acoustic signals provides information on the criteria of value for the selection of key elements in these highlighters, and the second outputs of these blocks s inputs connected to respective databases.

12. A device that implements the detection of key frames in the video stream in accordance with paragraphs. 1 and 6, including a storage device for a finite set of frames of the video stream, the output of which is connected simultaneously to three branches with series-connected comparison blocks with image samples and a solver in the first branch, the first switch in the second branch and the block for selecting the information part of each frame in series, the second switch and a packet forming unit, the second outputs of the block for selecting the information part of the frame and the packet forming unit are connected to the frame comparing unit, the output of which w is connected to the control inputs of the first and second switches, and the synchronization of the device is provided by the local clock.

13. A device that implements the detection of keywords and expressions in the speech stream in accordance with paragraphs. 1 and 7, comprising a series-connected speech frame memory, a switch, a speech context extractor, a speech packetizer, the switch being connected to a speech decoder, the output of which is connected to the input of the SWR and to the input of the context extractor, and the first output of the keyword or expression is connected to the second input of the speech packetizer, the second output of the SWR detector is connected to the second input of the context extractor, and the second output of the storage device Connected to the control input of the switch.

14. A device that implements the detection of key acoustic events in the audio stream in accordance with paragraphs. 1 and 8, including a series-connected storage of audio frames, a switch, an audio context extractor, an audio packetizer, the switch being connected to an audio signal decoder, the output of which is connected to the input of the CAS detector and to the input of the context extractor, and the first output of the CAS detector is connected to the second input of the audio generator packets, and the second output of the CAS detector is connected to the second input of the context selector, and the second output of the storage device is connected to the control input la.

15. A device that implements a method of decompression and recovery of selected key fragments of multimedia information in accordance with paragraphs. 1 and 10 of the formula and containing series-connected blocks for disassembling packets and decompression of the individual components of the MI, and a device for reproducing images, as well as information about keywords and expressions and acoustic events, including related speech and audio contexts; at the same time, to recover key images, speech and audio information multiplexed into a shared stream of a dedicated MI, this common stream goes to a series-connected packet sorter carrying information about images, keywords and expressions and key acoustic events and sync information, a matching unit and a joint unit MI playback, and all of these blocks are synchronized by a common local clock generator, the input of which is connected to the second input of the packet sorter.