RU2534368C2

RU2534368C2 - Method of compressing prefix tree data structure

Info

Publication number: RU2534368C2
Application number: RU2013101654/08A
Authority: RU
Inventors: Павел Анатольевич Марьянов; Андрей Александрович Миняев; Андрей Львович Кузьмин; Наталия Евгеньевна Лукьянченкова
Priority date: 2013-01-11
Filing date: 2013-01-11
Publication date: 2014-11-27
Also published as: RU2013101654A

Abstract

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to information processing and specifically to methods of searching for information and creating data structures to this end. The method of compressing a prefix tree data structure includes node merging after ranking and finding the most suitable merging pair. To this end, values of numerical characteristics of random quantities are calculated at each iteration for all nodes involved in compression. The best node for merging is selected for each node according to a selected criterion. Iterations continue while there is still potential for merging for at least a pair of nodes.

EFFECT: high compression density, which enables to reduce the amount of random-access memory needed to store said data structure without complicating the search algorithm and reducing the information search rate.

5 dwg

Description

Изобретение относится к области обработки информации, а именно к способам и методам поиска информации, а также создания структур данных, предназначенных для этой цели.The invention relates to the field of information processing, and in particular to methods and methods of information retrieval, as well as the creation of data structures designed for this purpose.

Характеристики систем поиска ключевых слов в потоке данных зависят как от используемого алгоритма поиска, так и от количества ключевых слов. Существенными характеристиками являются скорость поиска и объем оперативной памяти, используемой при проведении поиска. Скорость работы большинства алгоритмов поиска существенно падает при увеличении количества ключевых слов. С другой стороны, алгоритмы, скорость работы которых не зависит от количества ключевых слов, требуют наличия больших объемов памяти. Подбор соответствующего алгоритма поиска при заданных условиях является нетривиальной задачей.The characteristics of keyword search systems in a data stream depend both on the search algorithm used and on the number of keywords. The essential characteristics are the search speed and the amount of RAM used in the search. The speed of most search algorithms drops significantly with an increase in the number of keywords. On the other hand, algorithms whose speed does not depend on the number of keywords require large amounts of memory. The selection of an appropriate search algorithm under given conditions is a non-trivial task.

Алгоритм цифрового поиска на основе префиксного дерева является наилучшим по критерию наибольшей скорости поиска [Fredkin, Е. 1960. Trie memory. Communications of the ACM 3, 490-499]. Его достоинствами являются независимость от количества ключевых слов и простота реализации. Недостатком данного метода является большой объем памяти для хранения структуры данных вследствие ее сильной разреженности (количество занятых ненулевых элементов в структуре низко по отношению к их общему количеству). Разреженность структуры данных может быть устранена либо модификацией, либо уплотнением структуры данных.The digital search algorithm based on the prefix tree is the best by the criterion of the highest search speed [Fredkin, E. 1960. Trie memory. Communications of the ACM 3, 490-499]. Its advantages are independence from the number of keywords and ease of implementation. The disadvantage of this method is the large amount of memory for storing the data structure due to its strong sparseness (the number of occupied nonzero elements in the structure is low in relation to their total number). Sparseness of the data structure can be eliminated either by modifying or compacting the data structure.

Известны способы модификации структуры данных, заключающиеся в изменении структуры узла [патент США №7720854 В2 от 18.05.2010, МПК G06F 7/00 (2006.01), G06F 17/30 (2006.01); патент США №6675171 В2 от 06.01.2004, МПК G06F 17/30; патент США №6671694 В1 от 30.12.2003, МПК G06F 17/30]. Данные способы обладают двумя недостатками. Во-первых, усложнение структуры узла приводит к усложнению алгоритма поиска. Во-вторых, уход от прямой адресации элементов внутри узла приводит к увеличению времени итерации по ключу. Это приводит к уменьшению скорости работы алгоритма поиска.Known methods for modifying the data structure, consisting in changing the structure of the node [US patent No. 7720854 B2 dated 05/18/2010, IPC G06F 7/00 (2006.01), G06F 17/30 (2006.01); US patent No. 6675171 B2 dated January 6, 2004, IPC G06F 17/30; US patent No. 6671694 B1 dated 12.30.2003, IPC G06F 17/30]. These methods have two disadvantages. Firstly, the complexity of the node structure complicates the search algorithm. Secondly, avoiding direct addressing of elements inside the node leads to an increase in the iteration time for the key. This leads to a decrease in the speed of the search algorithm.

Процесс уплотнения структуры данных путем слияния узлов с использованием незанятых элементов получил малое распространение из-за требований к статичности пространства ключевых слов (после процесса уплотнения процесс вставки нового ключевого слова приводит к большим временным затратам). Для определенного круга задач данное требование является допустимым.The process of compacting the data structure by merging nodes using unoccupied elements is not widely used due to the requirements for the static space of keywords (after the process of compaction, the process of inserting a new keyword leads to a large amount of time). For a certain range of tasks, this requirement is acceptable.

В общем случае, решение задачи уплотнения сводится к поиску такой перестановки из всех возможных, при которой достигается наилучшее (максимальное) отношение количества занятых элементов к их общему количеству. Существует доказательство NP-полноты данного процесса [Dill, J. М. Optimal trie compaction is NP-complete. TR-CSD-167, Pennsylvania State Univ., March 1987].In the general case, the solution to the compaction problem is reduced to searching for such a permutation of all possible that the best (maximum) ratio of the number of occupied elements to their total number is achieved. There is evidence of NP completeness of this process [Dill, J. M. Optimal trie compaction is NP-complete. TR-CSD-167, Pennsylvania State Univ., March 1987].

Наиболее близким, взятым в качестве прототипа, является метод, система, программа и структура данных компактного массива для хранения символьных строк [патент США №6470347 В1 от 22.10.2002, МПК G06F 17/00]. Метод описывает структуру данных для хранения строк в виде префиксного дерева. Уплотнение структуры данных достигается в два этапа. На первом этапе удаляются повторяющиеся узлы, т.е. узлы с ненулевыми элементами в одних и тех же позициях. На втором этапе производится слияние узлов друг с другом. Условие слияния: ненулевые элементы узла-источника могут быть вставлены на место пустых элементов узла-приемника. Допускается смещение элементов одного узла относительно элементов другого.The closest, taken as a prototype, is the method, system, program and data structure of a compact array for storing character strings [US patent No. 6470347 B1 from 10.22.2002, IPC G06F 17/00]. The method describes the data structure for storing strings in the form of a prefix tree. Data structure compaction is achieved in two steps. At the first stage, duplicate nodes are deleted, i.e. nodes with nonzero elements in the same positions. At the second stage, nodes are merged with each other. Merge condition: non-zero elements of the source node can be inserted in place of the empty elements of the destination node. The displacement of the elements of one node relative to the elements of another is allowed.

Данный метод обладает двумя недостатками. Во-первых, удаление повторяющихся узлов недопустимо, т.к. это приводит к потере свойства ассоциативности, когда в конце итерации поиска можно однозначно сопоставить найденный ключ с определенными данными. Во-вторых, слияние узлов по принципу первого найденного варианта приводит к отклонению результирующей плотности сжатия от максимально возможного значения.This method has two drawbacks. First, deleting duplicate nodes is unacceptable, because this leads to the loss of associativity property, when at the end of the search iteration it is possible to unambiguously match the found key with certain data. Secondly, the merging of nodes according to the principle of the first found option leads to a deviation of the resulting compression density from the maximum possible value.

Указанные недостатки в целом приводят к невозможности использования прототипа при выполнении процедур поиска с количеством ключевых слов, превышающим 10⁶.These shortcomings generally lead to the inability to use the prototype when performing search procedures with the number of keywords in excess of 10 ⁶ .

Задачей заявленного изобретения является создание способа уплотнения структуры данных префиксного дерева, позволяющего добиться уменьшения объема оперативной памяти, требуемого для хранения данной структуры данных. Усложнения алгоритма поиска, а также уменьшения скорости поиска информации при этом не происходит.The objective of the claimed invention is to provide a method of compacting the data structure of the prefix tree, which allows to reduce the amount of RAM required to store this data structure. Complications of the search algorithm, as well as a decrease in the speed of information retrieval, do not occur.

Данная задача решается тем, что способ уплотнения структуры данных префиксного дерева, заключающийся в слиянии узлов дерева друг с другом по принципу первого найденного варианта, при котором ненулевые элементы узла-источника вставляют на место пустых элементов узла-приемника, согласно изобретению дополнен тем, что слияние узлов происходит только после ранжирования и нахождения наиболее подходящих пар слияния.This problem is solved in that the method of compacting the data structure of the prefix tree, which consists in merging the tree nodes with each other according to the principle of the first found option, in which nonzero elements of the source node are inserted into the place of empty elements of the destination node, according to the invention is supplemented by the fact that the merger nodes occurs only after ranking and finding the most suitable merge pairs.

Для этого на каждой итерации происходит вычисление числовых характеристик случайных величин для всех узлов, участвующих в уплотнении. Далее, для каждого узла, согласно выбранному критерию, выбирается наилучший предполагаемый узел для слияния. Если слияние не удалось, выбирается следующий узел и т.д. либо до удачного слияния, либо до конца списка. Итерации продолжаются до тех пор, пока есть потенциальная возможность слияния хотя бы одной пары узлов.To do this, at each iteration, the numerical characteristics of random variables are calculated for all nodes participating in the compaction. Further, for each node, according to the selected criterion, the best prospective node for merging is selected. If the merge failed, the next node is selected, etc. either until a successful merger, or until the end of the list. Iterations continue as long as there is the potential for merging at least one pair of nodes.

Следствием использования предлагаемого способа является увеличение времени, необходимого для уплотнения структуры данных. Учитывая, что добавление новых ключевых слов после уплотнения не предполагается, данным увеличением можно пренебречь.A consequence of the use of the proposed method is an increase in the time required to compact the data structure. Considering that adding new keywords after compaction is not supposed, this increase can be neglected.

Проведенный анализ уровня современной вычислительной техники позволил установить, что аналоги, характеризующиеся совокупностью признаков, тождественных всем признакам заявленного технического решения, отсутствуют, что указывает на соответствие заявленного способа условию патентоспособности «новизна».The analysis of the level of modern computer technology made it possible to establish that there are no analogues that are characterized by a set of features identical to all the features of the claimed technical solution, which indicates the compliance of the claimed method with the condition of patentability “novelty”.

Результаты поиска известных решений в данной и смежных областях техники с целью выявления признаков, совпадающих с отличительными от прототипа признаками заявленного объекта, показали, что они не следуют явным образом из уровня техники. Из уровня техники также не выявлена известность отличительных существенных признаков, обусловливающих тот же технический результат, который достигнут в заявляемом способе. Следовательно, заявленное изобретение соответствует условию патентоспособности «изобретательский уровень».Search results for known solutions in this and related fields of technology in order to identify features that match the distinctive features of the claimed object from the prototype showed that they do not follow explicitly from the prior art. The prior art also did not reveal the fame of the distinctive essential features that determine the same technical result that is achieved in the claimed method. Therefore, the claimed invention meets the condition of patentability "inventive step".

Заявленный способ поясняется чертежами, на которых показано:The claimed method is illustrated by drawings, which show:

фиг.1 - общая схема устройства, реализующего способ уплотнения структуры данных префиксного дерева;figure 1 is a General diagram of a device that implements a method of compacting the data structure of a prefix tree;

фиг.2 - структура полей элементов таблицы состояний и переходов структуры данных;figure 2 - the field structure of the elements of the state table and transitions of the data structure;

фиг.3 - структура полей элементов оптимизированной таблицы состояний и переходов структуры данных;figure 3 - the structure of the fields of elements of the optimized state table and transitions of the data structure;

фиг.4 - структура полей элементов вспомогательной оптимизированной таблицы переходов;figure 4 - structure of the fields of elements of the auxiliary optimized transition table;

фиг.5 - сравнение экспериментальных результатов уплотнения префиксного дерева для способа-прототипа и предлагаемого способа.5 is a comparison of the experimental results of compaction of the prefix tree for the prototype method and the proposed method.

Устройство, реализующее заявленный способ, общая схема которого представлена на фиг.1, работает следующим образом.A device that implements the claimed method, the General scheme of which is presented in figure 1, works as follows.

Из набора ключевых слов (входные данные блока 1) с помощью блока генератора разреженного префиксного дерева (блок 2) формируется разреженная структура данных. Далее, структура данных подается на вход решающего блока 2, который определяет дальнейшее действие: либо завершить процесс, либо начать очередную итерацию уплотнения. Положительное решение о начале новой итерации уплотнения принимается в двух случаях. Во-первых, после первичной генерации структуры данных блоком 1. Во-вторых, после завершения очередной итерации уплотнения с положительным исходом. Положительный исход итерации определяется тем, что нашлось как минимум два узла, для которых удалось выполнить процедуру слияния. Начало новой итерации уплотнения происходит после передачи управления решающего блока 2 в блок вычисления числовых характеристик случайных величин (блок 3). Далее, в блоке вычисления коэффициентов и формирования списка пар слияния (блок 4) производится ранжирование коэффициентов слияния и определение списка пар узлов, для которых должна быть произведена операция слияния в порядке приоритетности. Согласно данному списку в соответствующем блоке производится слияние узлов (блок 5). После завершения итерации управление снова подается на вход решающего блока 2. После принятия решения о завершении цикла итераций уплотнения структура данных передается на вход блока завершающей итерации уплотнения (блок 6). Далее, окончательно уплотненная структура подается на вход блока оптимизации структуры данных (блок 7). Оптимизированная структура данных является результатом работы блока 7.A sparse data structure is formed from a set of keywords (input data of block 1) using a sparse prefix tree generator block (block 2). Further, the data structure is fed to the input of the decision block 2, which determines the further action: either complete the process, or start the next iteration of compaction. A positive decision to start a new iteration of compaction is made in two cases. Firstly, after the initial generation of the data structure by block 1. Secondly, after the completion of the next iteration of compaction with a positive outcome. The positive outcome of the iteration is determined by the fact that at least two nodes were found for which the merge procedure was completed. The beginning of a new iteration of compaction occurs after the control transfer of the decision block 2 to the block for calculating the numerical characteristics of random variables (block 3). Further, in the unit for calculating the coefficients and forming a list of merge pairs (block 4), the merging coefficients are ranked and a list of pairs of nodes is determined for which the merge operation should be performed in order of priority. According to this list, the nodes are merged in the corresponding block (block 5). After the completion of the iteration, control is again fed to the input of the decision block 2. After the decision is made to complete the cycle of iterations of the summarization, the data structure is transferred to the input of the block of the final iteration of the summarization (block 6). Next, the finally compressed structure is fed to the input of the data structure optimization block (block 7). The optimized data structure is the result of block 7.

Каждый элемент узла разреженного префиксного дерева имеет формат, представленный на фиг.2. В элементе присутствуют 5 полей. Поле State (состояние) определяет тип элемента. При нулевом значении поля State (0) элемент считается пустым и может быть использован для размещения ненулевого элемента другого узла при слиянии. В этом случае значения остальных полей игнорируются и могут иметь произвольное значение. Значение поля State, равное единице (1), определяет, что произошло совпадение символа и необходимо произвести переход к следующему узлу, номер которого указан в поле Index. На данном этапе (в разреженном префиксном дереве) значения остальных полей игнорируются и могут иметь произвольное значение. Значение поля State, равное двум (2), определяет, что найдено ключевое слово и ассоциированное с ним уникальное числовое значение указано в поле Index. В качестве уникального значения обычно используется порядковый номер ключевого слова в первоначальном списке ключевых слов, из которых формируется разреженный массив. В этом случае значения остальных полей игнорируются и могут иметь произвольное значение. Другие значения поля State недопустимы.Each element of the sparse prefix tree node has the format shown in FIG. 2. There are 5 fields in the element. The State field defines the type of element. If the State (0) field is zero, the element is considered empty and can be used to place a nonzero element of another node during merging. In this case, the values of the remaining fields are ignored and may have an arbitrary value. The value of the State field, which is equal to unity (1), determines that a match has occurred and it is necessary to proceed to the next node, the number of which is indicated in the Index field. At this stage (in a sparse prefix tree), the values of the remaining fields are ignored and can have an arbitrary value. The value of the State field, equal to two (2), determines that a keyword has been found and the unique numeric value associated with it is indicated in the Index field. As a unique value, the serial number of the keyword is usually used in the initial list of keywords from which a sparse array is formed. In this case, the values of the remaining fields are ignored and may have an arbitrary value. Other values for the State field are not valid.

Вычисление коэффициентов слияния, ранжирования коэффициентов и определения списка приоритетности пар узлов производится только для узлов с количеством ненулевых элементов более одного. Уплотнение узлов с единичным количеством ненулевых элементов производится в блоке завершающей итерации уплотнения (блок 6).The calculation of the coefficients of merging, ranking the coefficients and determining the priority list of pairs of nodes is performed only for nodes with the number of nonzero elements more than one. Compaction of nodes with a single number of nonzero elements is performed in the block of the final iteration of compaction (block 6).

Вычисление числовых характеристик случайных величин происходит следующим образом.The calculation of the numerical characteristics of random variables is as follows.

Префиксное дерево представляется в виде набора независимых дискретных случайных величин (каждая случайная величина определяется отдельным узлом в структуре данных):The prefix tree is represented as a set of independent discrete random variables (each random variable is determined by a separate node in the data structure):

$A = {X_{i}}_{i = 1}^{N}$

A = {X_{i}}_{i = one}^{N}

Значения, принимаемые случайными величинами, определяются размером алфавита M (диапазон значений 0..M-1).The values accepted by random variables are determined by the size of the alphabet M (the range of values is 0..M-1).

Закон распределения каждой случайной величины представляется соответствующим рядом распределения, который определяется совокупностью занятых элементов в соответствующем узле:The distribution law of each random variable is represented by the corresponding distribution series, which is determined by the set of occupied elements in the corresponding node:

$p_{X_{i}} (x_{j}) = {\begin{matrix} \frac{1}{C_{i}^{*}} п р и x_{j} = E_{i, j}^{*} \\ 0 п р и x_{j} = E_{i, j}^{0} \end{matrix},$

p_{X_{i}} (x_{j}) = {\begin{matrix} \frac{one}{C_{i}^{*}} P R and x_{j} = E_{i, j}^{*} \\ 0 P R and x_{j} = E_{i, j}^{0} \end{matrix},

гдеWhere

$E_{n, m}^{*}$

- занятые элементы;

E_{n, m}^{*}

- busy items;

$E_{n, m}^{0}$

- свободные элементы;

E_{n, m}^{0}

- free items;

$C_{i}^{*}$

- количество ненулевых элементов в i-м узле.

C_{i}^{*}

- the number of nonzero elements in the i-th node.

Математическое ожидание для i-го узла:The mathematical expectation for the i-th node:

$ν_{1} = N X_{i} = M [X_{i}] = \sum_{j = 1}^{C_{i}^{*}} {(x_{j} - M X_{j})}^{2} p_{X_{i}} (x_{j})$

ν_{one} = N X_{i} = M [X_{i}] = \sum_{j = one}^{C_{i}^{*}} {(x_{j} - M X_{j})}^{2} p_{X_{i}} (x_{j})

Дисперсия для i-го узла:Dispersion for the i-th node:

$μ_{2} = D [X_{i}] = \sum_{j = 1}^{C_{i}^{*}} {(x_{j} - M X_{j})}^{2} p_{X_{i}} (x_{j})$

μ_{2} = D [X_{i}] = \sum_{j = one}^{C_{i}^{*}} {(x_{j} - M X_{j})}^{2} p_{X_{i}} (x_{j})

Числовые характеристики вычисляются для всех возможных смещений узла. Диапазон смещений определяется по крайним ненулевым (непустым) элементам узла. Диапазон определяется следующим образом. Допустим, в узле первый ненулевой элемент содержится в позиции S_high, а последний - в позиции S_low. При этом максимальное значение позиции элемента в узле равно S_max. Это означает, что можно производить смещение элементов данного узла вверх S_high раз и при этом ненулевые узлы не выйдут за границы диапазона 0..S_max. Также можно производить смещение элементов данного узла вниз S_max-S_low раз и при этом ненулевые узлы не выйдут за границы диапазона 0..S_max. Таким образом, диапазон смещений элементов данного узла задается следующими значениями: -S_high..S_max-S_low. Отрицательное значение определяет смещение вверх. Диапазон всегда имеет, по крайней мере, одно значение: 0.Numerical characteristics are calculated for all possible node offsets. The range of displacements is determined by the extreme non-zero (non-empty) elements of the node. The range is defined as follows. Suppose that in a node the first nonzero element is in position S _high , and the last in position S _low . In this case, the maximum value of the element position in the node is S _max . This means that it is possible to shift elements of a given node up S _high times and non-zero nodes will not go beyond the range 0..S _max . It is also possible to shift the elements of a given node down S _max -S _low times and non-zero nodes will not go beyond the range 0..S _max . Thus, the range of displacements of the elements of this node is set by the following values: -S _high ..S _max -S _low . A negative value indicates an upward offset. A range always has at least one value: 0.

Коэффициент слияния для i-го и j-го узлов с k-м и l-м смещением соответственно определяется следующим образом:The merge coefficient for the i-th and j-th nodes with k-th and l-th offset, respectively, is determined as follows:

$K m_{i_{k}, j_{l}} = | \frac{{(| M [X_{i_{k}}] - M [X_{j_{l}}] |)}^{2}}{9 \cdot \sqrt{D [X_{i_{k}}] \sqrt{D [X_{j_{l}}]}}} | - 1$

K m_{i_{k}, j_{l}} = | | | \frac{{(| | | M [X_{i_{k}}] - M [X_{j_{l}}] | | |)}^{2}}{9 \cdot \sqrt{D [X_{i_{k}}] \sqrt{D [X_{j_{l}}]}}} | | | - one

Ранжирование коэффициентов слияния производится методом сортировки по возрастанию. Чем ближе значение коэффициента к нулю, тем более приоритетной и вероятной является возможность слияния заданных узлов в заданных смещениях.Merging coefficients are ranked by ascending sorting method. The closer the coefficient value to zero, the more priority and probable is the possibility of merging given nodes at given offsets.

Уплотнение двух узлов происходит следующим образом. Узел-источник является перемещаемым узлом, а узел-приемник - узлом, чьи пустые элементы используются для размещения перемещаемого узла. Определяется номер виртуального узла в узле-приемнике VIndex: находится первое неиспользуемое значение поля VIndex_new из всех ненулевых элементов узла-приемника, начиная с 1. Для каждого ненулевого элемента узла-источника происходит копирование значений его полей Index, VTo, Shift в поля нулевого элемента узла-приемника по заданному смещению. При этом в поле VIndex всех формируемых элементов записывается определенное выше значение VIndex_new. Далее, происходит удаление узла-источника и корректировка ссылок на него из других элементов. Для всех элементов-переходов всех узлов, имеющих значение поля Index, равное значению индекса узла-источника, производится замена его на значение индекса узла-приемника. После этого происходит освобождение блока оперативной памяти, используемого для хранения узла-источника.The compaction of the two nodes is as follows. The source node is a roaming node, and the destination node is a node whose empty elements are used to host the roaming node. The number of the virtual node in the VIndex destination node is determined: the first unused value of the VIndex _new field is found from all nonzero elements of the destination node, starting from 1. For each nonzero element of the source node, the values of its Index, VTo, Shift fields are copied to the fields of the zero element node-receiver at a given offset. At the same time, the VIndex _new value defined above is recorded in the VIndex field of all formed elements. Next, the source node is deleted and links to it are adjusted from other elements. For all transition elements of all nodes with the value of the Index field equal to the index value of the source node, it is replaced with the index value of the destination node. After that, the memory block used to store the source node is released.

В случае очередного удачного слияния двух узлов из списка ранжированных коэффициентов удаляются все элементы, ссылающиеся хотя бы на один из узлов пары слияния. Это происходит потому, что после слияния меняется структура узла-приемника и все определенные для него характеристики теряют свою актуальность. Коэффициенты для узла-источника также теряют свою актуальность, так как узел уже не существует.In the case of the next successful merger of two nodes from the list of ranked coefficients, all elements that refer to at least one of the nodes of the merge pair are deleted. This is because after the merger, the structure of the receiver node changes and all the characteristics defined for it lose their relevance. The coefficients for the source node also lose their relevance, since the node no longer exists.

Завершающая итерация уплотнения заключается в слиянии всех узлов-источников, содержащих только один ненулевой элемент, с любым узлом-приемником, содержащим более одного ненулевого элемента. Если не осталось узлов-приемников, содержащих более одного ненулевого элемента, то производится слияние с любым возможным узлом-приемником с минимальным индексом расположения.The final iteration of compaction is to merge all source nodes containing only one non-zero element with any destination node containing more than one non-zero element. If there are no receiver nodes containing more than one nonzero element, then merge with any possible receiver node with a minimum location index.

Уплотненное дерево в общем случае состоит из преобладающего количества элементов двух типов - перехода (State = 1) и удачного завершения поиска (State = 2). Количество пустых элементов (State = 0) относительно мало. В этом случае налицо нерациональное использование памяти устройства, т.к. поля VTo и Shift используются только для элементов перехода. Процесс оптимизации позволяет устранить данную нерациональность. Для этого формируется оптимизированная таблица состояний и переходов и новая вспомогательная оптимизированная таблица переходов. Формат элемента оптимизированной таблицы состояний и переходов представлен на фиг.3. Формат элемента вспомогательной оптимизированной таблицы переходов представлен на фиг.4. Смысл оптимизации состоит в следующем. Для всех элементов типа переход (State = 1) с одновременным одинаковым значением трех полей Index, VTo и Shift формируется новый элемент во вспомогательной оптимизированной таблице переходов с теми же самыми значениями полей Index, VTo и Shift. Тогда сами переходные элементы оптимизированной таблицы состояний и переходов в поле Index будут содержать номер соответствующего элемента во вспомогательной оптимизированной таблице переходов.A compacted tree in the general case consists of a predominant number of elements of two types - a transition (State = 1) and successful completion of the search (State = 2). The number of empty elements (State = 0) is relatively small. In this case, there is an irrational use of the device’s memory, as VTo and Shift fields are used only for transition elements. The optimization process eliminates this irrationality. For this, an optimized table of states and transitions and a new auxiliary optimized transition table are formed. The element format of the optimized state and transition table is shown in FIG. 3. The format of the element of the auxiliary optimized conversion table is presented in figure 4. The meaning of optimization is as follows. For all elements of the transition type (State = 1) with the same value of the three fields Index, VTo and Shift, a new element is formed in the auxiliary optimized transition table with the same values of the Index, VTo and Shift fields. Then the transition elements of the optimized state and transition table in the Index field will contain the number of the corresponding element in the auxiliary optimized transition table.

Алгоритм поиска информации с помощью уплотненной структуры данных может быть описан следующим образом.An information retrieval algorithm using a compressed data structure can be described as follows.

Состояние поиска формируется набором переменных: Index - индекс физического узла, используемого в текущей итерации поиска; VIndex - индекс виртуального узла (внутри физического), используемого в текущей итерации поиска; Shift - текущее смещение кодов символов. Начальное состояние поиска определяется нулевыми значениями всех переменных из набора. Из входного потока, в котором производится поиск, получается очередной байт информации с кодом символа С. Если значение (С + Shift) выходит за границы значений индексов узла Index, текущая итерация поиска завершается неудачей, начальное состояние обнуляется, происходит переход к следующему символу из входного потока. Иначе, если значение (С + Shift) не выходит за границы значений индексов узла Index, производится анализ полей элемента узла Index в позиции (С + Shift). Если значение поля VIndex элемента не равно значению поля VIndex состояния поиска, текущая итерация поиска завершается неудачей, начальное состояние обнуляется, происходит переход к следующему символу из входного потока. Иначе, производится анализ значения поля State элемента узла Index в позиции (С + Shift). В случае нулевого значения поля State (State = 0) текущая итерация поиска завершается неудачей, начальное состояние обнуляется, происходит переход к следующему символу из входного потока. В случае когда поле State имеет значение 2 (State = 2), текущая итерация поиска завершается удачей, числовое значение, ассоциируемое с найденным ключевым словом, содержится в поле Index элемента, далее, если необходимо, начальное состояние обнуляется, происходит переход к следующему символу из входного потока. В случае когда поле State имеет значение 1 (State = 1), значения полей Index, VIndex, Shift, обновляются значениями соответствующих полей Index, VTo, Shift, взятых из вспомогательной таблицы переходов, по индексу, взятому из значения поля Index элемента в позиции (С + Shift), далее происходит переход к следующему символу из входного потока. Для защиты от внешних ошибок, когда поле State имеет иное значение, отличное от 0..2, текущая итерация поиска завершается неудачей, начальное состояние обнуляется, происходит переход к следующему символу из входного потока.The search state is formed by a set of variables: Index - the index of the physical node used in the current iteration of the search; VIndex - index of the virtual node (inside the physical) used in the current iteration of the search; Shift - current shift of character codes. The initial state of the search is determined by the zero values of all the variables in the set. From the input stream in which the search is performed, the next byte of information with the character code C is obtained. If the value (C + Shift) goes beyond the bounds of the index values of the Index node, the current iteration of the search fails, the initial state is reset, and the next character from the input flow. Otherwise, if the value (C + Shift) does not go beyond the bounds of the values of the indices of the Index node, the fields of the element of the Index node in the position (C + Shift) are analyzed. If the value of the VIndex field of the element is not equal to the value of the VIndex field of the search state, the current iteration of the search fails, the initial state is reset, the next character from the input stream is transferred. Otherwise, the value of the State field of the Index node element is analyzed at the position (С + Shift). If the value of the State field is zero (State = 0), the current iteration of the search fails, the initial state is reset, the transition to the next character from the input stream occurs. In the case when the State field has a value of 2 (State = 2), the current iteration of the search succeeds, the numerical value associated with the found keyword is contained in the Index field of the element, then, if necessary, the initial state is reset, the next character from input stream. In the case when the State field has a value of 1 (State = 1), the values of the Index, VIndex, Shift fields are updated by the values of the corresponding Index, VTo, Shift fields taken from the auxiliary transition table, at the index taken from the value of the Index field of the element in position ( C + Shift), then proceeds to the next character from the input stream. To protect against external errors, when the State field has a different value than 0..2, the current iteration of the search fails, the initial state is reset, the next character from the input stream is transferred.

Экспериментальная проверка характеристик способа уплотнения структуры данных префиксного дерева была выполнена на ЭВМ с использованием языка программирования С++ при следующих исходных данных:An experimental verification of the characteristics of the method of compacting the data structure of the prefix tree was performed on a computer using the C ++ programming language with the following initial data:

1. Объем набора F ключевых слов: от 100 тыс. до 10 млн;1. The volume of the set of F keywords: from 100 thousand to 10 million;

2. Размер алфавита М количеством состояний 1 байта информации (количество размещений с повторениями, при n=2, k=8):2. The size of the alphabet M by the number of states 1 byte of information (the number of placements with repetitions, with n = 2, k = 8):

$N_{p} = {\bar{A}}_{n}^{k} = {\bar{A}}_{n}^{k} = n^{k} = 2^{8} = 256$

;

N_{p} = {\bar{A}}_{n}^{k} = {\bar{A}}_{n}^{k} = n^{k} = 2^{8} = 256

;

3. Размер алфавита М не может изменяться в процессе уплотнения. Сравнение результатов эксперимента, приведенное на фиг.5, показывает, что применение предлагаемого способа дает выигрыш по объему оперативной памяти, занимаемой сжатым префиксным деревом (количество узлов в дереве после сжатия) при равных условиях на 5-7% (в зависимости от количества ключевых слов и их символьного содержания) по сравнению со способом-прототипом.3. The size of the alphabet M cannot be changed during the compaction process. A comparison of the results of the experiment, shown in figure 5, shows that the application of the proposed method gives a gain in the amount of RAM occupied by the compressed prefix tree (the number of nodes in the tree after compression) under equal conditions by 5-7% (depending on the number of keywords and their symbolic content) compared with the prototype method.

Результаты эксперимента позволяют сделать вывод о том, что заявленный способ решает поставленную задачу.The results of the experiment allow us to conclude that the claimed method solves the problem.

Промышленная применимость изобретения обусловлена тем, что устройство, реализующее предложенный способ, может быть осуществлено с помощью современной элементной базы, с достижением указанного в изобретении назначения, за счет того, что порядок слияния определяют путем вычисления значений числовых характеристик случайных величин для всех узлов, участвующих в уплотнении, и ранжирования для каждого узла-приемника узлов-источников в последовательности убывания приоритетности.Industrial applicability of the invention is due to the fact that a device that implements the proposed method can be implemented using a modern elemental base, achieving the destination specified in the invention, due to the fact that the merger order is determined by calculating the values of the numerical characteristics of random variables for all nodes involved in compaction, and ranking for each receiver node of the source nodes in the sequence of decreasing priority.

Claims

A method of compacting the data structure of a prefix tree, which consists in merging the tree nodes with each other, in which nonzero elements of the source node are inserted into the place of the empty elements of the destination node, characterized in that the merge order is determined by calculating the values of the numerical characteristics of random variables for all nodes, involved in compaction, and ranking for each destination node of the source nodes in the sequence of decreasing priority.