RU2251148C1

RU2251148C1 - Method for stream processing of text messages

Info

Publication number: RU2251148C1
Application number: RU2003126918/09A
Authority: RU
Inventors: А.В. Аграновский (RU); А.В. Аграновский; н Р.Э. Арутюн (RU); Р.Э. Арутюнян; Р.А. Хади (RU); Р.А. Хади; Б.А. Телеснин (RU); Б.А. Телеснин
Original assignee: Государственное научное учреждение научно-исследовательский институт "СПЕЦВУЗАВТОМАТИКА"
Priority date: 2003-09-04
Filing date: 2003-09-04
Publication date: 2005-04-27
Also published as: RU2003126918A

Abstract

FIELD: computer science.

SUBSTANCE: method includes text messages from data channel, linguistic words processing is performed, thesaurus of each text message is formed, statistical processing of words in thesaurus is performed, text message and thesaurus are stored in storage. Membership of text message in one of categories from the list is determined, starting data value of text message is determined, stored in storage with text message, data value values are periodically updated with consideration of time passed since their appearance and text messages with data value below preset threshold are erased, during processing of each message values of categories classification signs are updated.

EFFECT: higher efficiency.

1 dwg

Description

Изобретение относится к системам классификации текстовых сообщений и может использоваться в системах обработки информации, базах данных, электронных хранилищах при наличии постоянного источника текстовой информации.The invention relates to text message classification systems and can be used in information processing systems, databases, electronic repositories in the presence of a constant source of text information.

Известен способ классификации сообщений [1], заключающийся в том, что осуществляют преобразование текста сообщения из специального формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, осуществляют подсчет весов слов в документе в соответствии с частотами их появления; на этапе обучения по предъявленному набору классифицированных вручную документов формируют набор классификационных признаков, при необходимости классификации документа осуществляют преобразование его из специального формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, осуществляют подсчет весов слов в документе, на основе классификационного критерия SVM (Support Vector Machines) и классификационных признаков определяют принадлежность документа категории.A known method of classifying messages [1], which consists in converting the message text from a special storage format into natural language text, converting the document words into basic word forms, and calculating the word weights in the document in accordance with the frequencies of their appearance; at the training stage, according to the presented set of manually classified documents, a set of classification features is formed, if necessary, the document is classified, it is converted from a special storage format into natural language text, the document is converted into basic word forms, the word weights in the document are calculated based on the SVM classification criterion (Support Vector Machines) and classification features determine whether a document belongs to a category.

Однако указанный способ имеет существенные ограничения, заключающиеся в том, что он предназначен только для классификации сообщений в статическом режиме и не содержит средств для их потоковой обработки, таких как последовательное обучение классификатора, а также оценки информативности текстового сообщения и длительности его хранения в хранилище.However, this method has significant limitations in that it is intended only for classifying messages in static mode and does not contain means for stream processing, such as sequential training of the classifier, as well as evaluating the information content of a text message and the duration of its storage in the repository.

Известен также способ классификации сообщений [2], заключающийся в том, что непрерывно обрабатывают последовательность текстовых сообщений, обновляя на каждом этапе классификационные признаки по методу Видроу-Хоффа (Widrow-Hoff); это позволяет одновременно классифицировать сообщения и обучать классификатор. При этом нет необходимости предъявлять все обучающее множество документов сразу.There is also known a method for classifying messages [2], which consists in continuously processing a sequence of text messages, updating at each stage the classification features according to the Widrow-Hoff method; this allows you to simultaneously classify messages and train the classifier. Moreover, there is no need to present the entire training set of documents at once.

Однако указанный способ также имеет ограничения, связанные с тем, что он только описывает процедуру классификации сообщений и обучения классификатора и не описывает полного цикла обработки сообщения в потоковом режиме, например, начальную обработку документа и его хранение, а также не содержит механизмов оценки информативности текстового сообщения и длительности его хранения в хранилище.However, this method also has limitations associated with the fact that it only describes the procedure for classifying messages and training the classifier and does not describe the complete cycle of processing a message in streaming mode, for example, the initial processing of a document and its storage, and also does not contain mechanisms for evaluating the information content of a text message and the duration of its storage.

Наиболее близким по технической сущности к предлагаемому является способ идентификации объектов по их описаниям [3], пригодный для решения поставленной задачи, принятый за прототип, заключающийся в том, что получают текстовые сообщения на естественных языках из информационного канала, осуществляют лингвистическую обработку слов каждого сообщения, формируют тезаурус текстового каждого сообщения, осуществляют статистическую обработку слов в тезаурусе сообщения, сохраняют текстовое сообщение и тезаурус в хранилище.The closest in technical essence to the proposed one is a method for identifying objects according to their descriptions [3], suitable for solving the problem, adopted as a prototype, which consists in the fact that they receive text messages in natural languages from an information channel, carry out linguistic processing of the words of each message, form the thesaurus of the text of each message, carry out statistical processing of words in the message thesaurus, save the text message and thesaurus in the repository.

Данный способ позволяет осуществить сравнение данного текстового сообщения с множествами сообщений, поступивших за временные интервалы, и тем самым определить его тематическую близость этим интервалам как категориям. Недостатками прототипа являются невозможность задания категорий иначе как по временному признаку, а также отсутствие механизмов оценки информативности текстового сообщения и удаления его из хранилища, когда оно утрачивает свою информативность.This method allows you to compare this text message with the set of messages received over time intervals, and thereby determine its thematic proximity to these intervals as categories. The disadvantages of the prototype are the impossibility of defining categories other than on a temporary basis, as well as the lack of mechanisms for evaluating the information content of a text message and removing it from the store when it loses its information content.

Технический результат, получаемый от внедрения изобретения, заключается в устранении недостатков прототипа, то есть в получении возможности произвольного задания категорий. При этом для каждого текстового сообщения определяется его информативность, которая влияет на длительность хранения данного сообщения в хранилище.The technical result obtained from the implementation of the invention is to eliminate the disadvantages of the prototype, that is, to obtain the possibility of arbitrary assignment of categories. Moreover, for each text message, its information content is determined, which affects the duration of storage of this message in the repository.

Данный технический результат получают за счет того, что получают текстовые сообщения на естественных языках из информационного канала, осуществляют лингвистическую обработку слов каждого сообщения, формируют тезаурус текстового каждого сообщения, осуществляют статистическую обработку слов в тезаурусе сообщения, сохраняют текстовое сообщение и тезаурус в хранилище. При этом автоматически определяют принадлежность текстового сообщения одной категории из заранее определенного списка категорий. Кроме того, дополнительно определяют начальную информативность текстового сообщения, сохраняют ее в хранилище вместе с текстовым сообщением; периодически проводят обновление значений информативности хранящихся в базе данных текстовых сообщений с учетом прошедшего с момента их появления времени и удаляют те текстовые сообщения, информативность которых опустилась ниже заранее установленного порога. Дополнительной особенностью данного способа является то, что при обработке каждого текстового сообщения обновляют значения классификационных признаков категорий.This technical result is obtained due to the fact that they receive text messages in natural languages from the information channel, carry out linguistic processing of the words of each message, form a thesaurus of the text of each message, perform statistical processing of words in the message thesaurus, save the text message and thesaurus in the repository. In this case, the text message belonging to one category is automatically determined from a predetermined list of categories. In addition, the initial information content of the text message is additionally determined, it is stored in the storage together with the text message; periodically update the informative values of text messages stored in the database, taking into account the elapsed time since their appearance, and delete those text messages whose information content dropped below a predetermined threshold. An additional feature of this method is that when processing each text message, the values of the classification features of the categories are updated.

Согласно данному способу, информативность текстового сообщения определяется двумя факторами:According to this method, the information content of a text message is determined by two factors:

- содержанием сообщения (словоформами, входящими в него, отклонением словарного запаса данного сообщения от встречавшихся ранее);- the content of the message (word forms included in it, the deviation of the vocabulary of the message from previously encountered);

- временем, прошедшим с момента занесения сообщения в базу данных.- time elapsed since the moment the message was entered into the database.

Значения классификационных признаков определяют распределением весов словоформ, наиболее часто встречающихся в сообщениях данной категории. Для каждого текстового сообщения и каждой категории определяется функция принадлежности, характеризующая меру принадлежности данного сообщения данной категории. Категория, для которой значение этой функции максимально, присваивается сообщению. При этом в случае малых значений функции принадлежности сообщения этой категории, сообщение обладает нехарактерным словарным запасом и ему в соответствии с данным способом присваивается большее значение информативности, чем у сообщений, доставляющих большие этой функции.The values of classification features are determined by the distribution of the weights of word forms most often found in messages of this category. For each text message and each category, a membership function is defined that characterizes the measure of membership of a given message in this category. The category for which the value of this function is maximum is assigned to the message. Moreover, in the case of small values of the membership function of a message of this category, the message has an uncharacteristic vocabulary and, in accordance with this method, is assigned a higher value of information content than messages that deliver large to this function.

Непосредственно после попадания сообщения в базу данных, сообщение обладает, в соответствии с данным способом, максимальной информативностью, так как оно с большой вероятностью еще не было прочитано и оценено операторами комплекса обработки сообщений. Однако по прошествии некоторого времени, информативность сообщения снижается.Immediately after the message enters the database, the message has, in accordance with this method, the maximum information content, since it is most likely has not yet been read and evaluated by the operators of the message processing complex. However, after some time, the information content of the message decreases.

В соответствии с патентуемым способом, каждому текстовому сообщению s, попавшему в базу данных, в каждый момент времени t присваивается значение информативности по следующей формуле, согласующейся с рассуждениями, приведенными выше:In accordance with the patented method, each text message s that got into the database is assigned an informative value at each time moment t according to the following formula, which is consistent with the reasoning given above:

I(s)=1-(x(s),c_k)-α(t-t₀),I (s) = 1- (x (s), c _k ) -α (tt ₀ ),

где x(s) - вектор весов словоформ тезауруса сообщения s, c_k - вектор классификационных признаков категории k, которой принадлежит текстовое сообщение s, t₀ - момент времени попадания сообщения s в базу данных, α - коэффициент потери информативности. Значения t и t₀ могут выражаться в любых временных единицах, например в секундах. Выбор конкретных временных единиц отразится только на значении коэффициента α.where x (s) is the weight vector of the word thesaurus wordforms of the message s, c _k is the vector of classification features of category k to which the text message s belongs, t ₀ is the time when message s entered the database, α is the information loss coefficient. The values of t and t ₀ can be expressed in any time units, for example, in seconds. The choice of specific time units will only affect the value of the coefficient α.

Коэффициент α отвечает за уменьшение информативности сообщения за единицу времени и подбирается с учетом требований конкретных приложений данного способа.The coefficient α is responsible for reducing the information content of the message per unit time and is selected taking into account the requirements of specific applications of this method.

По мере того, как информативность сообщений опускается ниже порога информативности ε, происходит их удаление из базы данных как неинформативных.As the information content of messages falls below the information threshold ε, they are removed from the database as uninformative.

При этом сообщения, получившие наибольшие значения информативности с самого начала, будут больше находиться в базе данных; те же из них, которые изначально были малоинформативны, будут быстрее удалены из базы данных.At the same time, messages that received the highest informative values from the very beginning will be more in the database; the same ones that were initially uninformative will be deleted faster from the database.

Способ может быть реализован с помощью ЭВМ или вычислительного устройства, представленного в виде блок-схемы на чертеже.The method can be implemented using a computer or a computing device, presented in the form of a block diagram in the drawing.

Устройство для реализации способа состоит из информационного канала 1, блока 2 формирования тезаурусов текстовых сообщений, управляющего блока 3, блока 4 обучающих данных, блока 5 классификации, блока 6 определения начальной информативности текстовых сообщений, блока 7 сохранения текстовых сообщений, хранилища 8 текстовых сообщений, блока 9 обновления классификационных признаков, хранилища 10 классификационных признаков, блока 11 генерации временных отсчетов, блока 12 пересчета информативностей текстовых сообщений, блока 13 удаления текстовых сообщений.A device for implementing the method consists of an information channel 1, block 2 for generating text message thesauruses, control block 3, block 4 for training data, block 5 for classification, block 6 for determining the initial information content of text messages, block 7 for saving text messages, storage 8 for text messages, block 9 updates of classification features, storage of 10 classification features, block 11 for generating time samples, block 12 for recalculating the informativeness of text messages, block 13 for deleting texts x messages.

Согласно способу устройство работает следующим образом. При появлении в информационном канале 1 текстового сообщения оно передается в блок 2.According to the method, the device operates as follows. When a text message appears in information channel 1, it is transmitted to block 2.

В блоке 2 текстовое сообщение сначала проходит предварительную обработку, заключающуюся в определении по всем его словам их базовых словоформ. Для этого может использоваться один из способов [4-7], после чего формируется тезаурус текстового сообщения. Наиболее часто для решения задач нахождения базовых словоформ используется алгоритм Портера [4], заключающийся в использовании специальных правил отсечения и замены окончаний слов.In block 2, the text message first undergoes preliminary processing, which consists in determining all basic words of their basic word forms. One of the methods [4-7] can be used for this, after which a text message thesaurus is generated. Most often, to solve the problems of finding basic word forms, the Porter algorithm [4] is used, which consists in using special rules for cutting off and replacing word endings.

Тезаурус сообщения состоит из всех словоформ, содержащихся в нем. При этом каждой словоформе ставится в соответствие ее нормированный вес в тексте сообщения, определяемый по формуле TD-IDF (Term Frequency -Inverted Document Frequency):The message thesaurus consists of all word forms contained in it. In this case, each word form is associated with its normalized weight in the message text, determined by the formula TD-IDF (Term Frequency-Inverted Document Frequency):

tdidf(w)=tf(w)idf(w),tdidf (w) = tf (w) idf (w),

где tf(w) - частота появления словоформы w в данном сообщении, то естьwhere tf (w) is the frequency of occurrence of the word form w in this message, i.e.

Здесь c(w) - количество раз, которое словоформа w повторяется в данном сообщении, N - общее число слов в данном сообщении. Значение инвертированной частоты документов idf(w) вычисляется по формуле:Here c (w) is the number of times the word form w is repeated in this message, N is the total number of words in this message. The value of the inverted frequency of documents idf (w) is calculated by the formula:

idf(w)=log(M)-log(d(w)),idf (w) = log (M) -log (d (w)),

где d(w) - число документов, известных системе, в которых встречается словоформа w, М - общее число документов, известных системе. Для нормировки вектора весов используется евклидова норма:where d (w) is the number of documents known to the system in which the word form w occurs, M is the total number of documents known to the system. To normalize the vector of weights, the Euclidean norm is used:

Далее сформированный тезаурус текстового сообщения направляется в управляющий блок 3, контролируемый оператором устройства. Управляющий блок 3 позволяет устройству работать в двух режимах: в режиме начального обучения и нормальном режиме. Режим начального обучения необходим для построения начальных классификационных признаков. В этом режиме управляющее устройство получает информацию о категории, которой принадлежит текстовое сообщение, из блока 4. При работе в нормальном режиме управляющий блок обращается к блоку 5 для определения категории, которой принадлежит текстовое сообщение.Next, the generated text message thesaurus is sent to the control unit 3, controlled by the device operator. The control unit 3 allows the device to operate in two modes: in the initial learning mode and normal mode. The mode of primary education is necessary to build the initial classification features. In this mode, the control device receives information about the category to which the text message belongs from block 4. In normal operation, the control block refers to block 5 to determine the category to which the text message belongs.

При классификации текстового сообщения происходит вычисление скалярного произведения между нормированным вектором весов сообщения и векторами весов (классификационными признаками) всех категорий. Поскольку указанные вектора нормированы, то скалярное произведение этих векторов равно косинусу угла между соответствующими векторами в соответствующих многомерных пространствах. Категория, для которой это скалярное произведение будет максимальным, присваивается данному текстовому сообщению:When classifying a text message, the scalar product is calculated between the normalized vector of the message weights and the weight vectors (classification features) of all categories. Since these vectors are normalized, the scalar product of these vectors is equal to the cosine of the angle between the corresponding vectors in the corresponding multidimensional spaces. The category for which this scalar product will be maximal is assigned to this text message:

Информация о данной категории передается из блока 5 в управляющий блок 3 и далее поступает в блок определения начальной информативности текстовых сообщений.Information about this category is transferred from block 5 to control block 3 and then goes to the block for determining the initial informativeness of text messages.

Для определения начальной информативности текстового сообщения используется формула:To determine the initial information content of a text message, use the formula:

I(s)=1-(x(s),c_k),I (s) = 1- (x (s), c _k ),

где s - текущее текстовое сообщение, x(s) - вектор весов словоформ тезауруса сообщения s, c_k - вектор классификационных признаков категории k, которой принадлежит текстовое сообщение s. При этом скалярное произведение вычисляется по всем словоформам из пересечения множества словоформ данного текстового сообщения и классификационных признаков категории k, а вектора x(s) и c_k имеют компоненты, отвечающие соответствующим словоформам.where s is the current text message, x (s) is the weight vector of the word forms of the thesaurus of message s, c _k is the vector of classification features of category k to which the text message s belongs. In this case, the scalar product is calculated for all word forms from the intersection of the set of word forms of a given text message and classification signs of category k, and the vectors x (s) and c _k have components corresponding to the corresponding word forms.

Из блока 6 определения начальной информативности текстового сообщения данные о текущем текстовом сообщении, а именно само сообщение, его тезаурус и начальная информативность поступают в блок 7, который сохраняет их в хранилище 8.From block 6 determining the initial informativeness of a text message, data about the current text message, namely the message itself, its thesaurus and initial informativeness, are sent to block 7, which stores them in storage 8.

Далее управление передается в блок 9. Классификационные признаки, соответствующие каждой категории, представляют собой вектора весов, соответствующие всем словоформам, встречавшимся в каком-либо текстовом сообщении из данной категории. Вектора весов нормируются в соответствии с евклидовой нормой. Определение классификационных признаков для каждой категории происходит итеративно. Для этого используется динамический алгоритм обучения линейных классификаторов Видроу-Хоффа (Widrow-Hoff). Классификационные признаки категорий извлекаются из хранилища 10 и пересчитываются по формуле:Next, control is transferred to block 9. Classification signs corresponding to each category are weight vectors corresponding to all word forms found in any text message from this category. Vector scales are normalized in accordance with the Euclidean norm. The definition of classification features for each category is iterative. To do this, use the dynamic training algorithm for linear Widrow-Hoff classifiers (Widrow-Hoff). Classification signs of categories are extracted from the repository 10 and recalculated according to the formula:

c_kj,нов=с_kj,стар-2η((с_k,стар,x(s))-у)х_j(s),c _{kj, new} = c _{kj, old} -2η ((with _{k, old} , x (s)) - y) x _j (s),

где с_kj,стар и с_kj,нов - j-e компоненты соответственно старого и нового векторов классификационных признаков k-й категории, у - вектор, у которого на позиции, соответствующей номеру категории, которой принадлежит текстовое сообщение s, стоит единица, а на остальных позициях - нули, η - коэффициент скорости обучения, задаваемый оператором устройства. Затем обновленные значения классификационных признаков заносятся обратно в хранилище 10.where with _{kj, old} and with _{kj, new} are je components of the old and new vectors of classification signs of the kth category, respectively, y is a vector with one at the position corresponding to the category number to which the text message s belongs, and the rest positions - zeros, η - learning rate coefficient specified by the device operator. Then the updated values of the classification features are entered back into the store 10.

Блок 11 посылает сигналы в блок 12 с постоянным временным периодом, задаваемым оператором устройства. При получении сигнала блок 12 перебирает все сообщения, содержащиеся в хранилище 8, и обновляет значения их информативностей в соответствии со следующей формулой:Block 11 sends signals to block 12 with a constant time period specified by the device operator. Upon receipt of the signal, block 12 enumerates all the messages contained in the repository 8 and updates the values of their information contents in accordance with the following formula:

I(s)=I(s)-αΔt,I (s) = I (s) -αΔt,

где α - коэффициент потери информативности, Δt - временной период между последовательными сигналами блока генерации временных отсчетов. Новые значения информативностей заносятся в хранилище 8. Коэффициент α задается оператором устройства.where α is the coefficient of loss of information content, Δt is the time period between consecutive signals of the unit for generating time samples. New values of informativeness are stored in the storage 8. The coefficient α is set by the device operator.

В блоке 13 удаления текстовых сообщений происходит перебор всех сообщений, содержащихся в хранилище текстовых сообщений 8, и удаление из него всех сообщений, информативность которых в момент проверки ниже порога информативности ε, также задаваемого оператором устройства.In block 13 for deleting text messages, all messages contained in the text message storage 8 are enumerated and all messages are deleted from it, the information content of which at the time of verification is lower than the information threshold ε, also set by the device operator.

Значения коэффициентов α, η, ε могут быть различны в зависимости от специфики использования данного устройства.The values of the coefficients α, η, ε can be different depending on the specific use of this device.

Таким образом, с помощью способа происходит классификация текстовых сообщений на заранее заданном множестве категорий, а также определение информативности текстового сообщения и удаление его по мере того, как оно утрачивает информативность, чем достигается поставленный выше технический результат.Thus, using the method, there is a classification of text messages on a predetermined set of categories, as well as determining the information content of a text message and deleting it as it loses its information content, thereby achieving the above technical result.

Источники информации, принятые во внимание при составлении материалов заявки:Sources of information taken into account when compiling the application materials:

1. Патент США 6327581, кл. G 06 F 015/18.1. US patent 6327581, cl. G 06 F 015/18.

2. Lewis D.D., Shapire R.E., Callan J.P., Papka R. "Training algorithms for linear text classifiers", In Proceedings of SIGIR-96, 49th ACM International Conference on Research and Development in Information Retrieval, pages 294-306, Zurich, CH, 1996.2. Lewis DD, Shapire RE, Callan JP, Papka R. "Training algorithms for linear text classifiers", In Proceedings of SIGIR-96, 49th ACM International Conference on Research and Development in Information Retrieval, pages 294-306, Zurich, CH 1996.

3. Патент РФ № 2167450 С2, кл. G 06 F 17/30 - прототип.3. RF patent No. 2167450 C2, class. G 06 F 17/30 - prototype.

4. Porter M.F. "An algorithm for suffix stripping". Program, Vol.14, No.3, 1980, pp.130-137.4. Porter M.F. "An algorithm for suffix stripping". Program, Vol.14, No.3, 1980, pp. 130-137.

5. Патент РФ № 2096825 C1, кл. G 06 F 17/00.5. RF patent No. 2096825 C1, cl. G 06 F 17/00.

6. Патент США № 6308149, кл. G 06 F 17/27.6. US patent No. 6308149, CL. G 06 F 17/27.

7. Патент США № 6430557, кл. G 06 F 017/30; G 06 F 017/27; G 06 F 017/21.7. US patent No. 6430557, CL. G 06 F 017/30; G 06 F 017/27; G 06 F 017/21.

Claims

A method of streaming processing text messages, which consists in receiving text messages in natural languages from an information channel, performing linguistic processing of the words of each message, forming a thesaurus of the text of each message, performing statistical processing of words in the message thesaurus, storing the text message and thesaurus in the repository, characterized in that they automatically determine whether a text message belongs to one category from a predetermined list of categories, while determine the initial information content of the text message, save it in the repository along with the text message; periodically update the information content values stored in the database of text messages taking into account the elapsed time since their appearance, and delete those text messages whose information content has dropped below a predetermined threshold; when processing each text message, the values of the classification features of the categories are updated.