RU2487403C1

RU2487403C1 - Method of constructing semantic model of document

Info

Publication number: RU2487403C1
Application number: RU2011148742/08A
Authority: RU
Inventors: Денис Юрьевич Турдаков; Ярослав Ростиславович Недумов; Андрей Анатольевич Сысоев
Priority date: 2011-11-30
Filing date: 2011-11-30
Publication date: 2013-07-10
Also published as: US9201957B2; RU2011148742A; US20130138696A1

Abstract

FIELD: information technology.

SUBSTANCE: method of constructing a semantic model of a document consists of two basic steps. At the first step, ontology is extracted from external information resources that contain descriptions of separate objects of the object region. At the second step, text information of the document is tied to ontology concepts and a semantic model of the document is constructed. The information sources used are electronic resources, both tied and untied to the structure of hypertext links. First, all terms of the document are separated and tied to ontology concepts such that each term corresponds to a single concept which is its value, and values of terms are then ranked according to significance for the document.

EFFECT: enabling enrichment of document with metadata, which enable to improve and increase the rate of comprehension of basic information, and which enable to determine and highlight key terms in the text, which speeds up reading and improves understanding.

15 cl, 6 dwg

Description

Изобретение относится к области обработки данных при семантическом анализе текстовых данных и построении семантической модели документов.The invention relates to the field of data processing in the semantic analysis of text data and the construction of a semantic model of documents.

Объем информации, которую приходится анализировать человеку, растет с каждым днем. В связи с этим возникает потребность в обогащении документов метаданными, позволяющими улучшить и увеличить скорость восприятия основной информации. Особо остро эта проблема ощутима при анализе текстовых документов. Изобретение позволяет решать широкий класс задач, относящихся к данному направлению. Ниже перечислены некоторые из этих задач.The amount of information that a person has to analyze is growing every day. In this regard, there is a need for the enrichment of documents with metadata, allowing to improve and increase the speed of perception of basic information. This problem is especially acute when analyzing text documents. The invention allows to solve a wide class of problems related to this area. Listed below are some of these tasks.

Предлагаемое изобретение позволяет определять и подсвечивать ключевые термины текста. Это позволяет ускорить его чтение и улучшить понимание. При чтении больших текстовых документов или коллекции текстовых документов читателю достаточно взглянуть на ключевые слова, чтобы понять основное содержание текста и принять решение о необходимости более детального изучения.The present invention allows to define and highlight key terms of the text. This allows you to speed up his reading and improve understanding. When reading large text documents or a collection of text documents, the reader only needs to look at the keywords to understand the main content of the text and decide on the need for more detailed study.

В дополнение к этому, при помощи изобретения электронные тексты могут обогащаться гипертекстовыми ссылками на внешние электронные документы, содержащие более полное описание значений специфичных терминов. Это необходимо при ознакомлении с предметно-специфичной литературой, содержащей большое количество терминов, незнакомых читателю. Например, предложение "Настройка фортепиано заключается в согласовании звуков хроматического звукоряда между собой путем интервальной кварто-квинтовой темперации на семействе клавишно-струнных музыкальных инструментов", может быть не понятно человеку, не знакомому с предметной областью. Дополнительное описание значений терминов дает возможность понять смысл оригинального текста.In addition, using the invention, electronic texts can be enriched with hypertext links to external electronic documents containing a more complete description of the meanings of specific terms. This is necessary when familiarizing yourself with subject-specific literature containing a large number of terms unfamiliar to the reader. For example, the sentence "Piano tuning consists in harmonizing the sounds of a chromatic scale with each other by means of interval quarto-fifth temperament on a family of keyboard-stringed musical instruments" may not be clear to a person who is not familiar with the subject area. An additional description of the meaning of the terms makes it possible to understand the meaning of the original text.

Кроме того, изобретение позволяет осуществлять помощь читателю при ознакомлении с иностранной литературой. Использование изобретения предоставляет возможность создания программных систем предлагающих более полную информацию о ключевых понятиях иностранного текста, в том числе описаний на родном языке читателя.In addition, the invention allows the reader to assist in familiarizing themselves with foreign literature. Using the invention provides the opportunity to create software systems offering more complete information about the key concepts of a foreign text, including descriptions in the native language of the reader.

Предлагаемый способ выделения ключевых понятий и выбора близких к ним по смыслу, может быть применен в области информационного поиска. Одной из важнейших проблем современных информационно-поисковых систем (таких как Яндекс) является отсутствие прямой возможности поиска документов, содержащих только заранее известные значения многозначного запроса. Например, при использовании поискового запроса "платформа" из-за его многозначности будут получены документы из разных предметных областей (значениями могут быть "политическая платформа", "компьютерная платформа", "железнодорожная платформа" и т.д.). Для решения этой проблемы пользователю приходится уточнять запрос путем ввода дополнительного контекста в строку поиска.The proposed method for highlighting key concepts and choosing close ones in meaning can be applied in the field of information retrieval. One of the most important problems of modern information retrieval systems (such as Yandex) is the lack of direct ability to search for documents containing only previously known values of a multi-valued query. For example, when using the search query “platform”, because of its ambiguity, documents from different subject areas will be obtained (values can be “political platform”, “computer platform”, “railway platform”, etc.). To solve this problem, the user has to refine the query by entering additional context in the search bar.

Предложенное изобретение позволяет решить эту проблему, предоставив пользователю выбор значения или концепции для поиска. Информационно-поисковые системы, работающие со значениями терминов, относятся к области семантического поиска. На основе предлагаемого способа можно создавать системы семантического поиска. В таких системах документы будут ранжироваться с учетом семантической близости между значениями терминов запроса и значениями терминов в документах. Для этого производится автоматическое установление значения термина в заданном контексте. Данное изобретение также позволяет производить поиск в многоязычных коллекциях документов.The proposed invention allows to solve this problem by providing the user with a choice of meaning or concept for the search. Information retrieval systems that work with the meanings of terms belong to the field of semantic search. Based on the proposed method, you can create a semantic search system. In such systems, documents will be ranked taking into account the semantic proximity between the values of the query terms and the meaning of the terms in the documents. For this, the term is automatically set in the given context. This invention also allows searches in multilingual document collections.

Кроме того, на основе данного изобретения возможно создание рекомендательных систем, которые будут находить и рекомендовать документы, значения чьих ключевых терминов семантически схожи с ключевыми понятиями текущего документа. Пользователю такой системы будет предложен мощный инструмент для изучения коллекции документов через навигацию по ней за счет гипертекстовых ссылок на рекомендуемые документы.In addition, on the basis of this invention, it is possible to create recommendation systems that will find and recommend documents whose meanings of key terms are semantically similar to the key concepts of the current document. A user of such a system will be offered a powerful tool for studying a collection of documents through navigation through hypertext links to recommended documents.

Также осуществление рекомендаций возможно для схожих коллекций документов. Этот вариант использования аналогичен предыдущему, но рекомендации происходят между коллекциями документов или документом и коллекцией документов. В этом случае коллекция характеризуется значениями ключевых слов входящих в нее документов.Implementation of recommendations is also possible for similar collections of documents. This use case is similar to the previous one, but recommendations occur between collections of documents or a document and a collection of documents. In this case, the collection is characterized by the keyword values of the documents included in it.

Еще одной областью, где возможно применение данного изобретения, является область создания кратких описаний документов и коллекций документов, также известная как автоматическое аннотирование и реферирование документов. На основе предложенного способа можно создавать краткие описания документов и коллекций документов. Такие краткие описания позволят читателю быстро определить специфику документов. Краткие описания могут состоять из ключевых понятий документа, предложений, содержащих ключевые или близкие к ним понятия. Таким образом, краткие описания могут состоять из частей оригинального текста (коллекции текстов) или быть самостоятельными законченными документами, кратко отражающими основной смысл источников.Another area where the application of this invention is possible is the field of creating short descriptions of documents and collections of documents, also known as automatic annotation and abstracting of documents. Based on the proposed method, you can create brief descriptions of documents and collections of documents. Such brief descriptions will allow the reader to quickly determine the specifics of documents. Brief descriptions can consist of key concepts in a document, sentences containing key or related concepts. Thus, brief descriptions can consist of parts of the original text (collection of texts) or be independent finished documents that briefly reflect the main meaning of the sources.

Предлагаемый способ может быть применен к задачам извлечения информации. Так, на основе предлагаемого способа возможно создание системы автоматического обогащения баз знаний новыми концепциями и связями между ними. Для расширения базы знаний новыми концепциями необходимо установить связи между ними и существующими в базе знаний концепциями. Предлагаемый способ позволяет легко устанавливать связи между новой концепцией и концепциями базы знаний, через анализ описания новой концепции. Это приложении более подробно описано ниже.The proposed method can be applied to the tasks of extracting information. So, on the basis of the proposed method, it is possible to create a system for automatically enriching knowledge bases with new concepts and the relationships between them. To expand the knowledge base with new concepts, it is necessary to establish links between them and existing concepts in the knowledge base. The proposed method makes it easy to establish relationships between the new concept and the concepts of the knowledge base, through analysis of the description of the new concept. This appendix is described in more detail below.

Изобретение может быть применено и в других областях, связанных с анализом естественного языка, таких как извлечение информации из документов, машинный перевод, дискурсивный анализ, анализ тональности текста, создание диалоговых и вопросно-ответных систем и т.д.The invention can be applied in other areas related to the analysis of natural language, such as extracting information from documents, machine translation, discursive analysis, analysis of tonality of the text, the creation of dialogue and question-answer systems, etc.

Заметим, что предложенный способ применим не только к текстовым документам и коллекциям документов, но и мультимедийным объектам, содержащим текстовые метаданные. Например, музыкальные композиции могут содержать в метаданных текстовое название, исполнителя, автора и т.п. Видеофайлы также могут содержать текстовое название, тип, имена режиссера и актеров (для фильмов). Таким образом, изобретение может быть применено к разнообразным типам электронных документов, содержащих текстовую информацию, в широком классе задач из области обработки естественного языка, информационного поиска и извлечения информации.Note that the proposed method is applicable not only to text documents and document collections, but also to multimedia objects containing text metadata. For example, musical compositions may contain a text name, artist, author, etc. in the metadata. Video files may also contain a text name, type, director and actor names (for films). Thus, the invention can be applied to various types of electronic documents containing textual information in a wide class of problems from the field of natural language processing, information retrieval and information retrieval.

Наиболее близкие к предлагаемому способу идеи были высказаны в работах по созданию систем, позволяющих выделить в тексте ключевые слова и связать их со статьями Википедии. Способы, описанные в этих работах, состоят из двух частей: сначала выделяются ключевые термины, затем выделенные термины связываются со статьями Википедии.The ideas closest to the proposed method were expressed in the works on creating systems that allow highlighting keywords in the text and linking them with Wikipedia articles. The methods described in these works consist of two parts: first, key terms are highlighted, then the selected terms are associated with Wikipedia articles.

Наиболее известными работами в данной области являются проект "Wikify!" и работа Дэвида Милна и Яна Виттена. В проекте "Wikify!" [Rada Mihalcea and Andras Csomai. 2007. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM '07). ACM, New York, NY, USA, 233-242] авторы выделяют ключевые термины, связывают их со словарем Википедии и используют комбинацию заранее определенных правил и алгоритма машинного обучения для определения корректного значения. Так как поиск ключевых терминов осуществляется до определения значений терминов, то используются только признаки, не учитывающие семантические особенности текста. Это накладывает ограничения на точность алгоритмов.The most famous works in this area are the project "Wikify!" and the work of David Milne and Ian Witten. In the project "Wikify!" [Rada Mihalcea and Andras Csomai. 2007. Wikify !: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM '07). ACM, New York, NY, USA, 233-242] authors identify key terms, associate them with the Wikipedia dictionary, and use a combination of predefined rules and machine learning algorithm to determine the correct value. Since the search for key terms is carried out before determining the meaning of the terms, only features that do not take into account the semantic features of the text are used. This imposes limitations on the accuracy of the algorithms.

Милн и Виттен в своей работе [David Milne and lan H.Witten. 2008. Learning to link with wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management (CIKM '08). ACM, New York, NY, USA, 509-518] улучшили результаты, предложив использовать более сложные алгоритмы классификации для выделения ключевых терминов и определения их значений. Так же, как и в предыдущей работе, Википедия использовалась как тренировочный корпус для алгоритмов. Однако как и в системе Wikify!, для определения ключевых терминов использовались только признаки, не учитывающие семантические особенности текста. Это накладывает ограничения на точность алгоритмов.Milne and Witten in their work [David Milne and lan H.Witten. 2008. Learning to link with wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management (CIKM '08). ACM, New York, NY, USA, 509-518] improved the results by proposing the use of more sophisticated classification algorithms to highlight key terms and determine their meanings. As in the previous work, Wikipedia was used as a training building for algorithms. However, as in the Wikify !, system, only attributes that did not take into account the semantic features of the text were used to define key terms. This imposes limitations on the accuracy of the algorithms.

В патентной заявке [Andras Csomai, Rada Mihalcea. Method, System and Apparatus for Automatic Keyword Extraction. US patent 2010/0145678 A1], авторами которого являются авторы системы Wikify!, описывается способ определения ключевых слов. В патенте используются идеи, аналогичные представленным в работе [Rada Mihalcea and Andras Csomai. 2007. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM '07). ACM, New York, NY, USA, 233-242]. Авторы определяют признаки, на основе которых с помощью комбинации алгоритмов выделяют ключевые слова текста. Описанный способ предлагается использовать для построения индексов для книг. Этот патент обладает недостатками систем, описанных выше (т.е. для определения ключевых терминов использовались только признаки, не учитывающие семантические особенности текста), и направлен на решение узкого круга задач.In Patent Application [Andras Csomai, Rada Mihalcea. Method, System and Apparatus for Automatic Keyword Extraction. US patent 2010/0145678 A1], sponsored by the authors of the Wikify system !, describes a method for determining keywords. The patent uses ideas similar to those presented in [Rada Mihalcea and Andras Csomai. 2007. Wikify !: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM '07). ACM, New York, NY, USA, 233-242]. The authors identify features on the basis of which text keywords are identified using a combination of algorithms. It is proposed to use the described method for constructing indexes for books. This patent has the drawbacks of the systems described above (i.e., only attributes that do not take into account the semantic features of the text were used to define key terms), and is aimed at solving a narrow range of problems.

Также в последнее время начали появляться работы, в которых решается аналогичная задача выделения ключевых слов и их связывания с внешним контекстом, но вместо Википедии используются Веб-сайты, содержащие открытые данные, связанные ссылками [Gabor Melli and Martin Ester. 2010. Supervised identification and linking of concept mentions to a domain-specific ontology. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM '10). ACM, New York, NY, USA, 1717-1720. Delia Rusu, Blaz Fortuna and Dunja Miadenic. Automatically annotating text with linked open data. In Christian Bizer, Tom Heath, Tim Berners-Lee, and Michael Hausenblas, editors, 4th Linked Data on the Web Workshop (LDOW 2011), 20th World Wide Web Conference (WWW 2011), Hyderabad, India, 2011]. В этих работах предлагаются методы построения предметно-специфичных онтологии на основе специальных Веб-сайтов. В отличие от работ, использующих Википедию, получаемые онтологии имеют небольшой размер, поэтому для обработки текстов можно променять более ресурсоемкие алгоритмы. Из-за небольшого размера используемых онтологии в этих работах решалась только задача определения значения терминов, а задача поиска ключевых терминов не решалась.Recently, works have begun to appear that solve the similar problem of highlighting keywords and linking them with the external context, but instead of Wikipedia, Web sites containing open data linked by links [Gabor Melli and Martin Ester. 2010. Supervised identification and linking of concept mentions to a domain-specific ontology. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM '10). ACM, New York, NY, USA, 1717-1720. Delia Rusu, Blaz Fortuna and Dunja Miadenic. Automatically annotating text with linked open data. In Christian Bizer, Tom Heath, Tim Berners-Lee, and Michael Hausenblas, editors, 4th Linked Data on the Web Workshop (LDOW 2011), 20th World Wide Web Conference (WWW 2011), Hyderabad, India, 2011]. These works propose methods for constructing subject-specific ontologies based on special Web sites. In contrast to works using Wikipedia, the resulting ontologies are small in size, therefore, for processing texts, more resource-intensive algorithms can be exchanged. Due to the small size of the ontologies used in these works, only the problem of determining the meaning of terms was solved, and the task of searching for key terms was not solved.

Решаемая изобретением техническая задача состояла в создании способа построения семантической модели документа, используемой для обогащения документов дополнительной информацией, семантически связанной с основной темой (темами) документа (документов). При этом при построении семантической модели могли бы использоваться онтологии, построенные с использованием не только информационных источников (например, Википедия), содержащих открытые данные, связанные ссылками, но и любые другие доступные источники информации, содержащие текстовые описания объектов предметной области, не связанные ссылками, например, Веб-сайты компаний, электронные книги, специализированная документация и т.д. При этом значения терминов документа определялись бы не только исходя из лексических признаков, но и исходя из их семантической связи с документом.The technical problem solved by the invention was to create a method for constructing a semantic model of a document used to enrich documents with additional information that was semantically related to the main topic (s) of a document (documents). At the same time, when constructing the semantic model, ontologies could be used, constructed using not only information sources (for example, Wikipedia) containing open data linked by links, but also any other available information sources containing textual descriptions of domain objects that are not linked by links, for example, company websites, e-books, specialized documentation, etc. Moreover, the meanings of the terms of the document would be determined not only on the basis of lexical features, but also on the basis of their semantic connection with the document.

Сущность изобретения состоит в том, что предложен способ построения семантической модели документа, по которому из информационных источников извлекают онтологию, в качестве информационных источников используют электронные ресурсы, содержащие описания отдельных объектов реального мира, как связанные гипертекстовыми ссылками, так и не содержащие гипертекстовых ссылок в описании, каждой концепции онтологии назначают идентификатор, по которому она может быть однозначно определена, в случае существования гипертекстовые ссылки между описаниями концепций преобразуют в связи между концепциями, при отсутствии структуры гипертекстовых ссылок их добавляют, анализируя описания и определяя значения терминов с помощью онтологии, извлеченных из гипертекстовых энциклопедий, и затем преобразуют в связи между концепциями, сохраняют уникальный идентификатор ресурса с оригинальным описанием концепции, для каждой концепции определяют не менее одного текстового представления, вычисляют частоту совместного использования каждого текстового представления концепции и информативность для каждого текстового представления, также определяют, какому естественному языку принадлежит текстовое представление, и сохраняют полученную информацию, получают текст анализируемого документа, осуществляют поиск терминов текста и их возможных значений путем сопоставления частей текста и текстовых представлений концепций из контролируемого словаря для каждого термина из его возможных значений, используя алгоритм разрешения лексической многозначности терминов, выбирают одно, которое считают значением термина, а затем концепции, соответствующие значениям терминов, ранжируют по важности к тексту, и наиболее важные концепции считают семантической моделью документа.The essence of the invention lies in the fact that a method is proposed for constructing a semantic model of a document according to which an ontology is extracted from information sources, electronic resources containing descriptions of individual objects of the real world, both linked by hypertext links and not containing hypertext links in the description, are used as information sources. , each ontology concept is assigned an identifier by which it can be uniquely determined, in the case of hypertext links between op With the help of concepts, they transform in the connection between concepts, if there is no structure of hypertext links, they are added by analyzing descriptions and determining the meaning of terms using ontologies extracted from hypertext encyclopedias, and then they are transformed in connection with the concepts, save a unique resource identifier with an original description of the concept, for each concepts determine at least one text representation, calculate the frequency of sharing each text representation of the concept and informative the space for each text representation, it is also determined which natural language the text representation belongs to, and the information obtained is stored, the text of the analyzed document is obtained, the text terms and their possible meanings are searched by comparing the parts of the text and the concept text representations from the controlled vocabulary for each term from it possible values, using the algorithm for resolving the lexical ambiguity of terms, choose one that is considered the meaning of the term, and then the concept relations corresponding to the meanings of terms are ranked by importance to the text, and the most important concepts are considered a semantic model of the document.

При этом в качестве алгоритма разрешения лексической многозначности терминов используют алгоритм, который выбирает наиболее часто употребляемое значение, для чего определяют частоту совместного использования обрабатываемого термина и всевозможных концепций, связанных с ним, после чего в качестве значения термина выбирают концепцию с наибольшей частотой использования термина и концепции.In this case, as an algorithm for resolving the lexical ambiguity of terms, an algorithm is used that selects the most commonly used value, for which the frequency of sharing the processed term and all kinds of concepts associated with it is determined, after which the concept with the highest frequency of use of the term and concept is selected as the value of the term .

Кроме того, в качестве алгоритма разрешения лексической многозначности терминов могут выбирать алгоритм, вычисляющий семантически наиболее связанную последовательность значений, по которому рассматривают всевозможные последовательности значений концепций для заданной последовательности терминов, для каждой возможной последовательности концепций вычисляют ее вес, как сумму весов уникальных попарных комбинаций концепций, входящих в последовательность концепций, а значениями терминов считают концепции, принадлежащие последовательности с наибольшим весом.In addition, as an algorithm for resolving the lexical ambiguity of terms, one can choose an algorithm that calculates the semantically most related sequence of values, according to which various sequences of concept values for a given sequence of terms are considered, for each possible sequence of concepts, its weight is calculated as the sum of the weights of unique pairwise combinations of concepts, included in the sequence of concepts, and the values of terms are considered to be concepts belonging to the successor Weights with the highest weight.

Кроме того, в качестве алгоритма разрешения лексической многозначности терминов могут выбирать алгоритм, основанный на машинном обучении с учителем, по которому для каждого термина вычисляют вектор признаков, на основании которого выбирают наиболее подходящее значение.In addition, an algorithm based on machine learning with a teacher can be selected as an algorithm for resolving lexical ambiguity of terms, according to which for each term a feature vector is calculated, based on which the most suitable value is selected.

При этом в качестве признака вектора признаков выбирают информативность термина.At the same time, the term information content is selected as a feature of the feature vector.

Кроме того, в качестве признака вектора признаков могут выбирать вероятность употребления термина t в данном значении m_i, вычисляемую как $P_{t} (m_{i}) = \frac{c (t, m_{i})}{\sum_{_{i}} c (t, m_{i})},$

где c(t,m_i) - частота совместного использования термина t в значении m_i.In addition, the probability of using the term t in a given value m_icalculated as

P_{t} (m_{i}) = \frac{c (t, m_{i})}{\sum_{_{i}} c (t, m_{i})},

where c (t, m_i) is the frequency of sharing the term t in the value of m_i.

Кроме того, в качестве признака вектора признаков могут выбирать семантическую близость между концепцией и контекстом документа.In addition, as a feature, feature vectors can choose semantic proximity between the concept and the context of the document.

При этом в качестве контекста документа выбирают значения однозначных терминов.In this case, as the context of the document, the values of unambiguous terms are chosen.

Кроме того, в качестве признака вектора признаков выбирают сумму информативности каждого однозначного термина и семантической близости его значения ко всем другим концепциям из контекста документа.In addition, the sum of the information content of each unique term and the semantic proximity of its meaning to all other concepts from the context of the document is selected as a feature vector of features.

При этом для определения структуры ссылок информационного источника, не содержащего гипертекстовых ссылок, извлекают онтологию из гипертекстовой энциклопедии, обогащают описание концепций информационного источника, не содержащего гипертекстовых ссылок, связями с существующей онтологией, извлеченной из гипертекстовой энциклопедии, расширяют контролируемый словарь существующей онтологии текстовыми представлениями всех концепций обрабатываемого информационного источника, не содержащего гипертекстовых ссылок, принимают частоту совместного использования этих концепций и их текстовых представлений равной 1 для каждой уникальной пары представление-концепция, повторяют операцию обогащения концепций обрабатываемого информационного источника, используя информативность, посчитанную через инвертированную документную частоту, таким образом, получают дополнительные ссылки между концепциями, извлеченными из информационного источника, не содержащего гипертекстовых ссылок, обновляют значение частоты совместного использования текстового представления и концепции на основе полученных ссылок.Moreover, to determine the link structure of an information source that does not contain hypertext links, the ontology is extracted from the hypertext encyclopedia, the description of the concepts of the information source that does not contain hypertext links is enriched by links to the existing ontology extracted from the hypertext encyclopedia, the controlled vocabulary of the existing ontology is expanded with textual representations of all concepts processed information source that does not contain hypertext links, take the frequency with the joint use of these concepts and their textual representations equal to 1 for each unique pair of presentation-concept, repeat the enrichment operation of the concepts of the processed information source, using the information content calculated through the inverted document frequency, thus, additional links between the concepts extracted from the information source are not received containing hypertext links, update the value of the frequency of sharing the text presentation and the concept of based on the links received.

При этом для ранжирования концепций по важности к документу строят семантический граф документа, состоящий из значений всех терминов документов и всевозможных взвешенных связей между ними, где вес связи равен семантической близости между концепциями, которые соединены связью, к семантическому графу применяют алгоритм кластеризации, группирующий семантически близкие концепции, затем концепции из наиболее весомых кластеров ранжируют по важности к документу, и наиболее важные концепции считают семантической моделью документа.Moreover, to rank concepts by importance, a document’s semantic graph is constructed consisting of the values of all document terms and all sorts of weighted links between them, where the weight of the link is the semantic proximity between the concepts that are connected by the link, a clustering algorithm is applied to the semantic graph that groups semantically close concepts, then concepts from the most significant clusters are ranked by importance to the document, and the most important concepts are considered the semantic model of the document.

Кроме того, при извлечении онтологии вычисляют семантическую близость между концепциями, при этом для каждой концепции К составляют список концепций С, состоящий из концепций c_i на которые у концепции К есть ссылка или с которых на концепцию К есть ссылка, вычисляют семантическую близость от текущей концепции К до каждой концепций c_i∈C, сохраняют вычисленную семантическую близость между каждой парой концепций К и c_i, а также соответствующие концепции К и c_i, а для концепций, не входящих в список С, семантическую близость с концепцией К принимают равной нулю.In addition, when extracting ontologies, the semantic proximity between concepts is calculated, and for each concept K a list of concepts C is made up of concepts c _i to which concept K has a link or with which concept K has a link, semantic proximity is calculated from the current concept K to each concepts c _i ∈C, retain computed semantic closeness between each pair of R and concepts c _i, and K and related concepts c _i, but for concepts that are not included in the list C, the semantic proximity of the concept K at imayut zero.

При этом ссылкой между концепциями назначают вес, выбирают пороговое значение для весов, а список концепций С составляют из концепций, на которые у концепции К есть ссылка с весом больше выбранного порогового значения или с которых на концепцию К есть ссылка с весом больше выбранного порогового значения.In this case, the link between the concepts is assigned a weight, the threshold value for the weights is selected, and the list of concepts C is made up of concepts to which the concept K has a link with a weight greater than the selected threshold value or from which the concept K has a link with a weight greater than the selected threshold value.

Кроме того, онтологии могут извлекать из нескольких источников.In addition, ontologies can be derived from several sources.

Кроме того, в качестве текста документа используют метаданные документа.In addition, document metadata is used as the text of the document.

Таким образом, решение технической задачи стало возможным благодаря отличиям предлагаемого способа от способов, изложенных в известных работах, основные отличия состоят в следующем:Thus, the solution of the technical problem became possible due to the differences of the proposed method from the methods described in the known works, the main differences are as follows:

- известные способы определяют ключевые термины, а затем привязывают их к внешним источникам данных. В предлагаемом способе порядок обработки текстов обратный: сначала выделяются все термины и связываются с концепциями онтологии, извлеченной из внешних источников, а затем концепции ранжируются по важности к документу. Такой подход более сложен, так как необходимо определить значения всех терминов документа, но при этом позволяет принимать решения о принадлежности термина к ключевым на основе концептуальных знаний о документе, а не на основе текстовых признаков;- known methods define key terms and then bind them to external data sources. In the proposed method, the text processing order is reversed: first, all terms are highlighted and associated with the ontology concepts extracted from external sources, and then the concepts are ranked according to their importance to the document. This approach is more complicated, since it is necessary to determine the meanings of all terms of the document, but at the same time it allows you to make decisions about whether the term belongs to the key based on the conceptual knowledge of the document, and not based on textual attributes;

- данный способ предполагает построение семантической модели документа, которая, в частности, позволяет решать задачу обогащения текста ссылками на внешние источники;- this method involves the construction of a semantic model of the document, which, in particular, allows you to solve the problem of enriching the text with links to external sources;

- предлагаемый способ позволяет использовать намного больше информационных источников для построения онтологии. Так, кроме Википедии и Веб-сайтов, содержащих открытые данные, связанные ссылками, предлагается использовать любые доступные источники, содержащие текстовое описание объектов предметной области, в принципе не связанные гипертекстовыми ссылками: Веб-сайты компаний, электронные книги, специализированная документация и т.д.- the proposed method allows you to use much more information sources to build ontology. So, in addition to Wikipedia and Web sites containing open data linked by links, it is proposed to use any available sources containing a textual description of objects of the subject area, in principle not related to hypertext links: Company websites, e-books, specialized documentation, etc. .

- расширить круг решаемых задач.- expand the range of tasks.

Работа изобретения поясняется материалами, представленными на Фиг.1-Фиг.6.The operation of the invention is illustrated by the materials presented in Fig.1-Fig.6.

На Фиг.1 представлена общая схема построения семантической модели документа.Figure 1 presents a General scheme for constructing a semantic model of a document.

На Фиг.2 представлена общая схема построения семантической модели документа с предварительным подсчетом семантической близости.Figure 2 presents a General scheme for constructing a semantic model of a document with a preliminary calculation of semantic proximity.

На Фиг.3 представлена модельная схема онтологии, которая может быть использована для построения семантической модели на примере документа, состоящего из одного предложения "Пояс астероидов расположен между орбитами Марса и Юпитера и является местом скопления множества объектов всевозможных размеров".Figure 3 presents a model diagram of an ontology that can be used to build a semantic model using the example of a document consisting of one sentence "The asteroid belt is located between the orbits of Mars and Jupiter and is a place of accumulation of many objects of various sizes."

На Фиг.4 представлен Семантический граф для документа, состоящего из одного предложения "Пояс астероидов расположен между орбитами Марса и Юпитера и является местом скопления множества объектов всевозможных размеров".Figure 4 presents the Semantic graph for a document consisting of one sentence "The asteroid belt is located between the orbits of Mars and Jupiter and is a place of accumulation of many objects of various sizes."

На Фиг.5 представлена Таблица значений семантической близости концепций.Figure 5 presents a table of values of semantic affinity of concepts.

На Фиг.6 представлена Таблица информативности текстовых репрезентаций.Figure 6 presents the table of information content of textual representations.

Работа изобретения состоит из двух основных шагов, схематически представленных на Фиг.1. На первом шаге (101) из внешних информационных ресурсов извлекается онтология. На втором шаге (103-105) связывается текстовая информация документа с концепциями онтологии и строится семантическая модель документа.The work of the invention consists of two main steps, schematically presented in figure 1. At the first step (101), the ontology is extracted from external information resources. At the second step (103-105), the textual information of the document is connected with the concepts of ontology and a semantic model of the document is built.

Рассмотрим первый шаг предложенного способа: извлечение онтологии из внешних информационных источников. Источниками могут служить любые информационные ресурсы, которые содержат описания отдельных объектов предметной области. Далее при описании первого шага описывается структура онтологии, используемой в данном изобретении. После этого рассматривается процесс обработки различных информационных источников для извлечения онтологии с необходимой структурой.Consider the first step of the proposed method: extracting the ontology from external information sources. Sources can be any information resources that contain descriptions of individual objects of the subject area. Further, when describing the first step, the structure of the ontology used in this invention is described. After this, the process of processing various information sources for extracting an ontology with the necessary structure is considered.

Онтология состоит из концепций и связей между ними. Каждая концепция соответствует одному отдельному объекту предметной области. Связь между концепциями означает только то, что концепции некоторым образом взаимосвязаны. Наличие более сложной семантики связи возможно, но не обязательно для предлагаемого способа. Например, в онтологии, описывающей бизнес компании, производящей фототехнику, концепциями могут быть модели фотоаппаратов, используемые технологии ("система интеллектуальной автофокусировки") и т.д. Модели фотоаппаратов могут быть связаны с технологиями, которые в них используются и с другими моделями.An ontology consists of concepts and the relationships between them. Each concept corresponds to one separate object of the subject area. The connection between concepts only means that the concepts are interconnected in some way. The presence of more complex communication semantics is possible, but not necessary for the proposed method. For example, in an ontology that describes the business of a company manufacturing photographic equipment, concepts can be camera models, technologies used (the "intelligent autofocus system"), etc. Camera models can be associated with the technologies that they use with other models.

У каждой концепции имеется некоторый идентификатор, по которому концепция может быть однозначным образом найдена. Таким идентификатором может быть; (а) уникальное целое число, которое сопоставляется с концепцией при создании онтологии; (б) текстовое название концепции; или (в) любой другой способ однозначного нахождений концепции в онтологии, например, указатель в терминах языка программирования или первичный ключ в случае использования реляционной модели.Each concept has an identifier by which the concept can be uniquely found. Such an identifier may be; (a) a unique integer that matches the concept when creating the ontology; (b) the textual name of the concept; or (c) any other way to unambiguously find a concept in ontology, for example, a pointer in terms of a programming language or a primary key in the case of using the relational model.

Каждая концепция обладает как минимум одним текстовым представлением. Текстовое представление - это слово или несколько слов, по которым можно идентифицировать концепцию (в отличие от идентификатора, возможно неоднозначно). Множество всех текстовых представлений представляет собой контролируемый словарь, который используется на этапе связывания документов и онтологии.Each concept has at least one textual representation. A textual representation is a word or several words by which a concept can be identified (unlike an identifier, possibly ambiguously). The set of all textual representations is a controlled dictionary, which is used at the stage of document binding and ontology.

Если концепции соответствует несколько текстовых представлений, тогда эти представления будут являться синонимами по отношению друг к другу. Например, "Россия" и "Российская Федерация" являются текстовыми представлениями одной концепции.If several textual representations correspond to the concept, then these representations will be synonyms in relation to each other. For example, “Russia” and “Russian Federation” are textual representations of one concept.

Из-за особенностей естественного языка одно текстовое представления может быть связано с несколькими концепциями. Такие текстовые представления называются многозначными. Например, слово "платформа" может являться представлением концепций "политическая платформа", "компьютерная платформа", "железнодорожная платформа" и т.д.Due to the nature of a natural language, a single textual representation can be associated with several concepts. Such textual representations are called ambiguous. For example, the word “platform” can be a representation of the concepts “political platform”, “computer platform”, “railway platform”, etc.

Для осуществления связывания документа и онтологии необходимо знать частоту совместного использования текстового представления и концепции в заданной предметной области. Эта частота высчитывается на этапе построения онтологии, описанном ниже.To implement document binding and ontology, it is necessary to know the frequency of sharing a text representation and concept in a given subject area. This frequency is calculated at the stage of ontology construction described below.

Также на этапе построения онтологии для каждого текстового представления вычисляется его информативность. Информативность - это числовая мера, отражающая степень важности текстового представления для предметной области. Способы вычисления информативности описаны также ниже.Also, at the stage of constructing the ontology for each text representation, its information content is calculated. Informational content is a numerical measure reflecting the degree of importance of a textual representation for a subject area. Methods for calculating information content are also described below.

Кроме того, для различных естественных языков представления одной концепции могут быть различны. Например, "кошка" и "cat" являются текстовыми представлениями одной концепции на русском и английском языках. Таким образом, онтология содержит информацию, какому естественному языку принадлежит текстовое представление.In addition, for different natural languages, representations of one concept may be different. For example, “cat” and “cat” are textual representations of one concept in Russian and English. Thus, the ontology contains information to which natural language the text representation belongs.

Также при извлечении онтологии сохраняется ссылка на информационный ресурс с оригинальным описанием концепции. При создании практических приложений изобретения такие ссылки могут быть предоставлены читателю текста, обогащенного на основе предлагаемого способа, например в качестве ссылок на дополнительную информацию по теме документа.Also, when extracting the ontology, a link to an information resource with an original description of the concept is saved. When creating practical applications of the invention, such links can be provided to the reader of the text enriched on the basis of the proposed method, for example, as links to additional information on the subject of the document.

Таким образом, для построения онтологии необходимо обладать следующей информацией:Thus, to build an ontology, you must have the following information:

- концепция и ее идентификатор,- the concept and its identifier,

- уникальный идентификатор информационного ресурса с оригинальным описанием концепции,- a unique identifier of an information resource with an original description of the concept,

- связи между концепциями,- the relationship between concepts,

- текстовые представления концепций,- textual representations of concepts,

- частота совместного использования текстового представления и концепции,- the frequency of sharing a text presentation and concept,

- информативность текстового представления,- information content of the text presentation,

- язык текстового представления (при наличии многоязычной информации).- language of the textual representation (in the presence of multilingual information).

Рассмотрим процесс извлечения онтологии. Наиболее простыми для обработки информационными источниками являются гипертекстовые энциклопедии. Этот процесс известен и описан в [Rada Mihalcea and Andras Csomai. 2007. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM '07). ACM, New York, NY, USA, 233-242.] и [David Milne and lan H.Witten. 2008. Learning to link with wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management (CIKM '08). ACM, New York, NY, USA, 509-518]. Гипертекстовая энциклопедия - совокупность информации, состоящая из объектов и описания этих объектов. Каждый объект представляет собой некоторую энциклопедическую статью, например, «Город Москва» или «Теорема Пифагора». Таким образом, каждый объект гипертекстовой энциклопедии становится концепцией онтологии. В качестве идентификатора концепции может быть использована информация, извлекаемая из энциклопедии, по которой можно однозначно определить концепцию, либо идентификатор может быть создан системой обработки онтологии, которая сама назначит его каждой концепции. Например, в открытой энциклопедии Википедии, каждая статья уже обладает уникальным идентификатором, который может быть использован в онтологии, извлеченной из этой энциклопедии. При извлечении онтологии также следует сохранить уникальный идентификатор ресурса (URL), по которому можно будет найти оригинальную страницу.Consider the ontology extraction process. The simplest information sources to process are hypertext encyclopedias. This process is known and described in [Rada Mihalcea and Andras Csomai. 2007. Wikify !: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM '07). ACM, New York, NY, USA, 233-242.] And [David Milne and lan H. Whitten. 2008. Learning to link with wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management (CIKM '08). ACM, New York, NY, USA, 509-518]. A hypertext encyclopedia is a collection of information consisting of objects and descriptions of these objects. Each object is an encyclopedic article, for example, “Moscow City” or “Pythagorean Theorem”. Thus, each object of the hypertext encyclopedia becomes a concept of ontology. Information extracted from the encyclopedia can be used as an identifier for a concept, which can be used to uniquely determine a concept, or the identifier can be created by an ontology processing system, which itself will assign it to each concept. For example, in the open Wikipedia encyclopedia, each article already has a unique identifier that can be used in the ontology extracted from this encyclopedia. When extracting the ontology, you should also save a unique resource identifier (URL), by which you can find the original page.

Описание объекта может содержать упоминания других объектов энциклопедии. В гипертекстовых энциклопедиях такие упоминания представляются в виде гипертекстовых ссылок на описания других объектов. Таким образом, каждый объект может иметь ссылки на другие объекты, где ссылка обозначает отношение взаимосвязи между двумя объектами: (i) тем объектом, который ссылается, и (ii) тем объектом, на который ссылаются при помощи ссылки. Эти ссылки определяют связи между концепциями. Например, из описания "Москва - [столица/Столица] [Российской Федерации/Россия]" можно понять что концепция "Москва" взаимосвязана с концепциями "Столица" и "Россия". В приведенном и будущих примерах гипертекстовые ссылки обозначаются квадратными скобками и состоят из двух частей, разделенных вертикальной чертой: текста, который видит пользователь ("столица", "Российской Федерации"), и объекта, на которые ведут ссылки ("Столица", "Россия"). Текст, видимый пользователем, называется подписью ссылки.The description of the object may contain references to other objects of the encyclopedia. In hypertext encyclopedias, such references are presented in the form of hypertext links to descriptions of other objects. Thus, each object can have links to other objects, where the link indicates the relationship between the two objects: (i) the object that is referenced, and (ii) the object that is referenced by reference. These links define the links between the concepts. For example, from the description “Moscow - [capital / Capital] [Russian Federation / Russia]” it can be understood that the concept of “Moscow” is interconnected with the concepts of “Capital” and “Russia”. In the above and future examples, hypertext links are indicated by square brackets and consist of two parts separated by a vertical bar: the text that the user sees (“capital”, “Russian Federation”), and the object to which the links lead (“Capital”, “Russia "). The text that the user sees is called the link signature.

Для извлечения текстовых представлений и частотных характеристик, с ними связанных, будем использовать структуру ссылок, описанную выше. Будем считать подпись ссылки текстовым представлением концепции, на которую указывает ссылка. Так, в предыдущем примере "Российская Федерация" будет являться текстовым представлением концепции "Россия". В таком случае частота совместного использования текстового представления и концепции будет равна количеству ссылок, содержащих заданные текстовое представление и концепцию в качестве частей. Заметим, что в Википедии страницы перенаправлений, позволяющие задавать синонимы названия статьи, организуются как специальный случай гипертекстовый ссылки и обрабатываются аналогично.To extract textual representations and frequency characteristics associated with them, we will use the link structure described above. We will consider the signature of the link as a text representation of the concept referred to by the link. So, in the previous example, “Russian Federation” will be a text representation of the concept of “Russia”. In this case, the frequency of sharing the text representation and the concept will be equal to the number of links containing the given text representation and concept as parts. Note that on Wikipedia, redirect pages that allow you to set synonyms for the title of the article are organized as a special case of hypertext links and are processed similarly.

Однако не все подписи стоит считать текстовыми представлениями и добавлять в онтологию. Например, в подписях могут содержаться слова с опечатками или несодержательные термины, представляющие интерес только в контексте (например, слово "этот"). Для фильтрации таких подписей предлагается использовать порог встречаемости, при преодолении которого подпись будет считаться текстовым представлением. Порог подбирается в зависимости от обрабатываемого ресурса. Так, для англоязычной Википедии порог рекомендуется задавать числом не больше 10.However, not all signatures should be considered textual representations and added to the ontology. For example, signatures may contain misspelled words or meaningless terms that are of interest only in context (for example, the word "this"). To filter such signatures, it is proposed to use the occurrence threshold, at which the signature will be considered a text representation. The threshold is selected depending on the resource being processed. So, for the English-language Wikipedia, the threshold is recommended to be set by a number of not more than 10.

В гипертекстовых энциклопедиях принято использовать ссылки только для понятий, важных для понимания основного текста. Таким образом, информативность (степень важности) текстового представления можно оценить как отношение количества статей, где представление встретилось в качестве ссылки, к количеству статей, где представление встретилось вообще. Например, для термина "Пояс астероидов" информативность, вычисленная на основе Википедии, равна 0.3663, а информативность термина "База" равна 0.00468, что существенно ниже, так как термин многозначный и чаще предполагается, что его значение известно читателю, либо не существенно для описания.In hypertext encyclopedias, it is customary to use links only for concepts that are important for understanding the main text. Thus, the information content (degree of importance) of a text presentation can be estimated as the ratio of the number of articles where the presentation met as a reference to the number of articles where the presentation met in general. For example, for the term "Asteroid Belt" the information content calculated on the basis of Wikipedia is 0.3663, and the information content of the term "Base" is 0.00468, which is significantly lower, since the term is ambiguous and more often it is assumed that its meaning is known to the reader, or is not essential for description .

Гипертекстовые энциклопедии обычно создаются для определенного языка, таким образом, язык текстового представления является языком энциклопедии. Заметим, что при создании многоязычных онтологии необходимо определять дубликаты концепций. Например, в Википедии для статьи "Россия" существует аналогичная статья на английском языке. Гипертекстовые энциклопедии содержат межъязыковые ссылки на аналоги статьи на других языках, которые представляют простой способ установления таких дубликатов. Существуют более сложные методы установления дубликатов, но они относятся к области машинного перевода и не рассматриваются в данном изобретении.Hypertext encyclopedias are usually created for a particular language, so the text representation language is an encyclopedia language. Note that when creating multilingual ontologies, duplicate concepts must be defined. For example, on Wikipedia for the article "Russia" there is a similar article in English. Hypertext encyclopedias contain cross-language links to analogues of the article in other languages, which provide an easy way to establish such duplicates. There are more complex methods for establishing duplicates, but they relate to the field of machine translation and are not considered in this invention.

Помимо известного способа извлечения онтологии из гипертекстовых энциклопедий, в данном изобретении предлагается способ извлечения онтологии из других информационных источников, например из Вебсайтов, баз данных или электронных документов. Извлечение онтологии осуществимо, если из источника возможно выделить отдельные объекты и их описания. Например, Вебсайт с описанием новинок киноиндустрии может содержать отдельные страницы (или сегменты страниц) для описания фильмов и персональные страницы актеров, режиссеров и т.д.In addition to the known method for extracting ontologies from hypertext encyclopedias, the present invention proposes a method for extracting ontologies from other information sources, for example from Web sites, databases or electronic documents. Extracting an ontology is feasible if it is possible to isolate individual objects and their descriptions from the source. For example, a Web site with a description of film industry news may contain separate pages (or page segments) for describing films and personal pages of actors, directors, etc.

Для таких источников каждый объект становится концепцией онтологии. Аналогично случаю гипертекстовой энциклопедии, идентификатор концепции определяется на основе доступной информации или задается автоматически системой обработки источника. Кроме того, сохраняется уникальный идентификатор ресурса с описанием. Если такого идентификатора для объекта не существует, например, если на одной странице содержится несколько объектов и их описаний, то сохраняется наиболее точный идентификатор более общего фрейма (в примере идентификатор страницы).For such sources, each object becomes an ontology concept. Similarly to the case of a hypertext encyclopedia, the concept identifier is determined on the basis of available information or is set automatically by the source processing system. In addition, a unique resource identifier with a description is stored. If such an identifier for an object does not exist, for example, if several objects and their descriptions are contained on one page, then the most accurate identifier of the more general frame is stored (in the example, the page identifier).

Извлечение текстовых представлений концепций осуществляется на основе описанных ниже правил, использующих структуру источника. Для Веб страниц текстовые представления могут содержаться в названии страницы, либо выделены специальными тэгами. Также могут быть использованы более сложные способы, учитывающие структурные и текстовые свойства документа. Например, могут использоваться алгоритмы машинного обучения, использующие в качестве признаков части речи слов, контекст из слов в окружении, присутствие заглавных букв и т.д. (Gabor Melli and Martin Ester. 2010. Supervised identification and linking of concept mentions to a domain-specific ontology. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM '10). ACM, New York, NY, USA, 1717-1720.].Textual representations of concepts are extracted based on the rules described below, using the source structure. For Web pages, textual representations can be contained in the page title, or highlighted with special tags. More sophisticated methods can also be used that take into account the structural and textual properties of the document. For example, machine learning algorithms can be used, using as a sign part of speech words, the context of the words in the environment, the presence of capital letters, etc. (Gabor Melli and Martin Ester. 2010. Supervised identification and linking of concept mentions to a domain-specific ontology. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM '10). ACM, New York, NY, USA , 1717-1720.].

Определение связей между концепциями, производится на основе анализа их описаний. Если описания концепций имеют развитую ссылочную структуру, то извлечение остальной информации происходит образом, аналогичным обработке гипертекстовой энциклопедии.The definition of relationships between concepts is based on an analysis of their descriptions. If the descriptions of concepts have a developed link structure, then the rest of the information is extracted in a manner analogous to the processing of a hypertext encyclopedia.

В случай, когда описания не содержат ссылок, необходимы более сложные алгоритмы для построения связей между объектами. Данное изобретение может использоваться для решения этой задачи.In the case where the descriptions do not contain links, more complex algorithms are needed to build relationships between objects. The present invention can be used to solve this problem.

Сначала определим информативность текстового представления. Информативность текстового представления необходима на этапе определения связей между концепциями, однако в этом случае для ее определения нет возможности использовать ссылочную структуру. В таком случае степень важности текстового представления может быть определена с помощью меры обратной документной частоты термина, лексически совпадающего с текстовым представлением, которая известна из области информационного поиска [Дж Солтон. Динамические библиотечно-поисковые системы. М.: - Мир, 1979.]:First, we determine the information content of the text representation. The information content of the text representation is necessary at the stage of determining the relationships between concepts, however, in this case, it is not possible to use a link structure to determine it. In this case, the degree of importance of the textual representation can be determined using the measure of the inverse document frequency of the term, lexically matching the textual representation, which is known from the field of information retrieval [J. Salton. Dynamic library search engines. M .: - World, 1979.]:

$информированность (текстовое представление) = i d f (т е р м и н) = \log \frac{| D |}{| (d_{i} \supset t_{i}) |}$

,

awareness (text view) = i d f (t e R m and n) = \log \frac{| D |}{| (d_{i} \supset t_{i}) |}

,

Для определения ссылочной структуры необходимо выполнить следующие шаги:To define a link structure, you must perform the following steps:

1. извлечь онтологию из гипертекстовой энциклопедии, например Википедии;1. extract the ontology from the hypertext encyclopedia, for example Wikipedia;

2. обогатить описание концепций обрабатываемого информационного источника связями с существующей онтологией;2. Enrich the description of the concepts of the processed information source with links to the existing ontology;

3. расширить контролируемый словарь существующей онтологии текстовыми представлениями всех концепций обрабатываемого информационного источника;3. expand the controlled vocabulary of the existing ontology with textual representations of all concepts of the processed information source;

4. принять частоту совместного использования текстового представления и новой концепции, равной 1 для каждой уникальной пары представление-концепция;4. accept the frequency of sharing a text presentation and a new concept equal to 1 for each unique pair of presentation-concept;

5. повторить операцию обогащения концепций обрабатываемого информационного источника. Здесь необходимо использовать информативность, посчитанную через инвертированную документную частоту. При этом появятся дополнительные ссылки между самими концепциями (см. процесс обработки текста, описанный ниже);5. repeat the operation of enrichment of the concepts of the processed information source. Here it is necessary to use information content, calculated through the inverted document frequency. In this case, additional links will appear between the concepts themselves (see the text processing process described below);

6. обновить значение частоты совместного использования текстового представления и концепции на основе информации из полученных ссылок.6. Update the value of the frequency of sharing the text representation and concept based on the information from the received links.

Использование онтологии, извлеченной из гипертекстовой энциклопедии, для построения новой онтологии необходимо из-за многозначности терминов языка. Данное изобретение позволяет определять значение термина в заданном контексте. Таким образом, использование известной онтологии позволит разрешить многозначность терминов в описаниях новых концепций.Using an ontology extracted from a hypertext encyclopedia to construct a new ontology is necessary because of the ambiguity of language terms. This invention allows to determine the meaning of the term in a given context. Thus, the use of the well-known ontology will allow us to resolve the ambiguity of terms in the descriptions of new concepts.

Некоторые информационные источники содержат перевод информации на различные языки. Для таких источников необходимо при обработке сохранять язык текстового представления.Some information sources contain translation of information into various languages. For such sources, it is necessary to preserve the text presentation language during processing.

Результатом описанных выше операций будет служить одна онтология, извлеченная из нескольких информационных источников. Однако для некоторых приложений полезно различать онтологии, построенные на основе разных информационных источников. Для этого каждой концепции добавляется дополнительный атрибут, указывающий, из какого источника была извлечена концепция, и при обработке документов, обращаются к этому атрибуту для получения информации об источнике.The result of the operations described above will be a single ontology, extracted from several information sources. However, for some applications it is useful to distinguish between ontologies based on different information sources. For this, an additional attribute is added to each concept, indicating from which source the concept was extracted, and when processing documents, they turn to this attribute to obtain information about the source.

Прежде чем перейти к процессу обработки текстов, введем понятие семантической близости между концепциями, которое будет использоваться в дальнейшем.Before moving on to the word processing process, we introduce the concept of semantic closeness between concepts, which will be used in the future.

Семантической близостью будем называть отображение f:X×X→R, ставящее в соответствие паре концепций x и y действительное число и обладающее следующими свойствами:By semantic proximity we mean a map f: X × X → R, which associates a pair of concepts x and y with a real number and has the following properties:

- 0≤f(x,y)≤1,- 0≤f (x, y) ≤1,

- f(x,y)=1 ⇔ x=y.- f (x, y) = 1 ⇔ x = y.

Известные методы нахождения семантической близости можно разделить на два класса:Known methods for finding semantic proximity can be divided into two classes:

- методы, определяющие близость над текстовыми полями и- methods that determine proximity over text fields and

- методы, использующие ссылочную структуру онтологии.- methods using the ontology reference structure.

К первому классу относятся методы, используемые в информационном поиске, для сравнения текстовых документов. Наиболее известным методом является представление документа через векторную модель: каждому слову во всех документах назначается вес, затем документы представляются как векторы в n-мерном пространстве всевозможных слов и по некоторой математической мере вычисляется близость между полученными векторами. Вес слова в документе может быть определен какThe first class includes methods used in information retrieval for comparing text documents. The most famous method is to represent a document through a vector model: each word in all documents is assigned a weight, then documents are represented as vectors in the n-dimensional space of all kinds of words and, using some mathematical measure, the proximity between the obtained vectors is calculated. The weight of a word in a document can be defined as

вес=tf*idfweight = tf * idf

где tf - количество вхождений слова в документ, idf - обратная документная частота, описанная выше. Тогда вес каждого слова будет задавать координату вектора документа в соответствующем измерении. Для вычисления близости между векторами часто используется косинусная мера между ними:where tf is the number of occurrences of the word in the document, idf is the inverse document frequency described above. Then the weight of each word will specify the coordinate of the document vector in the corresponding dimension. To calculate the proximity between vectors, the cosine measure between them is often used:

$\cos (d_{1}, d_{2}) = \frac{d_{1} \cdot d_{2}}{‖ d_{1} ‖ ‖ d_{2} ‖},$

где

‖ d ‖ = \sqrt{\sum_{i = 1}^{n} d_{i}^{2}}

\cos (d_{one}, d_{2}) = \frac{d_{one} \cdot d_{2}}{‖ d_{one} ‖ ‖ d_{2} ‖},

Where

‖ d ‖ = \sqrt{\sum_{i = one}^{n} d_{i}^{2}}

Таким образом, близость между концепциями может быть определена как близость между их описаниями.Thus, the proximity between concepts can be defined as the proximity between their descriptions.

Однако чаще используются меры из второго класса. Эти меры в свою очередь могут быть разделены на локальные и глобальные. Локальные методы определяют близость между концепциями А и В как нормализованное количество общих соседей N(X):However, measures from the second class are more often used. These measures, in turn, can be divided into local and global. Local methods determine the proximity between concepts A and B as the normalized number of common neighbors N (X):

$s i m (A, B) = \frac{1}{Z} | N (A) \cap N (B) |,$

s i m (A, B) = \frac{one}{Z} | N (A) \cap N (B) |,

где Z - коэффициент нормализации, $| N(A) \cap N(B) |$

- пересечение множеств непосредственных соседей A и B.where Z is the normalization coefficient,

| N (A) \cap N (B) |

- the intersection of the sets of immediate neighbors A and B.

Наиболее известными локальными методами являютсяThe best known local methods are

- косинусная мера: $Z = \sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \sqrt{\sum_{j = 1}^{n} B_{j}^{2}}$

- cosine measure:

Z = \sqrt{\sum_{i = one}^{n} A_{i}^{2}} \sqrt{\sum_{j = one}^{n} B_{j}^{2}}

- коэффициент Дайса: $Z = \frac{| N (A) | + | N (B) |}{2}$

- Dice coefficient:

Z = \frac{| N (A) | + | N (B) |}{2}

- коэффициент Жаккара: $Z = | N(A) \cup N(B) |$

(объединение множеств непосредственных соседей).- Jacquard coefficient:

Z = | N (A) \cup N (B) |

(union of sets of immediate neighbors).

Для того чтобы данные меры удовлетворяли второму свойству определения семантической близости, будем считать, что каждая концепция онтологии обладает ссылкой на саму себя. Тогда близость между концепциями без связей с другими концепциями будет равна 1, только если эти концепции совпадают, и 0 во всех других случаях.In order for these measures to satisfy the second property of determining semantic proximity, we assume that each concept of ontology has a reference to itself. Then the proximity between concepts without connections with other concepts will be equal to 1 only if these concepts coincide, and 0 in all other cases.

Заметим, что указанные меры определяются над множествами и не могут учитывать семантику ссылок. Для устранения этого недостатка в диссертации Турдакова Д.Ю. было предложено взвешивать ссылки различного типа, а меры близости обобщить на случай взвешенных ссылок с помощью теории нечетких множеств [Турдаков Д.Ю. Методы и программные средства разрешения лексической многозначности на основе сетей документов. Диссертация, Москва, 2010].Note that these measures are defined over sets and cannot take into account the semantics of links. To eliminate this drawback in the thesis of D. Turdakova it was proposed to weight links of various types, and generalize proximity measures to the case of weighted links using the theory of fuzzy sets [Turdakov D.Yu. Methods and software for resolving lexical ambiguity based on document networks. The dissertation, Moscow, 2010].

Наиболее известным из глобальных методов является мера "SimRank". Основная предпосылка этой модели формулируется так: "два объекта похожи, если на них ссылаются похожие объекты". Поскольку данная предпосылка формулирует понятие похожести через саму себя, то базовой предпосылкой в модели SimRank служит утверждение: "каждый объект считается максимально похожим на себя самого", т.е. имеющим с самим собой значение похожести, равное единице.The most famous of the global methods is the measure "SimRank". The basic premise of this model is formulated as follows: "two objects are similar if similar objects refer to them." Since this premise formulates the concept of similarity through itself, the basic premise in the SimRank model is the assertion: “each object is considered to be as similar to itself as possible”, i.e. having with themselves a value of similarity equal to unity.

Заметим, что глобальные методы имеют более высокую вычислительную сложность и могут быть применены только к небольшим онтологиям. Поэтому для предложенного способа рекомендуется использовать локальные методы.Note that global methods have higher computational complexity and can only be applied to small ontologies. Therefore, for the proposed method, it is recommended to use local methods.

Кроме приведенного выше известного определения семантической близости данное изобретение предлагает обобщение для вычисления семантической близости между множествами концепций. Для этого множества представляются в виде обобщенной концепции, которая соединена с соседями всех входящих в нее концепций:In addition to the above known definition of semantic affinity, this invention provides a generalization for calculating semantic affinity between sets of concepts. For this, sets are presented in the form of a generalized concept, which is connected with the neighbors of all the concepts included in it:

$N (c_{1}, c_{2}, \dots, c_{N}) = U_{i = 1}^{N} N (c_{i})$

,

N (c_{one}, c_{2}, ..., c_{N}) = U_{i = one}^{N} N (c_{i})

,

то есть множество соседей обобщенной концепции состоит из объединения множеств всех непосредственных соседей концепций, входящих в обобщенную.that is, the set of neighbors of the generalized concept consists of the union of the sets of all immediate neighbors of the concepts included in the generalized.

Вычисление семантической близости - частая операция при обработке текста. Поэтому в данном изобретении предлагается сделать дополнительный (не обязательный) шаг при построении онтологии и заранее вычислить близость между концепциями (Фиг.2. 202). Однако посчитать заранее семантическую близость между всеми концепциями достаточно большой онтологии, например, извлеченной из Википедии, не представляется возможным. Википедия содержит описание более 3.5 миллионов понятий, причем все понятия сильно связаны. Это означает, что для хранения понадобится несколько терабайт данных, и, при существующем уровне техники, - несколько машинолет для подсчета. Поэтому в данном изобретении предлагается несколько эвристик, позволяющих произвести предварительный подсчет для подмножества терминов, и использовать только эти значения при обработке текстов.Calculating semantic affinity is a common operation in text processing. Therefore, in this invention, it is proposed to take an additional (optional) step in constructing the ontology and pre-calculate the proximity between the concepts (Figure 2. 202). However, it is not possible to calculate in advance the semantic similarity between all concepts of a sufficiently large ontology, for example, extracted from Wikipedia. Wikipedia contains a description of more than 3.5 million concepts, all of which are strongly related. This means that for storage you will need several terabytes of data, and, with the current level of technology, several machine for counting. Therefore, this invention proposes several heuristics that allow you to perform a preliminary calculation for a subset of terms and use only these values in word processing.

- Для онтологии, не хранящей типы ссылок производить подсчет семантической близости только для концепций, имеющих прямую ссылку.- For an ontology that does not store link types, calculate semantic proximity only for concepts that have a direct link.

- Для онтологии хранящих семантику ссылки назначить веса ссылок в зависимости от их типа и вычислять семантическую близость только для ссылок с весами больше некоторого порога. Порог и веса ссылок подбираются таким образом, чтобы найти компромисс между количеством значений семантической близости для предварительного подсчета и качеством определения значений терминов.- For the ontology of semantics-storing links, assign link weights depending on their type and calculate semantic proximity only for links with weights greater than a certain threshold. The threshold and link weights are selected in such a way as to find a compromise between the number of semantic proximity values for preliminary calculation and the quality of determining the meaning of terms.

Данные эвристики позволяют заранее определить семантическую близость без существенной потери качества при обработке текстов.The heuristic data allows you to pre-determine the semantic proximity without significant loss of quality in word processing.

Предварительный подсчет семантической близости выполняется для всех пар концепций, между которыми есть ссылка, определяемая с помощью перечисленных выше эвристик. Предварительный подсчет семантической близости осуществляется следующим образом:A preliminary calculation of semantic affinity is performed for all pairs of concepts, between which there is a link defined using the heuristics listed above. Preliminary calculation of semantic affinity is carried out as follows:

Для каждой концепции КFor each concept K

- получают список соседних концепций С, на которые у концепции К есть ссылка или с которых на концепцию К есть ссылка,- get a list of neighboring concepts C to which the concept K has a link or from which the concept K has a link,

- вычисляют семантическую близость от текущей концепции К до всех соседних концепций c_i∈С,- calculate the semantic proximity from the current concept K to all neighboring concepts c _i ∈С,

- для каждой концепции c_i из С сохраняют вычисленную семантическую близость между парой концепций К и c_i, а также соответствующие концепции K и c_i.- for each concept c _i from C, the calculated semantic similarity between a pair of concepts K and c _i , as well as the corresponding concepts K and c _i, are retained.

Если значение семантической близости посчитано заранее, то при обработке текстов ее сохраненные значения будут извлекаться из онтологии. Если же предподсчет семантической близости не производился, то она будет вычисляться по запросу.If the value of semantic proximity is calculated in advance, then when processing texts its stored values will be extracted from the ontology. If the pre-calculation of semantic proximity was not performed, then it will be calculated by request.

Перейдем ко второму шагу предлагаемого способа: связыванию документа и онтологии. Будем называть термином слово или несколько идущих подряд слов текста. Целью этого шага является поиск однозначного соответствия между терминами и концепциями онтологии. Такие концепции будем называть значениями терминов. Таким образом, цель можно переформулировать как поиск терминов в тексте и определение их значений.Let's move on to the second step of the proposed method: linking the document and ontology. We will call the term a word or several consecutive words of a text. The purpose of this step is to find a unique correspondence between the terms and concepts of ontology. Such concepts will be called meanings of terms. Thus, the goal can be reformulated as a search for terms in the text and determination of their meanings.

Для Нахождения терминов в тексте и определение их значений необходимо сделать три шага (Фиг.1):To find the terms in the text and determine their meanings, three steps must be taken (Figure 1):

- На первом шаге определяют всевозможные связи между терминами и концепциями (103).- In the first step, all kinds of connections between terms and concepts are determined (103).

- На втором шаге разрешается лексическая многозначность терминов (104).- In the second step, the lexical ambiguity of terms is resolved (104).

- На третьем шаге строится семантическая модель документа (105).- In the third step, a semantic model of the document is built (105).

Процесс поиска терминов (103) состоит в сопоставлении частей текста и текстовых представлений присутствующих в контролируемом словаре. Наиболее простым и эффективным способом является поиск полностью совпадающих строк. Также известны методы, использующие частичное совпадение, но эти методы могут применяться только для небольших онтологии, так как обладают значительно большей вычислительной сложностью.The process of searching for terms (103) consists in comparing parts of the text and textual representations present in the controlled dictionary. The easiest and most efficient way is to search for completely matching strings. Partial matching methods are also known, but these methods can only be used for small ontologies, since they have significantly greater computational complexity.

Рассмотрим метод, основанный на присутствии термина в словаре. Так как перебор всевозможных частей текста неэффективен, приведем несколько эвристик, позволяющих ускорить этот процесс.Consider a method based on the presence of a term in a dictionary. Since enumeration of various parts of the text is ineffective, here are some heuristics to speed up this process.

1. Так как в контролируемом словаре содержаться только слова или последовательности слов, имеет смысл разбить текст на слова и проверять на присутствие в словаре только части текста, состоящие из слов.1. Since the controlled dictionary contains only words or sequences of words, it makes sense to break the text into words and check for the presence in the dictionary of only parts of the text consisting of words.

2. Термин не может пересекать границы предложений, поэтому необходимо искать термин только в рамках одного предложения.2. The term cannot cross the boundaries of the proposals, therefore it is necessary to search for the term only within the framework of one proposal.

3. В подавляющем большинстве случаев текстовые представления концепций являются группами имени существительного (или существительными в случае одного слова). Поэтому для ускорения обработки рекомендуется определить части речи слов и не рассматривать комбинации, не являющиеся группами имени существительного. Эта эвристика также поможет увеличить точность нахождения терминов, разрешив морфологическую многозначность (например, в русском языке наречие «стекло» не будет рассматриваться как существительное «стекло», для английского глагол «cause» не будет рассматриваться как существительное «cause»).3. In the vast majority of cases, textual representations of concepts are groups of a noun (or nouns in the case of a single word). Therefore, to speed up processing, it is recommended to identify parts of the speech of words and not consider combinations that are not groups of a noun. This heuristic will also help increase the accuracy of terms by resolving morphological polysemy (for example, in the Russian language the adverb “glass” will not be considered as a noun “glass”, for the English verb “cause” will not be considered as a noun “cause”).

4. Так как слова могут находиться в различных формах, то необходимо хранить возможные формы слов в контролируемом словаре. Для уменьшения объема памяти и увеличения скорости обработки, рекомендуется с помощью алгоритмов лемматизации преобразовать все слова к начальной форме, например существительные - к единственному числу, именительному падежу в случае русского языка, и к единственному числу в случае английского языка. Для этих же целей имеет смысл преобразовать все буквы к единому регистру, в составных терминах убрать пробельные символы и знаки препинания. В этом случае такому же преобразованию подлежат все слова в контролируемом словаре. Например, текстовое представление "Пояс астероидов" преобразуется в "поясастероид".4. Since words can be in various forms, it is necessary to store the possible forms of words in a controlled dictionary. To reduce the amount of memory and increase processing speed, it is recommended using lemmatization algorithms to convert all words to the initial form, for example, nouns to the singular, nominative in the case of the Russian language, and to the singular in the case of the English language. For the same purposes, it makes sense to convert all letters to a single register, in compound terms to remove whitespace and punctuation marks. In this case, all words in the controlled dictionary are subject to the same conversion. For example, the text representation "Asteroid belt" is converted to a "belt asteroid."

5. Для составных терминов, содержащих в качестве своих частей другие термины, имеет смысл рассматривать только самый длинный термин. Например, для термина "Пояс астероидов", можно не рассматривать слово "Пояс" в отдельности.5. For compound terms containing other terms as parts, it makes sense to consider only the longest term. For example, for the term "Asteroid Belt", you can not consider the word "Belt" separately.

После того, как присутствующие в контролируемом словаре термины найдены, будем считать соответствующие им концепции возможными значениями терминов. Следующим шагом будет являться выбор из возможных концепций одной, которая будет считаться значением термина. Задача определения значений терминов относится к области разрешения лексической многозначности. Важнейшие результаты в данной области основаны на последних достижениях в области машинного обучения.After the terms present in the controlled dictionary are found, we will consider the concepts corresponding to them as possible meanings of the terms. The next step will be to choose one of the possible concepts that will be considered the meaning of the term. The task of determining the meaning of terms relates to the field of resolving lexical ambiguity. The most important results in this area are based on the latest advances in machine learning.

Машинное обучение - подраздел искусственного интеллекта, изучающий методы построения алгоритмов, способных обучаться. Выделяют обучение с учителем и обучение без учителя. В случае обучения с учителем, алгоритм генерирует функцию, которая связывает входные данные с выходными определенным образом (задача классификации). В качестве обучающих данных даны примеры связи входных данных с выходными желаемым образом. В случае обучения без учителя, алгоритм действует как агент, моделирующий набор входных данных, не имея доступа к предварительно размеченным примерам. Оба типа алгоритмов используют понятие признака. Признаки - индивидуальные измеримые свойства наблюдаемого феномена, которые используются для создания его численного представления (например, семантическая близость между обрабатываемой концепцией и значением предыдущего термина).Machine learning is a subsection of artificial intelligence that studies the methods of constructing algorithms capable of learning. Allocate learning with a teacher and learning without a teacher. In the case of training with a teacher, the algorithm generates a function that associates the input data with the output in a certain way (classification problem). As training data, examples of the connection of input data with output in a desired manner are given. In the case of training without a teacher, the algorithm acts as an agent simulating a set of input data, without having access to pre-marked examples. Both types of algorithms use the concept of a feature. Signs are the individual measurable properties of the observed phenomenon that are used to create its numerical representation (for example, the semantic closeness between the concept being processed and the meaning of the previous term).

Рассмотрим некоторые существующие алгоритмы определения значений, которые могут быть использованы в данном изобретении. Наиболее простым способом является выбор наиболее часто употребляемого значения. Для этого находим частоту совместного использования обрабатываемого термина и всевозможных концепций, связанных с ним. После этого в качестве значения выбираем концепцию с наибольшей частотой. Этот алгоритм всегда выбирает одно и то же значение для фиксированного термина, вне зависимости от контекста, поэтому имеет невысокую точность.Consider some existing algorithms for determining the values that can be used in this invention. The easiest way is to select the most commonly used value. To do this, we find the frequency of sharing the term being processed and all kinds of concepts related to it. After that, we select the concept with the highest frequency as the value. This algorithm always selects the same value for a fixed term, regardless of context, and therefore has low accuracy.

Другим подходом является алгоритм, вычисляющий семантически наиболее связанную последовательность значений. Рассмотрим всевозможные последовательности значений для заданной последовательности терминов. Для каждой возможной последовательности концепций необходимо вычислить ее вес. Вес последовательности вычисляется, как сумма весов уникальных попарных комбинаций концепций, входящих в нее. Значениями терминов будут являться концепции последовательности с наибольшим весом. Пример использования описан ниже.Another approach is an algorithm that computes the semantically most related sequence of values. Consider all possible sequences of values for a given sequence of terms. For each possible sequence of concepts, it is necessary to calculate its weight. The weight of the sequence is calculated as the sum of the weights of the unique pairwise combinations of the concepts included in it. The meaning of the terms will be the concepts of the sequence with the highest weight. An example of use is described below.

Два описанных алгоритма являются крайними случаями, не учитывающими важную информацию о тексте. Поэтому наилучшие результаты показывают алгоритмы, основанные на машинном обучении с учителем. Для каждого термина вычисляется вектор признаков, на основании которого алгоритм выбирает наиболее подходящее значение. Признаками могут служитьThe two algorithms described are extreme cases that do not take into account important information about the text. Therefore, algorithms based on machine learning with a teacher show the best results. For each term, a feature vector is calculated, based on which the algorithm selects the most suitable value. Signs may be

- информативность термина- informational content of the term

- вероятность употребления термина t в данном значении m_i, вычисляемая как $P_{t} (m_{i}) = \frac{c (t, m_{i})}{\sum_{_{i}} c (t, m_{i})},$

где C(t,m_i) - частота совместного использования термина и значения.- the probability of using the term t in a given value of m _i , calculated as

P_{t} (m_{i}) = \frac{c (t, m_{i})}{\sum_{_{i}} c (t, m_{i})},

where C (t, m _i ) is the frequency of sharing the term and meaning.

- Семантическая близость между концепцией и контекстом. Контекстом могут служить уже определенные значения, например значения однозначных терминов.- Semantic affinity between concept and context. The context may already be defined meanings, for example meanings of unambiguous terms.

- Качество контекста, определенное как сумма информативности каждого однозначного термина и семантической близости его значения ко всем другим концепциям из контекста.- The quality of the context, defined as the sum of the information content of each unique term and the semantic proximity of its meaning to all other concepts from the context.

- а также другие признаки.- as well as other symptoms.

В качестве обучающих данных для алгоритма машинного обучения может использоваться текст, размеченный экспертами. Однако разметка текста - это ресурсоемкая операция, поэтому в качестве обучающего множества можно использовать документы из источника для построения онтологии.As training data for a machine learning algorithm, text marked up by experts can be used. However, marking up a text is a resource-intensive operation, therefore, as a training set, you can use documents from a source to build an ontology.

Еще одним важным замечанием является то, что не всегда значение термина содержится в онтологии. Для определения такой ситуации добавим к списку концепций соответствующих каждому термину специальную концепцию, означающую отсутствие правильного значения в онтологии. Тогда описанный выше алгоритм машинного обучения будет определять такие случаи. Однако для этого потребуется специальный тренировочный набор, содержащий такие случаи. Заметим, что с помощью такого корпуса можно обучить алгоритм, который будет определять присутствие значения термина в онтологии. Такой алгоритм можно комбинировать и с простыми алгоритмами определения значений, описанными выше. Частным случаем такого алгоритма является фильтрация терминов с информативностью ниже некоторого порога. Последний подход описан в литературе [Rada Mihalcea and Andras Csomai. 2007. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM '07). ACM, New York, NY, USA, 233-242; и David Milne and lan H.Witten. 2008. Learning to link with wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management (CIKM '08). ACM, New York, NY, USA, 509-518].Another important point is that the meaning of the term is not always contained in the ontology. To determine this situation, we add a special concept to the list of concepts corresponding to each term, which means the absence of the correct meaning in the ontology. Then the machine learning algorithm described above will determine such cases. However, this will require a special training kit containing such cases. Note that with the help of such a corpus it is possible to train an algorithm that will determine the presence of the meaning of the term in the ontology. Such an algorithm can also be combined with the simple value determination algorithms described above. A special case of such an algorithm is filtering terms with information below a certain threshold. The latter approach is described in the literature [Rada Mihalcea and Andras Csomai. 2007. Wikify !: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM '07). ACM, New York, NY, USA, 233-242; and David Milne and lan H. Witten. 2008. Learning to link with wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management (CIKM '08). ACM, New York, NY, USA, 509-518].

На завершающем шаге предлагаемого способа строится семантическая модель документа (Фиг.1. 105). В простейшем случае семантическая модель документа представляет собой список всех значений терминов. Такая модель может быть полезна, если документ содержит малое количество терминов, и как следствие, все они важны. Однако для больших документов необходимо определять наиболее значимые концепции.At the final step of the proposed method is built a semantic model of the document (Fig.1. 105). In the simplest case, the semantic model of the document is a list of all the meanings of the terms. Such a model can be useful if the document contains a small number of terms, and as a result, all of them are important. However, for large documents, it is necessary to determine the most significant concepts.

Способ определения наиболее значимых концепций состоит из следующих шагов:The method for determining the most significant concepts consists of the following steps:

- На первом шаге выделяется основная тема (темы) документа. В больших документах часто присутствует одна или небольшое количество основных тем и множество дополнительных описаний. Например, документ с описанием некоторого события может, кроме терминов, непосредственно относящихся к событию, содержать термины с описанием времени и места, где данное событие произошло. Другим случаем, где важно понимать основную тему, является обработка зашумленных документов. Например, при обработке Веб-страниц иногда сложно отделить основной текст от вспомогательных элементов, таких как меню и т.д.- In the first step, the main theme (s) of the document is highlighted. Large documents often have one or a small number of main topics and many additional descriptions. For example, a document describing an event may, in addition to terms directly related to the event, contain terms describing the time and place where the event occurred. Another case where it is important to understand the main topic is the processing of noisy documents. For example, when processing Web pages, it is sometimes difficult to separate the main text from auxiliary elements such as menus, etc.

Для выделения основной темы предлагается сгруппировать семантически близкие значения, и после этого выделить основную группу (группы). Для исполнения этой идеи построим полный взвешенный семантический граф документа, где узлами будут значения терминов, а ребра будут иметь вес, равный семантической близости между концепциями в соответствующих узлах. К построенному графу применим алгоритм кластеризации. Алгоритмы кластеризации относятся к алгоритмам машинного обучения без учителя и позволяют разделить граф на группы таким образом, чтобы внутри группы были сильные связи, а между группами были только слабые связи.To highlight the main topic, it is proposed to group semantically close meanings, and then select the main group (s). To implement this idea, we will construct a full weighted semantic graph of the document, where the nodes will be the meanings of the terms, and the edges will have a weight equal to the semantic proximity between the concepts in the corresponding nodes. We apply the clustering algorithm to the constructed graph. Clustering algorithms relate to machine-learning algorithms without a teacher and allow you to divide the graph into groups so that within the group there are strong connections and only weak connections between the groups.

Далее каждой группе назначается вес. Вес может состоять из комбинации нескольких параметров группы. В качестве примера, можно использовать информативности группы, то есть сумму информативностей всех терминов, чьи значения присутствуют в группе.Next, each group is assigned a weight. Weight may consist of a combination of several group parameters. As an example, you can use the information content of the group, that is, the sum of the information content of all terms whose values are present in the group.

Далее выбирается группа или несколько групп с наибольшим весом. Количество групп может определяться эвристически, например, браться все группы с весом больше некоторого порога, либо автоматически через анализ перепадов между весами групп. В качестве примера можно брать все группы с весом не менее 75% от наибольшего веса для всех групп.Next, select the group or several groups with the highest weight. The number of groups can be determined heuristically, for example, take all groups with a weight greater than a certain threshold, or automatically through the analysis of differences between group weights. As an example, we can take all groups with a weight of at least 75% of the highest weight for all groups.

Выделенные группы будут содержать значения, описывающие основные темы документа. Будем считать эти значения кандидатами в ключевые.The highlighted groups will contain values that describe the main topics of the document. We will consider these values as candidates for key ones.

- На втором шаге из значений-кандидатов выбираются ключевые. Для этого все значения-кандидаты взвешиваются и сортируются по убыванию веса. Ключевыми концепциями выбираются N наиболее весомых, где параметр N подбирается эвристически. Для взвешивания концепций могут использоваться различные комбинации их параметров. Например, весом может быть произведение- In the second step, the key ones are selected from the candidate values. To do this, all candidate values are weighted and sorted in descending order of weight. The key concepts are the N most significant ones, where the parameter N is selected heuristically. To weight concepts, various combinations of their parameters can be used. For example, the weight may be a product

- среднего количества слов во всех текстовых представлениях концепции, которые встретились в качестве термина документа,- the average number of words in all textual representations of the concept that met as a document term,

- частоты встречаемости концепции в документе и- the frequency of occurrence of the concept in the document and

- максимума информативности для всех текстовых представлений концепции, встретившихся в качестве термина в документе.- maximum information content for all textual representations of the concept, met as a term in the document.

Имея построенную семантическую модель документа с выделенными концепциями можно легко реализовать, приложения, описанные выше. Например, для задачи обогащения документа можно считать ключевыми те термины, чьи значения являются ключевыми. Вычисляя близость между документами, как семантическую близость между их моделями можно строить системы семантического поиска и семантические рекомендательные системы.Having a semantic document model built with highlighted concepts, you can easily implement the applications described above. For example, for the task of enriching a document, those terms whose meanings are key can be considered key. By calculating the proximity between documents, as the semantic proximity between their models, we can build semantic search systems and semantic recommendation systems.

Рассмотрим процесс построения семантической модели на примере документа, состоящего из одного предложения "Пояс астероидов расположен между орбитами Марса и Юпитера и является местом скопления множества объектов всевозможных размеров" и онтологии, изображенной на Фиг.3.Consider the process of constructing a semantic model using the example of a document consisting of one sentence "The asteroid belt is located between the orbits of Mars and Jupiter and is the place of accumulation of many objects of various sizes" and the ontology depicted in Figure 3.

1. Определим термины текста, которым могут соответствовать концепции.1. Define the terms of the text, which may correspond to the concept.

2.1 Разбиваем входящий текст на лексемы: "Пояс", "астероидов", "расположен", "между", "орбитами", "Марса", "и", "Юпитера", "и", "является", "местом", "скопления", "множества", "объектов", "всевозможных", "размеров".2.1 We break the incoming text into lexemes: "Belt", "asteroids", "located", "between", "orbits", "Mars", "and", "Jupiter", "and", "is", "place" , "clusters", "sets", "objects", "all kinds of", "sizes".

2.2. Применяем алгоритм для нахождения леммы каждого слова: "пояс", "астероид", "расположенный", "между", "орбита", "Марс", "и", "Юпитер", "и", "являться", "место", "скопление", "множество", "объект", "всевозможный", "размер".2.2. We use the algorithm to find the lemma of each word: "belt", "asteroid", "located", "between", "orbit", "Mars", "and", "Jupiter", "and", "appear", "place "," cluster "," multitude "," object "," all kinds of "," size ".

2.3. Применяем жадный алгоритм для поиска терминов в словаре. Для этого просматриваем последовательности лексем, состоящие не более чем из пяти слов (n=5), и проверяем их на присутствие в контролируемом словаре. Каждое слово в словаре должно находиться в нормальной форме, чтобы по последовательности "пояс"+"астероид" можно было найти "пояс астероидов". Таким образом, находим термины "Пояс астероидов", "Орбита", "Марс", "Юпитер", "Место", "Множество", "Объект". Заметим, что термины "скопление" и "размер" не были найдены, так как не присутствуют в онтологии.2.3. We use the greedy algorithm to search for terms in the dictionary. To do this, look through sequences of tokens consisting of no more than five words (n = 5), and check for their presence in the controlled dictionary. Each word in the dictionary should be in normal form, so that by the sequence "belt" + "asteroid" you can find the "belt of asteroids". Thus, we find the terms "Asteroid Belt", "Orbit", "Mars", "Jupiter", "Place", "Set", "Object". Note that the terms “cluster” and “size” were not found, since they are not present in the ontology.

2.4. Для каждого термина получаем множество концепций, которые связаны с текстовыми репрезентациями из словаря: "Пояс астероидов", "Орбита (траектория)", "Орбита (глазница)", "Марс (планета)", "Марс (бог)", "Юпитер (планета)", "Юпитер (бог)", "Место (расположение)", "Место (роман)", "Множество (математика)", "Множество (программирование)", "Объект (предмет), "Объект (космический)".2.4. For each term, we get many concepts that are associated with textual representations from the dictionary: "Asteroid Belt", "Orbit (trajectory)", "Orbit (eye socket)", "Mars (planet)", "Mars (god)", "Jupiter (planet) "," Jupiter (god) "," Place (location) "," Place (novel) "," Set (mathematics) "," Set (programming) "," Object (object), "Object (space ) ".

2. Определим значения терминов. Для этого необходимо каждому термину сопоставить только одну из возможных концепцию.2. Define the meaning of the terms. For this, it is necessary for each term to compare only one of the possible concepts.

2.1. На первом шаге произведем фильтрацию терминов с информативностью ниже некоторого порога. Порог подбирается в зависимости от онтологии. Для данной онтологии выберем порог, равный 0.003. Информативность терминов представлена в таблице на Фиг.6. На основании алгоритма фильтрации термины "Множество" и "Место" и соответствующие им концепции исключаются из дальнейшей обработки. Заметим, что таким образом мы избежали ошибки определения значения термина "Множество", так как его общеупотребительное неопределенно-количественное значение "много" не присутствует в онтологии.2.1. In the first step, we filter the terms with information below a certain threshold. The threshold is selected depending on the ontology. For this ontology, we choose a threshold equal to 0.003. The information content of the terms is presented in the table in Fig.6. Based on the filtering algorithm, the terms “Set” and “Place” and their corresponding concepts are excluded from further processing. Note that in this way we avoided the error in determining the meaning of the term “Many”, since its commonly used indefinitely quantitative meaning “many” is not present in the ontology.

2.2. Следующим шагом определим значения оставшихся терминов. В данном примере покажем, как для вычисления значения концепций применять алгоритм, вычисляющий семантически наиболее связанную последовательность. Для этого для каждой возможной последовательности необходимо вычислить ее вес. Вес последовательности вычисляется как сумма весов уникальных попарных комбинаций концепций, входящих в нее. Веса семантической близости между концепциями данного примера показаны на Фиг.5 в таблице значений семантической близости концепций (близость концепций, представленных в таблице, и концепций "Орбита (глазница)" и "Объект (предмет)" нулевая, поэтому они не представлены в таблице с целью экономии места). В данном примере семантическая близость вычисляется через меру Дайса. Близость между концепциями вычисляется как удвоенное количество общих соседей, деленное на сумму всех соседей. Например, для концепций "Пояс астероидов" и "Орбита (траектория)" общими соседями являются "Марс (планета)" и "Юпитер (планета)". У концепции "Пояс астероидов" 3 соседних концепции (учитывая ссылку на себя), у "Орбиты (траектория)" - 8 соседей. Таким образом, семантическая близость равна $\frac{2 * 2}{3 + 8} = \frac{4}{11} \approx 0.3636$

. Заметим, что таблица симметрична относительно диагонали, а на диагонали стоят единицы, так как термин похож сам на себя с весом 1. Поэтому достаточно заполнить верхнюю часть. Кроме того, нам не понадобятся веса близости между концепциями, соответствующими одному термину, поэтому их вычислять тоже не надо. Рассмотрим последовательность "Пояс астероидов", "Орбита (траектория)", "Марс (планета)", "Юпитер (планета)", "Объект (космический)". Вес такой последовательность равен2.2. The next step is to determine the meaning of the remaining terms. In this example, we show how to use the algorithm to calculate the meaning of the concepts, which calculates the semantically most related sequence. For this, for each possible sequence, it is necessary to calculate its weight. The weight of the sequence is calculated as the sum of the weights of the unique pairwise combinations of the concepts included in it. The weights of the semantic affinity between the concepts of this example are shown in FIG. 5 in the table of values of the semantic affinity of the concepts (proximity of the concepts presented in the table and the concepts “Orbit (eye socket)” and “Object (object)” are zero, therefore they are not presented in the table with to save space). In this example, semantic proximity is calculated through the Dyce measure. The proximity between the concepts is calculated as the doubled number of common neighbors divided by the sum of all neighbors. For example, for the concepts of “Asteroid Belt” and “Orbit (trajectory)”, the common neighbors are “Mars (planet)” and “Jupiter (planet)”. The Asteroid Belt concept has 3 neighboring concepts (given the link to itself), the Orbit (trajectory) has 8 neighbors. Thus, semantic affinity is equal to

\frac{2 * 2}{3 + 8} = \frac{four}{eleven} \approx 0.3636

. Note that the table is symmetrical with respect to the diagonal, and there are units on the diagonal, since the term is similar to itself with a weight of 1. Therefore, it is enough to fill in the upper part. In addition, we do not need the weight of proximity between concepts corresponding to one term, therefore, they also do not need to be calculated. Consider the sequence "Asteroid Belt", "Orbit (trajectory)", "Mars (planet)", "Jupiter (planet)", "Object (space)". The weight of such a sequence is

вес₁=0.3636+0.4+0.3636+0+0.5333+0.635+0+0.4+0.2222+0.2=3.1177.weight ₁ = 0.3636 + 0.4 + 0.3636 + 0 + 0.5333 + 0.635 + 0 + 0.4 + 0.2222 + 0.2 = 3.1177.

Для всех остальных последовательностей вес будет меньше. Например, для последовательности "Пояс астероидов", "Орбита (траектория)", "Марс (бог)", "Юпитер (бог)", "Объект (космический)"For all other sequences, the weight will be less. For example, for the sequence Asteroid Belt, Orbit (trajectory), Mars (god), Jupiter (god), Object (space)

вес₂=0.3636+0.2222+0.2222+0+0.1429+0.1429+0+0.5+0=1.5938<вес₁.weight ₂ = 0.3636 + 0.2222 + 0.2222 + 0 + 0.1429 + 0.1429 + 0 + 0.5 + 0 = 1.5938 <weight ₁ .

Таким образом, определяются значения терминов.Thus, the meanings of the terms are determined.

3. На третьем шаге строим полный взвешенный граф, где узлами служат значения, найденные на предыдущем шаге, а вес ребер равен семантической близости между узлами. Заметим, что семантическую близость между узлами мы вычислили на предыдущем шаге. Семантический граф для данного примера представлен на Фиг.4.3. In the third step, we construct a complete weighted graph, where the nodes are the values found in the previous step, and the weight of the edges is the semantic proximity between the nodes. Note that we calculated the semantic proximity between nodes in the previous step. The semantic graph for this example is presented in Figure 4.

4. Вычислим ключевые слова для данного документа. Для этого с помощью алгоритма кластеризации определим список кандидатов. Не будем вдаваться в подробности работы алгоритмов кластеризации, однако заметим, что все концепции, за исключением концепции "Объект (космический)", имеют ненулевую семантическую близость между собой. Алгоритм кластеризации найдет один кластер, содержащий все эти концепции, а принадлежность концепции "Объект (космический)" к этому кластеру будет зависеть от алгоритма. В данном примере, будем считать, что "Объект (космический)" был отнесен к другому кластеру. Выберем один кластер с наибольшим весом. Вес кластера будем вычислять как сумму информативностей терминов, чьи значения входят в него. Несложно понять, что таким кластером будет кластер, содержащий концепции "Пояс астероидов", "Орбита (траектория)", "Марс (планета)", "Юпитер (планета)". Таким образом, список кандидатов будет содержать только эти значения.4. We calculate the keywords for this document. To do this, using the clustering algorithm, we define a list of candidates. We will not go into details of the operation of clustering algorithms, however, we note that all concepts, with the exception of the concept of "Object (space)", have non-zero semantic proximity to each other. The clustering algorithm will find one cluster containing all these concepts, and the belonging of the "Object (space)" concept to this cluster will depend on the algorithm. In this example, we assume that the "Object (space)" was assigned to another cluster. We select one cluster with the largest weight. The cluster weight will be calculated as the sum of the informativeness of the terms whose values are included in it. It is easy to understand that such a cluster will be a cluster containing the concepts of “Asteroid Belt”, “Orbit (trajectory)”, “Mars (planet)”, “Jupiter (planet)”. Thus, the candidate list will contain only these values.

5. Произведем ранжирование концепций по значимости. Будем определять вес как произведение следующих трех величин:5. We will rank the concepts by importance. We will determine the weight as the product of the following three quantities:

- среднего количества слов во всех текстовых репрезентациях концепции, которые встретились в качестве термина документа,- the average number of words in all textual representations of the concept that met as a document term,

- максимуму информативности для всех текстовых репрезентаций концепции, встретившихся в качестве термина в документе.- maximum informational content for all textual representations of the concept, met as a term in the document.

Информативность текстовых репрезентаций представлена в таблице на Фиг.6.The information content of textual representations is presented in the table in Fig.6.

Таким образом, вес концепций в порядке убывания будет следующим:Thus, the weight of the concepts in descending order will be as follows:

вес("Пояс астероидов")=2*1*0.3663=0.7326weight ("Asteroid Belt") = 2 * 1 * 0.3663 = 0.7326

вес("Юпитер (планета)")=1*1*0.2589=0.2589weight ("Jupiter (planet)") = 1 * 1 * 0.2589 = 0.2589

вес("Марс (планета)")=1*1*0.0586=0.0586weight ("Mars (planet)") = 1 * 1 * 0.0586 = 0.0586

вес("Орбита (траектория)")=1*1*0.0488=0.0488weight ("Orbit (trajectory)") = 1 * 1 * 0.0488 = 0.0488

6. Ключевыми будут являться первые n концепций, где n задается эвристически в зависимости от задачи. Выберем в данном примере значение n=3. Тогда ключевыми будут считаться концепции "Пояс астероидов", "Марс (планета)", "Юпитер (планета)".6. The key will be the first n concepts, where n is set heuristically depending on the task. In this example, choose the value n = 3. Then the concepts of "Asteroid Belt", "Mars (planet)", "Jupiter (planet)" will be considered key.

В дальнейшем легко определить ключевые слова в исходном тексте. Ключевыми будут считаться все текстовые репрезентации ключевых концепций, то есть "Пояс астероидов", "Марса" и "Юпитера".In the future, it is easy to identify keywords in the source text. All textual representations of key concepts, that is, the Asteroid Belt, Mars, and Jupiter, will be considered key.

Так, основываясь на предложенном подходе легко создавать приложения для помощи чтения документов, подсветив ключевые термины или обогатив исходный текст гиперссылками с ключевых слов на описание концепций.So, based on the proposed approach, it is easy to create applications for reading documents by highlighting key terms or enriching the source text with hyperlinks from keywords to the description of concepts.

Также очевидно применение изобретения в области построения семантических информационно-поисковых систем. Если позволить пользователю указывать системе, какое именно значение термина он ищет, то можно обрабатывать только те документы, семантическая модель которых содержит это значение. Для обеспечения большей полноты поиска, можно также искать документы, содержащие близкие по значению концепции в семантической модели.It is also obvious the application of the invention in the field of constructing semantic information retrieval systems. If you allow the user to tell the system exactly what meaning of the term he is looking for, then you can process only those documents whose semantic model contains this value. To provide a more complete search, you can also search for documents containing concepts of similar importance in the semantic model.

Применение изобретения в области рекомендательных систем заключается в рекомендации документов, чьи семантические модели наиболее близки к модели текущего документа. Близость между моделями может вычисляться как классическими способами (через нормализованное пересечение множеств), так и через введенное в данном изобретении обобщение семантической близости для множества концепций.Application of the invention in the field of recommendation systems consists in recommending documents whose semantic models are closest to the model of the current document. The proximity between the models can be calculated both in classical ways (through the normalized intersection of sets), and through the generalization of semantic proximity for many concepts introduced in this invention.

Кроме того, аналогичную технику можно применять для рекомендации коллекций документов. В данном случае коллекция документов интерпретируется как обобщенный документ, содержащий семантические модели документов входящих в нее. Рассмотрим данное применение на примере Веб журналов. Пусть пользователя интересуют сообщения в одном Веб журнале. Чтобы порекомендовать Веб журналы, в которых пишется о семантически близких вещах, необходимо сравнить сообщения журналов и найти наиболее близкие. Применяя предложенный способ можно считать, что Веб журнал является обобщенным документом с семантической моделью, состоящей из моделей всех сообщений. Тогда, применяя метод для рекомендации документов, получим рекомендации для Веб журналов.In addition, a similar technique can be used to recommend collections of documents. In this case, the collection of documents is interpreted as a generalized document containing semantic models of documents included in it. Consider this application using Weblogs as an example. Let the user be interested in messages in one Weblog. To recommend Web magazines that write about semantically related things, you need to compare the log messages and find the closest ones. Using the proposed method, we can assume that the Web journal is a generalized document with a semantic model consisting of models of all messages. Then, applying the method to recommend documents, we will receive recommendations for Web journals.

Также возможно создание кратких описаний документов. Они могут состоять из ключевых терминов, либо предложений или параграфов, содержащих ключевые термины, либо использоваться более сложные техники из области автоматического аннотирования и реферирования, где в качестве признаков будут использоваться знания о ключевых терминах и их значениях.It is also possible to create short descriptions of documents. They can consist of key terms, or sentences or paragraphs containing key terms, or use more complex techniques from the field of automatic annotation and abstracting, where knowledge of key terms and their meanings will be used as signs.

Claims

1. A method of constructing a semantic model of a document by which an ontology is extracted from information sources, electronic resources containing descriptions of individual objects of the real world, both linked by hypertext links and not containing hypertext links in the description, are assigned as information sources, each ontology concept is assigned an identifier by which it can be uniquely determined, if there are hypertext links between descriptions of concepts, they are transformed in the connection between chains, in the absence of a structure of hypertext links, they are added by analyzing descriptions and determining the meaning of terms using ontologies extracted from hypertext encyclopedias, and then they are transformed in connection with the concepts, save a unique resource identifier with the original description of the concept, at least one text is determined for each concept representations, calculate the frequency of sharing of each textual representation of the concept and information content for each textual representation, as well determine which natural language the text representation belongs to, and save the received information, obtain the text of the analyzed document, search for text terms and their possible meanings by comparing parts of the text and text representations of concepts from a controlled dictionary, for each term from its possible meanings, using the algorithm permissions of the lexical ambiguity of terms, choose one that is considered the meaning of the term, and then concepts corresponding to the meanings of the terms, ranks comfort on the importance of the text, and the most important concept is considered a semantic document model.

2. The method according to claim 1, characterized in that as an algorithm for resolving the lexical ambiguity of terms, an algorithm is used that selects the most commonly used value, for which the frequency of sharing the processed term and all kinds of concepts associated with it is determined, after which as the value term choose the concept with the highest frequency of use of the term and concept.

3. The method according to claim 1, characterized in that as an algorithm for resolving the lexical ambiguity of terms, an algorithm is selected that calculates the semantically most related sequence of values, according to which various sequences of concept values are considered for a given sequence of terms, its weight is calculated for each possible sequence of concepts, as the sum of the weights of unique pairwise combinations of concepts included in the sequence of concepts, and the values of terms are considered concepts belonging to living sequences with the highest weight.

4. The method according to claim 1, characterized in that as an algorithm for resolving the lexical ambiguity of terms, an algorithm based on machine learning with a teacher is selected, according to which a feature vector is calculated for each term, based on which the most suitable value is selected.

5. The method according to claim 4, characterized in that the term information content is selected as a sign.

6. The method according to claim 4, characterized in that the probability of using the term t in a given value of m _i calculated as

P_{t} (m_{i}) = \frac{c (t, m_{i})}{\sum_{i} c (t, m_{i})}

, where c (t, m _i ) is the frequency of sharing the term t in the value of m _i .

7. The method according to claim 4, characterized in that the semantic closeness between the concept and the context of the document is selected as a feature.

8. The method according to claim 7, characterized in that the values of the unambiguous terms are selected as the context of the document.

9. The method according to claim 4, characterized in that the sum of the information content of each unique term and the semantic proximity of its meaning to all other concepts from the context of the document is selected as a feature.

10. The method according to claim 1, characterized in that to determine the link structure of an information source that does not contain hypertext links, extract the ontology from the hypertext encyclopedia, enrich the description of the concepts of the information source that does not contain hypertext links, with links to the existing ontology extracted from the hypertext encyclopedia expand the controlled vocabulary of the existing ontology with textual representations of all concepts of the processed information source that does not contain hypertext links, take the frequency of sharing these concepts and their textual representations equal to 1 for each unique pair of presentation-concept, repeat the operation of enriching the concepts of the processed information source, using the information content counted through the inverted document frequency, thus, additional links between the concepts extracted from an information source that does not contain hypertext links updates the value of the frequency of text sharing representations and concepts based on the links received.

11. The method according to claim 1, characterized in that for ranking the concepts of importance to the document, a semantic graph of the document is constructed, consisting of the meanings of all the terms of the documents and all sorts of weighted relationships between them, where the weight of the relationship is equal to the semantic proximity between the concepts that are connected by the connection, a clustering algorithm is applied to the semantic graph, grouping semantically related concepts, then concepts from the most powerful clusters are ranked by importance to the document, and the most important concepts are considered semantically document model.

12. The method according to claim 1, characterized in that when extracting the ontology, semantic proximity between the concepts is calculated, and for each concept K a list of concepts C is made up of concepts c _i to which the concept K has a link or from which to the concept K is a reference, calculate the semantic proximity from the current concept K to each concept c _i ∈ C, save the calculated semantic proximity between each pair of concepts K and c _i , as well as the corresponding concepts K and c _i , and for concepts not included in list C, semantic b lick with concept K is taken equal to zero.

13. The method according to p. 12, characterized in that the links between the concepts are assigned weights, a threshold value for the weights is selected, and the list of concepts C is made up of concepts to which the concept K has a link with a weight greater than the selected threshold value or from which to the concept There is a link with a weight greater than the selected threshold value.

14. The method according to claim 1, characterized in that the ontology is extracted from several sources.

15. The method according to claim 1, characterized in that the document metadata is used as the text of the document.