CN111949800A

CN111949800A - A method and system for establishing a knowledge graph of an open source project

Info

Publication number: CN111949800A
Application number: CN202010643011.7A
Authority: CN
Inventors: 孙艳春; 黄罡; 孙志玉
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2020-07-06
Filing date: 2020-07-06
Publication date: 2020-11-17

Abstract

The embodiment of the invention provides a method and a system for establishing an open source project knowledge graph, which specifically define a data mode of the open source project knowledge graph in advance; acquiring knowledge information of a program code from an open source project code, acquiring knowledge information related to the open source project from an open source community where the open source project is located and a remote warehouse of the open source project, analyzing the knowledge information of all different sources, and extracting a plurality of triples; unifying and disambiguating all triples, constructing a knowledge graph of the open source project based on the data mode, and finally carrying out visual analysis and display on the knowledge graph. The embodiment of the invention establishes the knowledge graph of the open source project, which is used for a developer to quickly and accurately search the project code to be learned and understand the code through the related code knowledge, thereby meeting the requirement of the newly added developer on quick learning of the open source project.

Description

A method and system for establishing a knowledge graph of an open source project

技术领域technical field

本发明涉及开源项目技术领域，特别是涉及一种开源项目知识图谱的建立方法和一种开源项目知识图谱的建立系统。The invention relates to the technical field of open source projects, in particular to a method for establishing a knowledge graph of an open source project and a system for establishing a knowledge graph of an open source project.

背景技术Background technique

开源项目(open source project)，是一种开放源码的软件项目，开发者可以通过开源社区修改开源项目的源码，定制自己的个性化产品。An open source project is an open source software project. Developers can modify the source code of the open source project through the open source community and customize their own personalized products.

大型的开源项目通常会由多个开发者共同参与开发，并吸引众多的开发者来学习该开源项目的源代码，这些开发者在不断的学习和技术历练之后，也可能进入到开源项目的主要分支中，为开源项目贡献自己的力量。Large-scale open source projects are usually developed by multiple developers, and attract many developers to learn the source code of the open source project. After continuous learning and technical experience, these developers may also enter the main open source project. In the branch, contribute your own strength to the open source project.

大多数的开源项目缺乏项目架构文档，也缺少针对项目代码知识的管理与检索功能，而当前开源社区的主要功能集中于对项目进行版本管理，面向的也只是开源项目现有的开发者与不作开发贡献的用户。因此，目前新加入的开发者在最初接触到一个开源项目的时候，往往只能通过逐步阅读源码来了解项目代码，而很难直接找到和需求相关的代码，学习效率十分低下。Most open source projects lack project architecture documents, as well as management and retrieval functions for project code knowledge, while the main functions of the current open source community focus on version management of projects, which are only for existing developers of open source projects and those who do not. Users who contributed to the development. Therefore, when new developers first come into contact with an open source project, they can only understand the project code by reading the source code step by step, and it is difficult to directly find the code related to the requirements, and the learning efficiency is very low.

目前对于开源社区代码分析领域的研究，大多数都集中在分析代码本身，分析方法主要使用诸如语法树、静态分析结果之类的信息。而对开发者智慧学习的研究，则多为聚焦于如何帮助开发者智慧地编程，例如推荐代码段，推测开发人员的意图，智能编程工具等。总的来说，没有从开发者的学习需要的角度考虑，为开发者提供针对性的代码相关知识信息，从而帮助开发者快速了解并加入到开源项目中。At present, most of the research in the field of open source community code analysis focuses on analyzing the code itself, and the analysis methods mainly use information such as syntax trees and static analysis results. The research on developer smart learning mostly focuses on how to help developers program smartly, such as recommending code segments, inferring developers' intentions, and smart programming tools. In general, without considering the developer's learning needs, it provides developers with targeted code-related knowledge information to help developers quickly understand and join open source projects.

因此，对于新加入的开发者而言，不能快速地找到需要的项目代码，也不能快速地获取项目代码的相关知识以对于代码进行理解，最终导致学习效率低下。Therefore, for newly added developers, it is not possible to quickly find the required project code, nor to quickly acquire relevant knowledge of the project code to understand the code, which eventually leads to low learning efficiency.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，提出了本发明实施例提供一种克服上述问题或者至少部分地解决上述问题的一种开源项目知识图谱的建立方法和相应的一种开源项目知识图谱的建立系统。In view of the above problems, it is proposed that the embodiments of the present invention provide an open source project knowledge graph establishment method and a corresponding open source project knowledge graph establishment system that overcomes the above problems or at least partially solves the above problems.

为了解决上述问题，本发明实施例提供了一种开源项目知识图谱的建立方法，所述方法包括：预先定义开源项目知识图谱的数据模式；通过静态代码分析方法从开源项目代码中获取程序代码本身的知识信息，所述程序代码本身的知识信息包括：函数、文件；从开源项目所在的开源社区和开源项目的远程仓库中获取与开源项目相关的知识信息，所述与开源项目相关的知识信息包括：项目提交记录、代码合并请求、以及问题集合；对所述程序代码本身的知识信息与所述与开源项目相关的知识信息进行分析，抽取出多个三元组，并根据所述三元组中每种数据来源的不同结构化特征，统一所有三元组中每种知识实体的数据格式，并对每个三元组进行消歧，以保证三元组集中，每种有效的知识实体有且仅有一种实体名称与之对应；基于所述数据模式，利用所述三元组集，构建开源项目的知识图谱；使用可视化工具Gephi，对所述开源项目知识图谱进行可视化的分析与展示。In order to solve the above problems, an embodiment of the present invention provides a method for establishing a knowledge graph of an open source project. The method includes: predefining a data schema of the knowledge graph of an open source project; obtaining the program code itself from the code of the open source project through a static code analysis method The knowledge information of the program code itself includes: functions and files; the knowledge information related to the open source project is obtained from the open source community where the open source project is located and the remote warehouse of the open source project. The knowledge information related to the open source project Including: project submission records, code merge requests, and problem sets; analyzing the knowledge information of the program code itself and the knowledge information related to the open source project, extracting multiple triples, and according to the triples Different structural features of each data source in the group, unify the data format of each knowledge entity in all triples, and disambiguate each triple to ensure that the triples are concentrated and each valid knowledge entity There is one and only one entity name corresponding to it; based on the data schema, use the triplet set to construct the knowledge graph of the open source project; use the visualization tool Gephi to visually analyze and display the knowledge graph of the open source project .

可选地，所述预先定义开源项目知识图谱的数据模式的方法，包括：从开源项目的多个角度抽取出组成知识图谱的知识信息，包括：基本元素关系与实体；Optionally, the method for predefining the data schema of the knowledge graph of an open source project includes: extracting knowledge information constituting the knowledge graph from multiple perspectives of the open source project, including: basic element relationships and entities;

可选地，所述实体包括：函数、文件、项目提交记录、问题集合、代码合并请求，所述基本元素关系包括：调用关系、包含关系、修改关系、涉及关系；Optionally, the entities include: functions, files, project submission records, issue sets, and code merge requests, and the basic element relationships include: calling relationships, including relationships, modifying relationships, and involving relationships;

可选地，所述从开源项目代码中获取程序代码本身的知识信息的方法，包括：通过静态代码分析方法从开源项目代码中获取程序代码本身的知识信息；其中，包括使用静态分析工具针对每个文件、每个项目模块对所述开源项目代码进行分析，分别生成局部的关系子图，所述关系子图以Dot语言进行描述与输出；Optionally, the method for obtaining the knowledge information of the program code itself from the open source project code includes: obtaining the knowledge information of the program code itself from the open source project code through a static code analysis method; Each file and each project module analyze the open source project code, respectively generate a partial relational subgraph, and the relational subgraph is described and output in Dot language;

可选地，所述静态分析工具包括：Doxygen，CppCheck和FindBugs；Optionally, the static analysis tools include: Doxygen, CppCheck and FindBugs;

可选地，所述多个三元组，包括：sub函数对obj函数的调用关系三元组、sub文件对obj函数的包含关系三元组、sub提交记录对obj文件的修改关系三元组、sub问题集合对obj项目提交记录的涉及关系三元组、sub问题集合对obj问题集合的涉及关系三元组、sub问题集合对obj合并请求的涉及关系三元组、sub合并请求对obj项目提交记录的涉及关系三元组、sub代码合并请求对obj文件的涉及关系三元组Optionally, the multiple triples include: a triple of the calling relationship of the sub function to the obj function, a triple of the inclusion relationship of the sub file to the obj function, and a triple of the modification relationship of the sub submission record to the obj file. , Sub problem set to obj project submission record involving relationship triples, sub problem set to obj problem set involving relationship triples, sub problem set to obj merge request involving relationship triples, sub merge request to obj project The relationship triples involved in the submission record, and the relationship triples involved in the sub code merge request to the obj file

可选地，所述基于所述数据模式，利用所述多个三元组，构建开源项目的知识图谱的方法，包括：基于所述数据模式，针对每个数据源，进行清洗、抽取三元组、消歧，单独抽取出代表某种关系的三元组集，构成一个关系三元组集子图，并发地抽取所述三元组集子图构建的过程，最后聚合所有三元组集子图，构建开源项目的知识图谱。Optionally, the method for constructing a knowledge graph of an open source project by using the multiple triples based on the data schema includes: cleaning and extracting triples for each data source based on the data schema. Group, disambiguate, extract the triplet set representing a certain relationship separately, form a relationship triplet set subgraph, extract the process of constructing the triplet set subgraph concurrently, and finally aggregate all triplet sets Subgraph, to build a knowledge graph of open source projects.

本发明实施例还提供了一种开源项目知识图谱的建立系统，所述系统具体包括：The embodiment of the present invention also provides a system for establishing a knowledge graph of an open source project, the system specifically includes:

定义模块，用于预先定义开源项目知识图谱的数据模式；Definition module, which is used to predefine the data schema of the knowledge graph of open source projects;

知识获取模块，用于通过静态代码分析方法从开源项目代码中获取程序代码本身的知识信息，所述程序代码本身的知识信息包括：函数、文件以及它们之间的调用、包含关系；以及，从开源项目所在的开源社区和开源项目的远程仓库中获取与开源项目相关的知识信息，所述与开源项目相关的知识信息包括：项目提交记录、代码合并请求、以及问题集合以及它们之间的涉及关系；The knowledge acquisition module is used to acquire knowledge information of the program code itself from the open source project code through the static code analysis method, the knowledge information of the program code itself includes: functions, files, and the calling and inclusion relationships between them; and, from Obtain knowledge information related to the open source project from the open source community where the open source project is located and the remote repository of the open source project, where the knowledge information related to the open source project includes: project submission records, code merge requests, and issue sets and the relationships between them relation;

知识分析模块，用于对所述程序代码本身的知识信息与所述与开源项目相关的知识信息进行分析，抽取出多个三元组；并根据所述三元组中每种数据来源的不同结构化特征，统一所述三元组中每种知识实体的数据格式，并对每个三元组进行消歧，以保证三元组集中，每种有效的知识实体有且仅有一种实体名称与之对应；The knowledge analysis module is used to analyze the knowledge information of the program code itself and the knowledge information related to the open source project, and extract a plurality of triples; and according to the difference of each data source in the triples Structural features, unify the data format of each knowledge entity in the triplet, and disambiguate each triplet to ensure that the triplet is concentrated, and each valid knowledge entity has one and only one entity name Corresponding;

构建模块，用于基于所述数据模式，利用所述多个三元组，构建开源项目的知识图谱；a building module for constructing a knowledge graph of an open source project based on the data schema and using the multiple triples;

展示模块，用于使用可视化工具Gephi，对所述开源项目知识图谱进行可视化的分析与展示。The display module is used to visually analyze and display the knowledge graph of the open source project by using the visualization tool Gephi.

可选地，所述定义模块，从开源项目的多个角度抽取出组成知识图谱的知识信息，包括：基本元素关系与实体；其中，所述实体包括：函数、文件、项目提交记录、问题集合、代码合并请求；其中，所述基本元素关系包括：调用关系、包含关系、修改关系、涉及关系。Optionally, the definition module extracts knowledge information constituting the knowledge graph from multiple perspectives of open source projects, including: basic element relationships and entities; wherein, the entities include: functions, files, project submission records, and problem sets , a code merge request; wherein, the basic element relationship includes: calling relationship, containing relationship, modifying relationship, and involving relationship.

从上述技术方案可以看出，本发明实施例提供了一种开源项目知识图谱的建立方法和系统，该方法和系统具体为面向开发者对开源项目的学习需求，从开源项目代码本身、开源项目所在的开源社区、以及开源项目的远程仓库多个数据源中，抽取可供开发者学习并参与开源项目开发所需要的项目代码本身的知识信息和与项目代码相关的知识信息，构建开源项目的知识图谱，进而对开源项目的代码知识进行全面、有效的展示，以此帮助提高开发者的学习效率，促使开发者更好地参与到开源项目的开发中，为开源项目的发展作出贡献。It can be seen from the above technical solutions that the embodiments of the present invention provide a method and system for establishing a knowledge graph of an open source project. From the open source community where you are located and the multiple data sources of the remote warehouse of the open source project, extract the knowledge information of the project code itself and the knowledge information related to the project code that can be used by developers to learn and participate in the development of open source projects. Knowledge graph, and then comprehensively and effectively display the code knowledge of open source projects, so as to help improve the learning efficiency of developers, encourage developers to better participate in the development of open source projects, and contribute to the development of open source projects.

附图说明Description of drawings

图1是本发明提供的一种开源项目知识图谱的建立方法实施例的步骤流程图；1 is a flow chart of steps of an embodiment of a method for establishing a knowledge graph of an open source project provided by the present invention;

图2是本发明实施例提供的一种SPO三元组结构图；Fig. 2 is a kind of SPO triple structure diagram that the embodiment of the present invention provides;

图3是本发明实施例提供的一种为开源项目知识图谱所设计的数据模式结构图；3 is a data schema structure diagram designed for an open source project knowledge graph provided by an embodiment of the present invention;

图4是本发明实施例提供的一种静态代码分析结果中的部分函数调用关系子图；4 is a partial function call relationship subgraph in a static code analysis result provided by an embodiment of the present invention;

图5是本发明提供的一种开源项目知识图谱的建立系统实施例的结构框图；5 is a structural block diagram of an embodiment of a system for establishing a knowledge graph of an open source project provided by the present invention;

图6是本发明实施例构建的一种开源项目知识图谱的可视化效果图；6 is a visualization rendering of a knowledge map of an open source project constructed in an embodiment of the present invention;

图7是本发明提供的一种开源项目知识图谱的建立方法实施例的框架结构图。FIG. 7 is a frame structure diagram of an embodiment of a method for establishing a knowledge graph of an open source project provided by the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

实施例一Example 1

图1是本发明提供的一种开源项目知识图谱的建立方法实施例的步骤流程图。FIG. 1 is a flow chart of steps of an embodiment of a method for establishing a knowledge graph of an open source project provided by the present invention.

参照图1所示，本实施例提供的开源项目知识图谱的建立方法应用于开源社区中的开源项目，本实施例旨在从开发者学习代码需求的角度，获取开源项目的代码知识信息和开源社区中的相关知识信息，来建立开源项目的知识图谱，以此满足开发者对陌生代码的学习需求，具体的知识图谱建立方法包括如下步骤：1 , the method for establishing a knowledge graph of an open source project provided by this embodiment is applied to an open source project in an open source community. The relevant knowledge information in the community is used to establish the knowledge graph of the open source project, so as to meet the developer's learning needs for unfamiliar code. The specific knowledge graph establishment method includes the following steps:

步骤S101，预先定义开源项目知识图谱的数据模式。Step S101 , predefine the data schema of the knowledge graph of the open source project.

开源项目的代码，由特定的程序语言编写，而程序语言又是一种结构化的语言，当一个开发者希望了解陌生系统中的特定功能函数时，往往不能仅仅定位并阅读该函数，还需要从程序调用关系的最外层出发，沿着函数调用来逐步深入到目标函数，通过调用路径来学习并熟悉整个功能及其在系统的定位；例如从单元测试中的相关功能出发，逐级深入到所测试的具体函数，了解这个函数在测试用例下的调用与运算细节。The code of an open source project is written in a specific programming language, and the programming language is a structured language. When a developer wants to understand a specific function in an unfamiliar system, he often cannot just locate and read the function, but also needs to Starting from the outermost layer of the program call relationship, follow the function call to gradually deepen to the target function, and learn and become familiar with the entire function and its positioning in the system through the call path; for example, starting from the relevant functions in the unit test, and deepening step by step Go to the specific function under test to understand the details of the function's call and operation under the test case.

当前对开源项目的分析研究大多针对代码搜索或特征定位。这些研究的目的主要是如何从用户的角度更好地使用开源项目或如何对项目代码结构进行分析。但是对于实际的开源社区及其贡献者和学习者来说，能够以较低的门槛参与到开源项目的贡献中是十分重要的。当前开源社区中的大多数开源项目都没有维护面向开发者的系统设计文档，这意味着如果一个新加入的开发者要开发某种功能，则需要花费大量时间阅读并学习项目代码。如果项目的组织结构或程序注释不够完善，这将对开发者参与到实际开发中造成很大的困难和挑战。因此，本发明针对开发者学习需求，对症地构建开源项目相应的知识图谱，来帮助开发者学习陌生代码。Most of the current analysis research on open source projects is aimed at code search or feature localization. The main purpose of these studies is how to better use open source projects from the user's point of view or how to analyze the project code structure. But for the actual open source community and its contributors and learners, it is very important to be able to participate in the contribution of open source projects with a low threshold. Most open source projects in the current open source community do not maintain system design documents for developers, which means that if a new developer wants to develop a certain function, it needs to spend a lot of time reading and learning the project code. If the organizational structure or program annotation of the project is not perfect, it will cause great difficulties and challenges for developers to participate in actual development. Therefore, according to the developer's learning needs, the present invention constructs the corresponding knowledge map of the open source project symptomatically, so as to help the developer learn unfamiliar codes.

知识图谱的本质是一个语义网络的知识库，其旨在描述现实世界中各类知识中的实体与他们之间的关系。实体可以指代现实中的一个事物对象，也可以是一个抽象的概念，而关系则是实体之间的联系及其语义描述。知识图谱通常可以被看做一个图结构，其中知识图谱的实体作为图的节点，而关系作为图中的边。The essence of knowledge graph is a knowledge base of semantic network, which aims to describe the relationship between entities in various kinds of knowledge in the real world and them. An entity can refer to a thing object in reality, or it can be an abstract concept, while a relationship is the relationship between entities and their semantic description. A knowledge graph can usually be viewed as a graph structure, in which the entities of the knowledge graph serve as the nodes of the graph, and the relationships serve as the edges in the graph.

知识图谱最早应用于搜索引擎，用户使用搜索引擎搜索并了解一个知识时，搜索引擎可以利用知识图谱识别出用户具体指代的对象。例如，当用户检索某个戏剧的上映时间时，同名小说的结果便不会被混淆进来。The knowledge graph was first applied to search engines. When a user uses a search engine to search for and understand a piece of knowledge, the search engine can use the knowledge graph to identify the object specifically referred to by the user. For example, when a user searches for the showtime of a drama, the results for the novel of the same name will not be mixed in.

通常地，知识图谱可以在网页中以知识侧栏(knowledge panel)的形式进行展示，不仅可以显示用户搜索内容的网站链接，而且可以对搜索主题的信息进行结构化的聚合与展示。Generally, the knowledge graph can be displayed in the form of a knowledge panel on a web page, which can not only display the website link of the user's search content, but also perform structured aggregation and display of the information of the search topic.

由此可知，知识图谱是由一些互相连接的关系及其属性而构成的，这些关系通常被表示为一个SPO三元组(Subject-Predicate-Object)。如图2所示，在一个三元组(triple)中，Subject代表其中的主体，Predicate代表关系本身，Object代表关系指向的客体，主体和客体都是知识图谱的实体。It can be seen that the knowledge graph is composed of some interconnected relationships and their attributes, which are usually represented as an SPO triple (Subject-Predicate-Object). As shown in Figure 2, in a triple, Subject represents the subject, Predicate represents the relationship itself, Object represents the object pointed to by the relationship, and both the subject and the object are entities of the knowledge graph.

知识图谱中的数据模式(Schema)是对知识的提炼与规范，预先设计并遵守给定的Schema有助于标准化，方便后续对知识三元组的处理与查询，为了构建开源项目的知识图谱，我们先定义知识图谱的数据模式。为知识图谱构建Schema相当于为知识图谱建立本体(Ontology)。所述本体包括概念、概念层次、属性、属性值类型、关系、关系定义域(Domain)概念集以及关系值域(Range)概念集。在此基础上，我们还可以额外添加规则(Rules)或公理(Axioms)来表示模式层更复杂的约束关系。The data schema (Schema) in the knowledge graph is the refinement and specification of knowledge. Pre-designing and complying with the given Schema is helpful for standardization and facilitates subsequent processing and querying of knowledge triples. In order to build the knowledge graph of open source projects, We first define the data schema of the knowledge graph. Building a Schema for a knowledge graph is equivalent to building an ontology for a knowledge graph. The ontology includes concepts, concept levels, attributes, attribute value types, relations, relation domain (Domain) concept sets, and relation value domain (Range) concept sets. On this basis, we can also add additional rules (Rules) or axioms (Axioms) to express more complex constraints at the schema layer.

图3展示了一种本发明实施例提供的一种为开源项目知识图谱所设计的数据模式。FIG. 3 shows a data schema designed for an open source project knowledge graph provided by an embodiment of the present invention.

所述数据模式包括从开源项目的多个角度抽取出组成知识图谱的基本元素关系与实体。其中，所述实体，包括，：函数、文件、项目提交记录、问题集合、代码合并请求；其中，所述基本元素关系，包括：调用关系、包含关系、修改关系、涉及关系。The data schema includes extracting basic element relationships and entities constituting the knowledge graph from multiple perspectives of the open source project. Wherein, the entities include: functions, files, project submission records, problem sets, and code merge requests; wherein, the basic element relationships include: calling relationships, inclusion relationships, modification relationships, and involving relationships.

值得注意的是，本发明实施例基于开发者学习开源项目并且能使其参与开源项目开发的需求，从开源项目的多个角度抽取的知识信息，并非是用户如何使用开源项目所需要的知识信息，也并非是成熟的开发者智慧编程中需要的代码段推荐知识信息。例如，本申请更多地抽取开源项目内部的信息，这样便于开发者了解之后对开源项目内部贡献代码并开发，而非抽取api信息让开发者使用api。本发明实施例抽取开源社区中目标开源项目的问题集合，即，开发者对目标开源项目的讨论文本信息中的知识实体与实体之间的基本元素关系，以将函数、文件和项目提交记录、问题集合、代码合并请求等信息互相连接，帮助开发者快速了解并学习的项目代码和相关的代码知识，而非只抽取讨论的描述信息和讨论中的问答关系。同时，本领域技术人员可以理解的是，本发明实施例提供的知识图谱，是用于满足新加入的开发者对于开源项目快速学习和参与开发的需要。It is worth noting that the embodiments of the present invention are based on the needs of developers to learn open source projects and enable them to participate in open source project development, and the knowledge information extracted from multiple perspectives of open source projects is not the knowledge information required by users on how to use open source projects. , and it is not the code segment recommended knowledge information required by mature developers in smart programming. For example, this application extracts more information inside open source projects, so that developers can understand and contribute code to open source projects and develop them later, instead of extracting API information for developers to use APIs. The embodiment of the present invention extracts the question set of the target open source project in the open source community, that is, the basic element relationship between the knowledge entities and the entities in the discussion text information of the developer on the target open source project, so as to record the submission of functions, files and projects, Information such as issue sets and code merge requests are connected to each other, helping developers to quickly understand and learn the project code and related code knowledge, instead of only extracting the description information of the discussion and the question-and-answer relationship in the discussion. At the same time, those skilled in the art can understand that the knowledge graph provided by the embodiment of the present invention is used to meet the needs of newly added developers for rapid learning and participation in development of open source projects.

如下表1为本发明实施例设计的知识图谱数据模式中，构建所述知识图谱需要抽取的实体及对所述实体相应的描述。Table 1 below is the knowledge graph data schema designed by the embodiment of the present invention, the entities that need to be extracted to construct the knowledge graph and the corresponding descriptions of the entities.

表1所述知识图谱包括的实体及描述Entities and descriptions included in the knowledge graph described in Table 1

具体地，包括：实体“Func”，即，函数，代表开源项目代码中的一个函数；实体“File”，即，文件，代表开源项目中的一个文件；实体“Commit”，即，项目提交记录，代表开源项目提交历史中的一个提交记录；实体“Issue”，即，问题集合，代表开源项目在开源社区中的一个问题及其评论的集合；实体“Pull Request”，即，代码合并请求，代表开源项目在开源社区中的一个代码合并请求。Specifically, it includes: the entity "Func", that is, a function, representing a function in the code of the open source project; the entity "File", that is, a file, representing a file in the open-source project; the entity "Commit", that is, the project submission record , which represents a commit record in the commit history of an open source project; the entity "Issue", that is, a collection of issues, represents a collection of issues and comments of an open source project in the open source community; the entity "Pull Request", that is, a code merge request, A code merge request in the open source community on behalf of an open source project.

如下表2为本发明实施例设计的知识图谱数据模式中，构建所述知识图谱需要抽取的基本元素关系及对所述基本元素关系相应的描述。Table 2 below is the knowledge graph data schema designed by the embodiment of the present invention, the basic element relationships that need to be extracted to construct the knowledge graph and the corresponding descriptions of the basic element relationships.

表2所述知识图谱中包括的基本元素关系Basic element relationships included in the knowledge graph described in Table 2

具体地，包括：关系“(sub,func_call,obj)”，表示sub函数对obj函数的调用关系；关系“(sub,file_contain_func,obj)”，表示sub文件对obj函数的包含关系；关系“(sub,commit_change_file,obj)”，表示sub项目提交记录对obj文件的修改关系；关系“(sub,issue_relate_commit,obj)”，表示sub问题集合对obj项目提交记录的涉及关系；关系“(sub,issue_relate_issue,obj)”，表示sub问题集合对obj问题集合的涉及关系；关系“(sub,issue_relate_pr,obj)”，表示sub问题集合对obj代码合并请求的包含关系；关系“(sub,pr_relate_commit,obj)”，表示sub代码合并请求对obj项目提交记录的涉及关系，关系“(sub,pr_relate_file,obj)”，表示sub代码合并请求对obj文件的涉及关系。Specifically, it includes: the relationship "(sub, func_call, obj)", which represents the calling relationship of the sub function to the obj function; the relationship "(sub, file_contain_func, obj)", which represents the inclusion relationship between the sub file and the obj function; the relationship "( sub,commit_change_file,obj)", which indicates the modification relationship between the sub project submission record and the obj file; the relationship "(sub,issue_relate_commit,obj)" indicates the relationship between the sub issue set and the obj project submission record; the relationship "(sub,issue_relate_issue" ,obj)", indicating the relationship between the sub issue set and the obj issue set; the relationship "(sub, issue_relate_pr, obj)", indicating the inclusion relationship between the sub issue set and the obj code merge request; the relationship "(sub, pr_relate_commit, obj) ", indicates the relationship between the sub code merge request and the obj project submission record, and the relationship "(sub, pr_relate_file, obj)" indicates the relationship between the sub code merge request and the obj file.

以所述sub问题集合对obj代码合并请求的包含关系为例进行说明，在开源社区内的讨论信息中，开发者可能会在一次与某开源项目相关的讨论中引用该开源项目的代码合并请求信息，此时将开发者讨论的问题集合视作对该代码合并请求信息存在包含关系。Taking the inclusion relationship of the sub-issue set to the obj code merge request as an example, in the discussion information in the open source community, the developer may refer to the code merge request of an open source project in a discussion related to an open source project Information, at this time, the set of issues discussed by developers is regarded as the inclusion of the code merge request information.

以所述sub问题集合对obj问题集合的涉及关系为例进行说明，在开源社区内的讨论信息中，开发者可能会在一次与某开源项目相关的讨论中引用另一讨论帖，此时将开发者讨论的问题集合视作对另一讨论帖讨论的问题集合存在涉及关系。Taking the relationship between the sub problem set and the obj problem set as an example, in the discussion information in the open source community, the developer may refer to another discussion thread in a discussion related to an open source project. A set of issues discussed by a developer is deemed to have a reference to a set of issues discussed in another discussion thread.

步骤S102，从开源项目代码中获取程序代码本身的知识信息。In step S102, the knowledge information of the program code itself is obtained from the open source project code.

本实施例选用静态代码分析方法从开源项目代码中获取程序代码本身的知识信息。所述程序代码本身的知识信息，包括：项目代码中的函数、文件以及它们之间的调用、包含关系。所述调用、包含关系，具体可以是sub函数对obj函数的调用关系、sub文件对obj函数的包含关系、sub文件对obj函数的包含关系、sub文件对obj文件的调用关系。In this embodiment, the static code analysis method is used to obtain the knowledge information of the program code itself from the open source project code. The knowledge information of the program code itself includes: functions and files in the project code, as well as the calling and inclusion relationships between them. The invocation and inclusion relationship may specifically be the invocation relationship between the sub function and the obj function, the inclusion relationship between the sub file and the obj function, the inclusion relationship between the sub file and the obj function, and the calling relationship between the sub file and the obj file.

与动态代码分析方法相比，静态代码分析方法具有快速、通用、便捷、依赖少的优势。从开发者阅读并学习开源项目的角度出发，运行一个不熟悉的大型开源项目，需要设置运行环境、查阅文档以及选择参数，对开发者来说费时费力。而使用静态分析方法对代码本身进行分析，非常适用于开发者阅读并学习开源项目的场景，可以有效解决或部分解决上述问题。Compared with the dynamic code analysis method, the static code analysis method has the advantages of being fast, general, convenient and less dependent. From the perspective of developers reading and learning open source projects, running an unfamiliar large-scale open source project requires setting the operating environment, consulting documents, and selecting parameters, which is time-consuming and labor-intensive for developers. Using the static analysis method to analyze the code itself is very suitable for developers to read and learn open source projects, and can effectively solve or partially solve the above problems.

可选的，静态分析工具，包括：Doxygen，CppCheck和FindBugs。Optional, static analysis tools, including: Doxygen, CppCheck and FindBugs.

更进一步地，本实施例选用Doxygen作为开源项目代码的静态分析工具。Doxygen是一种支持多语言，跨平台的静态代码分析工具，具有语言支持广泛、通用性好、分析内容多样化的优点。它可以分析输入的项目代码并提取代码信息，例如函数调用关系，文件结构和函数属性等。其支持C++、C、Java、Objective-C、Python等多种常见语言，并在多种操作系统中均可使用。Further, in this embodiment, Doxygen is selected as a static analysis tool for open source project codes. Doxygen is a multi-language, cross-platform static code analysis tool. It has the advantages of wide language support, good versatility, and diversified analysis content. It can analyze the input project code and extract code information, such as function call relationship, file structure and function attributes, etc. It supports many common languages such as C++, C, Java, Objective-C, Python, etc., and can be used in various operating systems.

优选地，使用静态分析工具Doxygen，针对开源项目的每个文件、每个项目模块对所述开源项目代码进行分析，分别生成局部的关系子图。Preferably, the static analysis tool Doxygen is used to analyze the open source project code for each file and each project module of the open source project, and generate local relationship subgraphs respectively.

优选地，所述关系子图以Dot语言进行描述，并将dot语言描述的关系子图知识信息映射到实际函数名上，实现输出。所述Dot语言是一种文本图形描述语言，用于提供一种简单的描述图形的方法，所述图形可以被人类和计算机程序读懂。Preferably, the relationship subgraph is described in Dot language, and the knowledge information of the relationship subgraph described in the dot language is mapped to the actual function name to realize output. The Dot language is a textual graphics description language for providing a simple method for describing graphics that can be read by humans and computer programs.

图4展示了使用Doxygen静态分析，并以Dot语言描述与输出的静态代码分析结果中的一个模块中的部分函数调用关系子图，以此为例说明所述关系子图的结构和表达。Fig. 4 shows some function call relationship subgraphs in a module in the static code analysis result described and outputted in Dot language using Doxygen static analysis, taking this as an example to illustrate the structure and expression of the relationship subgraph.

步骤S103，从开源社区和开源项目的远程仓库中获取与开源项目相关的知识信息。In step S103, knowledge information related to the open source project is acquired from the open source community and the remote repository of the open source project.

本实施例以目前被广泛使用的开源社区GitHub为例，利用GitHubAPI，结合Pygithub与Pygit框架工具，对github开源社区和Git仓库中的信息进行获取，并通过关键词提取出有效的、与目标开源项目相关的知识信息。This embodiment takes the widely used open source community GitHub as an example, uses the GitHub API, combined with Pygithub and Pygit framework tools, to obtain the information in the GitHub open source community and Git warehouse, and extracts the effective and target open source through keywords. Project-related knowledge information.

所述与开源项目相关的知识信息包括：项目提交记录、代码合并请求、问题集合以及它们之间的涉及关系。The knowledge information related to the open source project includes: project submission records, code merge requests, problem sets, and related relationships between them.

具体地，本发明实施例中，使用基于GitHubAPIV3封装实现的PyGithub框架，从GitHub开源社区中，通过关键词提取与目标开源项目相关的知识信息，包括从开发者之间对项目的讨论信息中，提取出与目标开源项目有关的问题集合、代码合并请求的相关信息。使用PyGit框架，从Git仓库中获取与目标开源项目相关的知识信息，包括对目标开源项目的提交记录树进行遍历与数据抽取，获取项目提交记录的相关信息。Specifically, in the embodiment of the present invention, using the PyGithub framework based on GitHubAPIV3 package implementation, from the GitHub open source community, the knowledge information related to the target open source project is extracted through keywords, including from the discussion information between developers on the project, Extract information about issue sets and code merge requests related to the target open source project. Use the PyGit framework to obtain knowledge information related to the target open source project from the Git repository, including traversing and data extraction of the submission record tree of the target open source project, and obtaining the relevant information of the project submission record.

其中，GitHubAPI是开源社区GitHub官方提供的一个开放式的查询接口，它可以对开源社区中的功能提供简单的使用与访问。而Pygit是对libgit的Python封装，提过简洁的接口访问对象中的各种属性，执行对git仓库的各种操作。所述libgit是对开源项目的版本控制系统Git中主要核心的方法的可移植的C语言实现，通过生成可靠稳定的C接口链接库，从而让开发者可以使用代码调取API的方式来轻松地实现Git中的常规操作。因此，可以将libgit理解为Git的共享程序库，相对于作为一个独立程序应用的Git，libgit2去掉了其中复杂的优化和非核心的功能。Among them, GitHub API is an open query interface officially provided by the open source community GitHub, which can provide simple use and access to functions in the open source community. Pygit is a Python encapsulation of libgit. It provides a concise interface to access various attributes in objects and perform various operations on git warehouses. The libgit is a portable C language implementation of the main core methods in the version control system Git of open source projects. By generating a reliable and stable C interface link library, developers can easily use the code to call the API. Implements normal operations in Git. Therefore, libgit can be understood as a shared library of Git. Compared with Git, which is used as an independent program, libgit2 removes complex optimizations and non-core functions.

步骤S104，对所述程序代码本身的知识信息与所述与开源项目相关的知识信息进行分析，抽取出多个三元组。Step S104, analyze the knowledge information of the program code itself and the knowledge information related to the open source project, and extract a plurality of triples.

本发明实施例在步骤103的基础上，基于启发式规则，对所述程序代码本身的知识信息与所述与开源项目相关的知识信息进行进一步的分析，抽取出多个三元组。包括：通过文本匹配与自然语言分析的方法，从文本信息中抽取出与目标开源项目相关的多个知识实体和基本元素关系。On the basis of step 103 and based on heuristic rules, the embodiment of the present invention further analyzes the knowledge information of the program code itself and the knowledge information related to the open source project, and extracts multiple triples. Including: extracting multiple knowledge entities and basic element relationships related to the target open source project from the text information by means of text matching and natural language analysis.

所述启发式规则在本技术领域中应用广泛，可选种类多，在本发明实施例中对启发式规则的具体使用方法不作赘述。具体地，可选用蚁群算法、神经网络算法等。The heuristic rule is widely used in the technical field, and there are many optional types, and the specific usage method of the heuristic rule is not repeated in this embodiment of the present invention. Specifically, ant colony algorithm, neural network algorithm, etc. can be selected.

在本发明实施例中，以RDF形式化地表示所述知识图谱数据模式中的三元关系，即，将所述程序代码本身的知识信息与所述与开源项目相关的知识信息进行分析抽取，转换为多个三元组。所述RDF(Resource Description Framework)是W3C(万维网联盟WorldWide Web Consortium)定制的一种资源描述框架，是一种主要用于更为丰富地描述并表达实体/资源的技术规范。In the embodiment of the present invention, the ternary relationship in the knowledge graph data schema is formally represented by RDF, that is, the knowledge information of the program code itself and the knowledge information related to the open source project are analyzed and extracted, Convert to multiple triples. The RDF (Resource Description Framework) is a resource description framework customized by W3C (WorldWide Web Consortium), and is a technical specification mainly used to describe and express entities/resources more abundantly.

在本发明的一种优选实施例中，基于步骤101所述数据模式包括的组成知识图谱的基本元素关系与实体，进一步得到多个三元组，所述多个三元组的类别包括：所述sub函数对obj函数的调用关系三元组、所述sub文件对obj函数的包含关系三元组、所述sub提交记录对obj文件的修改关系三元组、所述sub问题集合对obj项目提交记录的涉及关系三元组、所述sub问题集合对obj问题集合的涉及关系三元组、所述sub问题集合对obj合并请求的涉及关系三元组、所述sub合并请求对obj项目提交记录的涉及关系三元组、sub代码合并请求对obj文件的涉及关系三元组In a preferred embodiment of the present invention, based on the basic element relationships and entities constituting the knowledge graph included in the data schema in step 101, multiple triples are further obtained, and the categories of the multiple triples include: all The triplet of the calling relationship of the sub function to the obj function, the triplet of the inclusion relationship of the sub file to the obj function, the triplet of the modification relationship of the sub submission record to the obj file, the set of the sub questions to the obj project The relation triples of the commit record, the relation triples of the sub problem set to the obj problem set, the relation triples of the sub problem set to the obj merge request, and the sub merge request to the obj project submission The record involving relation triples, the sub code merge request to the obj file involving relation triples

步骤S105，根据所述三元组中每种数据来源的不同结构化特征，统一所有三元组中每种知识实体的数据格式，并对每个三元组进行消歧，以保证三元组集中，每种有效的知识实体有且仅有一种实体名称与之对应。Step S105, according to the different structural features of each data source in the triplet, unify the data format of each knowledge entity in all triples, and disambiguate each triplet to ensure that the triplet Concentration, each valid knowledge entity has one and only one entity name corresponding to it.

由于所述程序代码本身的知识信息与所述与开源项目相关的知识信息，分别获取自开源项目代码和开源社区，数据来自不同的来源，所以知识信息中知识实体的的表现形式、数据格式、数据一致性都极有可能出现差异。Since the knowledge information of the program code itself and the knowledge information related to the open source project are respectively obtained from the open source project code and the open source community, and the data comes from different sources, the representation, data format, Data consistency is highly likely to differ.

在本发明实施例中，通过文本分析与启发式规则的方式，将所有可能带有不同的前后缀的知识实体，使用规则进行统一并消除歧义。具体包括：首先根据预定义的知识图谱schema对所有抽取出的知识三元组进行校验，针对与schema不匹配的三元组，使用自然语言分析方法，定义不同的映射规则，将不同来源抽取出的不同格式的知识信息映射至相同的格式化数据上，使得所有三元组组成的三元组集中，每种有效的知识实体有且仅有一种实体名称与之对应。In the embodiment of the present invention, by means of text analysis and heuristic rules, all knowledge entities that may have different prefixes and suffixes are unified and disambiguated by rules. Specifically, it includes: first, verify all the extracted knowledge triples according to the predefined knowledge graph schema, and use natural language analysis methods for triples that do not match the schema, define different mapping rules, and extract different sources. The knowledge information in different formats is mapped to the same formatted data, so that the triples composed of all triples are concentrated, and each valid knowledge entity has one and only one entity name corresponding to it.

例如，对于文件知识信息的抽取，不同来源的知识中对于文件名的表示可能为绝对路径、相对路径、单独的文件名，我们需要对这些情况进行消歧，以使相同文件的名称与关系可以对应到相同的实体之上，便于将不同来源的关系进行融合。For example, for the extraction of file knowledge information, the representations of file names in knowledge from different sources may be absolute paths, relative paths, and individual file names. We need to disambiguate these cases so that the names and relationships of the same files can be Corresponding to the same entity, it is convenient to fuse relationships from different sources.

步骤S106，基于所述数据模式，利用所述三元组集，构建开源项目的知识图谱。Step S106, based on the data schema, and using the triplet set, construct a knowledge graph of the open source project.

本发明实施例中，基于上述步骤得到了所有知识信息以RDF形式表示的三元组集，并完成了知识实体的统一和消歧。此时，基于相同的知识实体，将三元组进行组合，生成开源项目的知识图谱。In the embodiment of the present invention, based on the above steps, a triplet set in which all knowledge information is represented in the form of RDF is obtained, and the unification and disambiguation of knowledge entities are completed. At this time, based on the same knowledge entity, the triples are combined to generate the knowledge graph of the open source project.

由于每一个三元组融合到现有的图谱中之前，需要将所述三元组与现有图谱中的知识信息进行校验，每次校验往往都需要多个三元组共同来验证，融合完成后才能添加下一个三元组，这种图谱构建方法的效率较为低下。Because each triplet needs to be verified with the knowledge information in the existing map before it is fused into the existing map, each verification often requires multiple triples to be verified together. The next triplet can only be added after the fusion is completed, and the efficiency of this map construction method is relatively low.

因此，在本发明提供的一种优选实施例中，基于预先定义的数据模式，针对每个数据来源，进行清洗、抽取三元组、消歧，单独抽取出代表某一类关系的三元组集，构成此类关系的三元组集子图，并发地抽取所述三元组集子图构建的数据流水线，即，多个三元组集子图的构建为并发进行，最后聚合所有的三元组集子图，构建开源项目的知识图谱。相对于原有构建过程中的串行处理，本优选实施例中的图谱并行构建方法的效率更高，同时也促进了开发中的统一化。Therefore, in a preferred embodiment provided by the present invention, based on a pre-defined data pattern, for each data source, cleaning, extracting triples, and disambiguation are performed, and a triple representing a certain type of relationship is separately extracted. set, constitute the triplet set subgraphs of this type of relationship, and concurrently extract the data pipeline constructed by the triplet set subgraphs, that is, the construction of multiple triplet set subgraphs is performed concurrently, and finally all the The triplet set subgraph, builds the knowledge graph of open source projects. Compared with the serial processing in the original construction process, the parallel construction method of the graph in this preferred embodiment is more efficient, and also promotes the unification in development.

步骤S107，使用可视化工具Gephi，对所述开源项目知识图谱进行可视化的分析与展示。Step S107, use the visualization tool Gephi to visually analyze and display the knowledge graph of the open source project.

图6展示了本发明实施例构建的一种开源项目知识图谱的可视化效果图。FIG. 6 shows a visualization effect diagram of an open source project knowledge graph constructed in an embodiment of the present invention.

通过可视化工具Gephi，实现开源项目知识图谱的信息可视化，可以帮助系统管理员发现知识图谱中存在的知识丢失、无法对齐等问题，以便进一步对知识图谱的构建过程进行相应的优化和改善。Through the visualization tool Gephi, the information visualization of the knowledge map of open source projects can be realized, which can help system administrators to find problems such as knowledge loss and inability to align in the knowledge map, so as to further optimize and improve the construction process of the knowledge map.

此外，基于可视化内容，能够根据开发者的学习需求展示开源项目的相关知识信息，也能根据开发者查询的知识节点显示相关的图谱子图，为开发者提供需要学习的目标代码的相关知识，并给予用户一定的提示信息以优化使用体验，增加一些点击和事件以提供更丰富的信息。In addition, based on the visual content, the relevant knowledge information of the open source project can be displayed according to the developer's learning needs, and the relevant map sub-graphs can also be displayed according to the knowledge nodes queried by the developer, so as to provide the developer with the relevant knowledge of the target code that needs to be learned. And give users certain prompt information to optimize the use experience, and add some clicks and events to provide richer information.

如图7所示，本发明实施例面向开发者对开源项目的学习需求，从开源项目代码本身、开源项目所在的开源社区、以及开源项目的远程仓库中，抽取可供开发者学习并参与开源项目开发所需要的知识，构建开源项目的知识图谱，以此帮助提高开发者的学习效率，进而使开发者更好地参与到开源项目的开发中。As shown in FIG. 7 , the embodiment of the present invention is oriented to the learning needs of developers for open source projects, and extracts data from the open source project code itself, the open source community where the open source project is located, and the remote warehouse of the open source project for developers to learn and participate in open source projects. The knowledge required for project development and the knowledge map of open source projects are constructed to help improve the learning efficiency of developers, so that developers can better participate in the development of open source projects.

需要说明的是，对于方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明实施例并不受所描述的动作顺序的限制，因为依据本发明实施例，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作并不一定是本发明实施例所必须的。It should be noted that, for the sake of simple description, the method embodiments are described as a series of action combinations, but those skilled in the art should know that the embodiments of the present invention are not limited by the described action sequences, because According to embodiments of the present invention, certain steps may be performed in other sequences or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.

实施例二Embodiment 2

参照图5，示出了本发明的一种开源项目知识图谱的建立系统实施例的结构框图，该系统可以应用于开源项目，本实施例旨在从开发者学习代码需求的角度，获取开源项目的代码知识信息和开源社区中的相关知识信息，来建立开源项目的知识图谱，以此满足开发者对陌生代码的学习需求，所述建立系统具体包括：Referring to FIG. 5, there is shown a structural block diagram of an embodiment of an open source project knowledge graph establishment system according to the present invention. The system can be applied to open source projects. This embodiment aims to obtain open source projects from the perspective of developers learning code requirements. The code knowledge information of the open source community and the relevant knowledge information in the open source community are used to establish the knowledge graph of the open source project, so as to meet the developer's learning needs of the unfamiliar code. The establishment system specifically includes:

定义模块201，用于预先定义开源项目知识图谱的数据模式。The definition module 201 is used to predefine the data schema of the knowledge graph of the open source project.

由此可知，知识图谱是由一些互相连接的关系及其属性而构成的，这些关系通常被表示为一个SPO三元组(Subject-Predicate-Object)。如图2所示，在一个三元组(triple)中，Subject代表其中的客体，Predicate代表关系本身，Object代表关系指向的主体。It can be seen that the knowledge graph is composed of some interconnected relationships and their attributes, which are usually represented as an SPO triple (Subject-Predicate-Object). As shown in Figure 2, in a triple (triple), Subject represents the object in it, Predicate represents the relationship itself, and Object represents the subject pointed to by the relationship.

值得注意的是，本申请基于开发者学习开源项目并且能使其参与开源项目开发的需求，从开源项目的多个角度抽取的知识信息，并非是用户如何使用开源项目所需要的知识信息，也并非是成熟的开发者智慧编程中需要的代码段推荐知识信息。例如，本申请更多地抽取开源项目内部的信息，这样便于开发者了解之后对开源项目内部贡献代码并开发，而非抽取api信息，让开发者使用api。另外，本申请抽取开源社区中目标开源项目的问题集合，即，开发者对目标开源项目的讨论文本信息中的知识实体与实体之间的基本元素关系，以将函数、文件和项目提交记录、问题集合、代码合并请求等信息互相连接，帮助开发者快速了解并学习的项目代码和相关的代码知识，而非只抽取讨论的描述信息和讨论中的问答关系。It is worth noting that this application is based on the needs of developers to learn open source projects and enable them to participate in the development of open source projects. The knowledge information extracted from multiple perspectives of open source projects is not the knowledge information required by users on how to use open source projects, but also It is not the code segment recommended knowledge information required by mature developers in intelligent programming. For example, this application extracts more information inside open source projects, so that developers can understand and contribute code to open source projects and develop them later, instead of extracting API information and allowing developers to use APIs. In addition, this application extracts the problem set of the target open source project in the open source community, that is, the basic element relationship between knowledge entities and entities in the developer's discussion text information on the target open source project, so as to record functions, files and project submissions, Information such as issue sets and code merge requests are connected to each other, helping developers to quickly understand and learn the project code and related code knowledge, instead of only extracting the description information of the discussion and the question-and-answer relationship in the discussion.

参考表1，本发明实施例设计的知识图谱数据模式中，构建所述知识图谱需要抽取的实体具体包括：实体“Func”，即，函数，代表开源项目代码中的一个函数；实体“File”，即，文件，代表开源项目中的一个文件；实体“Commit”，即，项目提交记录，代表开源项目提交历史中的一个提交记录；实体“Issue”，即，问题集合，代表开源项目在开源社区中的一个问题及其评论的集合；实体“Pull Request”，即，代码合并请求，代表开源项目在开源社区中的一个代码合并请求。Referring to Table 1, in the knowledge graph data schema designed by the embodiment of the present invention, the entities that need to be extracted to construct the knowledge graph specifically include: an entity "Func", that is, a function, representing a function in the code of an open source project; an entity "File" , that is, a file, representing a file in the open source project; the entity "Commit", that is, the project submission record, representing a commit record in the open source project's submission history; the entity "Issue", that is, the issue collection, representing the open source project in the open source A collection of an issue and its comments in the community; the entity "Pull Request", ie, a code merge request, represents a code merge request for an open source project in the open source community.

参考表2，本发明实施例设计的知识图谱数据模式中，构建所述知识图谱需要抽取的基本元素关系具体包括：关系“(sub,func_call,obj)”，表示sub函数对obj函数的调用关系；关系“(sub,file_contain_func,obj)”，表示sub文件对obj函数的包含关系；关系“(sub,commit_change_file,obj)”，表示sub项目提交记录对obj文件的修改关系；关系“(sub,issue_relate_commit,obj)”，表示sub问题集合对obj项目提交记录的涉及关系；关系“(sub,issue_relate_issue,obj)”，表示sub问题集合对obj问题集合的涉及关系；关系“(sub,issue_relate_pr,obj)”，表示sub问题集合对obj代码合并请求的包含关系；关系“(sub,pr_relate_commit,obj)”，表示sub代码合并请求对obj项目提交记录的涉及关系，关系“(sub,pr_relate_file,obj)”，表示sub代码合并请求对obj文件的涉及关系。Referring to Table 2, in the knowledge graph data schema designed by the embodiment of the present invention, the basic element relationship that needs to be extracted to construct the knowledge graph specifically includes: the relationship "(sub, func_call, obj)", which represents the calling relationship of the sub function to the obj function ;The relationship "(sub,file_contain_func,obj)" indicates the inclusion relationship between the sub file and the obj function; the relationship "(sub,commit_change_file,obj)" indicates the modification relationship between the submission record of the sub project and the obj file; the relationship "(sub, issue_relate_commit, obj)", which indicates the relationship between the sub issue set and the obj project submission record; the relationship "(sub, issue_relate_issue, obj)", which indicates the relationship between the sub issue set and the obj issue set; the relationship "(sub, issue_relate_pr, obj) )", indicating the inclusion relationship between the sub problem set and the obj code merge request; the relationship "(sub, pr_relate_commit, obj)", indicating the involving relationship between the sub code merge request and the obj project submission record, the relationship "(sub, pr_relate_file, obj) ", indicating the relationship between the sub code merge request and the obj file.

知识获取模块202，用于通过静态代码分析方法从开源项目代码中获取程序代码本身的知识信息，所述程序代码本身的知识信息包括：函数、文件以及它们之间的调用、包含关系；以及，从开源项目所在的开源社区和开源项目的远程仓库中获取与开源项目相关的知识信息，所述与开源项目相关的知识信息包括：项目提交记录、代码合并请求、以及问题集合以及它们之间的涉及关系。A knowledge acquisition module 202, configured to acquire knowledge information of the program code itself from the open source project code by a static code analysis method, where the knowledge information of the program code itself includes: functions, files, and calls and inclusion relationships between them; and, Obtain knowledge information related to the open source project from the open source community where the open source project is located and the remote repository of the open source project. The knowledge information related to the open source project includes: project submission records, code merge requests, and issue collections and the relationship between them. relationship is involved.

可选的静态分析工具，包括：Doxygen，CppCheck和FindBugs。Optional static analysis tools including: Doxygen, CppCheck and FindBugs.

本实施例以目前被广泛使用的开源社区GitHub为例，利用GitHubAPI，结合Pygithub与Pygit框架工具，对github开源社区和Git仓库中的信息进行获取，并通过关键词提取出有效的、与目标开源项目相关的知识信息。所述与开源项目相关的知识信息包括：项目提交记录、代码合并请求、问题集合以及它们之间的涉及关系。This embodiment takes the widely used open source community GitHub as an example, uses the GitHub API, combined with Pygithub and Pygit framework tools, to obtain the information in the GitHub open source community and Git warehouse, and extracts the effective and target open source through keywords. Project-related knowledge information. The knowledge information related to the open source project includes: project submission records, code merge requests, problem sets, and related relationships between them.

具体地，本发明实施例中，使用基于GitHub APIV3封装实现的PyGithub框架，从GitHub开源社区中，通过关键词提取与目标开源项目相关的知识信息，包括从开发者之间对项目的讨论信息中，提取出与目标开源项目有关的问题集合、代码合并请求的相关信息。使用PyGit框架，从Git仓库中获取与目标开源项目相关的知识信息，包括对目标开源项目的提交记录树进行遍历与数据抽取，获取项目提交记录的相关信息。Specifically, in the embodiment of the present invention, the PyGithub framework based on the GitHub APIV3 package is used to extract knowledge information related to the target open source project from the GitHub open source community through keywords, including from the discussion information between developers on the project. , which extracts information about issue sets and code merge requests related to the target open source project. Use the PyGit framework to obtain knowledge information related to the target open source project from the Git repository, including traversing and data extraction of the submission record tree of the target open source project, and obtaining the relevant information of the project submission record.

其中，GitHub API是开源社区GitHub官方提供的一个开放式的查询接口，它可以对开源社区中的功能提供简单的使用与访问。而Pygit是对libgit的Python封装，提过简洁的接口访问对象中的各种属性，执行对git仓库的各种操作。所述libgit是对开源项目的版本控制系统Git中主要核心的方法的可移植的C语言实现，通过生成可靠稳定的C接口链接库，从而让开发者可以使用代码调取API的方式来轻松地实现Git中的常规操作。因此，可以将libgit理解为Git的共享程序库，相对于作为一个独立程序应用的Git，libgit2去掉了其中复杂的优化和非核心的功能。Among them, the GitHub API is an open query interface officially provided by the open source community GitHub, which can provide simple use and access to functions in the open source community. Pygit is a Python encapsulation of libgit. It provides a concise interface to access various attributes in objects and perform various operations on git warehouses. The libgit is a portable C language implementation of the main core methods in the version control system Git of open source projects. By generating a reliable and stable C interface link library, developers can easily use the code to call the API. Implements normal operations in Git. Therefore, libgit can be understood as a shared library of Git. Compared with Git, which is used as an independent program, libgit2 removes complex optimizations and non-core functions.

知识分析模块203，用于对所述程序代码本身的知识信息与所述与开源项目相关的知识信息进行分析，抽取出多个三元组；并根据所述三元组中每种数据来源的不同结构化特征，统一所述三元组中每种知识实体的数据格式，并对每个三元组进行消歧，以保证三元组集中，每种有效的知识实体有且仅有一种实体名称与之对应。The knowledge analysis module 203 is used to analyze the knowledge information of the program code itself and the knowledge information related to the open source project, and extract a plurality of triples; Different structural features, unify the data format of each knowledge entity in the triplet, and disambiguate each triplet to ensure that the triplet is concentrated, and each valid knowledge entity has one and only one entity The name corresponds to it.

本发明实施例中，在知识获取模块202的基础上，基于启发式规则，对所述程序代码本身的知识信息与所述与开源项目相关的知识信息进行进一步的分析，抽取出多个三元组。包括：通过文本匹配与自然语言分析的方法，从文本信息中抽取出与目标开源项目相关的多个知识实体和基本元素关系。In the embodiment of the present invention, on the basis of the knowledge acquisition module 202, based on heuristic rules, the knowledge information of the program code itself and the knowledge information related to the open source project are further analyzed, and a plurality of ternary elements are extracted. Group. Including: extracting multiple knowledge entities and basic element relationships related to the target open source project from the text information by means of text matching and natural language analysis.

在本发明的一种优选实施例中，以RDF形式化地表示所述知识图谱数据模式中的三元关系，即，将所述程序代码本身的知识信息与所述与开源项目相关的知识信息进行分析抽取，转换为多个三元组。所述RDF(Resource Description Framework)是W3C(万维网联盟World Wide Web Consortium)定制的一种资源描述框架，是一种主要用于更为丰富地描述并表达实体/资源的技术规范。In a preferred embodiment of the present invention, the ternary relationship in the knowledge graph data schema is formally represented by RDF, that is, the knowledge information of the program code itself and the knowledge information related to the open source project are combined Analysis and extraction are performed and converted into multiple triples. The RDF (Resource Description Framework) is a resource description framework customized by W3C (World Wide Web Consortium), and is a technical specification mainly used to describe and express entities/resources more abundantly.

在本发明的一种优选实施例中，基于定义模块201所述数据模式包括的组成知识图谱的基本元素关系与实体，进一步得到多个三元组，所述多个三元组的类别包括：所述sub函数对obj函数的调用关系三元组、所述sub文件对obj函数的包含关系三元组、所述sub提交记录对obj文件的修改关系三元组、所述sub问题集合对obj项目提交记录的涉及关系三元组、所述sub问题集合对obj问题集合的涉及关系三元组、所述sub问题集合对obj合并请求的涉及关系三元组、所述sub合并请求对obj项目提交记录的涉及关系三元组、sub代码合并请求对obj文件的涉及关系三元组In a preferred embodiment of the present invention, based on the basic element relationships and entities constituting the knowledge graph included in the data schema of the definition module 201, multiple triples are further obtained, and the categories of the multiple triples include: The triple of the calling relationship of the sub function to the obj function, the triple of the inclusion relationship of the sub file to the obj function, the triple of the modification relationship of the sub submission record to the obj file, the triple of the sub problem set to obj The relation triples of the project submission record, the relation triples of the sub problem set to the obj problem set, the relation triples of the sub problem set to the obj merge request, the sub merge request to the obj project The relationship triples involved in the submission record, and the relationship triples involved in the sub code merge request to the obj file

构建模块204，用于基于所述数据模式，利用所述多个三元组，构建开源项目的知识图谱。The construction module 204 is configured to construct a knowledge graph of an open source project by using the multiple triples based on the data pattern.

因此，在本发明的一种优选实施例中，基于预先定义的数据模式，针对每个数据来源，进行清洗、抽取三元组、消歧，单独抽取出代表某一类关系的三元组集，构成此类关系的三元组集子图，并发地抽取所述三元组集子图构建的数据流水线，即，多个三元组集子图的构建为并发进行，最后聚合所有的三元组集子图，构建开源项目的知识图谱。相对于原有构建过程中的串行处理，本优选实施例中的图谱并行构建方法的效率更高，同时也促进了开发中的统一化。Therefore, in a preferred embodiment of the present invention, based on a pre-defined data pattern, for each data source, cleaning, extracting triples, and disambiguation are performed, and a triplet set representing a certain type of relationship is individually extracted. , constitute the triplet set subgraphs of this type of relationship, and concurrently extract the data pipeline constructed by the triplet set subgraphs, that is, the construction of multiple triplet set subgraphs is performed concurrently, and finally all three Tuple set subgraph to build the knowledge graph of open source projects. Compared with the serial processing in the original construction process, the parallel construction method of the graph in this preferred embodiment is more efficient, and also promotes the unification in development.

展示模块205，用于使用可视化工具Gephi，对所述开源项目知识图谱进行可视化的分析与展示。The display module 205 is used to visually analyze and display the knowledge graph of the open source project by using the visualization tool Gephi.

本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments may be referred to each other.

本领域内的技术人员应明白，本发明实施例的实施例可提供为方法、装置、或计算机程序产品。因此，本发明实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It should be understood by those skilled in the art that the embodiments of the embodiments of the present invention may be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product implemented on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.

本发明实施例是参照根据本发明实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。Embodiments of the present invention are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal equipment to produce a machine that causes the instructions to be executed by the processor of the computer or other programmable data processing terminal equipment Means are created for implementing the functions specified in the flow or flows of the flowcharts and/or the blocks or blocks of the block diagrams.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing terminal equipment to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising instruction means, the The instruction means implement the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上，使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operational steps are performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby executing on the computer or other programmable terminal equipment The instructions executed on the above provide steps for implementing the functions specified in the flowchart or blocks and/or the block or blocks of the block diagrams.

尽管已描述了本发明实施例的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例做出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明实施例范围的所有变更和修改。Although preferred embodiments of the embodiments of the present invention have been described, additional changes and modifications to these embodiments may be made by those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be construed to include the preferred embodiments as well as all changes and modifications that fall within the scope of the embodiments of the present invention.

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or terminal device comprising a list of elements includes not only those elements, but also a non-exclusive list of elements. other elements, or also include elements inherent to such a process, method, article or terminal equipment. Without further limitation, an element defined by the phrase "comprises a..." does not preclude the presence of additional identical elements in the process, method, article or terminal device comprising said element.

以上对本发明所提供的一种开源项目知识图谱的建立方法和一种开源项目知识图谱的建立系统，进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The method for establishing a knowledge graph of an open source project and a system for establishing a knowledge graph for an open source project provided by the present invention have been described above in detail. In this paper, specific examples are used to illustrate the principles and implementations of the present invention. The description of the embodiment is only used to help understand the method of the present invention and its core idea; meanwhile, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in specific embodiments and application scope. As mentioned above, the contents of this specification should not be construed as limiting the present invention.

Claims

1. A method for establishing an open source project knowledge graph, the method comprising:

predefining a data mode of an open source project knowledge graph;

acquiring knowledge information of a program code from an open source project code by a static code analysis method, wherein the knowledge information of the program code comprises: functions, files and calling and containing relations among the functions and the files;

acquiring knowledge information related to the open source project from an open source community where the open source project is located and a remote warehouse of the open source project, wherein the knowledge information related to the open source project comprises: project submission records, code merge requests, problem sets, and related relationships between them;

analyzing the knowledge information of the program code and the knowledge information related to the open source project, extracting a plurality of triples, unifying the data format of each knowledge entity in all the triples according to different structural characteristics of each data source in the triples, and disambiguating each triplet so as to ensure that the triples are concentrated, and each effective knowledge entity has one entity name corresponding to the effective knowledge entity;

constructing a knowledge graph of the open source project by utilizing the three tuple sets based on the data mode;

and carrying out visual analysis and display on the open source project knowledge graph by using a visualization tool Gephi.

2. The method of claim 1, wherein predefining data patterns of the open source project knowledge-graph comprises:

extracting knowledge information forming a knowledge graph from a plurality of angles of an open source project, wherein the knowledge information comprises: basic element relationships and entities; wherein the entity comprises: functions, files, project submission records, problem sets and code merging requests; wherein the basic element relationships comprise: calling a relationship, containing a relationship, modifying a relationship, referring to a relationship.

3. The method of claim 1, wherein obtaining knowledge information of the program code itself from the open source project code by a static code analysis method comprises:

analyzing the open source project codes by using a static analysis tool aiming at each file and each project module of the open source project, and respectively generating local relation subgraphs, wherein the relation subgraphs are described and output by a Dot language;

the static analysis tool, comprising: doxygen, CppCheck and FindBugs.

4. The method of claim 1, wherein the categories of the plurality of triples comprise:

the method comprises the following steps of calling relation triples of a sub function to obj functions, containing relation triples of sub files to obj functions, modifying relation triples of sub submission records to obj files, relation related triples of sub problem sets to obj item submission records, relation related triples of sub problem sets to obj problem sets, relation related triples of sub problem sets to obj merging requests, relation related triples of sub merging requests to obj item submission records, and relation related triples of sub code merging requests to obj files.

5. The method of claim 1, wherein the step of constructing a knowledge-graph of open source items comprises:

and based on the data mode, cleaning, extracting triples and disambiguating each data source, further respectively extracting a ternary set representing a certain type of relation to form a ternary set subgraph, concurrently extracting a data pipeline for constructing the ternary set subgraph, and finally aggregating all the ternary set subgraphs to construct a knowledge graph of the open source project.

6. A system for establishing an open source project knowledge graph, the system comprising:

the definition module is used for predefining a data mode of the open source project knowledge graph;

the knowledge acquisition module is used for acquiring knowledge information of a program code from an open source project code by a static code analysis method, wherein the knowledge information of the program code comprises the following steps: functions, files and calling and containing relations among the functions and the files; and acquiring knowledge information related to the open source project from the open source community where the open source project is located and a remote warehouse of the open source project, wherein the knowledge information related to the open source project comprises: project submission records, code merge requests, and problem sets and the relationship of involvement between them;

the knowledge analysis module is used for analyzing the knowledge information of the program code and the knowledge information related to the open source project and extracting a plurality of triples; unifying the data format of each knowledge entity in the triples according to different structural characteristics of each data source in the triples, and disambiguating each triplet so as to ensure that each effective knowledge entity has one entity name corresponding to the effective knowledge entity in the triples;

a construction module for constructing a knowledge graph of the open source item using the plurality of triples based on the data schema;

and the display module is used for carrying out visual analysis and display on the open source project knowledge graph by using a visualization tool Gephi.

7. The system of claim 6, wherein the definition module comprises:

extracting knowledge information forming a knowledge graph from a plurality of angles of an open source project, wherein the method comprises the following steps: basic element relationships and entities; wherein the entity comprises: functions, files, project submission records, problem sets and code merging requests; wherein the basic element relationships comprise: calling a relationship, containing a relationship, modifying a relationship, referring to a relationship.

8. The system of claim 6, wherein the knowledge acquisition module comprises:

the static analysis tool, comprising: doxygen, CppCheck and FindBugs.

9. The system of claim 6, wherein in the knowledge analysis module, the categories of the plurality of triples comprise:

10. The system of claim 6, the build module, comprising: