[go: up one dir, main page]

CN120256636A - File classification method, device, equipment and readable storage medium - Google Patents

File classification method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN120256636A
CN120256636A CN202510321767.2A CN202510321767A CN120256636A CN 120256636 A CN120256636 A CN 120256636A CN 202510321767 A CN202510321767 A CN 202510321767A CN 120256636 A CN120256636 A CN 120256636A
Authority
CN
China
Prior art keywords
classification
description information
category
node
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510321767.2A
Other languages
Chinese (zh)
Inventor
何轶凡
郑佳斌
耿启涛
谢亚娟
朱珊丽
何永亮
蒋忠林
陈勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Geely Holding Group Co Ltd
Geely Automobile Research Institute Ningbo Co Ltd
Original Assignee
Zhejiang Geely Holding Group Co Ltd
Geely Automobile Research Institute Ningbo Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Geely Holding Group Co Ltd, Geely Automobile Research Institute Ningbo Co Ltd filed Critical Zhejiang Geely Holding Group Co Ltd
Priority to CN202510321767.2A priority Critical patent/CN120256636A/en
Publication of CN120256636A publication Critical patent/CN120256636A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/185Hierarchical storage management [HSM] systems, e.g. file migration or policies thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供一种文件分类方法、装置、设备及可读存储介质,将文件分类任务转换为大模型的生成任务,通过对预设的多层级分类结构,确定每一层级中每一类别节点的分类路径描述信息,该信息包含了从层级根节点到该类别节点经过的所有节点及其分类类别的说明信息,并通过大模型结合该分类路径描述信息推理生成文件所属的层级类别结果来表征文件的层级分类类别,减少逐层分类过程中的累积误差,避免了因信息丢失而导致的分类性能下降问题,提高了文件分类的准确性和分类效率。

The present application provides a file classification method, apparatus, device and readable storage medium, which converts a file classification task into a large model generation task, and determines the classification path description information of each category node in each level through a preset multi-level classification structure, wherein the classification path description information includes description information of all nodes and their classification categories from the level root node to the category node, and the level category result to which the file belongs is generated by inference through the large model combined with the classification path description information to characterize the hierarchical classification category of the file, thereby reducing the cumulative error in the layer-by-layer classification process, avoiding the problem of reduced classification performance due to information loss, and improving the accuracy and efficiency of file classification.

Description

File classification method, device, equipment and readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for classifying files.
Background
Under the background of big data and informatization, the efficient classification of document data plays an important role in improving the management efficiency and business processing capacity of digital assets. However, existing file classification methods have significant drawbacks in handling multi-level classification and dynamically changing file classification scenarios.
The current hierarchical classification mode can classify layer by layer, but accumulated errors are easy to generate in the hierarchical classification process, so that classification accuracy is affected, and the classification mode that leaf nodes of a multi-level classification system are directly adopted as file classification basis ignores important classification information contained in father nodes and ancestor nodes of the leaf nodes, so that classification performance is reduced.
Disclosure of Invention
In view of the above, the present application provides a method, apparatus, device and readable storage medium for classifying files.
Specifically, the application is realized by the following technical scheme:
According to a first aspect of an embodiment of the present application, there is provided a method for classifying files, the method including:
determining classification path description information of each class node in each hierarchy aiming at a preset multi-hierarchy classification structure, wherein the classification path description information is determined based on all nodes passing from a hierarchy root node to the class node and description information of classification class represented by each node;
acquiring a text sequence obtained by analyzing a file to be classified;
And inputting the text sequence and the classification path description information of each class node into a large model in an associated way, and obtaining a hierarchical classification result which is output by the large model after reasoning according to the classification path description information of each class node, wherein the result indicates the hierarchical classification class to which the file to be classified belongs in the multi-layer classification structure.
Optionally, the method further comprises:
Constructing a generated prompt word aiming at each level of the multi-level hierarchical structure, wherein the generated prompt word is used for prompting a large language model to enhance the degree of distinction between the description information of each category node of the same level;
And inputting each class node of the same level in the multi-level hierarchical structure and the generated prompt word into a large language model, and acquiring the description information of each class node of the same level output by the large language model.
Optionally, in the case that the same level in the multi-level hierarchical structure includes a plurality of category nodes, the method further includes:
Acquiring multi-batch description information generated for a plurality of category nodes in the same hierarchy for a plurality of times, wherein each batch comprises description information of all category nodes in the hierarchy;
Determining the text description distinction degree of each batch based on the text similarity between the description information of every two types of nodes in the same hierarchy aiming at the description information in each batch;
and taking a batch of description information with the highest text description distinction degree in the multiple batches of description information as the description information of a plurality of category nodes in the same hierarchy.
Optionally, the determining the text description distinction degree of the batch based on the text similarity between the description information of the two-category nodes in the same hierarchy includes:
For a plurality of pieces of description information corresponding to each category node one by one in the same batch, respectively determining the complement of the text similarity between every two pieces of description information relative to 1, and taking the complement as the text description distinction between the two pieces of description information;
And summing the text description discrimination between all the different piece of explanatory information, and determining the average discrimination according to the summation result and the logarithm of all the different piece of explanatory information as the text description discrimination of the batch.
Optionally, the determining the classification path description information of each class node in each hierarchy includes:
for each node passing from the hierarchical root node to the class node, organizing the description information of the class represented by the node into an information item;
And integrating all the information items through set symbol links according to the path sequence of the passed nodes to form the classified path description information of the nodes.
Optionally, the large model determines a category node to which the text sequence corresponding to the file to be classified belongs according to the classification path description information of each category node, and generates a category sequence formed from the hierarchical root node to a node through which the category node belongs as a hierarchical classification result of the text sequence corresponding to the file to be classified, wherein the method further comprises:
Performing character matching similarity calculation on the class sequence output by the large model and the class nodes of the multi-layer class classification structure layer by layer, and determining a node path in the multi-layer class classification structure, which has the highest character similarity with the class sequence;
and taking the class sequences represented by all the nodes in the node path as a hierarchical classification result after the text sequences corresponding to the files to be classified are updated.
Optionally, the associating the text sequence with the classification path description information of each class node inputs a large model, including:
Constructing a classification prompt word for classifying the document according to the text sequence and the classification path description information of each class node, wherein the classification prompt word at least further comprises a classification task executed by the large model and description information required by corresponding classification output;
and inputting the classification prompt words into the large model so that the large model generates a hierarchical classification result of the text sequence corresponding to the file to be classified according to the classification prompt words.
Optionally, the obtaining a text sequence obtained by analyzing the file to be classified includes:
Under the condition that the file to be classified comprises a picture, at least one of text information, file name and text description aiming at content included in the picture is obtained as a text sequence of the picture;
In the case that the file to be classified includes audio or video, text information obtained by converting speech included in the audio or video is acquired as a text sequence of the audio or video.
According to a second aspect of an embodiment of the present application, there is provided a document sorting apparatus, the apparatus including:
The classification preprocessing module is used for determining classification path description information of each class node in each level according to a preset multi-level classification structure, wherein the classification path description information is determined and obtained based on all nodes passing from a level root node to the class node and the description information of the classification class represented by each node;
the file analysis module is used for acquiring a text sequence obtained by analyzing the files to be classified;
and the large model classification module is used for inputting the text sequence and the classification path description information of each class node into a large model in a correlated way, acquiring a hierarchical classification result which is output by the large model after being inferred according to the classification path description information of each class node, and indicating the hierarchical classification class to which the file to be classified belongs in the multi-layer classification structure.
Optionally, the apparatus further comprises:
Constructing a generated prompt word aiming at each level of the multi-level hierarchical structure, wherein the generated prompt word is used for prompting a large language model to enhance the degree of distinction between the description information of each category node of the same level;
And inputting each class node of the same level in the multi-level hierarchical structure and the generated prompt word into a large language model, and acquiring the description information of each class node of the same level output by the large language model.
Optionally, in the case that the same level in the multi-level hierarchical structure includes a plurality of category nodes, the apparatus further includes:
The batch generation module is used for acquiring multi-batch description information which is generated for a plurality of category nodes in the same level for a plurality of times, wherein each batch comprises description information of all category nodes in the level;
The distinguishing degree calculation module is used for determining the text description distinguishing degree of each batch based on the text similarity between the description information of every two types of nodes in the same level aiming at the description information in each batch;
And the screening module is used for taking a batch of description information with the highest text description distinction degree in the multiple batches of description information as the description information of the multiple category nodes in the same hierarchy.
Optionally, the distinguishing degree calculating module is specifically configured to:
For a plurality of pieces of description information corresponding to each category node one by one in the same batch, respectively determining the complement of the text similarity between every two pieces of description information relative to 1, and taking the complement as the text description distinction between the two pieces of description information;
And summing the text description discrimination between all the different piece of explanatory information, and determining the average discrimination according to the summation result and the logarithm of all the different piece of explanatory information as the text description discrimination of the batch.
Optionally, the classification preprocessing module is specifically configured to:
for each node passing from the hierarchical root node to the class node, organizing the description information of the class represented by the node into an information item;
And integrating all the information items through set symbol links according to the path sequence of the passed nodes to form the classified path description information of the nodes.
Optionally, the large model determines a category node to which the text sequence corresponding to the file to be classified belongs according to the classification path description information of each category node, and generates a category sequence formed from the hierarchical root node to a node through which the category node belongs as a hierarchical classification result of the text sequence corresponding to the file to be classified, wherein the device further comprises:
Performing character matching similarity calculation on the class sequence output by the large model and the class nodes of the multi-layer class classification structure layer by layer, and determining a node path in the multi-layer class classification structure, which has the highest character similarity with the class sequence;
and taking the class sequences represented by all the nodes in the node path as a hierarchical classification result after the text sequences corresponding to the files to be classified are updated.
Optionally, the large model classification module is specifically configured to:
Constructing a classification prompt word for classifying the document according to the text sequence and the classification path description information of each class node, wherein the classification prompt word at least further comprises a classification task executed by the large model and description information required by corresponding classification output;
and inputting the classification prompt words into the large model so that the large model generates a hierarchical classification result of the text sequence corresponding to the file to be classified according to the classification prompt words.
Optionally, the file parsing module is specifically configured to:
Under the condition that the file to be classified comprises a picture, at least one of text information, file name and text description aiming at content included in the picture is obtained as a text sequence of the picture;
In the case that the file to be classified includes audio or video, text information obtained by converting speech included in the audio or video is acquired as a text sequence of the audio or video.
According to a third aspect of the embodiment of the application, the electronic device comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for executing the file classification method by calling the computer program.
According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described file classification method.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
In the technical scheme provided by the application, the file classification task is converted into the generation task of the large model, the classification path description information of each class node in each level is determined through the preset multi-level classification structure, the information comprises the description information of all nodes passing from the level root node to the class node and classification categories thereof, the level classification category of the file is represented by reasoning and generating the level category result of the file by combining the classification path description information through the large model, the accumulated error in the layer-by-layer classification process is reduced, the problem of classification performance degradation caused by information loss is avoided, and the accuracy and classification efficiency of file classification are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed. Furthermore, not all of the above-described effects need be achieved in any one embodiment of the present application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1A is a flow chart illustrating steps of a method for classifying documents according to an exemplary embodiment of the present application;
FIG. 1B is a diagram illustrating a multi-level taxonomy of files according to an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram illustrating the generation of explanatory information for classification categories represented by a class node according to an exemplary embodiment of the present application;
FIG. 3A is a flowchart illustrating steps for generating specification information and evaluating multiple lots according to an exemplary embodiment of the present application;
FIG. 3B is a schematic representation of the calculation of textual description differentiation of a batch, according to an exemplary embodiment of the application;
FIG. 4 is a diagram illustrating an exemplary multi-level hierarchical structure according to an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of a document sorting apparatus according to an exemplary embodiment of the present application;
fig. 6 is a hardware schematic of an electronic device according to an exemplary embodiment of the application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first classification threshold may also be referred to as a second classification threshold, and similarly, a second classification threshold may also be referred to as a first classification threshold without departing from the scope of the application. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.
In the context of big data and informatization, efficient classification of document materials plays an important role in improving the management efficiency and business processing capacity of digital assets, for example, for a large amount of document materials generated inside enterprises, which constitute rich digital assets, in order to fully utilize these digital assets, the document materials are efficiently classified to improve the efficiency of business query and use and provide powerful support for enterprise decision-making. However, existing file classification methods have significant drawbacks in handling multi-level classification and dynamically changing file classification scenarios.
The current hierarchical classification mode can classify the files layer by layer, but the detailed data marking is needed for each level, the workload of data preparation is increased, meanwhile, accumulated errors are easy to generate in the multi-layer classification process, so that the final classification accuracy is affected, the leaf nodes of the multi-level classification system are directly adopted as the classification mode of the file classification basis, and the important classification information contained in the father node and ancestor node of the leaf nodes is ignored, so that the classification performance is reduced. In addition, most of the existing methods are based on fixed classification categories, and when a category system is changed, a model needs to be reconstructed and trained, which is time-consuming and labor-consuming, and can influence the continuity and stability of the service due to frequent model updating. The limitation of the existing classification mode seriously affects the efficiency and accuracy of document classification in practical application.
In view of the above, the application provides a zero-sample multi-level file classification method based on a large model aiming at the limitation of the existing file classification method in processing multi-level categories and dynamic change scenes, and the method converts a file classification task into a generation task of the large model, so that the file classification task is more in line with a training paradigm of the large model, and the powerful capability of the large model in text understanding and generation is fully utilized. Before classifying by using the large model, each classification category of the hierarchy is subjected to expanded description and complete path related description is generated, so that key information of classification categories of different hierarchies is ensured to be reserved, the key information is easier to understand and process by the large language model, and the problem of classification performance degradation caused by information loss in the traditional method is avoided.
The file classification method provided by the application can be widely applied to file classification scenes of multi-layer classification in different fields, such as enterprise file management, classification management of academic papers according to subject fields, research directions, release years and the like, classification management of files such as policy files, regulations, notices and the like according to policy fields, release mechanisms, release years and the like, classification of files such as legal cases, contracts, legal files and the like of law offices or law institutions according to law fields, case types, parties and the like, classification management of teaching resources of education institutions or online learning platforms according to subjects, grades, course types and the like, classification management of personal files such as photos, documents, videos and the like according to time, subjects, types and the like.
Referring to the step flowchart of the file classification method exemplarily shown in fig. 1A, the file classification method provided by the present application may at least include the following steps:
S101, determining classification path description information of each class node in each level according to a preset multi-level classification structure, wherein the classification path description information is determined and obtained based on all nodes passing from a level root node to the class node and description information of classification class represented by each node;
The multi-level hierarchical classification structure represents a hierarchy for classifying documents that includes multiple classification levels of hierarchically refined classification categories according to document set logic or attribute relationships or other classification basis, with inclusive or inclusive relationships between adjacent levels. Each level comprises one or more category nodes which represent different classification categories, the classification categories represented by the category nodes in the same level belong to subdivision categories in the same level, and belong to parallel relations in a level, and all the classification categories in the adjacent previous level are further subdivided.
In the multi-level hierarchical structure, the top-level node is called a hierarchical root node and is the starting point or basic classification of all other nodes, and each specific classification class is regarded as a class node which are connected with each other according to a hierarchical relationship to form the multi-level hierarchical structure.
For example, referring to a file multi-level taxonomy schematic diagram schematically shown in FIG. 1B, a first level includes nodes "design documents" as level root nodes, which are then divided into two classes of nodes, office documents and design documents, of a second level by document type. The node "office document" is further classified into a third hierarchy to represent two types of nodes of "report" and "plan", and the node "report" and the node "plan" may be further classified into three types of nodes of "annual report/plan", "quarterly report/plan" and "monthly report/plan" in a fourth hierarchy according to the period. Similarly, the node "design document" is further subdivided into two types of nodes, namely "drawing" and "model", in the third level, the node "drawing" can be subdivided into four types of nodes, namely "building drawing", "mechanical drawing", "electrical drawing" and "other drawing", in the fourth level according to the design field, and the node "model" can be divided into two types of nodes, namely "three-dimensional model" and "two-dimensional model", in the fourth level.
The description information of the classification category represented by each node is an expansibility description of the classification category represented by the node, and may include, but is not limited to, specific definitions such as coverage, main features or attributes of the classification category for establishing boundaries of the classification category, and examples of the classification category, such as providing specific examples or application scenarios belonging to the classification category, examples may be specific objects, concepts or events, and commonly embody core features of the classification category.
For example, using the category node "mechanical drawing" in FIG. 1B as an example, the description information may include, but is not limited to, drawings drawn to express information on the structure, size, material, technical requirements, etc. of the mechanical component, and the shape, size, assembly relationship, and manufacturing process requirements of the mechanical component are precisely described by specific graphic symbols, lines, dimension marks, text descriptions, etc., including detailed information on the geometry, dimension marks, material marks, tolerance requirements, surface roughness, heat treatment requirements, etc. of the mechanical component.
In this embodiment, the large model processes the file classification task depending on the accurate understanding of classification categories, by adding description information to each class node of each level in the multi-level classification structure, a rich context is provided for the large model, and meanwhile, subtle differences or overlaps may exist between different classification categories in the same level, by providing detailed description information, uniqueness and distinguishing points of each category can be highlighted, and classification of similar but different classification categories is facilitated, so that the large model can grasp the essential characteristics of the classification categories more accurately, thereby improving the accuracy and efficiency of classification.
The classification path description information of each class node is used for clearly representing the position of the classification class represented by the class node in the whole multi-layer classification structure and the relation and distinction of the classification class and other classification classes, including node path information from the hierarchical root node to the class node, and can be specifically determined based on all nodes passing from the root node to the class node and the description information of the classification class represented by each node.
The description information of the classification category represented by each category node in the multi-layer classification structure is added based on the classification category represented by each category node in the multi-layer classification structure, and for any category node in the multi-layer classification structure, the classification category represented by each node and the description information of each node in the nodes can be organized into ordered data according to the node sequence of each node included on a node path from a hierarchical root node to the category node, so as to form classification path description information of the category node. Wherein, the classification path description information can be organized in a data format such as key value pairs, sets or JSON.
Still taking the category node "mechanical drawing" in fig. 1B as an example, the category path description information of the category node at least includes { root node "design document" +description information }, { second level node "design document" +description information }, { third level node "drawing" +description information }, and { fourth level node "mechanical drawing" +description information }.
S102, acquiring a text sequence obtained by analyzing a file to be classified;
The files to be classified represent files that need to be classified, and may be files in any format, including but not limited to documents (e.g., PDF, word, excel, PPT, etc.), pictures, audio, video, etc. The text sequence comprises a data sequence extracted from the file to be classified and presented in a text form, and the text sequence is the basis of the subsequent classification task.
Regarding the acquisition of the text sequence, for text information included in the file to be classified, the text information may be directly extracted as a constituent part of the text sequence of the file to be classified. In the case that a picture is included in the file to be classified, at least one of text information, a file name, and a text description for content included in the picture, which are included in the picture, may be acquired as a text sequence of the picture. That is, for a picture that itself contains text information, the text information in the picture may be directly extracted as a text sequence of the picture, or for a picture that does not contain text information, a text description may be generated for image content included in the picture by a large language model or other picture processing tool, thereby taking the text description as a text sequence of the picture.
In the case that the file to be classified includes audio or video, text information obtained by converting speech included in the audio or video is acquired as a text sequence of the audio or video. The method comprises the steps of storing caption information directly according to the playing progress sequence of the audio or video to obtain a text sequence composed of the caption information, and converting voice into text information according to the existing voice-to-text technology for the video or audio without the caption, wherein the text information is arranged according to the playing sequence of the voice to form the text sequence of the audio or video.
In the process of analyzing the files to be classified into text sequences, various files such as PDF, office documents (Word, excel, PPT), pictures and the like can be automatically converted into text sequences in a MarkDown format. Markdown format is supported with its simplicity, good readability, and structured content, enabling large language models to understand and generate text more efficiently.
In order to further improve the quality of the acquired text sequence and improve the classification accuracy, the acquired text sequence can be subjected to data cleaning to remove noise and useless information in the text, such as redundant blank spaces, line changing symbols, special characters, HTML labels and the like, so that the content which can better represent the data characteristics and is beneficial to the document classification task is reserved, and meanwhile, the text sequence can be further processed through cleaning operations such as spelling correction, grammar correction, missing word filling, wrong word correction, repeated item deletion, special symbol removal, stop word removal, regularized text format and the like, so that the quality of the text data input to a subsequent module is ensured, and the text data is cleaner, more accurate and easier to understand. The embodiment can flexibly select the data cleaning operation to adapt to different actual scenes, and realize the comprehensive and clear extraction of the contents of the files to be classified.
And S103, inputting the text sequence and the classification path description information of each class node into a large model in an associated manner, and obtaining a hierarchical classification result which is output by the large model after reasoning according to the classification path description information of each class node, wherein the result indicates the hierarchical classification class to which the file to be classified belongs in the multi-layer classification structure.
The large model in this embodiment includes a large neural network model in the deep learning field, which has a huge number of parameters and a complex network structure, and can capture and process complex features and modes in data, for example, the large model may include, but is not limited to, a transducer model such as BERT, GPT series, convolutional neural network model, and the like. The large model can automatically extract useful characteristic representations from the input text data through a complex network structure and a large number of parameters, and capture context information in the text data so as to accurately understand the meaning of the text data and classify the text data.
The text sequence of the file to be classified and the classification path description information of each class node in the same-layer hierarchical structure are input into the large model, the classification path description information based on the class nodes comprises all nodes passing from the root node to the class node and description information of classification class represented by each node, and the large model can fully understand the hierarchy and logic relation of the multi-layer hierarchical structure after receiving the classification path description information.
Based on detailed classification path description information, the large model can perform in-depth reasoning analysis on the file to be classified, and gradually reasoning the most probable classification category of the file to be classified on each classification level according to the content of the text sequence and the classification path description information of each classification node. The process not only considers the direct relevance of text content, but also integrates the hierarchical and logical relationship of the classification structure, thereby improving the classification accuracy and rationality. The hierarchical classification result output by the large model clearly indicates the hierarchy and specific classification category of the file to be classified in the multi-layer classification structure, and the hierarchical classification result is based on deep understanding of the text content and accurate grasp of the classification structure by the large model, so that the method has higher reliability and practicability.
In order to better guide the classification logic of the large model, a classification prompt word for indicating the large model to classify files can be constructed according to the text sequence and the classification path description information of each class node, and the classification prompt word can provide a structured information framework to help the large model to better understand the input content and conduct accurate and efficient classification reasoning. After the classification prompt word is obtained, the classification prompt word can be input into the large model, so that the large model generates a hierarchical classification result of the text sequence corresponding to the file to be classified according to the classification prompt word.
The text sequence and the classification path description information are integrated in the classification prompt word so as to clearly classify the target, the path and the possible classification options of the large model, thereby enhancing the interpretation capability of the large model on the input information and improving the classification accuracy and efficiency. The classification prompt word at least can also comprise the classification task executed by the large model and the description information of the corresponding classification output requirement so as to obtain the hierarchical classification result which is output by the large model and meets the classification output requirement. In addition, besides basic requirements on language, style and the like of text generation, customized requirements can be added into the classified prompt words, so that the large model generated content based on the prompt words meets the requirements of document classification, such as a customized requirement example:
claim 1 "please select the most likely class from a given # class node #".
Claim 2 "please give a reason for classification based on the content of the # text sequence # and the classified class"
Claim 1 defines the scope of the large model for generating content for the document category, and avoids the generation type from outputting some information which is completely irrelevant to the document category. Requirement 2 combines the powerful understanding, summarizing and question-answering capabilities of large models so that it can provide interpretability for its classification results.
The construction of the classification prompt word can be implemented by using a classification prompt word template, wherein the template comprises a fixed prompt structure and replaceable variable parameters, in the embodiment, the text sequence and the classification path description belong to the replaceable variable parameters, and the template filling can be completed by inserting the text sequence of the file to be classified and the classification path description information of each class node into the classification prompt word template as the variable parameters, so that the classification prompt word aiming at specific input is generated.
For example, exemplary classification hints may include the following:
[ you are a file classification assistant, you can infer the text sequence input according to the classification path description information of each class node provided by the multi-layer classification structure module, determine the class node to which the text sequence belongs, and output the hierarchical classification result for the text sequence.
Please fully understand and learn the classification path description information of each class node provided by the multi-layer class classification structure module:
Multi-layer hierarchical structure module (input all class nodes and their classification path description information)
Please recognize the category nodes to which the following text sequences belong according to the learned classification categories, and output hierarchical classification results according to the following output format requirements:
[ text sequence ]
Output Format requirement (e.g., outputting the position information of the class node to which the text sequence belongs in the multi-layer hierarchical structure, including the complete node path from the root node to the class node to which the text sequence belongs)
]
After the constructed classification prompt words are input into the large model, the classification process of the large model can be guided more effectively, so that the large model can conduct more accurate classification reasoning based on the structured prompt information, a hierarchical classification result of a text sequence corresponding to the file to be classified is generated, and classification accuracy and classification efficiency are improved.
The hierarchical classification result refers to a result of distributing the files to be classified into the most suitable classification hierarchy and classification category according to a preset file multi-hierarchy classification structure, and indicates the specific position of the files to be classified in the multi-hierarchy classification structure.
For example, taking a document to be classified as a mechanical drawing, generating a text description of contents included in the mechanical drawing as a text sequence obtained by analysis, inputting the text sequence and classification path description information of each class node in the multi-layer classification structure shown in fig. 1B into a large model, and obtaining a hierarchical classification result generated by the large model under ideal conditions, such as a "design part document-design document-drawing-mechanical drawing".
Based on the large model, which is a generating model, and is affected by different model parameters and different settings, the large model can not generate a hierarchical classification result conforming to the specifications, for example, for the hierarchical classification result of the mechanical drawing, the large model may be generated by "design part document-design class drawing-design class mechanical drawing", so that the hierarchical classification result generated by the large model can be further processed to obtain the hierarchical classification result conforming to the multi-layer hierarchical structure, and particularly, the subsequent embodiment can be seen.
Regarding the hierarchical classification result, it may further include a classification basis for determining the text sequence as the hierarchical classification category by the large model, where the classification basis represents information or logic on which the large model is based when determining the text sequence as a hierarchical classification category, and may include analysis information about keywords, phrases, sentence structures, context relations, and the like in the text sequence, and knowledge, rules, and patterns learned inside the large model, and by comprehensively considering these factors, the large model can determine the matching degree of the text sequence and the classification category represented by each class node, and give a corresponding classification result.
The classification basis can be used as a basis for explaining the classification result, enhances the interpretability and traceability of the hierarchical classification result output by the large model, enables a user to know how the large model makes a classification decision, helps the user understand why the text is classified into a certain category, and increases the trust degree of the user on the hierarchical classification result.
In the embodiment of the disclosure, the file classification task is converted into the large model generation task, and the class path to which the large model generation file belongs is used for capturing the semantic information of the file more accurately, so that the accumulated error in the layer-by-layer classification process is reduced, and the classification accuracy is remarkably improved. Specifically, the method firstly determines classification path description information of each class node in each level aiming at a preset multi-level classification structure, the information comprises description information of all nodes passing from a level root node to the class node and classification categories thereof, further obtains a text sequence of a file to be classified, and associates the text sequence with the classification path description information of each class node to input the text sequence into a large model so that the large model can more accurately understand and process key information of different level categories, and accurately positions the file to be classified in the multi-level classification structure according to input information reasoning, thereby determining the class classification of the file to be classified in the multi-level classification structure. By the method, the description information of each category is ensured to be completely reserved, the problem of classification performance degradation caused by information loss in the traditional method is avoided, and the accuracy and the classification efficiency of file classification are improved.
In addition, based on the mode, under the condition that the multi-layer classification structure is changed, classification path description information can be reconstructed for newly added or deleted or replaced class nodes according to the multi-layer classification structure, and then the classification path description information is input into a large model to carry out file classification, so that the quick classification of files under the change of a classification class system is realized, the change of the classification class system can be adapted in real time without reconstructing and training the model, the maintenance cost is reduced, and the continuity and the stability of the classification system are ensured.
In some embodiments, in order to enhance the accuracy and efficiency of description information of each class node in the multi-layer class classification structure, the present embodiment provides a method for automatically constructing description information, and for the description information of each class node in each level of the multi-layer class classification structure described in the foregoing embodiments, text generation and understanding capability of a large language model can be utilized, and in combination with algorithm advantages of deep learning, description information of a class represented by the class node can be automatically generated, and the description information can clearly and accurately convey core features and meanings of each class node, so as to provide more visual and understandable class navigation for users.
Based on the portion of the multi-level hierarchical structure where there may be overlap between the classification categories represented by the respective category nodes in the same level, it may be difficult to sufficiently ensure the degree of distinction between the respective category node specification information in the same level by means of the automatic generation capability of the large language model alone. Therefore, in order to further improve the accuracy and the practicability of the description information, the embodiment further introduces a construction process of generating the prompt word, and the great language model is guided to pay more attention to the difference between different types of nodes in the same level when the description information is generated through the carefully designed prompt word, so that the identification degree and the practicability of the description information are enhanced.
That is, referring to the explanatory information generation schematic diagram of the classification category represented by one type of the nodes exemplarily shown in fig. 2, this embodiment can be implemented by:
S201, constructing and generating a prompt word aiming at each level of the multi-level hierarchical structure, wherein the generated prompt word is used for prompting a large language model to enhance the degree of distinction between the description information of each category node of the same level;
the construction of the generated prompt word can be realized through a prompt word template, the prompt word template comprises fixed prompt texts such as' please generate description information for the classification categories of { alternative parameter variables } of the same level, the description information among the category nodes in the same level is required to have differentiation degree, the classification categories represented by the category nodes of the same level are used as alternative parameter variables, and when the generated prompt word is constructed for the current level, the classification categories of the category nodes included in the current level are filled into the prompt word template to generate the generated prompt word of the current level.
Or to guide a large language model more precisely, generating a prompt word for each class node in each level of the multi-level class classification structure can be constructed, if the prompt word template includes a fixed prompt text such as a class { X1} represented by a given class node { M1}, and the generated prompt word specific to the class node can be constructed for each class node through a filling template by expanding the description to the class { X1}, which is required to be different from other class classes { X2....
S202, inputting all class nodes of the same level in the multi-level hierarchical structure and the generated prompt words into a large language model, and acquiring the description information of all class nodes of the same level output by the large language model.
The large language model used in the present embodiment and the large model for file classification described above are two independent models, and the large language model can fully understand the input class node and automatically generate explanatory information for the classification class represented by the class node. The large language model can be realized by adopting BERT and GPT series models, and can also belong to the same model type as the large model.
And inputting all class nodes of the same level in the multi-level class classification structure and the generated prompt words into a large language model. Through the step, the large language model can fully understand and fuse the category nodes and the information for generating the prompt words, further output the description information of each category node in the same level, accurately reflect the core characteristics of each category node, and effectively improve the distinction degree and the identification degree among the description information of different category nodes in the same level through the guidance of the generated prompt words.
In the embodiment of the disclosure, the degree of distinction between the information is emphasized when the description information of the same-level class node is generated by constructing the generated prompt word, so that the generated description information is prevented from being too similar or fuzzy, blindness and trial-and-error cost of the large language model when the description information is generated are reduced, and the model can generate the information meeting the requirements more quickly after receiving the explicit prompt word, thereby improving the overall generation efficiency and the accuracy and the readability of the generated description information. By guiding the large model in this way, the generated description information of different categories can be as different as possible, so that the classification interval between different classification categories is enlarged, and the classification difficulty of the large language model is reduced.
In some embodiments, because the generation of the large language model has uncertainty, in order to further improve the accuracy and the readability of the description information of each class node in the multi-level classification structure and ensure that the description information of a plurality of class nodes in the same level has enough distinction, the embodiment provides a mechanism for generating the description information and evaluating the description information in multiple batches, which is suitable for the case that the same level in the multi-level classification structure comprises a plurality of class nodes, and the mechanism finds out the description information batch which best meets the requirements through multiple times of generation of the description information and batch screening, thereby ensuring the distinction degree between the description information of a plurality of class nodes in the same level. Referring to the flow chart of steps for generating specification information and evaluating multiple batches exemplarily shown in fig. 3A, this embodiment may be implemented by:
S301, acquiring multi-batch description information which is generated for a plurality of category nodes in the same hierarchy for a plurality of times, wherein each batch comprises description information of all category nodes in the hierarchy;
In the multi-layer hierarchical structure, when a plurality of category nodes are contained in the same layer, the category classification represented by the plurality of category nodes in the same layer is taken as one batch, and the description information of a plurality of batches is generated in a multi-time generation mode, wherein each batch contains the description information of all category nodes in the layer, so that the whole coverage of each batch on the category nodes in the layer is ensured. The description information of the single category node can be realized by the mode described in the foregoing embodiment.
For example, assuming that a third level includes a category node A, B, C, D, a lot of description information may be generated using the category node A, B, C, D as a lot, where each lot of description information includes e_ A, E _ B, E _ C, E _d, corresponding to the category node A, B, C, D.
S302, determining the text description distinction degree of each batch based on the text similarity between the description information of every two types of nodes in the same level aiming at the description information in each batch;
Text similarity is an indicator that measures the similarity of two text paragraphs or sentences in terms of content, vocabulary, structure, etc., and the higher the similarity, the closer the two texts are expressed, possibly containing similar information or views.
The text description distinction is used for representing the degree that the explanatory information of the nodes of different categories can be clearly distinguished from each other in terms of content, expression mode and the like in the same hierarchy, and the higher the degree of distinction is, the more unique the explanatory information of the nodes of different categories is, and the easier the large model is to identify and understand the unique characteristics and differences of each category.
Based on this, if the average text similarity between all explanatory information of the same lot is higher, the text description discrimination for characterizing the lot is lower. Therefore, when determining the text description distinction degree of each batch, the description information generated by each batch can be firstly evaluated by adopting a text similarity calculation method, and the semantic similarity degree between every two kinds of node description information in the same level can be evaluated. The text similarity can be determined by adopting a calculation mode such as cosine similarity and Jaccard similarity, and particularly, the most suitable algorithm can be selected according to actual requirements. Then, based on the calculated text similarity, the text description distinction degree can be determined based on the reciprocal of the similarity, the difference value or other index calculation modes capable of reflecting the information difference.
Regarding the calculation of the text description discrimination, it is possible to determine by calculating the text description discrimination between every two pieces of explanatory information in the same batch, thereby obtaining the text description discrimination average value of the whole batch. That is, referring to the computational schematic of the textual description discrimination of a batch exemplarily shown in fig. 3B, the textual description discrimination may be determined by:
S3021, respectively determining the complement of the text similarity between every two pieces of description information relative to 1 as the text description distinction between the two pieces of description information according to a plurality of pieces of description information which are in one-to-one correspondence with each category node in the same batch;
Text similarity between any two pieces of description information in the same batch is calculated using a text similarity algorithm such as cosine similarity, jaccard similarity, edit distance, etc., the similarity value is typically between 0 and 1, where 0 indicates complete dissimilarity and 1 indicates complete identity.
For each pair of explanatory information, a complement of the text similarity with respect to 1 is calculated, that is, if the text similarity of the two explanatory information is S (0≤S≤1), the text description discrimination between them is 1-S, the complement reflects the degree of difference in content of the two explanatory information, and the calculated text description discrimination between each pair of explanatory information is stored.
For example, for n=4 explanatory information in batch 2 illustrated in fig. 3B, n×1/2=6 explanatory information pairs are formed, and the complements of the text similarity between each pair of explanatory information with respect to 1 are calculated, so as to obtain the text description distinction Fi (i takes the value of 1+.i+.n×1/2) of the explanatory information of the i-th pair.
And S3022, summing the text description discrimination between all the different pieces of description information, and determining the average discrimination according to the summation result and the logarithm of all the different pieces of description information as the text description discrimination of the batch.
The logarithm of all the different pairwise explanatory information is the number of all the different pairwise explanatory information pairs when the pairwise combination is carried out on a plurality of explanatory information corresponding to each category node in the same batch, and the total number of the explanatory information pairs needing to be subjected to the discrimination calculation is represented.
Based on the above description, the procedure of determining the text description discrimination degree of the whole batch based on the text description discrimination degree between the two explanatory information provided in the present embodiment can be expressed as the following formula:
Wherein text (i) represents the extension description information of the i-th category, sim ()' represents the text similarity calculation function. Namely, for the description information of n different types of nodes of a certain level, similarity is calculated in pairs, the complement number of1 is taken as the text description distinction of the description information (text (i), text (j)) of the node pair (i, j), and finally, the average number of the text description distinction of n (n-1)/2 description information pairs is taken as the text description distinction of the batch.
This step sums up the text description distinction between all the different pieces of explanatory information calculated in step S3021, and the result of the summation obtained reflects the degree of the overall difference between the explanatory information in the entire batch. Next, the result of the summation is divided by the logarithm to obtain an average distinction degree as the text description distinction degree of the batch, the average distinction degree reflecting the average degree of difference between the explanatory information in the batch, the higher the average distinction degree, the more unique in content the explanatory information in the batch, and the easier it is to distinguish.
Regarding the calculation of the text description distinction degree, the text description distinction degree of the batch can be evaluated according to the complement number of the average value relative to 1 by calculating the text similarity degree between every two pieces of description information in the same batch, summing the calculated text similarity degrees between all different pieces of description information, and then calculating the average value, so as to obtain the average value of the text similarity degrees between the pieces of description information in the same batch. It will be appreciated that the manner in which the differentiation of the text description is calculated may take other suitable embodiments, and the application is not limited in this regard.
And S303, using a batch of description information with the highest text description distinction degree in the multiple batches of description information as description information of multiple category nodes in the same hierarchy.
After the multi-batch description information of the category nodes belonging to the same level is obtained, the text description distinction degree of each batch is compared, and a batch of description information with the highest distinction degree is selected as the final description information of the category nodes in the same level, so that the finally output description information is ensured to accurately reflect the characteristics of the category nodes, and has enough distinction degree, thereby avoiding confusion and misunderstanding of information.
For example, for the X category nodes in the same hierarchy exemplarily shown in fig. 3A, 4 batches of specification information each including specification information corresponding one-to-one to the X category nodes are generated, respectively. Taking the batch 1 as an example, determining the text description distinction degree of the batch 1 by calculating the text similarity between every two of the X pieces of description information in the batch 1, and obtaining the text description distinction degree of the batches 2-4 in the same way. Screening the lot with the highest text description discrimination, such as lot 2, from lot 1 to lot 4, takes the X pieces of description information included in lot 2 as the description information of the X category nodes in the hierarchy, and the lots 1,3, 4 are discarded.
In the embodiment of the disclosure, by acquiring the multi-batch description information generated for a plurality of category nodes in the same hierarchy for a plurality of times, each category node is ensured to have a plurality of possible description options, the situation that the single generation may bring is avoided, the text description distinction degree of each batch is determined for the description information in each batch, the difference degree of the description information of different category nodes in the same batch is evaluated in a quantized manner, the description information of each category node is optimized by taking the batch of description information with the highest text description distinction degree in the multi-batch description information as the description information of a plurality of category nodes in the same hierarchy, and the maximization of the difference between the different category extension descriptions is ensured, so that the classification interval between different categories is effectively enlarged. The optimization not only improves the distinguishing property of the information, but also reduces the difficulty of the large language model in the subsequent classification task.
In some embodiments, regarding the determination of the classification path description information of each class node in each hierarchy described in the foregoing embodiments, in order to more effectively convey the location of the class node and its belonging path in the multi-layer class classification structure, this embodiment provides a manner of generating classification path description information based on the set symbol links, which may be implemented by:
First, for each node passing from the hierarchical root node to the category node, the node is organized into one item of information with the description information of the category that it represents. That is, the information item includes description information of the classification category of the node and the node at the same time, if Y nodes pass from the root node to the category node, Y information items can be formed.
And then, according to the path sequence of the passed nodes, integrating the information items through set symbol links to form the classified path description information of the nodes. The setting symbol is used for connecting information items and endowing the classification path description information structure and logic, each node and the classification information thereof can be orderly connected in series in this way to form a clear and coherent classification path, and the classification path description information of the node is formed by the path formed by integrating the information items through the linkage of the setting symbol, so that the position of the classification node in the classification structure can be intuitively displayed, and powerful support is provided for understanding and identifying different classes through detailed classification information.
For example, for an exemplary diagram of a multi-level hierarchical structure schematically illustrated in fig. 4, the classification path description information of each class node in the classification structure may be generated by organizing a set symbol "|", for example, the classification path description information of each class node may be expressed as:
the first class { one }: { description information };
......
Primary category { two }: { description info } |secondary category {1}: { description info } |tertiary category { B }: { description info }
......
One-level class { two }: { description information } |two-level class {3}: { description information } |three-level class { C }: { description information } |four-level class { d }: { description information }
As shown in fig. 4, 13 class nodes exist in total, 13 classification path description information is constructed, and the classification path description information of all class nodes in the multi-layer classification structure forms a set, and is used as a description for the multi-layer classification structure, and is input into a large model, so that the large model accurately knows a file classification system and logic.
In the embodiment of the disclosure, each category node can be defined by the complete classification path through the method, so that category confusion or misunderstanding caused by information deficiency is avoided, through accumulation of description information of each node and the classification category thereof in the path, the integrity of the related information of the classification category represented by any category node is fully ensured, the obtained description information of the classification path not only contains rich semantic information, but also obviously improves the degree of distinction between different categories, and enables the description information of the classification path to transmit the classification attribute and semantic characteristics of the node while conveying the position of the node.
In some embodiments, the large model determines a class node to which the text sequence corresponding to the file to be classified belongs according to the classification path description information of each class node, and generates a class sequence formed from the hierarchical root node to a node through which the class node belongs as a hierarchical classification result of the text sequence corresponding to the file to be classified.
Based on the fact that the large model belongs to the generation model, the hierarchical classification result generated by the large model is not necessarily completely consistent with the character description of each classification category in the multi-layer classification structure under the influence of different model parameters and different settings. For example, assuming that the hierarchical classification result corresponding to the hierarchical classification structure in the multi-hierarchy is "car|new energy|three electric system", the generation of the large model may be "car|new energy three electric system". Therefore, in order to ensure that the hierarchical classification structure of the finally obtained file to be classified is consistent with the category paths in the preset multi-level classification structure, the classification accuracy is improved, the classification error caused by inaccuracy of the large model output is avoided, and after the hierarchical classification result of the large model output is obtained, the hierarchical classification result can be further optimized by matching the hierarchical classification result with the hierarchical classification structure in a hierarchical classification character similarity manner through the following steps:
A1, carrying out character matching similarity calculation on class sequences output by the large model and class nodes of the multi-layer class classification structure layer by layer, and determining a node path in the multi-layer class classification structure, which has the highest character similarity with the class sequences; and a2, taking the class sequences represented by all the nodes in the node path as a hierarchical classification result after the text sequences corresponding to the files to be classified are updated.
In this embodiment, the class sequence output by the large model is compared with the preset multi-layer class classification structure layer by layer, and the node path in the multi-layer class classification structure with the highest character similarity can be found by calculating the character matching similarity between the class represented by the class node and each layer class in the class sequence, so as to correct the classification deviation possibly generated by the large model and ensure that the final classification result accords with the preset multi-layer class classification structure.
In order to calculate the character matching similarity between two character strings (i.e., the class of each level in the class sequence generated by the large model and the class represented by the class node of each level in the multi-level class classification structure), the present embodiment calculates the character matching similarity based on a shortest edit distance formula, the shortest edit distance (also referred to as a Levenshtein distance) refers to the minimum number of edit operations required to convert one character string into another, where the edit operations include inserting, deleting, and replacing characters. The shortest editing distance calculation method adopting the dynamic programming method is as follows:
through the formula, the final Dm n (m and n are the lengths of the two strings s1 and s2 respectively) can be obtained through calculation in a dynamic programming mode, namely the shortest editing distance is obtained, and the character matching similarity is 2D m n/m+n.
Taking the aforementioned hierarchical classification result corresponding to the multi-level hierarchical classification structure as "car|new energy|three electric system", the generation of the large model is "car|new energy three electric system" as an example, assuming that for the two character strings s1= "three electric system" and s2= "new energy three electric system" of the third level, the shortest edit distance between them is calculated as an example:
Initializing a matrix D of size (len (s 1) +1) x (len (s 2) +1) and initializing all elements to 0 (note that in practice the first row and first column need to be initialized to values from 0 to the respective string length, representing the edit distance for converting one empty string to another string), for example the initialized matrix D is as follows, () represents the values to be padded:
[0,1,2,3,4,5,6]
[1,(),(),(),(),(),()]
[2,(),(),(),(),(),()]
[3,(),(),(),(),(),()]
[4,(),(),(),(),(),()]
the remainder of the matrix D is filled according to the recurrence formula for the shortest edit distance. For each position (i, j), we compare s1[ i-1] with s2[ j-1]:
If they are equal, D [ i ] [ j ] = D [ i-1] [ j-1] (no editing is required).
If they are not equal, di j=min (Di-1 j+1, di j-1+1, di-1 j-1+1) (corresponding to delete, insert and replace operations, respectively).
The filled matrix D is as follows:
[0,1,2,3,4, 5,6]
[1, 2,3,4, X1, X2] # X1 and X2 are values to be calculated [2, X3, X4, X5, X6, X7] # X3, X4, X5, X6, X7 are values to be calculated [3, (), (), (), X8, X9] # X8 and X9 are values to be calculated, depending on the previous values [4, (), (), (4+X), (5+X) ]# when i= 4,j =6, since the "system" is the same, only the insertion of "new energy" is considered, X is the minimum edit distance calculated previously
Regarding X1 to X9 to be calculated as described above:
x1=min (2+1, 3+1, 4+1) =min (3, 4, 5) =3 (delete s1[1] = 'electric' or insert 'new' to s1 before or replace 'three' to 'new')
X2=min (3+1, x1+1, 4+1) =min (4, 5) =4 (continuing the above operation)
X3=min (1+1, 2+1, 3+0) =min (2, 3) =2 ('three' = 'three'), no editing is required, but the previous editing distance is considered
X4=min(2+1,X3+1,3+1)=min(3,3,4)=3
X5=min(3+1,X4+1,4+1)=min(4,4,5)=4
X6=min (4+1, x5+1, 4+0) =min (5, 4) =4 ('line' = 'line', no editing is required
X7=min(4+1,X6+1,5+1)=min(5,5,6)=5
X8=min (3+1, x7+1, 5+0) =min (4,6,5) =4 ('system' = 'system'), no editing is required
X9 (i.e., D [4] [6 ])=min (4+1, X8+1, 4+x) =min (5, 4+x), but here X is actually the minimum edit distance before calculating to D [3] [5], since "three electricity" does not match the prefix of "new energy three electricity", it is necessary to trace back to D [0] [3] (i.e., 3, meaning that 3 edits are required to convert empty string to "new energy") and then add the edit distance (here 0, because they are identical) to convert "system" to "three electricity system", but in this particular case, knowing that the minimum edit distance from D [3] [4] to D [4] [6] is an insertion that considers "new energy", so X can be seen as 3 (to "new energy") plus 0 (because "system" has been matched), but X9 = min (5, 4+3) = 4 by recursion knowing X8 is 4 (considering previous edits) (but in practice, since we skip directly from D [3] [5] to D [4] [6] consider an insertion of "new energy", and know that "system" is the same, we derive directly D [4] [6] = 5, the explanation here being to show the recursion procedure). However, a more accurate recurrence should be a stepwise calculation based on the previous values, rather than a direct jump, but in this example D [4] [6] =5 could be directly taken to be the correct result from the recurrence for simplicity.
Finally, the value of Dlen (s 1) [ len (s 2) ] is the minimum edit distance required for s1 to convert to s 2. In this example, the value of D4 6 is 5, which means that 5 editing operations (specifically, "new energy" is inserted) are required to convert s1 to s2, so that the character matching similarity can be calculated by using the shortest editing distance.
By the method, the character matching similarity between each node in the class sequence output by the large model and the corresponding node in the multi-layer classification structure can be dynamically calculated, so that the best matched node path is found, and the hierarchical classification result is updated, so that the class with the highest character matching similarity with the class of the document classification generated by the large model is selected from the multi-layer classification mechanism to serve as the hierarchical classification result of the final model.
Corresponding to the foregoing embodiment of the document classification method, referring to fig. 5, the present application further provides an embodiment of a document classification apparatus, where the apparatus includes:
a classification preprocessing module 501, configured to determine, for a preset multi-level classification structure, classification path description information of each class node in each level, where the classification path description information is determined based on all nodes passing from a level root node to the class node and description information of classification class represented by each node;
The file analysis module 502 is configured to obtain a text sequence obtained by analyzing a file to be classified;
And the large model classification module 503 is configured to input the text sequence and the classification path description information of each class node into a large model, and obtain a hierarchical classification result that is output by the large model after reasoning according to the classification path description information of each class node, where the result indicates a hierarchical classification class to which the file to be classified belongs in the multi-layer hierarchical classification structure.
In some embodiments, the apparatus further comprises:
Constructing a generated prompt word aiming at each level of the multi-level hierarchical structure, wherein the generated prompt word is used for prompting a large language model to enhance the degree of distinction between the description information of each category node of the same level;
And inputting each class node of the same level in the multi-level hierarchical structure and the generated prompt word into a large language model, and acquiring the description information of each class node of the same level output by the large language model.
In some embodiments, where the same level in the multi-level hierarchical structure includes multiple category nodes, the apparatus further comprises:
The batch generation module is used for acquiring multi-batch description information which is generated for a plurality of category nodes in the same level for a plurality of times, wherein each batch comprises description information of all category nodes in the level;
The distinguishing degree calculation module is used for determining the text description distinguishing degree of each batch based on the text similarity between the description information of every two types of nodes in the same level aiming at the description information in each batch;
And the screening module is used for taking a batch of description information with the highest text description distinction degree in the multiple batches of description information as the description information of the multiple category nodes in the same hierarchy.
In some embodiments, the discrimination computation module is specifically configured to:
For a plurality of pieces of description information corresponding to each category node one by one in the same batch, respectively determining the complement of the text similarity between every two pieces of description information relative to 1, and taking the complement as the text description distinction between the two pieces of description information;
And summing the text description discrimination between all the different piece of explanatory information, and determining the average discrimination according to the summation result and the logarithm of all the different piece of explanatory information as the text description discrimination of the batch.
In some embodiments, the classification preprocessing module is specifically configured to:
for each node passing from the hierarchical root node to the class node, organizing the description information of the class represented by the node into an information item;
And integrating all the information items through set symbol links according to the path sequence of the passed nodes to form the classified path description information of the nodes.
In some embodiments, the large model determines a category node to which the text sequence corresponding to the file to be classified belongs according to the classification path description information of each category node, and generates a category sequence formed from the hierarchical root node to a node through which the category node belongs as a hierarchical classification result of the text sequence corresponding to the file to be classified, where the apparatus further includes:
Performing character matching similarity calculation on the class sequence output by the large model and the class nodes of the multi-layer class classification structure layer by layer, and determining a node path in the multi-layer class classification structure, which has the highest character similarity with the class sequence;
and taking the class sequences represented by all the nodes in the node path as a hierarchical classification result after the text sequences corresponding to the files to be classified are updated.
In some embodiments, the large model classification module is specifically configured to:
Constructing a classification prompt word for classifying the document according to the text sequence and the classification path description information of each class node, wherein the classification prompt word at least further comprises a classification task executed by the large model and description information required by corresponding classification output;
and inputting the classification prompt words into the large model so that the large model generates a hierarchical classification result of the text sequence corresponding to the file to be classified according to the classification prompt words.
In some embodiments, the file parsing module is specifically configured to:
Under the condition that the file to be classified comprises a picture, at least one of text information, file name and text description aiming at content included in the picture is obtained as a text sequence of the picture;
In the case that the file to be classified includes audio or video, text information obtained by converting speech included in the audio or video is acquired as a text sequence of the audio or video.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.
The embodiment of the application further provides an electronic device, the schematic structural diagram of which is shown in fig. 6, the electronic device 600 comprises at least one processor 601, a memory 602 and a bus 603, the at least one processor 601 is electrically connected with the memory 602, the memory 602 is configured to store at least one computer executable instruction, and the processor 601 is configured to execute the at least one computer executable instruction, thereby executing the steps of any file classification method provided in any embodiment or any optional implementation manner of the application.
Further, the processor 601 may be an FPGA (Field-Programmable gate array) or other device having logic processing capability, such as an MCU (Microcontroller Unit, micro control unit), a CPU (Central Process Unit, central processing unit).
The embodiment of the application also provides another readable storage medium, which stores a computer program for implementing the steps of any one of the file classification method provided by any one of the embodiments or any one of the optional implementations of the application when the computer program is executed by a processor.
The readable storage medium provided by the embodiments of the present application includes, but is not limited to, any type of disk (including floppy disks, hard disks, optical disks, CD-ROMs, and magneto-optical disks), ROMs (Read-Only memories), RAMs (Random Access Memory, random access memories), EPROMs (Erasable Programmable Read-Only memories), EEPROMs (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only memories), flash memories, magnetic cards, or optical cards. That is, a readable storage medium includes any medium that stores or transmits information in a form readable by a device (e.g., a computer).
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims (11)

1.一种文件分类方法,其特征在于,所述方法包括:1. A file classification method, characterized in that the method comprises: 针对预设的多层级分类结构,确定每一层级中每一类别节点的分类路径描述信息;其中,所述分类路径描述信息被基于从层级根节点到该类别节点经过的所有节点以及每一节点所表示的分类类别的说明信息确定得到;For a preset multi-level classification structure, determining classification path description information of each category node in each level; wherein the classification path description information is determined based on all nodes from the level root node to the category node and description information of the classification category represented by each node; 获取解析待分类文件得到的文本序列;Obtain the text sequence obtained by parsing the file to be classified; 将所述文本序列与所述每一类别节点的分类路径描述信息关联输入大模型,并获取所述大模型根据所述每一类别节点的分类路径描述信息推理后输出的层级分类结果,该结果表明所述待分类文件在所述多层级分类结构中所属的层级分类类别。The text sequence is associated with the classification path description information of each category node and input into the big model, and the hierarchical classification result output by the big model after reasoning based on the classification path description information of each category node is obtained, and the result indicates the hierarchical classification category to which the file to be classified belongs in the multi-level classification structure. 2.根据权利要求1所述的方法,其特征在于,所述方法还包括:2. The method according to claim 1, characterized in that the method further comprises: 针对所述多层级分类结构的每个层级,构建生成提示词;其中,所述生成提示词用于提示大语言模型增强同一层级的各个类别节点的说明信息之间的区分度;For each level of the multi-level classification structure, construct a generation prompt word; wherein the generation prompt word is used to prompt the large language model to enhance the distinction between the description information of each category node at the same level; 将所述多层级分类结构中同一层级的各个类别节点与所述生成提示词一同输入至大语言模型,获取所述大语言模型输出的对同一层级中各个类别节点的说明信息。Each category node at the same level in the multi-level classification structure is input into a large language model together with the generated prompt word, and description information of each category node at the same level output by the large language model is obtained. 3.根据权利要求1或2所述的方法,其特征在于,在所述多层级分类结构中同一层级包括多个类别节点的情况下,所述方法还包括:3. The method according to claim 1 or 2, characterized in that, when the same level in the multi-level classification structure includes multiple category nodes, the method further comprises: 获取针对所述同一层级中的多个类别节点多次生成得到的多批次说明信息;每一批次包括该层级所有类别节点的说明信息;Acquire multiple batches of description information generated multiple times for multiple category nodes in the same level; each batch includes description information of all category nodes in the level; 针对每一批次中的说明信息,基于同一层级中两两类别节点的说明信息之间的文本相似度,确定该批次的文本描述区分度;For each batch of description information, the text description discrimination of the batch is determined based on the text similarity between the description information of each pair of category nodes in the same level; 将所述多批次说明信息中文本描述区分度最高的一批说明信息,作为该同一层级中的多个类别节点的说明信息。A batch of description information with the highest text description differentiation among the multiple batches of description information is used as description information for multiple category nodes in the same level. 4.根据权利要求3所述的方法,其特征在于,所述基于同一层级中两两类别节点的说明信息之间的文本相似度,确定该批次的文本描述区分度,包括:4. The method according to claim 3, characterized in that the determining the text description discrimination of the batch based on the text similarity between the description information of two categories of nodes in the same level comprises: 针对同一批次中与各个类别节点一一对应的多个说明信息,分别确定每两个说明信息之间的文本相似度相对于1的补数,作为该两个说明信息之间的文本描述区分度;For multiple description information corresponding to each category node in the same batch, respectively determine the complement of the text similarity between each two description information relative to 1 as the text description discrimination between the two description information; 将所有不同的两两说明信息之间的文本描述区分度求和,并根据求和结果与所有不同的两两说明信息的对数确定出平均区分度,作为该批次的文本描述区分度。The text description discriminations between all different pairwise description information are summed up, and the average discrimination is determined based on the summation result and the logarithm of all different pairwise description information, which is used as the text description discrimination of the batch. 5.根据权利要求1所述的方法,其特征在于,所述确定每一层级中每一类别节点的分类路径描述信息,包括:5. The method according to claim 1, wherein determining the classification path description information of each category of nodes in each level comprises: 对于从层级根节点到该类别节点所经过的每个节点,将该节点与其所表示的分类类别的说明信息组织成一个信息项;For each node from the root node of the hierarchy to the category node, organize the description information of the node and the classification category it represents into an information item; 按照所经过节点的路径顺序,将各个信息项通过设定符号链接整合,形成该类别节点的分类路径描述信息。According to the path sequence of the passed nodes, each information item is integrated by setting symbolic links to form the classification path description information of the category node. 6.根据权利要求1所述的方法,其特征在于,所述大模型依据该每一类别节点的分类路径描述信息,确定所述待分类文件对应的文本序列所属的类别节点,并生成从所述层级根节点到该所属的类别节点所经过的节点形成的类别序列,作为所述待分类文件对应的文本序列的层级分类结果;所述方法还包括:6. The method according to claim 1 is characterized in that the large model determines the category node to which the text sequence corresponding to the to-be-classified file belongs based on the classification path description information of each category node, and generates a category sequence formed by the nodes passed from the hierarchical root node to the category node as the hierarchical classification result of the text sequence corresponding to the to-be-classified file; the method further comprises: 将所述大模型输出的类别序列与所述多层级分类结构进行逐层级的类别节点的字符匹配相似度计算,确定出与所述类别序列的字符相似度最高的、所述多层级分类结构中的节点路径;Calculate the character matching similarity of the category nodes of the category sequence output by the large model and the multi-level classification structure layer by layer, and determine the node path in the multi-level classification structure with the highest character similarity to the category sequence; 将由所述节点路径中所有节点所表示的类别序列,作为所述待分类文件对应的文本序列更新后的层级分类结果。The category sequence represented by all nodes in the node path is used as the hierarchical classification result after the text sequence corresponding to the file to be classified is updated. 7.根据权利要求1所述的方法,其特征在于,所述将所述文本序列与所述每一类别节点的分类路径描述信息关联输入大模型,包括:7. The method according to claim 1, characterized in that the step of associating the text sequence with the classification path description information of each category node and inputting the text sequence into a large model comprises: 根据该文本序列与所述每一类别节点的分类路径描述信息,构建文档分类的分类提示词;所述分类提示词至少还包括所述大模型执行的分类任务和相应分类输出要求的描述信息;According to the text sequence and the classification path description information of each category node, a classification prompt word for document classification is constructed; the classification prompt word at least includes description information of the classification task performed by the large model and the corresponding classification output requirements; 将该分类提示词输入至所述大模型,以使所述大模型根据该分类提示词生成该待分类文件对应的文本序列的层级分类结果。The classification prompt word is input into the large model so that the large model generates a hierarchical classification result of the text sequence corresponding to the file to be classified according to the classification prompt word. 8.根据权利要求1所述的方法,其特征在于,所述获取解析待分类文件得到的文本序列,包括:8. The method according to claim 1, characterized in that the step of obtaining a text sequence obtained by parsing the file to be classified comprises: 在待分类文件包括图片的情况下,获取图片中所包括的文字信息、文件名称、针对图片中所包括内容的文字描述中的至少一种作为该图片的文本序列;In the case where the file to be classified includes a picture, obtaining at least one of text information included in the picture, a file name, and a text description of the content included in the picture as a text sequence of the picture; 在待分类文件包括音频或视频的情况下,获取音频或视频所包括的语音被转换得到的文本信息作为该音频或视频的文本序列。In the case that the file to be classified includes audio or video, text information obtained by converting the speech included in the audio or video is obtained as a text sequence of the audio or video. 9.一种文件分类装置,其特征在于,所述装置包括:9. A file classification device, characterized in that the device comprises: 分类预处理模块,用于针对预设的多层级分类结构,确定每一层级中每一类别节点的分类路径描述信息;其中,所述分类路径描述信息被基于从层级根节点到该类别节点经过的所有节点以及每一节点所表示的分类类别的说明信息确定得到;A classification preprocessing module, for determining classification path description information of each category node in each level for a preset multi-level classification structure; wherein the classification path description information is determined based on description information of all nodes from the level root node to the category node and the classification category represented by each node; 文件解析模块,用于获取解析待分类文件得到的文本序列;A file parsing module, used to obtain a text sequence obtained by parsing the file to be classified; 大模型分类模块,用于将所述文本序列与所述每一类别节点的分类路径描述信息关联输入大模型,并获取所述大模型根据所述每一类别节点的分类路径描述信息推理后输出的层级分类结果,该结果表明所述待分类文件在所述多层级分类结构中所属的层级分类类别。The large model classification module is used to associate the text sequence with the classification path description information of each category node and input it into the large model, and obtain the hierarchical classification result output by the large model after reasoning based on the classification path description information of each category node, which indicates the hierarchical classification category to which the file to be classified belongs in the multi-level classification structure. 10.一种电子设备,其特征在于,包括:存储器、处理器;10. An electronic device, comprising: a memory and a processor; 所述存储器,用于存储计算机程序;The memory is used to store computer programs; 所述处理器,用于调用所述计算机程序以实现如权利要求1-9任一项所述的方法。The processor is used to call the computer program to implement the method according to any one of claims 1-9. 11.一种可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现如权利要求1-9中任一项所述的方法。11. A readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the method according to any one of claims 1 to 9 is implemented.
CN202510321767.2A 2025-03-18 2025-03-18 File classification method, device, equipment and readable storage medium Pending CN120256636A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510321767.2A CN120256636A (en) 2025-03-18 2025-03-18 File classification method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510321767.2A CN120256636A (en) 2025-03-18 2025-03-18 File classification method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN120256636A true CN120256636A (en) 2025-07-04

Family

ID=96188588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510321767.2A Pending CN120256636A (en) 2025-03-18 2025-03-18 File classification method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN120256636A (en)

Similar Documents

Publication Publication Date Title
US20210232762A1 (en) Architectures for natural language processing
CN118606440B (en) Data intelligent analysis method and system combining knowledge graph and rule constraints
US12032605B2 (en) Searchable data structure for electronic documents
CN115630146A (en) Method and device for automatically generating demand document based on human-computer interaction and storage medium
CN112559734B (en) Brief report generating method, brief report generating device, electronic equipment and computer readable storage medium
CN115130989A (en) Method, device and equipment for auditing service document and storage medium
US20230014904A1 (en) Searchable data structure for electronic documents
KR20200139008A (en) User intention-analysis based contract recommendation and autocomplete service using deep learning
WO2023134676A1 (en) Method and system for automatically formulating optimization problem using machine learning
CN114780582A (en) Natural answer generating system and method based on form question and answer
CN116719840A (en) A medical information push method based on structured processing of medical records
CN119646192A (en) Multimodal scientific data retrieval method and system based on summary-assisted cognitive enhancement
CN118503454A (en) Data query method, device, storage medium and computer program product
CN107220249A (en) Full-text search based on classification
CN120526446A (en) Document upload method for project management software based on OCR and large language model
Crosilla et al. Benchmarking large language models for handwritten text recognition
CN117291192B (en) Government affair text semantic understanding analysis method and system
CN119849610A (en) Knowledge graph construction method and device
CN119273306A (en) A method for intelligent declaration of enterprise projects based on deep learning big model
CN120256636A (en) File classification method, device, equipment and readable storage medium
CN117875706A (en) An AI-based digital management method for grading process
CN117633172A (en) Question-answer data generation method and device, electronic equipment and readable storage medium
US11727215B2 (en) Searchable data structure for electronic documents
CN119474406B (en) GLM-4-based annotation-free course knowledge point diagram spectrum construction method and system
US20250322047A1 (en) Prompt refinement to improve accuracy of outputs from machine learning models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination