CN119988694A

CN119988694A - Graph partition retrieval method, system and readable storage medium based on distance perception

Info

Publication number: CN119988694A
Application number: CN202510472804.XA
Authority: CN
Inventors: 王鑫炜; 李恒; 黄炎; 陈书俊
Original assignee: Shanghai Aikesheng Information Technology Co ltd
Current assignee: Shanghai Aikesheng Information Technology Co ltd
Priority date: 2025-04-16
Filing date: 2025-04-16
Publication date: 2025-05-13

Abstract

The invention relates to a graph division retrieval method, a system and a readable storage medium based on distance perception, wherein the method comprises the steps of constructing original data into an initial graph structure; selecting distance measurement, distributing weight for each edge, merging nearest neighbor node pairs based on the weight to obtain a coarsened graph structure, carrying out hierarchical segmentation on the coarsened graph structure, projecting back to the initial graph structure based on segmentation conditions to obtain a plurality of initial cutting edge sets, finding out corresponding nodes according to each initial cutting edge in the initial cutting edge sets, randomly placing the nodes into different partitions, optimizing each initial cutting edge set by taking the sum of minimized cutting edge weights as an objective function to obtain a graph index structure, inputting a query, calling the graph index structure, respectively calculating the distances between the query and the central nodes of each partition of the graph index structure, and selecting the partition with the nearest distance for retrieval. The method reduces the times of cross-partition query and improves the query efficiency.

Description

Graph division retrieval method and system based on distance perception and readable storage medium

Technical Field

The invention relates to the technical field of vector databases, in particular to a graph division retrieval method and system based on distance perception and a readable storage medium.

Background

In the field of computer vision, image matching and retrieval are core tasks, and are widely applied to the fields of face recognition, automatic driving, medical image analysis and the like. With explosive growth of image data volume (such as social media, medical image library, etc.), traditional retrieval methods based on text labels or low-level visual features (color, texture) gradually expose limitations:

1) The manual annotation has strong dependence, the traditional method needs to manually annotate the image content, has high cost and is difficult to cover complex scenes;

2) The feature dimension is high, the image features (such as SIFT and ORB) are usually high-dimensional vectors, the direct matching calculation complexity is as high as O (N2), and large-scale data are difficult to deal with;

semantic gap-low-level features are difficult to capture image semantic information, so that cross-modal retrieval effect is limited.

In order to solve the problems, an Approximate Nearest Neighbor (ANN) method becomes a research hotspot, and the core aim is to remarkably improve the retrieval efficiency on the premise of ensuring a certain precision by a dimension reduction or structured indexing technology.

The currently prevailing ANN methods can be divided into four classes:

1. tree-based methods (e.g., KD-Tree, ball-Tree) accelerate nearest neighbor search by spatial partitioning, but have limited effect on high-dimensional data;

2. based on quantization methods (e.g., LSH, PQ) by compressing the feature dimension through quantization, but quantization errors may affect the matching accuracy;

3. hash-based methods (e.g., CNNH, DPSH) achieve fast matching by hash coding, but are not robust to complex geometric deformations and illumination changes.

4. The graph-based method (such as Graph Neural Networks) realizes efficient retrieval by constructing an image similarity graph structure, has optimal performance, but has the following bottlenecks:

the construction cost is high, and a large amount of computation resources and storage space are consumed for constructing the graph of a large data set (such as millions of images);

the graph division defect is that a traditional graph division algorithm (such as METIS) aims at minimizing edge cut numbers, and can divide high-similarity nodes into different subgraphs, so that index performance is reduced;

Therefore, it is necessary to study a new graph partitioning algorithm to maintain the spatial proximity of data so that the query result is more accurate.

Disclosure of Invention

The invention aims to provide a graph division searching method, a system and a readable storage medium based on distance perception, which are used for solving the technical problems of the existing graph division algorithm, reducing the number of times of cross-partition query and improving the query efficiency and the accuracy of query results.

In order to achieve the above object, the present invention provides a graph division searching method based on distance sensing, comprising the following steps:

constructing the original data into an initial graph structure;

Selecting a distance measure, distributing weight for each edge, merging nearest neighbor node pairs based on the weight to gradually simplify the initial graph structure, and obtaining a simplified coarsened graph structure;

Hierarchical segmentation is carried out on the coarsened graph structure, and based on the segmentation condition, the coarsened graph structure is projected back to the initial graph structure in an iterative mode, and a plurality of initial cutting edge sets are finally obtained;

Finding out corresponding nodes according to each initial cutting edge in the initial cutting edge set, randomly putting the nodes into different partitions, and optimizing each initial cutting edge set one by taking the sum of minimized cutting edge weights as an objective function to obtain a final graph index structure;

and inputting a query, calling the graph index structure, respectively calculating the distances between the query and the central nodes of each partition of the graph index structure, and selecting the partition closest to the query for retrieval.

Optionally, the distance measure is used to measure the proximity between adjacent nodes, and the distance measure is euclidean distance, cosine similarity or manhattan distance.

Optionally, the weights are expressed as follows:

where w _uv represents the weight, d _uv represents the distance between adjacent nodes u and v, ε is a constant.

Based on the same inventive concept, the invention also provides a graph division retrieval system based on distance perception, which comprises:

the diagram construction module is used for constructing the original data into an initial diagram structure;

the coarsening module is used for selecting a distance measure, distributing weight to each edge, merging nearest neighbor node pairs based on the weight so as to gradually simplify the initial graph structure, and obtaining a simplified coarsened graph structure;

the graph segmentation module is used for carrying out hierarchical segmentation on the coarsened graph structure, and carrying out back projection on the initial graph structure in an iterative mode based on the segmentation condition to finally obtain a plurality of initial cutting edge sets;

The cutting edge optimizing module is used for finding out corresponding nodes according to each initial cutting edge in the initial cutting edge set, randomly placing the nodes into different partitions, and optimizing each initial cutting edge set one by taking the sum of minimized cutting edge weights as an objective function to obtain a final graph index structure;

And the retrieval module is used for inputting a query, calling the graph index structure, respectively calculating the distances between the query and the central nodes of the partitions of the graph index structure, and selecting the partition closest to the query for retrieval.

Optionally, the weights are expressed as follows:

Based on the same inventive concept, the present invention also provides a readable storage medium having stored thereon a computer program which, when executed, enables the distance-aware based graph division search method as described above.

According to the graph division retrieval method, system and readable storage medium based on distance perception, the distance information among the nodes is considered in the division process, so that the nodes with the relatively close distance are effectively divided into the same partition, the number of times of cross-partition query is reduced, and the query efficiency is improved. Meanwhile, the spatial proximity of the data can be maintained by adopting a node partition mode of minimizing the sum of the edge cutting weights, so that the query result is more accurate.

Drawings

It will be appreciated by those skilled in the art that the drawings are provided for a better understanding of the invention and do not constitute any limitation on the scope of the invention. Wherein:

fig. 1 is a flowchart of a graph division searching method based on distance sensing according to an embodiment of the present invention.

Detailed Description

As described in the background, conventional graph partitioning methods have focused on minimizing the number of edges, often ignoring the distance between partitioned adjacent nodes, which may result in closer nodes being assigned to different partitions. In order to better maintain a local neighborhood structure, the invention provides a graph division retrieval method based on distance perception, which effectively divides nodes with relatively close distances into the same partition by considering the distance information among the nodes in the division process, thereby reducing the times of cross-partition query and improving the query efficiency. Meanwhile, the spatial proximity of the data can be maintained by adopting a node partition mode of minimizing the sum of the edge cutting weights, so that the query result is more accurate.

The invention will be described in further detail with reference to the drawings and the specific embodiments thereof in order to make the objects, advantages and features of the invention more apparent. It should be noted that the drawings are in a very simplified form and are all to a non-precise scale, merely for the purpose of facilitating and clearly aiding in the description of embodiments of the invention. For a better understanding of the invention with objects, features and advantages, refer to the drawings. It should be understood that the structures, proportions, sizes, etc. shown in the drawings are shown only in connection with the present disclosure for the understanding and reading of the present disclosure, and are not intended to limit the scope of the invention, which is defined by the appended claims, and any structural modifications, proportional changes, or dimensional adjustments, which may be made by the present disclosure, should fall within the scope of the present disclosure under the same or similar circumstances as the effects and objectives attained by the present invention.

As used in this disclosure, the singular forms "a," "an," and "the" include plural referents unless the content clearly dictates otherwise. As used in this disclosure, the term "or" is generally employed in its sense including "and/or" unless the content clearly dictates otherwise.

Referring to fig. 1, the present embodiment provides a graph division searching method based on distance sensing, which includes the following steps:

S1, constructing original data into an initial graph structure;

S2, selecting a distance measure, distributing weight to each edge, and merging nearest neighbor node pairs based on the weight to gradually simplify the initial graph structure, so as to obtain a simplified coarsened graph structure;

s3, carrying out hierarchical segmentation on the coarsened graph structure, and carrying out iterative projection on the initial graph structure based on segmentation conditions to finally obtain a plurality of initial cutting edge sets;

S4, finding out corresponding nodes according to each initial cutting edge in the initial cutting edge set, randomly placing the nodes into different partitions, and optimizing each initial cutting edge set one by taking the sum of minimized cutting edge weights as an objective function to obtain a final graph index structure;

S5, inputting a query, calling the graph index structure, respectively calculating the distances between the query and the central nodes of all the partitions of the graph index structure, and selecting the partition closest to the query for retrieval.

First, S1 is performed on the original data to construct an initial graph structure. In this embodiment, the original data may be constructed into the initial graph structure by using a conventional graph construction algorithm, which is not limited in the present invention.

And S2, selecting a distance measure, distributing weight to each edge, and merging nearest neighbor node pairs based on the weight to gradually simplify the initial graph structure, so as to obtain a simplified coarsened graph structure.

In this embodiment, the distance measure is used to measure the proximity between neighboring nodes, and common choices include euclidean distance, cosine similarity, and manhattan distance, according to the nature of the data set. Depending on the distance metric selected, each edge in the graph may be assigned a weight that helps prioritize the connections between neighboring nodes during the partitioning process.

After selecting the distance metric, each edge is assigned a weight that is inversely proportional to the distance connecting the nodes. This ensures that the edges connecting the closely related nodes get a higher weight. The weights are expressed as follows:

Where w _uv represents the weight, d _uv represents the distance between adjacent nodes u and v, ε is a constant to avoid dividing by zero.

This approach ensures that edges between closer nodes are prioritized during partitioning, thereby preserving local connectivity more efficiently.

The S2 is actually a stepwise simplification of the graph by merging node pairs. In implementations, node pairs that are weighted may be preferentially merged, the weights indicating that the node pairs are closer. By merging the nodes with their nearest neighbors, this stage preserves the critical neighborhood relationships, maintaining the local structure of the graph, which is important for accurate segmentation of the graph. This weighted coarsening method allows the graph to be reduced in size while also maintaining its proximity-based structure.

It should be noted that, the simplified diagrams mentioned herein refer to not only one simplification but also a plurality of simplifications. For example, the initial graph structure includes 100 nodes, 50 nodes remain after the first merging of node pairs, and 25 nodes remain after the second merging of node pairs, and so on, until a minimum and simplest version of the coarsened graph structure is obtained.

And then executing S3, carrying out hierarchical segmentation on the coarsened graph structure, and carrying out iterative mode back projection on the initial graph structure based on the segmentation condition to finally obtain a plurality of initial cutting edge sets.

In this embodiment, the partitions are divided according to the number of data points on the final coarsened graph structure, and the corresponding number of partitions is divided by how many points.

And S4, finding out corresponding nodes according to each initial cutting edge in the initial cutting edge set, randomly placing the nodes into different partitions, and optimizing each initial cutting edge set one by taking the sum of minimized cutting edge weights as an objective function to obtain a final graph index structure.

It is emphasized that the conventional graph partitioning algorithm minimizes the total number of the cut edges by not considering the intensity or importance of each edge, ignores the distance information between the nodes, and easily causes adjacent nodes to be partitioned into different partitions, thereby affecting the query efficiency and recall rate. In the technology, the optimization target is adjusted to minimize the sum of the edge cutting weights instead of the number of edges by introducing the distance-based weights, so that the possibility of separating closely connected nodes is effectively reduced by the design, the quality of the segmentation in the aspect of maintaining a local structure is improved, and meanwhile, the spatial proximity of data can be maintained, so that the query result is more accurate.

In this embodiment, the modified objective function becomes:

where E _cut represents the initial set of edges and w _uv represents the weight of each edge, reflecting the proximity between connected nodes.

In this embodiment, each initial cutting edge set is optimized in turn. The optimization process is further described below using one of the sets of cut edges as an example.

1) And finding out the corresponding graph nodes according to the initial cutting edge set, randomly placing the nodes into different partitions so as to update the initial cutting edge set, then respectively calculating which corresponding objective function value in each updated cutting edge set is the smallest, and ending the optimization of the initial cutting edge set after finding out the smallest value.

2) The optimization of the second initial set of cutting edges is then started, the method being consistent with the procedure described above.

3) And (3) until all the initial cutting edge sets are optimized, ending the whole optimization process.

It should be noted that, the process of updating the initial cutting edge set is the most accurate to traverse all the possibilities, but considering that the points involved are more (for example, the number of involved nodes is m), the schemes of moving the partitions are also correspondingly more (the schemes involved are 2 ^m), so a threshold value x can be set, and the minimum value of the objective function is selected after n partition schemes are calculated. The specific x value needs to be set according to the number of nodes involved, for example x=1/3×2 ^m.

And finally, executing S5, inputting a query, calling the graph index structure, respectively calculating the distances between the query and the central nodes of all the partitions of the graph index structure, and selecting the partition closest to the query for retrieval.

For example, after a query (data) is obtained, the query is first computed with the center node of each partition of the graph index structure obtained by the graph partitioning and searching method based on distance perception, so as to select which partition should be entered for computation, and finally top-k values are returned as final search results, so that the computation steps can be reduced, and the index efficiency is improved.

The recall rate and the query number per second of the graph division retrieval method based on distance perception provided by the invention are compared with those of the existing method through specific experimental data.

Experimental environment:

Ubuntu 20.04.5 LTS

CPU:Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz

256G memory

Data set SIFT1M, GIST

The recall rates for each of the methods for SIFT128, 1M when calculating 12000 distance calculations are shown in table 1 below:

TABLE 1

The number of queries per second for each method for SIFT128, 1M when querying 1000 queries is shown in table 2 below:

TABLE 2

GIST960, 1M recall rates for each method when calculating 12000 distance calculations are shown in table 3 below:

TABLE 3 Table 3

GIST960, 1M the number of queries per second for each method when querying 1000 queries is shown in table 4 below:

TABLE 4 Table 4

As can be seen from tables 1 to 4, the method of the present invention is superior to the conventional method in Recall (Recall) and Query Per Second (QPS) when the initial degrees of the same edges are identical.

Based on the same inventive concept, the embodiment of the invention also provides a graph division retrieval system based on distance perception, which comprises the following steps:

In this embodiment, the distance measure is used to measure the proximity between adjacent nodes, and the distance measure is euclidean distance, cosine similarity, or manhattan distance.

Preferably, the weights are expressed as follows:

Based on the same inventive concept, the embodiment of the invention also provides a readable storage medium, on which a computer program is stored, which when executed can implement the graph division searching method based on distance perception as described above.

The readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device, such as, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the preceding. More specific examples of a readable storage medium (a non-exhaustive list) include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or in-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. The computer program described herein may be downloaded from a readable storage medium to a respective computing/processing device or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer program from the network and forwards the computer program for storage in a readable storage medium in the respective computing/processing device. Computer programs for performing the operations of the present invention can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuits, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for a computer program, which can execute computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, systems, and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer programs. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the programs, when executed by the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer programs may also be stored in a readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the readable storage medium storing the computer program includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the computer program which is executed on the computer, other programmable apparatus or other devices implements the functions/acts specified in the flowchart and/or block diagram block or blocks.

In summary, the embodiments of the present invention provide a graph division search method, a system and a readable storage medium based on distance perception, which effectively divide nodes with a relatively close distance into the same partition by considering distance information between the nodes in the division process, thereby reducing the number of times of cross-partition query and improving the query efficiency. Meanwhile, the spatial proximity of the data can be maintained by adopting a node partition mode of minimizing the sum of the edge cutting weights, so that the query result is more accurate.

It should also be appreciated that while the present invention has been disclosed in the context of a preferred embodiment, the above embodiments are not intended to limit the invention. Many possible variations and modifications of the disclosed technology can be made by anyone skilled in the art without departing from the scope of the technology, or the technology can be modified to be equivalent. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. A distance-aware graph partitioning retrieval method, comprising the following steps:

Construct the original data into an initial graph structure;

Selecting a distance metric, assigning a weight to each edge, and merging nearest neighbor node pairs based on the weight to gradually simplify the initial graph structure to obtain a simplified coarsened graph structure;

The coarsened graph structure is hierarchically segmented, and based on the segmentation situation, it is iteratively back-projected onto the initial graph structure, and finally a plurality of initial cut edge sets are obtained;

Find the corresponding node according to each initial cut edge in the initial cut edge set, randomly put the node into different partitions, and optimize each initial cut edge set one by one by minimizing the sum of the cut edge weights as the objective function to obtain the final graph index structure;

A query is input, the graph index structure is called, and the distances between the query and the central nodes of each partition of the graph index structure are calculated respectively, and the partition with the closest distance is selected for retrieval.

2. According to the distance-aware graph partitioning retrieval method according to claim 1, it is characterized in that the distance metric is used to measure the proximity between adjacent nodes, and the distance metric is Euclidean distance, cosine similarity or Manhattan distance.

3. The distance-aware graph partitioning retrieval method according to claim 1, wherein the weight is expressed as follows:

Among them, w _uv represents the weight, d _uv represents the distance between adjacent nodes u and v, and ε is a constant.

4. A distance-aware graph partitioning retrieval system, comprising:

A graph construction module, used to construct the original data into an initial graph structure;

A coarsening module, configured to select a distance metric, assign a weight to each edge, and merge nearest neighbor node pairs based on the weight to gradually simplify the initial graph structure to obtain a simplified coarsening graph structure;

A graph segmentation module, used for hierarchically segmenting the coarsened graph structure, and iteratively back-projecting it onto the initial graph structure based on the segmentation situation, and finally obtaining a plurality of initial cut edge sets;

The edge cutting optimization module is used to find the corresponding node according to each initial edge cutting in the initial edge cutting set, randomly put the nodes into different partitions, and optimize each initial edge cutting set one by one by minimizing the sum of edge cutting weights as the objective function to obtain the final graph index structure;

The retrieval module is used to input a query, call the graph index structure and respectively calculate the distance between the query and the central node of each partition of the graph index structure, and select the partition with the closest distance for retrieval.

5. According to the distance-aware graph partitioning retrieval system of claim 4, it is characterized in that the distance metric is used to measure the proximity between adjacent nodes, and the distance metric is Euclidean distance, cosine similarity or Manhattan distance.

6. The distance-aware graph partitioning retrieval system according to claim 4, wherein the weight is expressed as follows:

7. A readable storage medium having a computer program stored thereon, characterized in that when the computer program is executed, it can implement the distance-aware graph partitioning retrieval method according to any one of claims 1-3.