CN116304736A - Data set matching method based on number taking algorithm - Google Patents
Data set matching method based on number taking algorithm Download PDFInfo
- Publication number
- CN116304736A CN116304736A CN202310254184.3A CN202310254184A CN116304736A CN 116304736 A CN116304736 A CN 116304736A CN 202310254184 A CN202310254184 A CN 202310254184A CN 116304736 A CN116304736 A CN 116304736A
- Authority
- CN
- China
- Prior art keywords
- data set
- matching
- queue
- data sets
- queues
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/06—Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
- G06F7/08—Sorting, i.e. grouping record carriers in numerical or other ordered sequence according to the classification of at least some of the information they carry
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a data set matching method based on a number taking algorithm, which comprises the following steps: classifying the data sets to be matched to generate fragment queues through a clustering algorithm, acquiring the arrangement coefficient of the data sets in each fragment queue, screening out the data sets with the arrangement coefficient smaller than a processing threshold, re-ordering the data sets in each fragment queue from large to small according to the arrangement coefficient, calculating the sum of all the arrangement coefficients of the data sets in each fragment queue to obtain an ordering value, ordering the fragment queues from large to small according to the ordering value, generating an ordering table, and when the data sets are matched, selecting the matching sequence of the fragment queues from the positive sequence of the ordering table, and selecting the matching sequence of the data sets from the positive sequence of the fragment queues. The invention effectively optimizes the matching sequence of the data sets, ensures that the data sets with high importance are matched preferentially, effectively reduces the processing capacity of the data sets and improves the matching efficiency of the data sets.
Description
Technical Field
The invention relates to the technical field of data set matching, in particular to a data set matching method based on a number taking algorithm.
Background
The data set matching refers to finding matched records or entities in two or more data sets, and mainly comprises main key matching, fuzzy matching, rule matching, data set mining matching, machine learning matching and text matching, wherein a number taking algorithm is an algorithm for managing the service sequence of customers and is generally used for places such as restaurants, banks and hospitals needing queuing waiting for service so as to ensure that customers can be served according to the principle of first-come first-served, and the number taking algorithm is used in the data set matching in the prior method, so that the efficiency of the data set matching is effectively improved and the calculation time is reduced.
The prior art has the following defects:
1. the existing data set matching method based on the number taking algorithm mainly comprises the steps of dividing a data set into fragments according to the fixed size of the data set, generating a unique number for each fragment, adding the fragments into a queue according to the sequence of the numbers, however, sorting and matching are carried out after the data set is divided according to the fixed size of the data set, and the importance of the data set cannot be acquired, so that the important data set is sorted and processed later, and the data set matching efficiency and the data set analysis value are reduced;
2. when the existing data sets are matched, all the data sets are added into a queue according to the sequence of the numbers for matching, but a large number of data sets often exist data sets which do not need to be matched, so that the data sets are matched together, the calculation cost is increased, and the matching efficiency is reduced.
Disclosure of Invention
The invention aims to provide a data set matching method based on a number taking algorithm, which aims to solve the defects in the background technology.
In order to achieve the above object, the present invention provides the following technical solutions: a data set matching method based on a number taking algorithm, the matching method comprising the steps of:
s1: classifying the data sets to be matched through a clustering algorithm, generating fragment queues according to the classification result of the data sets, and endowing each fragment queue with a class label;
s2: acquiring the arrangement coefficient of the data set in each slicing queue, and screening out the data set with the arrangement coefficient smaller than the processing threshold;
s3: the data sets in each slicing queue are reordered from big to small according to the arrangement coefficients, all the data set arrangement coefficients in each slicing queue are calculated and summed to obtain an ordering value, the slicing queues are ordered from big to small according to the ordering value, and an ordering table is generated;
s4: when the data sets are matched, selecting the matching sequence of the slicing queues from the positive sequence of the sorting table, and selecting the matching sequence of the data sets from the positive sequence of the slicing queues;
s5: taking out the data set at the head of the queue, generating a unique identifier for the data set, matching the record in the data set with the target data set, storing the matching result into a result set, and deleting the data set from the slicing queue;
s6: there is no unprocessed data set in the slicing queue and the matching algorithm ends.
In a preferred embodiment, in step S2, obtaining the permutation coefficients of the data sets in each of the slice queues includes the steps of:
collecting index parameters and time parameters of a data set in each slicing queue, and establishing arrangement coefficients of the index parameters and the time parameters through a formula, wherein the expression is as follows:
in the method, in the process of the invention,for index parameters of the dataset, < >>Is the time parameter of the data set, alpha and beta are respectively the scale coefficients of the index parameter and the time parameter, and alpha>β>0,α+β=2.468。
In a preferred embodiment, in step S2, a processing threshold PL is set y The arrangement coefficient PL x And a processing threshold PL y Comparing;
if the data setAlignment coefficient PL x <Processing threshold value PL y The system judges that the data set is unimportant data and does not reach the matching standard, and screens the data set out of the slicing queue;
if the arrangement coefficient PL of the data set x Process threshold PL y The system judges that the data set reaches the matching standard and sorts the data set into the slicing queues.
In a preferred embodiment, the index parameterFor representing the value of a dataset, wherein yw i Representing the traffic demand, qs, of a dataset i Representing missing value ratios, kz, of a dataset i Representing the null duty cycle of the dataset.
In a preferred embodiment, the time parameterFor representing the time sensitivity of a data set, wherein gx j Representing the update frequency of the dataset, bc j Representing the retention period, yc, of a dataset j Representing the delay time of the data set.
In a preferred embodiment, in step S1, a class label is assigned to each of the sliced queues to be { A, B, C }, and the data sets in the a sliced queues are { A1, A2, A3, A4, A5}, the data sets in the B sliced queues are { B1, B2, B3}, and the data sets in the C sliced queues are { C1, C2, C3, C4}.
In a preferred embodiment, in step S3, the data set in each slice queue is reordered from large to small according to the ranking factor, as follows:
s3.1: if the arrangement coefficient PL of the data set in the A-slice queue x The comparison result of A3 > A2 > A1 > A5 > A4, and the arrangement coefficient PL of the A4 data set x <Processing threshold value PL y The update sequence of the data set in the A slice queue is { A3, A2, A1, A5};
s3.2: if the arrangement of data sets in the B-slice queue isNumber PL x If the comparison result of B2 > B3 > B1, updating and sequencing the data set in the B slice queue to { B2, B3, B1};
s3.3: if the arrangement coefficient PL of the data set in the C-slice queue x The comparison result of (C1) is C4 > C2 > C3 > C1, and the arrangement coefficient PL of the C3 and C1 data set x <Processing threshold value PL y The data set updates in the C-slice queue are ordered { C4, C2}.
In a preferred embodiment, in step S3, the sorting of the slice queues from large to small according to the sorting value includes the following steps:
s3.4: calculating all data set arrangement coefficients PL in { A, B, C } fragment queues x Summing to obtain a sequencing value;
s3.5: if the comparison result of the sequencing values is B > A > C;
s3.6: the sorting table is { B, A, C } which is generated by sorting the fragment queues from large to small according to the sorting value.
In a preferred embodiment, the data set alignment coefficients PL are of the same batch x <Processing threshold value PL y The number of datasets is calibrated to be H i The arrangement coefficients PL in the data set of the same batch x Process threshold PL y The number of datasets is calibrated to be H j The screening rate and the sorting rate of the data set of the same batch are obtained through formula calculation, and the expression is as follows:
wherein sc i Representing the screening rate, px of a dataset i Representing the ordering rate of the data, (H) i +H j ) And calculating the total amount of the data set by a formula to obtain a batch number value, wherein the expression is as follows:
in the pH z For the lot number value of the same lot data set, the lot number value PH z Input numberAfter the database, the database automatically sorts the batch of data sets, and the sorting mode of the data sets of each batch in the database is as follows: according to the pH value of the batch number z Ordering from big to small.
In a preferred embodiment, in step S1, the classification of the data sets to be matched by means of a clustering algorithm comprises the following steps:
s1.1: determining K values of data set clusters, and randomly selecting K points from the data set as initial centroids to serve as representative points of each cluster;
s1.2: calculating the distance from each data set point to each centroid, determining the nearest centroid, and distributing the data set points into corresponding clusters;
s1.3: re-calculating the average value of all data set points in each;
s1.4: steps S1.2 and S1.3 are repeated until the cluster center is no longer changed, and for a new dataset to be matched, the dataset is assigned to the corresponding cluster according to the centroid of the dataset closest to the centroid of the dataset.
In the technical scheme, the invention has the technical effects and advantages that:
1. the method comprises the steps of classifying data sets to be matched through a clustering algorithm to generate fragment queues, acquiring the arrangement coefficient of the data sets in each fragment queue, screening out the data sets with the arrangement coefficient smaller than a processing threshold, re-ordering the data sets in each fragment queue according to the arrangement coefficient from large to small, calculating the sum of all the data set arrangement coefficients in each fragment queue to obtain an ordering value, ordering the fragment queues according to the ordering value from large to small, generating an ordering table, when the data sets are matched, selecting the matching sequence of the fragment queues from the positive sequence of the ordering table, selecting the matching sequence of the data sets from the positive sequence of the fragment queues, effectively optimizing the matching sequence of the data sets, guaranteeing that the importance data sets are matched with high priority, effectively reducing the processing capacity of the data sets, and improving the matching efficiency of the data sets;
2. the invention acquires each data set in each slicing queue by acquiring index parameters and time parameters and establishing arrangement coefficients of the index parameters and the time parameters through a formulaImportance of the data set, and before the data set is divided into the slicing queues, the arrangement coefficient PL is arranged x <Processing threshold value PL y The data set is screened out, so that the data processing amount is effectively reduced, and the data processing efficiency is improved;
3. the invention is realized by arranging the coefficients PL x And a processing threshold PL y The screening rate and the sorting rate of the same batch of data sets are obtained according to the comparison result of the data sets, the batch number value is obtained by comparing the sorting rate with the screening rate, and finally, the batch number value PH is obtained in the database according to the batch number value z The data sets of each batch are ordered from large to small, so that the inquiry and management of the later data sets are facilitated.
Drawings
For a clearer description of embodiments of the present application or of the solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments described in the present invention, and that other drawings may be obtained according to these drawings for a person skilled in the art.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, the data set matching method based on the number taking algorithm according to the present embodiment includes the following steps:
classifying the data sets to be matched through a clustering algorithm, generating a fragment queue according to the classification result of the data sets, giving a class label to each fragment queue, dividing the data sets into a plurality of parts to facilitate subsequent processing, acquiring the arrangement coefficient of the data sets in each fragment queue, screening the data sets with the arrangement coefficient smaller than a processing threshold, re-ordering the data sets in each fragment queue according to the arrangement coefficient from large to small, calculating the arrangement coefficient sum of all the data sets in each fragment queue to obtain an ordering value, ordering the fragment queues according to the ordering value from large to small, generating an ordering table, when the data sets are matched, selecting the matching sequence of the fragment queues from the positive sequence of the ordering table, selecting the data set matching sequence of the head of the fragment queue from the positive sequence of the fragment queue, taking out the data sets at the head of the queue, matching the record in the data sets with the target data sets after generating unique identifiers for the data sets, storing the matching result in the result set, deleting the data sets from the fragment queues, and ending the matching algorithm if the unprocessed data sets in the fragment queues are not available.
According to the method, the sorting algorithm is used for sorting the data sets to be matched to generate the fragment queues, the arrangement coefficient of the data sets in each fragment queue is obtained, the data sets with the arrangement coefficient smaller than the processing threshold are screened out, the data sets in each fragment queue are reordered from large to small according to the arrangement coefficient, all the data set arrangement coefficients in each fragment queue are calculated to be summed to obtain sorting values, the fragment queues are sorted from large to small according to the sorting values, a sorting table is generated, when the data sets are matched, the matching sequence of the fragment queues is selected from the positive sequence of the sorting table, the matching sequence of the data sets is selected from the positive sequence of the fragment queues, the matching sequence of the data sets is effectively optimized, the priority matching of the data sets with high importance is guaranteed, the processing quantity of the data sets is effectively reduced, and the matching efficiency of the data sets is improved.
In the embodiment, the data sets to be matched are classified through a clustering algorithm, wherein the clustering algorithm comprises K-means clustering, hierarchical clustering and DBSCAN clustering;
wherein,,
classifying the data set to be matched by K-means clustering comprises the following steps:
(1) Determining a K value: the number of clusters, i.e., the K value, first needs to be determined, which can be determined based on the characteristics of the dataset, the target, and the application requirements;
(2) Randomly initializing a centroid: randomly selecting K points from the data set as initial centroids to serve as representative points of each cluster;
(3) Calculating the distance: for each data set point, calculating its distance to each centroid to determine the nearest centroid, i.e., assign the data set point to a corresponding cluster;
(4) Updating the centroid: for each cluster, recalculating the centroid position, namely the average value of all data set points in the cluster;
(5) Repeating the steps (3) and (4) until the cluster center is no longer changed or the maximum iteration number is reached;
(6) And for the new data set to be matched, distributing the new data set to the corresponding cluster according to the centroid which is closest to the new data set, taking the data set in the cluster as a matching result, and carrying out subsequent processing and analysis.
Specifically, in this embodiment, the K value of the data set is determined by the contour coefficient method, and the processing logic is: initializing a K value, setting K=2, calculating the average value of the distances between each data set point and all other data set points in the same cluster, and marking the average value as P avg For each data set point, calculating the average value of the distances between the data set point and all the data set points in other clusters, taking the minimum value, and recording asCalculating the contour coefficient LK of each data set point x The expression is: /> Calculating the contour coefficients LK of all data set points x Is recorded as LK avg If the average value LK avg The maximum K value is not 1, and the average LK is selected avg The maximum K value is taken as the data set K value, if the average LK avg The maximum K value is 1, then other methods (e.g., elbow rule, cross validation) need to be considered to select the dataset K value.
Classifying the data sets to be matched through hierarchical clustering comprises the following steps:
(1) Initializing each data set point into a single cluster;
(2) Calculating the similarity between each data set point, for example, euclidean distance, cosine similarity, etc. can be used;
(3) Combining the two most similar data sets into a new cluster according to the similarity;
(4) Recalculating the similarity of the new cluster and other clusters, and carrying out merging again until all the data set points are merged into one cluster or the stopping condition is met;
(5) The clustering result can be intuitively represented by drawing a clustering tree diagram;
(6) The final clustering result is selected according to the requirements, for example, meaningful clustering results can be selected by cutting off the dendrogram.
Classifying the data sets to be matched by DBSCAN clustering comprises the following steps:
(1) Initializing parameters including a neighborhood radius epsilon and a minimum density threshold MinPts;
(2) Randomly selecting an unaccessed data set point p, taking the unaccessed data set point p as a center, establishing a neighborhood with epsilon as a radius, and marking the data set point as a noise point if the number of the data set points in the neighborhood is smaller than MinPts;
(3) If the number of the data set points in the neighborhood is greater than or equal to MinPts, marking the data set point p as a core point, and establishing a new cluster by taking p as a center;
(4) Adding all the data set points which are not accessed in the neighborhood to the cluster, and adding all the data set points in the neighborhood to the cluster if the points are also core points;
(5) Repeating steps (2) - (4) until all data set points have been accessed;
(6) The clustering result may be evaluated by defining the density of data set points and the number of noise points within the neighborhood.
In summary, K-means clustering can divide a data set into K clusters, and aggregate similar data set points together for further analysis and processing; the hierarchical clustering result depends on a similarity measurement method and a merging strategy to a great extent, and the processing efficiency of a large-scale data set is low; the DBSCAN needs to reasonably select the parameters epsilon and MinPts, the clustering method is complex, and the calculated amount is increased, so that the K-means clustering algorithm is preferable to classify the data set.
Example 2
Obtaining the arrangement coefficient of the data set in each slicing queue, screening out the data set with the arrangement coefficient smaller than the processing threshold, re-ordering the data set in each slicing queue from large to small according to the arrangement coefficient, calculating the arrangement coefficient of all the data sets in each slicing queue, summing to obtain an ordering value, ordering the slicing queues from large to small according to the ordering value, and generating an ordering table specifically comprises the following steps:
collecting index parameters and time parameters of a data set in each slicing queue, and establishing arrangement coefficients of the index parameters and the time parameters through a formula, wherein the expression is as follows:
in the method, in the process of the invention,for index parameters of the dataset, < >>Is the time parameter of the data set, alpha and beta are respectively the scale coefficients of the index parameter and the time parameter, and alpha>β>0,α+β=2.468。
Index parameterThe index parameter is used for reflecting the value of the data set, wherein yw i Representing the traffic demand, qs, of a dataset i Representing missing value ratios, kz, of a dataset i Null representing data setThe larger the value of the index parameter, the higher the value of the data set.
Time parameterThe time parameter is used to reflect the time sensitivity of the data set, wherein gx j Representing the update frequency of the dataset, bc j Representing the retention period, yc, of a dataset j The larger the time parameter, the higher the time sensitivity of the data set, and the need for matching as soon as possible.
Setting a processing threshold PL y The arrangement coefficient PL x And a processing threshold PL y Comparing;
if the arrangement coefficient PL of the data set x <Processing threshold value PL y The system judges that the data set is unimportant data and does not reach the matching standard, and screens the data set out of the slicing queue;
if the arrangement coefficient PL of the data set x Process threshold PL y The system judges that the data set reaches the matching standard and sorts the data set into the slicing queues.
According to the method, the index parameters and the time parameters of the data sets in each slicing queue are collected, the arrangement coefficients are established through formulas, so that the importance degree of each data set is obtained, and before the data sets are sliced into the slicing queues, the arrangement coefficients PL are established x <Processing threshold value PL y The data set is screened out, the data processing amount is effectively reduced, and the data processing efficiency is improved.
If the arrangement coefficient PL of the data set x Process threshold PL y The system judges that the data set reaches the matching standard, and sorts the data set into the slicing queues specifically as follows:
classifying the data sets to be matched through a clustering algorithm, generating fragment queues according to the classification result of the data sets, giving a class label to each fragment queue, setting the class label as { A, B, C }, wherein the data sets in the fragment queues A are { A1, A2, A3, A4 and A5}, the data sets in the fragment queues B are { B1, B2 and B3}, and the data sets in the fragment queues C are { C1, C2, C3 and C4};
if the arrangement coefficient PL of the data set in the A-slice queue x The comparison result of A3 > A2 > A1 > A5 > A4, and the arrangement coefficient PL of the A4 data set x <Processing threshold value PL y The update sequence of the data set in the A slice queue is { A3, A2, A1, A5};
if the arrangement coefficient PL of the data set in the B-slice queue x If the comparison result of B2 > B3 > B1, updating and sequencing the data set in the B slice queue to { B2, B3, B1};
if the arrangement coefficient PL of the data set in the C-slice queue x The comparison result of (C1) is C4 > C2 > C3 > C1, and the arrangement coefficient PL of the C3 and C1 data set x <Processing threshold value PL y The data set update sequence in the C slice queue is { C4, C2};
calculating all data set arrangement coefficients PL in { A, B, C } fragment queues x Summing to obtain a sorting value, wherein the comparison result of the sorting value is B & gtA & gtC, sorting the fragment queues from large to small according to the sorting value, and generating a sorting table { B, A, C };
the matching order of the data sets is { B2, A3, C4, B3, A2, C2, B1, A5}.
Example 3
In the above-described embodiment 2, the processing threshold PL is set y The arrangement coefficient PL x And a processing threshold PL y Comparing;
if the arrangement coefficient PL of the data set x <Processing threshold value PL y The system judges that the data set is unimportant data and does not reach the matching standard, and screens the data set out of the slicing queue;
if the arrangement coefficient PL of the data set x Process threshold PL y The system judges that the data set reaches the matching standard and sorts the data set into the slicing queues.
After matching the data sets of the same batch, the system needs to set a batch number for the data sets of the batch and store the batch number into a database, so that the data sets of each batch need to be ordered for convenient management and searching of subsequent data;
thus, it willArrangement coefficient PL in the same batch of data set x <Processing threshold value PL y The number of datasets is calibrated to be H i Arrange the coefficients PL in the same batch of data set x Process threshold PL y The number of datasets is calibrated to be H j The screening rate and the sorting rate of the data set of the same batch are obtained through formula calculation, and the expression is as follows:
wherein sc i Representing the screening rate, px of a dataset i Representing the ordering rate of the data, (H) i +H j ) And calculating the total amount of the data set by a formula to obtain a batch number value, wherein the expression is as follows:
in the pH z For the lot number value of the same lot data set, the lot number value PH z After the database is input, the database automatically sorts the batch of data sets, and the sorting mode of the data sets of each batch in the database is as follows: according to the pH value of the batch number z Ordering from big to small.
In the present embodiment, by arranging the coefficients PL x And a processing threshold PL y The screening rate and the sorting rate of the same batch of data sets are obtained according to the comparison result of the data sets, the batch number value is obtained by comparing the sorting rate with the screening rate, and finally, the batch number value PH is obtained in the database according to the batch number value z The data sets of each batch are ordered from large to small, so that the inquiry and management of the later data sets are facilitated.
The above formulas are all formulas with dimensionality removed and numerical calculation, the formulas are formulas with a large number of data sets collected for software simulation to obtain the latest real situation, and preset parameters in the formulas are set by those skilled in the art according to the actual situation.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with the embodiments of the present application are all or partially produced. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data set storage device such as a server, data set center, or the like, that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A data set matching method based on a number taking algorithm is characterized by comprising the following steps of: the matching method comprises the following steps:
s1: classifying the data sets to be matched through a clustering algorithm, generating fragment queues according to the classification result of the data sets, and endowing each fragment queue with a class label;
s2: acquiring the arrangement coefficient of the data set in each slicing queue, and screening out the data set with the arrangement coefficient smaller than the processing threshold;
s3: the data sets in each slicing queue are reordered from big to small according to the arrangement coefficients, all the data set arrangement coefficients in each slicing queue are calculated and summed to obtain an ordering value, the slicing queues are ordered from big to small according to the ordering value, and an ordering table is generated;
s4: when the data sets are matched, selecting the matching sequence of the slicing queues from the positive sequence of the sorting table, and selecting the matching sequence of the data sets from the positive sequence of the slicing queues;
s5: taking out the data set at the head of the queue, generating a unique identifier for the data set, matching the record in the data set with the target data set, storing the matching result into a result set, and deleting the data set from the slicing queue;
s6: there is no unprocessed data set in the slicing queue and the matching algorithm ends.
2. The method for matching a data set based on a number taking algorithm as claimed in claim 1, wherein: in step S2, obtaining the permutation coefficients of the data sets in each slice queue includes the following steps:
collecting index parameters and time parameters of a data set in each slicing queue, and establishing arrangement coefficients of the index parameters and the time parameters through a formula, wherein the expression is as follows:
3. The method for matching a data set based on a number taking algorithm as claimed in claim 2, wherein: in step S2, a processing threshold PL is set y The arrangement coefficient PL x And a processing threshold PL y Comparing;
if the arrangement coefficient PL of the data set x <Processing threshold value PL y The system judges that the data set is unimportant data and does not reach the matching standard, and screens the data set out of the slicing queue;
if the arrangement coefficient PL of the data set x Process threshold PL y The system judges that the data set reaches the matching standard and sorts the data set into the slicing queues.
4. A method for matching a data set based on a number taking algorithm as defined in claim 3, wherein: the index parameterFor representing the value of a dataset, wherein yw i Representing the traffic demand, qs, of a dataset i Representing missing value ratios, kz, of a dataset i Representing the null duty cycle of the dataset.
5. The method for matching a data set based on a number taking algorithm according to claim 4, wherein: the time parameterFor representing the time sensitivity of a data set, wherein gx j Representing the update frequency of the dataset, bc j Representing the retention period, yc, of a dataset j Representing the delay time of the data set.
6. A method for matching a data set based on a number taking algorithm as defined in claim 3, wherein: in step S1, a class label is assigned to each of the sliced queues to be { A, B, C }, and the data sets in the a sliced queues are { A1, A2, A3, A4, A5}, the data sets in the B sliced queues are { B1, B2, B3}, and the data sets in the C sliced queues are { C1, C2, C3, C4}.
7. The method for matching a data set based on a number taking algorithm as defined in claim 6, wherein: in step S3, the reordering step of the data set in each slice queue from big to small according to the ranking coefficient is as follows:
s3.1: if the arrangement coefficient PL of the data set in the A-slice queue x The comparison result of A3 > A2 > A1 > A5 > A4, and the arrangement coefficient PL of the A4 data set x <Processing threshold value PL y The update sequence of the data set in the A slice queue is { A3, A2, A1, A5};
s3.2: if the arrangement coefficient PL of the data set in the B-slice queue x If the comparison result of B2 > B3 > B1, updating and sequencing the data set in the B slice queue to { B2, B3, B1};
s3.3: if the arrangement coefficient PL of the data set in the C-slice queue x The comparison result of (C1) is C4 > C2 > C3 > C1, and the arrangement coefficient PL of the C3 and C1 data set x <Processing threshold value PL y The data set updates in the C-slice queue are ordered { C4, C2}.
8. The method for matching a data set based on a number taking algorithm as defined in claim 7, wherein: in step S3, the sorting of the slice queues from large to small according to the sorting value includes the following steps:
s3.4: calculating all data set arrangement coefficients PL in { A, B, C } fragment queues x Summing to obtain a sequencing value;
s3.5: if the comparison result of the sequencing values is B > A > C;
s3.6: the sorting table is { B, A, C } which is generated by sorting the fragment queues from large to small according to the sorting value.
9. A method for matching a data set based on a number taking algorithm as defined in claim 3, wherein: the arrangement coefficients PL in the data set of the same batch x <Processing threshold value PL y The number of datasets is calibrated to be H i The arrangement coefficients PL in the data set of the same batch x Process threshold PL y The number of datasets is calibrated to be H j The screening rate and the sorting rate of the data set of the same batch are obtained through formula calculation, and the expression is as follows:
wherein sc i Representing the screening rate, px of a dataset i Representing the ordering rate of the data, (H) i +H j ) And calculating the total amount of the data set by a formula to obtain a batch number value, wherein the expression is as follows:
in the pH z For the lot number value of the same lot data set, the lot number value PH z After the database is input, the database automatically sorts the batch of data sets, and the sorting mode of the data sets of each batch in the database is as follows: according to the pH value of the batch number z Ordering from big to small.
10. A method for matching datasets based on a number taking algorithm as claimed in any one of claims 1 to 9 wherein: in step S1, classifying the data sets to be matched by the clustering algorithm includes the following steps:
s1.1: determining K values of data set clusters, and randomly selecting K points from the data set as initial centroids to serve as representative points of each cluster;
s1.2: calculating the distance from each data set point to each centroid, determining the nearest centroid, and distributing the data set points into corresponding clusters;
s1.3: re-calculating the average value of all data set points in each;
s1.4: steps S1.2 and S1.3 are repeated until the cluster center is no longer changed, and for a new dataset to be matched, the dataset is assigned to the corresponding cluster according to the centroid of the dataset closest to the centroid of the dataset.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310254184.3A CN116304736A (en) | 2023-03-16 | 2023-03-16 | Data set matching method based on number taking algorithm |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310254184.3A CN116304736A (en) | 2023-03-16 | 2023-03-16 | Data set matching method based on number taking algorithm |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116304736A true CN116304736A (en) | 2023-06-23 |
Family
ID=86820028
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310254184.3A Withdrawn CN116304736A (en) | 2023-03-16 | 2023-03-16 | Data set matching method based on number taking algorithm |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116304736A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116860876A (en) * | 2023-07-13 | 2023-10-10 | 上海热璞网络科技有限公司 | A recommended method, device and server for data fragmentation |
-
2023
- 2023-03-16 CN CN202310254184.3A patent/CN116304736A/en not_active Withdrawn
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116860876A (en) * | 2023-07-13 | 2023-10-10 | 上海热璞网络科技有限公司 | A recommended method, device and server for data fragmentation |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US6012058A (en) | Scalable system for K-means clustering of large databases | |
| WO2015035864A1 (en) | Method, apparatus and system for data analysis | |
| US7069264B2 (en) | Stratified sampling of data in a database system | |
| WO2008154029A1 (en) | Data classification and hierarchical clustering | |
| CN109919781A (en) | Case recognition methods, electronic device and computer readable storage medium are cheated by clique | |
| CN102841916A (en) | Registration and maintenance of address data for each service point in a territory | |
| KR20130036094A (en) | Managing storage of individually accessible data units | |
| CN110910991B (en) | A medical automatic image processing system | |
| US12013855B2 (en) | Trimming blackhole clusters | |
| CN113168544A (en) | Method and system for servicing complex industrial systems | |
| US6563952B1 (en) | Method and apparatus for classification of high dimensional data | |
| US20170316071A1 (en) | Visually Interactive Identification of a Cohort of Data Objects Similar to a Query Based on Domain Knowledge | |
| US20200301966A1 (en) | Attribute diversity for frequent pattern analysis | |
| CN116304736A (en) | Data set matching method based on number taking algorithm | |
| CN104965846B (en) | Visual human's method for building up in MapReduce platform | |
| CN113901037A (en) | Data management method, device and storage medium | |
| CN113590559A (en) | Method for managing whole process of enterprise project management document | |
| CN118708580A (en) | Method and computing device for managing data | |
| CN112286874B (en) | Time-based file management method | |
| CN117520994B (en) | Method and system for identifying abnormal air ticket searching user based on user portrait and clustering technology | |
| CN113505172A (en) | Data processing method and device, electronic equipment and readable storage medium | |
| KR101085066B1 (en) | Association classification method for meaningful knowledge exploration in large multi-attribute datasets | |
| WO2016107297A1 (en) | Clustering method based on local density on mapreduce platform | |
| CN115687539A (en) | Knowledge base data information clustering method and system based on MapReduce model | |
| Glava et al. | Searching similar entities in models of various subject domains based on the analysis of their tuples |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WW01 | Invention patent application withdrawn after publication |
Application publication date: 20230623 |
|
| WW01 | Invention patent application withdrawn after publication |