Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a database security management and control platform SQL sentence complementation method based on a grammar tree, the method uses a parser to convert a portion of the SQL statement input by the user into a syntax tree, and then predicts the next possible content based on the syntax tree structure and context information. The method specifically comprises the following steps:
(1) The construction of a grammar tree, namely, inputting SQL sentences in an editor by a user to generate an original grammar tree;
(2) The module receives the original grammar tree generated by the grammar tree construction module, traverses the original grammar tree, adds ID and father node ID for each node, marks the table name and column name, adds type information for clause nodes and outputs the grammar tree after reinforcement;
(3) Generating a complement suggestion, namely giving the complement suggestion according to the enhanced grammar tree and the current cursor position;
(4) Database object management, namely returning the database object information according to the database object information request;
(5) Displaying the complement suggestion in the editor UI;
And when the completion display is performed, the SQL sentence completion result is displayed to the user according to the keyword, the table name and the column name, the higher ranking score is given to the newer record, the lower ranking score is given to the longer record, and the disturbance is added into the decay function.
Further, in the editor integration module, sorting is performed according to the keywords, the table names and the column names, and the method specifically comprises the following steps:
3.1 preparing a table name sequence x n,xn-1,…,x2,x1, initializing an empty set S for storing the processed table names and the corresponding sorting scores thereof, wherein n is the number of the table names in the history record, and x i represents the table name with the sequence number i, and i is more than or equal to 1 and less than or equal to n.
3.2 Calculating parametersWherein, α max and α min are respectively the maximum value and the minimum value of the preset parameter α, log () represents the base 10 logarithm, and n max is the preset maximum value of n;
3.3, calculating the sorting score for the table name with the sequence number of i, wherein i is circularly calculated from 1 to n, and the calculation process is as follows:
① Calculating a ranking score f (i) =e -αi(1+b×sin2 ((1- α) i+c)), where sin () is a sine function, b e [0,1 ], c is the phase offset;
② Inquiring whether the information of the table name x i exists in the set S, if the information of the table name x i exists in the set S, finding out the sorting score corresponding to the table name x i from the set S, and marking as At this point, a new rank score for table name x i is calculatedAnd will beUpdate to set S, order the ranking score of table name x i if the table name x i information was not previously present in set SAnd will beAdded to the set S.
③ See if there are other table names following table name x i in the table name sequence. If there are other table names, go to step ①, and if there are no other table names, the description is completed, go to step 3.4.
3.4 Rank the table names in set S from big to small in rank score.
Based on the method, the invention also provides a database security management and control platform SQL sentence completion system based on the grammar tree, which comprises the following steps:
(1) The grammar tree construction module receives SQL sentences input by a user and generates an original grammar tree;
(2) The grammar tree enhancement module is used for receiving the original grammar tree generated by the grammar tree construction module, traversing the original grammar tree, adding an ID and a father node ID for each node, marking a table name and a column name, adding type information for clause nodes, and outputting the enhanced grammar tree;
(3) The completion suggestion generation module is used for giving a completion suggestion according to the enhanced grammar tree and the current cursor position;
(4) The database object management module returns the database object information according to the database object information request;
(5) An editor integration module that displays the complement suggestion in an editor UI;
And when the completion display is performed, the SQL sentence completion result is displayed to the user according to the keyword, the table name and the column name, the higher ranking score is given to the newer record, the lower ranking score is given to the longer record, and the disturbance is added into the decay function.
Further, in the editor integration module, sorting is performed according to the keywords, the table names and the column names, and the method specifically comprises the following steps:
3.1 preparing a table name sequence x n,xn-1,…,x2,x1, initializing an empty set S for storing the processed table names and the corresponding sorting scores thereof, wherein n is the number of the table names in the history record, and x i represents the table name with the sequence number i, and i is more than or equal to 1 and less than or equal to n.
3.2 Calculating parametersWherein, α max and α min are respectively the maximum value and the minimum value of the preset parameter α, log () represents the base 10 logarithm, and n max is the preset maximum value of n;
3.3, calculating the sorting score for the table name with the sequence number of i, wherein i is circularly calculated from 1 to n, and the calculation process is as follows:
① Calculating a ranking score f (i) =e -αi(1+b×sin2 ((1- α) i+c)), where sin () is a sine function, b e [0,1 ], c is the phase offset;
② Inquiring whether the information of the table name x i exists in the set S, if the information of the table name x i exists in the set S, finding out the sorting score corresponding to the table name x i from the set S, and marking as At this point, a new rank score for table name x i is calculatedAnd will beUpdate to set S, order the ranking score of table name x i if the table name x i information was not previously present in set SAnd will beAdded to the set S.
③ See if there are other table names following table name x i in the table name sequence. If there are other table names, go to step ①, and if there are no other table names, the description is completed, go to step 3.4.
3.4 Rank the table names in set S from big to small in rank score.
The invention can provide accurate complement suggestion according to the structure and logic of SQL sentences, has better processing capability for complex SQL sentences such as nested sub-queries, joint queries and the like, can accurately identify grammar errors in the SQL sentences, and is easy to realize the functions such as SQL sentence formatting, code highlighting and the like. The invention ensures that for each table name in the historical data, the ranking score can be calculated based on the time decay and the specific periodic disturbance, so that the data score of longer time is lower and the data score of more recent time is higher as a whole, and the repeatability of the data is considered.
Detailed Description
Example 1
Based on the analysis, the invention designs an SQL sentence completion method based on a grammar tree (Abstract Syntax Tree, AST), which is named AST-SQL. The method uses a parser to convert a part of SQL sentences input by a user into a grammar tree, and predicts the next possible content according to the grammar tree structure and the context information, and takes the next possible content as a complement suggestion.
A syntax tree is a tree-like representation of the source code syntax structure, that is, the structure of a program is represented in the form of a tree, each node on the tree representing a syntax structure such as an expression, statement or type declaration, etc. In SQL, in particular, the syntax tree can represent the constituent parts of the SQL statement in a hierarchical and clear manner. The AST-SQL method can provide accurate complement suggestion according to the current context because the structure and logic of SQL sentences can be obtained through a grammar tree, and has better processing capability even for complex SQL sentences such as nested sub-queries, joint queries and the like. Meanwhile, the implementation basis of the method is a grammar tree, so that grammar errors in SQL sentences have congenital advantages in recognition capability. In addition, functions like SQL statement formatting, code highlighting and the like can be easily realized as auxiliary functions on the basis of a grammar tree.
In the AST-SQL method, each node of the syntax tree has some fixed attributes, including:
(1) ID-a unique ID per node.
(2) Type Each node has a type attribute, which indicates the type of the node, and specifically comprises statement,clause,keyword,property_access,identifier,comma,function_call,parenthesis,operator,literal.
(3) Index, the elements of the array child have index subscript attributes, so that the elements are convenient to search forward and backward.
According to the definition of the AST-SQL method for the node, taking a simple SQL statement as an example, it can be roughly split into fragments as in fig. 1.
On the basis of an AST-SQL method, the invention designs an SQL sentence completion system based on a grammar tree, and key components of the system comprise:
(1) SQLFormatterAst the component is used to generate the syntax tree of the SQL statement. In the database field, the components are relatively mature, and in general, tools for formatting SQL sentences have the core of building a grammar tree around the SQL sentences.
(2) SqlAst the method is used for analyzing the grammar tree generated by SQLFormatterAst, the meta information contained in the grammar tree is more conventional, and in order to be suitable for being used in SQL sentence completion, additional meta information such as node ID (ID field), father node ID (parentId field) and the like is added in the conventional grammar tree, and the additional meta information is helpful for establishing the hierarchical relationship among different nodes, so that a precondition is provided for improving the efficiency of the method. The component ultimately outputs an enhanced syntax tree that contains structural information about the SQL statement.
(3) SqlAdvisor determining appropriate complement suggestions based on the cursor position and the syntax tree. The component locates the node where the cursor is located, then analyzes the context of the node (e.g., clause type, node type, etc.), and generates a completion suggestion based on the analysis result. The completion suggestion is finally output, including the type of completion (table name, column name, keywords, etc.) and related context information.
(4) DatabaseObjectStore storing information about database objects (e.g., tables and columns) for providing completion options. The component loads the information of the corresponding database object according to the authority of the user and provides the complement option according to the request.
(5) SqlCompletionProvider implementing completion logic in the editor, including exposing completion suggestions. In flow, the component receives SqlAdvisor the generated completion suggestion and generates the completion term in the editor based on the completion suggestion and then presents the completion term to the user.
The key modules included in the system are:
(1) Grammar tree construction this module receives the SQL statement entered by the user and then uses SQLFormatterAst components to generate the original grammar tree.
(2) Syntax tree enhancement-the module receives the original syntax tree generated by the syntax tree construction module and then uses SqlAst components to do the following:
1) Traversing the original grammar tree, adding ID, father node ID and other information for each node,
2) Specific nodes are marked, such as table names, column names, etc.
3) Type information is added for clause nodes.
And finally, outputting the enhanced grammar tree.
(3) And generating a complement suggestion, namely giving the complement suggestion according to the enhanced grammar tree and the current cursor position. The module mainly uses SqlAdvisor components which firstly locate the node where the cursor is located, then analyze the context of the cursor node, determine the completion type and finally generate the completion suggestion, including the completion type and the context information.
(4) And (3) managing the database object, namely returning the database object information by the module according to the database object information request on the basis of the DatabaseObjectStore component.
(5) Editor integration the module uses SqlCompletionProvider components to follow up the complement suggestion and displays it specifically in the editor UI.
The SQL statement based on AST-SQL complements the complete module and component diagram of the system and the data flow process is shown in figure 2. The method mainly comprises the following implementation steps:
(1) Syntax tree construction
The user inputs SQL sentences in the editor, and the editor records the cursor position in the input process. In the AST-SQL method, the cursor position is marked with a special string __ cursor __. The SQLFormatterAst component is used to parse the SQL statement with __ cursor __ markup, generating the original syntax tree.
It should be noted that, the user may input many SQL statements in one SQL editor, and when the SQL statements are completed, only the SQL statement in which the current cursor is located need to be analyzed, instead of all the SQL statements in the whole SQL editor. The SQL sentences are divided by the marks, the method for acquiring the SQL sentences where the cursor is located is to search the nearest mark position before the cursor and the nearest mark position after the cursor, and the content between the 2 marks is the SQL sentences where the cursor is located. The semicolons used in this process are not included in either case, one in SQL notes, such as Hello, world, and the other in strings, such as 'Hello, world'.
FIG. 3 illustrates in specific contrast the user input SQL and SQL used to generate the grammar tree (where | represents the cursor).
(2) Syntax tree enhancement
In this step, the SqlAst component augments the original syntax tree that was generated, assigning each node information such as ID, type, parent and child nodes. Traversing clauses in the grammar tree, and carrying out different marking operations, such as marking table names, column names and the like, according to the clause types.
(3) Cursor position location
The SqlAdvisor component looks up the node labeled __ cursor __ and determines the node where the cursor is located. And searching the parent node of the cursor node by a recursion method until the nearest clause node is found. The completion type (table name, column name, keyword, etc.) is judged according to the type and position of the cursor node.
(4) Complement suggestion generation
A completion suggestion is generated SqlAdvisor based on the position and context of the cursor node. In short, the complement suggestion is to obtain what type of data, such as table name, column name, and common keywords, should be prompted to input by the current cursor position. Therefore, the content generated by the partial complement proposal is trivial and includes more head ends, and mainly includes the following cases:
1) Complement table name:
The table names, including view names, are completed in the system. The case of complementing table names mainly includes:
in the FROM clause, traversing forward in the grammar tree, complementing column names if ON is encountered, complementing table names if JOIN is encountered, complementing table names if comma or the first element is in front.
INSERT INTO clause, complement the table name, and may require automatic addition of aliases, etc. at full time.
INSERT IGNORE INTO clauses, complement table names.
Replacement INTO clause, complement table name.
DELETE FROM clause, complement table name.
UPDATE clause, complement table name.
DROP TABLE clause, complement TABLE name.
CREATE VIEW clauses-View map title does not require complementation.
VALUES clauses, in which the values entered by the user are not required to be completed.
DROP VIEW clause-complement VIEW map title.
The TRUNCATE TABLE clause, complement the TABLE name.
ALTER TABLE clause, complement TABLE name.
2) Complement column name:
SELECT clause, complement column names according to tables in the FROM clause.
WHERE clause, complement column name.
GROUP BY clause, complement column name.
ORDER BY clause, complement column name.
HAVING clause, complement column name.
3) Complement key:
in other cases, the keywords are complemented.
(5) Database object retrieval
The corresponding database object is retrieved from DatabaseObjectStore components according to the type of complement suggestion. If some tables or columns have not been loaded, an asynchronous load is performed and the completion list is updated.
For table name completion, if a user inputs a schema in SQL, then the table under the schema is completed, and if no schema is input, then all tables are completed.
For column name completion, if a table name is input by the user, the column of the table is completed, and if no table name is input, the completion is performed according to the table in the FROM clause.
(6) Completion presentation in an editor
The completion suggestions generated by the SqlAdvisor component are displayed to the user by the SqlCompletionProvider component. The presentation of the complement suggestion takes into account differences in different types of hinting items, such as keywords, table names, column names, etc., and supports different types of icons.
In the complement presentation, a ranking algorithm is designed, which can rank according to keywords, table names and column names, respectively. And when the SQL sentence completion result is displayed to the user, the result is sequenced according to the sequencing algorithm and then displayed. The algorithm is calculated by giving a higher ranking score to newer records and a lower ranking score to longer records, wherein the lower ranking score contains a decay function. At the same time, in order to avoid that the attenuation process is too ideal, a slight disturbance is added to the attenuation function.
The system stores the usage records of the key words, table names and column names of the user, and the usage records of the table names are shown in fig. 4.
The usage record of Table names includes Table names and the order of usage, and in the above example, the user sequentially accesses tables such as Table4, table3, table2. The first Table4 was used a long time ago, while the last Table2 was last used just recently. The Table names are numbered according to the reverse order of the actual use order, so that the corresponding serial numbers of the Table names can be obtained, the last Table2 serial number is 1, the first Table4 serial number is 10, and the other serial numbers are sequentially numbered. The method and process of keywords and column names are consistent with table names, so the algorithm will be described only by taking the table names as examples.
According to the above rule, let the sequence of table names be x n,xn-1,…,x2,x1, where n is the number of table names in the history, it should be noted that there may be duplication in these n table names. In this sequence, x i represents the table name with the sequence number i, 1.ltoreq.i.ltoreq.n.
The specific steps of the algorithm are shown in fig. 5. The specific information of each step is as follows:
1) Initialization of
The table name sequence x n,xn-1,…,x2,x1 is prepared and the empty set S is initialized for storing the processed table names and their corresponding ranking scores.
2) Parameter calculation
The parameter alpha is calculated, and the parameter is required to be used in the following steps, and the calculation method comprises the following steps:
Where α max and α min are the maximum and minimum values of the preset parameter α, respectively, log () represents the base-10 logarithm, and n max is the preset maximum value of n.
3) Historical data traversal
The calculation of the rank scores is performed on table names with the sequence number i, note that i is calculated circularly from 1 to n.
The calculation process is as follows:
① The method for calculating the sequencing score comprises the following steps:
f(i)=e-αi(1+b×sin2((1-α)i+c))
where sin () is a sine function, b e [0, 1) controls the amplitude of the periodic disturbance, c is the phase offset, and controls the starting position of the disturbance.
② Repeating the table name check:
Query set S is made as to whether there is information for table name x i already in set S.
If the inspection result shows that the information of table name x i already exists in set S, the ranking score corresponding to table name x i is found out from set S and recorded asAt this time, a new ranking score of table name x i is calculated by: And the new ranking score of the calculated table name x i Updated into the set S.
If the inspection results show that the information for table name x i was not previously present in set S, then let the table name x i rank scoreAnd the table name x i and the calculated ranking scoreAdded to the set S.
③ Traversing:
See if there are other table names following table name x i in the table name sequence. If there are other table names, the step a is skipped to continue processing, otherwise, if there are no other table names, the step a is continued after the traversal is completed.
4) Result output
The processed table names and the corresponding sorting scores thereof are stored in the set S, and the table names are sorted according to the sorting scores from big to small, so that the table name sequence for display output can be obtained.
The algorithm ensures that for each table name in the historical data, a ranking score can be calculated based on the time decay and the particular periodic disturbance, resulting in a lower data score for longer and higher data score for more recent as a whole, while taking into account the repeatability of the data.
Example 2
The following describes the use of the SQL statement completion system with specific examples.
Assume that there are 3 SQL statements:
these 3 SQL statements are not complete, and each SQL is actually in the middle of editing. In the SQL editor, there should be one SQL sentence complement suggestion for each possible input content of the SQL current position (the current input position, i.e. cursor position, is represented by "|" in the 3 SQL sentences). The specific process is as follows:
(1) Syntax tree construction
The user inputs SQL sentences in the editor, and the editor records the cursor position in the input process. In the AST-SQL method, the cursor position is marked with a special string __ cursor __. According to the method, the 3 SQL sentences are respectively converted into:
The SQLFormatterAst component then parses the SQL statement with __ cursor __ markup, generating the original syntax tree.
(2) Syntax tree enhancement
In this step, the SqlAst component augments the original syntax tree that was generated, assigning each node information such as ID, type, parent and child nodes. Traversing clauses in the grammar tree, and carrying out different marking operations, such as marking table names, column names and the like, according to the clause types. According to the method, the enhanced grammar trees corresponding to the 3 SQL sentences are respectively as follows:
(3) Cursor position location
The SqlAdvisor component looks up the node labeled __ cursor __ and determines the node where the cursor is located. And searching the parent node of the cursor node by a recursion method until the nearest clause node is found. The completion type (table name, column name, keyword, etc.) is judged according to the type and position of the cursor node.
From the generated grammar tree, the 3 SQL sentences should be complemented with column names, table names and table names, respectively.
(4) Complement suggestion generation
A completion suggestion is generated SqlAdvisor based on the position and context of the cursor node. In short, the complement suggestion is to obtain what type of data, such as table name, column name, and common keywords, should be prompted to input by the current cursor position. From the generated grammar tree, the 3 SQL sentences should be presented with column names, table names and table names, respectively.
(5) Database object retrieval
The corresponding database object is retrieved from DatabaseObjectStore components according to the type of complement suggestion. If some tables or columns have not been loaded, an asynchronous load is performed and the completion list is updated.
For table name completion, if a user inputs a schema in SQL, then the table under the schema is completed, and if no schema is input, then all tables are completed.
For column name completion, if a table name is input by the user, the column of the table is completed, and if no table name is input, the completion is performed according to the table in the FROM clause.
For the 3 SQL sentences, the 1 st SQL sentence needs to supplement column names, but no table names are provided, so that the table names cannot be prompted, and for the 2 nd and 3 rd SQL sentences, the table names need to be supplemented, and at the moment, the corresponding database tables are retrieved from the DatabaseObjectStore component, so that a table name list is obtained.
(6) Completion presentation in an editor
The completion suggestions generated by the SqlAdvisor component are displayed to the user by the SqlCompletionProvider component. The presentation of the complement suggestion takes into account differences in different types of hinting items, such as keywords, table names, column names, etc., and supports different types of icons.
For the 3 SQL sentences, the 1 st SQL sentence needs to supplement column names, but no table names are provided, so that the 3 st SQL sentence cannot be prompted, and for the 2 nd and 3 rd SQL sentences, the corresponding database tables are retrieved from the DatabaseObjectStore components to obtain a table name list, and the table name list is provided for the SqlAdvisor components. If the part of the list is found in the use record, the sorting score of the part of the list needs to be calculated and displayed in front of the prompt area, and if the part of the list is not used in the use record, the list is displayed in back of the prompt area after being sequentially arranged by letters.
The calculation of the ranking score adopts a ranking algorithm which can rank according to keywords, table names and column names respectively. And when the SQL sentence completion result is displayed to the user, the result is sequenced according to the sequencing algorithm and then displayed. The algorithm is calculated by giving a higher ranking score to newer records and a lower ranking score to longer records, wherein the lower ranking score contains a decay function. At the same time, in order to avoid that the attenuation process is too ideal, a slight disturbance is added to the attenuation function.
The specific steps of the algorithm of the ordered set of table name historical data :X={x10,x9,…,x1}={"student","scores","courses","student","departments","scores","student","scores","courses","student"}, are:
1) Initialization of
The initialization empty set S is used to store the processed table names and their corresponding ranking scores, n=10.
2) Parameter calculation
Let a max=0.5,amin=0.1,nmax =15, b=0.2, c=0.5, according to the given formula
The parameter α, α≡0.366 can be calculated.
3) Historical data traversal
The calculation of the rank scores is performed on table names with the sequence number i, note that i is calculated circularly from 1 to n.
The calculation process is as follows:
① The method for calculating the sequencing score comprises the following steps:
f(i)=e-αi(1+b×sin2((1-α)i+c))
where sin () is a sine function, b e [0, 1) controls the amplitude of the periodic disturbance, c is the phase offset, and controls the starting position of the disturbance.
② Repeating the table name check:
Query set S is made as to whether there is information for table name x i already in set S.
If the inspection result shows that the information of table name x i already exists in set S, the ranking score corresponding to table name x i is found out from set S and recorded asAt this time, a new ranking score of table name x i is calculated by: And the new ranking score of the calculated table name x i Updated into the set S.
If the inspection results show that the information for table name x i was not previously present in set S, then let the table name x i rank scoreAnd the table name x i and the calculated ranking scoreAdded to the set S.
③ Traversing:
See if there are other table names following table name x i in the table name sequence. If there are other table names, the step a is skipped to continue processing, otherwise, if there are no other table names, the step a is continued after the traversal is completed.
The specific calculation process is as follows:
when i=1, the table name x 1, i.e. the table name "student",
f(1)=e-0.366×1(1+0.2×sin2((1-0.366)×1+0.5))≈0.697
The table name "student" and its rank score of 0.697 are added to set S.
When i=2, the processing table name x 2, i.e. table name "courses",
f(2)=e-0.366×2(1+0.2×sin2((1-0.366)×2+0.5))≈0.483
The table name "courses" and its rank score 0.483 are added to set S.
When i=3, the table name x 3, i.e. the table name "score",
f(3)=e-0.366×3(1+0.2×sin2((1-0.366)×3+0.5))≈0.332
The table name "score" and its rank score 0.332 are added to set S.
When i=4, the table name x 4, i.e. the table name "student",
f(4)=e-0.366×4(1+0.2×sin2((1-0.366)×4+0.5))≈0.228
Because the table name "student" was previously present in set S, the rank score corresponding to table name x 4 was found from set S and recorded asOn the basis of which the weight is updated:
The ranking score of the table name "student" in set S is updated to 0.732.
When i=5, the table name x 5, i.e. the table name "score",
f(5)=e-0.366×5(1+0.2×sin2((1-0.366)×5+0.5))≈0.157
Because the table name "score" was previously present in set S, the rank score corresponding to table name x 5 was found from set S and recorded asOn the basis of which the weight is updated:
The ranking score of the table name "score" in set S is updated to 0.415.
And so on, the following steps are respectively:
When i=6, the table name "departments" and its ranking score 0.105 are added to the set S.
When i=7, the ranking score of the table name "student" in the set S is updated to 0.753.
When i=8, the ranking score of table name "courses" in set S is updated to 0.502.
When i=9, the ranking score of the table name "score" in the set S is updated to 0.432.
When i=10, the ranking score of the table name "student" in the set S is updated to 0.767.
4) Result output
The processed table names and the corresponding sorting scores thereof are stored in the set S, and the table names are sorted according to the sorting scores from big to small, so that the table name sequence for display output can be obtained. The results of the above example are as follows:
Finally, four table names of student, courses, score, departments may be displayed in order in the completion suggestion for presentation to the user.
The algorithm sorting process has the advantages that ① is simple and easy to implement, a complex mathematical model or a deep learning framework is not needed, understanding and implementation are easy, a large amount of computing resources are not needed, and implementation difficulty is low. ② The time sensitivity, algorithm takes into account the time decay factor, which means that more recent data has a higher score, which helps to capture the latest trend with higher accuracy. ③ The repeated data processing is carried out, the algorithm effectively processes the repeated data, and the score calculation method accurately reflects the repeatability of the data points and simultaneously adapts to the time attenuation characteristic. ④ The algorithm can calculate results quickly in response, which is very useful for application scenarios requiring real-time feedback.