WO2025107166A1

WO2025107166A1 - Text to structured query language conversion

Info

Publication number: WO2025107166A1
Application number: PCT/CN2023/133168
Authority: WO
Inventors: Ying Ma; Zhou HUANG; Xiang Zhang; Yiming Huang; Qiuyu WANG
Original assignee: eBay Inc
Current assignee: eBay Inc
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2025-05-30
Anticipated expiration: 2026-05-22

Abstract

Various embodiments described herein support or provide operations including receiving a natural language text from a device; using a first machine learning model to generate a first embedding vector that represents the natural language text; matching the first embedding vector with a second embedding vector that represents a data table; generating, based on the natural language text, a prompt as an input to a second machine learning model; using the second machine learning model to generate a response based on the prompt; and causing display of the SQL query in a user interface of the device.

Description

TEXT TO STRUCTURED QUERY LANGUAGE CONVERSION

TECHNICAL FIELD

The present disclosure generally relates to data processing using machine learning technologies. More particularly, various embodiments described herein provide for systems, methods, techniques, instruction sequences, and devices that facilitate the generation of structured query language outputs based on natural language inputs.

BACKGROUND

Existing systems face challenges in effectively handling the complexity and scale of database structures when it comes to generating structured query language (SQL) outputs based on user inputs in natural language. Specifically, users are often required to provide specifics of the databases to help the existing systems identify the database relevant to the user inputs to generate SQL outputs. Further, due to the complexity and scale of database structures in a production environment, existing systems also face challenges in improving the quality of SQL output generation to align more closely with human-level performance.

SUMMARY

An embodiment provides a system comprising one or more hardware processors; and at least one non-transitory machine-readable medium for storing instructions that, when executed by the one or more hardware processors, cause the one or more hardware processors to perform operations comprising: receiving a natural language text from a device; generating, using a first machine learning model, a first embedding vector that represents the natural language text; matching the first embedding vector with a second embedding vector that represents a data table, the data table being associated with a table schema; generating, based on the natural language text, a prompt as an input to a second machine learning model, the prompt including the table schema; and generating, using the second machine learning model, a response based on the prompt, the response including a Structured query language (SQL) query that can be executed on a database; and causing display of the SQL query in a user interface of the device.

In an embodiment of the system, further comprising accessing a plurality of data tables from one or more SQL databases, each data table being associated with a table schema; generating, using the first machine learning model, a plurality of embedding vectors, each embedding vector representing a corresponding data table; and storing the plurality of embedding vectors in a vector database.

In an embodiment of the system, the plurality of embedding vectors comprises the second embedding vector that represents the data table.

In an embodiment of the system, further comprising detecting, via the device, a successful execution of the SQL query against the database; generating a question-SQL pair that includes both the natural language text and the SQL query; and storing the question-SQL pair in a mapping data repository.

In an embodiment of the system, the natural language text comprises a question. The embodiment of the system further comprises accessing a plurality of historical questions associated with the data table, each historical question being associated with a syntactically correct SQL query; identifying, using the first machine learning model, a historical question based on the natural language text received from the device; and generating the prompt that includes both the historical question and the syntactically correct SQL query as a training example for the first machine learning model.

In an embodiment of the system, the historical questions associated with the data table are identified from a mapping data repository that includes a plurality of question-SQL pairs, each question-SQL pair including a syntactically correct SQL query that matches a respective question.

In an embodiment of the system, further comprising detecting a failed execution of the SQL query against the database; determining that a modification is made to the SQL query; and detecting a successful execution of a modified SQL query against the database.

In an embodiment of the system, further comprising generating a question-SQL pair that includes both the natural language text and the modified SQL query; and storing the question-SQL pair in a mapping data repository.

In an embodiment of the system, the first machine learning model comprises at least one Bidirectional Encoder Representations from Transformers (BERT) machine learning model.

In an embodiment of the system, the second machine learning model comprises at least one large language machine learning model.

An embodiment provides a method comprising receiving a natural language text from a device; generating, using a first machine learning model, a first embedding vector that represents the natural language text; matching the first embedding vector with a second embedding vector that represents a data table, the data table being associated with a table schema; generating, based on the natural language text, a prompt as an input to a second machine learning model, the prompt including the table schema; and generating, using the second machine learning model, a response based on the prompt, the response including a Structured query language (SQL) query that can be executed on a database; and causing display of the SQL query in a user interface of the device.

In an embodiment of the method, further comprising accessing a plurality of data tables from one or more SQL databases, each data table being associated with a table schema; generating, using the first machine learning model, a plurality of embedding vectors, each embedding vector representing a corresponding data table; and storing the plurality of embedding vectors in a vector database.

In an embodiment of the method, the plurality of embedding vectors comprises the second embedding vector that represents the data table.

In an embodiment of the method, further comprising detecting, via the device, a successful execution of the SQL query against the database; generating a question-SQL pair that includes both the natural language text and the SQL query; and storing the question-SQL pair in a mapping data repository.

In an embodiment of the method, the natural language text comprises a question. The embodiment of the method further comprises accessing a plurality of historical questions associated with the data table, each historical question being associated with a syntactically correct SQL query; identifying, using the first machine learning model, a historical question based on the natural language text received from the device; and generating the prompt that includes both the historical question and the syntactically correct SQL query as a training example for the first machine learning model.

In an embodiment of the method, the historical questions associated with the data table are identified from a mapping data repository that includes a plurality of question-SQL pairs, each question-SQL pair including a syntactically correct SQL query that matches a respective question.

In an embodiment of the method, further comprising detecting a failed execution of the SQL query against the database; determining that a modification is made to the SQL query; and detecting a successful execution of a modified SQL query against the database.

In an embodiment of the method, further comprising generating a question-SQL pair that includes both the natural language text and the modified SQL query; and storing the question-SQL pair in a mapping data repository.

In an embodiment of the method, the first machine learning model comprises at least one Bidirectional Encoder Representations from Transformers (BERT) machine learning model, and wherein the second machine learning model comprises at least one large language machine learning model.

An embodiment provides a machine-storage medium comprising receiving a natural language text from a device; generating, using a first machine learning model, a first embedding vector that represents the natural language text; matching the first embedding vector with a second embedding vector that represents a data table, the data table being associated with a table schema; generating, based on the natural language text, a prompt as an input to a second machine learning model, the prompt including the table schema; and generating, using the second machine learning model, a response based on the prompt, the response including a Structured query language (SQL) query that can be executed on a database; and causing display of the SQL query in a user interface of the device.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. To easily identify the discussion of any particular element or act, the most significant digit or digits in a reference number refer to the figure number in which that element is first introduced. Some embodiments are illustrated by way of examples, and not limitations, in the accompanying figures.

FIG. 1 is a block diagram showing an example data system that includes a data management system, according to various embodiments of the present disclosure.

FIG. 2 is a block diagram illustrating an example data management system that facilitates the generation of structured query language outputs, according to various embodiments of the present disclosure.

FIG. 3 is a flowchart illustrating an example method for generating structured query language outputs using machine learning models, according to various embodiments of the present disclosure.

FIG. 4 is a flowchart illustrating an example method for generating structured query language outputs using machine learning models, according to various embodiments of the present disclosure.

FIG. 5 is a flowchart illustrating an example method for generating structured query language outputs using machine learning models, according to various embodiments of the present disclosure.

FIG. 6 is a diagram illustrating data flow within an example data management system that facilitates the generation of structured query language outputs during operation, according to various embodiments of the present disclosure.

FIG. 7 is a flowchart illustrating an example method for generating structured query language outputs using machine learning models, according to various embodiments of the present disclosure.

FIG. 8 is a block diagram illustrating a representative software architecture, which may be used in conjunction with various hardware architectures herein described, according to various embodiments of the present disclosure.

FIG. 9 is a block diagram illustrating components of a machine able to read instructions from a machine storage medium and perform any one or more of the methodologies discussed herein according to various embodiments of the present disclosure.

DETAILED DESCRIPTION

The description that follows includes systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative embodiments of the present disclosure. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of embodiments. It will be evident, however, to one skilled in the art that the present inventive subject matter may be practiced without these specific details.

Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present subject matter. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment.

For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be apparent to one of ordinary skill in the art that embodiments of the subject matter described may be practiced without the specific details presented herein, or in various combinations, as described herein. Furthermore, well-known features may be omitted or simplified in order not to obscure the described embodiments. Various embodiments may be given throughout this description. These are merely descriptions of specific embodiments. The scope or meaning of the claims is not limited to the embodiments given.

Existing systems face challenges when it comes to using machine learning models to generate structured query language (SQL) outputs based on user inputs in natural language (also referred to as text-SQL process) . A first example challenge is the inefficient handling of large data volumes and complex data structures. Specifically, research on text-to-SQL is typically done based on limited data volume and relatively simple SQL queries. In the real-world production environment, the size of the dataset is significantly larger (e.g., tens of thousands of data tables) , posing challenges in efficiently identifying the correct data table (s) relevant to a user's inquiry. A second example challenge is the low accuracy of SQL query generation. Current studies show the accuracy of SQL outputs (e.g., SQL queries) generated using machine learning models is below 50%. Improving the accuracy of SQL query generation to align closely with human-level performance (e.g., 90%accuracy) is a pivotal challenge that needs to be addressed.

Various embodiments include systems, methods, and non-transitory computer-readable media that facilitate the generation of structured query language outputs based on natural language inputs (e.g., text inputs) . Specifically, a data management system can use machine learning models (e.g., large language models) to convert text inputs to SQL outputs that can be executed on databases. SQL outputs are also referred to as SQL queries or SQL results described herein. In various embodiments, the data management system can use machine learning models (e.g., Bidirectional Encoder Representations from Transformers (BERT) machine learning model) to identify one or more data tables relevant to a user’s inquiry. An example inquiry can include one or more questions in natural language (e.g., text format) . The data management system can use one or more large language models to generate one or more SQL queries in response to the one or more questions. Users can then execute the SQL queries against the data table to look for answers to the questions in the inquiry. The generation of structured query language outputs based on natural language inputs can also be referred to as text-to-SQL process, as described herein.

In various embodiments, upon receiving a user inquiry in natural language (e.g., text format) , the data management system can generate one or more embeddings (also referred to as word embeddings, or embedding vectors described herein) to represent the inquiry. An embedding can be a representation (e.g., a real-valued vector) of a word used in text analysis. Embeddings can also be generated (e.g., pre-generated) for each data table in one or more databases. The data management system can use machine learning models (e.g., BERT machine learning models) to identify specific data tables relevant to the inquiry, such as matching the embedding of the inquiry against numerous embeddings generated for the data tables based on semantic similarity. Under this approach, the data management system can automatically locate data tables based on the user’s inquiry such that users no longer need to deal with the hassle of providing specifics (e.g., table schemas) of data tables to the large language models.

In various embodiments, upon identifying the one or more data tables, the data management system can generate one or more prompts that include the received text inquiry (e.g., natural language texts) and one or more table schemas associated with the one or more data tables. A table schema can be a structure or an organization of data within a data table. A table schema can define columns and/or fields in the data table, as well as data types, constraints, and relationships with other data tables. A well-defined table schema is crucial for maintaining data integrity and consistency, making it possible to query and work with the data effectively.

In various embodiments, the data management system uses the one or more prompts as inputs to one or more large language models and generates one or more model responses based on the one or more prompts. A model response includes one or more SQL queries that can be executed on (or against) the respective data tables in one or more databases. The data management system can cause the display of one or more SQL queries on a user interface of a device, from which one or more text inquiries were received.

In various embodiments, when generating prompts as inputs to one or more machine learning models (e.g., large language ML models) , the data management system can identify one or more historical questions (e.g., historical inquiries) associated with a data table. Each historical question is associated with a syntactically correct SQL query that matches the historical question. In various embodiments, the data management system generates question-SQL pairs based on such historical questions and the corresponding syntactically correct SQL queries. The data management system can use BERT ML models to identify one or more historical questions similar (e.g., semantically similar) to the questions in the received inquiry. The data management system can generate a prompt that includes such question-SQL pairs (e.g., semantically similar historical questions and the corresponding syntactically correct SQL queries) as training examples for the large language ML models to generate SQL queries that match the user’s inquiry. Under this approach, since the training examples guide the generation of model responses, the large language ML models are more likely to generate syntactically accurate SQL queries, thereby improving the accuracy of SQL query generation. Further, user feedback plays a crucial role in continuously improving the text-to-SQL process. By allowing users to modify and execute SQL queries on the SQL development platform, the data management system can capture users’ corrections and improvements on model-generated results. Such feedback can be used to refine the text-SQL repository, ensuring that future queries can be generated under the guidance of more accurate text-to-SQL examples.

In various embodiments, the data management system detects, via the device, successful or unsuccessful (e.g., failed) execution of the SQL query against the database. Execution can be detected based on user interactions with an SQL development platform. An SQL development platform, often referred to as an Integrated Development Environment (IDE) for SQL, can be a software tool or application that provides an environment for software developers and administrators to create, manage, and query SQL databases. The data system described herein can include (or communicatively coupled to) the SQL development platform. In various embodiments, the data management system can be incorporated into a plugin tool installed on the SQL development platform. A plugin, or plugin tool, can be referred to as a software component (or system) that adds specific features or functionality to extend the capabilities of an existing application or program (e.g., web browsers, media players, or other types of software or platforms) .

In various embodiments, upon detecting a successful execution of the SQL query, the data management system can generate a question-SQL pair as an example for prompt generation. The question-SQL pair includes both the natural language text (e.g., text inquiry) and the successfully executed SQL query. The data management system stores the question-SQL pair in a mapping data repository.

In various embodiments, upon detecting a failed execution of the SQL query, the data management system can determine whether the SQL query is modified based on the detection of a successful execution later on. The data management system can generate a question-SQL pair that includes both the natural language text (e.g., text inquiry) and the modified SQL query. The question-SQL pair can be associated with the relevant data tables and stored in the mapping data repository. Each data table can be associated with a plurality of question-SQL pairs that can be used as examples to guide the large language models to generate SQL outputs in the future.

In various embodiments, one or more BERT ML models (e.g., first machine learning models) can be used to generate and match the embeddings of text inquiries with embeddings of data tables based on semantic similarity.

In various embodiments, one or more large language models (e.g., second machine learning models) can be used to generate SQL results based on prompts described herein.

The technical advantages of the disclosed invention include, without limitation: suitable implementation in complex production environment, similar questions as training examples (e.g., “few shots” ) in prompt generation, user feedback-driven improvement, seamless integration into SQL development platforms, and real-time adaptation and learning, as respectively discussed below.

Suitable implementation in complex production environment: Existing research related to text-to-SQL processes is done based on limited number of databases. There are hundreds of databases and tens of thousands of tables in production environment. The disclosed invention is more suitable for complex production environment where it allows similar data tables to be automatically retrieved by the system instead of asking users to provide such information (e.g., table schemas) .

Similar Questions as “few shots” in prompt generation: Few shots, also referred to as training examples described herein, in prompts can guide large language machine learning models to generate more accurate SQL queries. Instead of using static few shots, the disclosed invention automatically identifies SQL answers that are semantically similar to the user's questions and integrates them into prompts as few shots, thereby helping large language machine learning models generate more accurate SQL queries.

User Feedback-driven Improvement: User feedback plays a crucial role in continuously improving the text-to-SQL process. By allowing users to modify and execute SQL queries within the system, the disclosed invention captures users’ modifications (e.g., corrections and improvements) to model-generated SQL queries. Such feedback is used to refine the question-SQL repository, ensuring that future queries can be generated under the guidance of more accurate text-to-SQL examples.

Seamless Integration into SQL Development Platforms: The systems discussed under the disclosed invention seamlessly can be integrated into existing SQL development platforms as a plugin tool. This allows users to access the enhanced text-to-SQL functionality within their familiar working environment without the need for additional tools or interfaces. The seamless integration enhances user convenience and adoption.

Real-time Adaptation and Learning: The systems discussed under the disclosed invention continuously learns and adapts based on user interactions and newly collected data. By dynamically incorporating user modifications and capturing semantic similarities between queries, the system can provide more accurate and context-aware SQL suggestions over time.

Reference will now be made in detail to embodiments of the present disclosure, examples of which are illustrated in the appended drawings. The present disclosure may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein.

FIG. 1 is a block diagram showing an example data system 100 that includes a data management system 122 (also referred to as system 122) , according to various embodiments of the present disclosure. By including the data management system 122, the data system 100 can facilitate the generation of structured query language outputs using machine learning technologies. As shown, the data system 100 includes one or more client devices 102, a server system 108, and a network 106 (e.g., Internet, wide-area-network (WAN) , local-area-network (LAN) , wireless network) that communicatively couples them together. Each client device 102 can host a number of applications, including a client software application 104. The client software application 104 can communicate data with the server system 108 via a network 106. Accordingly, the client software application 104 can communicate and exchange data with the server system 108 via network 106.

The server system 108 provides server-side functionality via the network 106 to the client software application 104. While certain functions of the data system 100 are described herein as being performed by the data management system 122 on the server system 108, it will be appreciated that the location of certain functionality within the server system 108 is a design choice. For example, it may be technically preferable to initially deploy certain technology and functionality within the server system 108, but to later migrate this technology and functionality to the client software application 104.

The server system 108 supports various services and operations that are provided to the client software application 104 by the data management system 122. Such operations include transmitting data from the data management system 122 to the client software application 104, receiving data from the client software application 104 at the data management system 122, and the data management system 122 processing data generated by the client software application 104. Data exchanges within the data system 100 may be invoked and controlled through operations of software component environments available via one or more endpoints, or functions available via one or more user interfaces of the client software application 104, which may include web-based user interfaces provided by the server system 108 for presentation at the client device 102.

With respect to the server system 108, an Application Program Interface (API) server 110 and a web server 112 is coupled to an application server 116, which hosts the data management system 122. The application server 116 is communicatively coupled to a database server 118, which facilitates access to a database 120 that stores data associated with the application server 116, including data that may be generated or used by the data management system 122.

The API server 110 receives and transmits data (e.g., API calls, commands, requests, responses, and authentication data) between the client device 102 and the application server 116. Specifically, the API server 110 provides a set of interfaces (e.g., routines and protocols) that can be called or queried by the client software application 104 in order to invoke the functionality of the application server 116. The API server 110 exposes various functions supported by the application server 116 including, without limitation, user registration; login functionality; data object operations (e.g., generating, storing, retrieving, encrypting, decrypting, transferring, access rights, licensing) ; and/or user communications.

The server system 108, or the data management system 122 may extract user data from one or more third-party platforms (e.g., third-party social media platforms) . The extracted data may be open-source poster data associated with targeted influencers on the one or more third-party platforms 124 and may include user profile data, activity data, and media posted (either created and/or shared) by the one or more influencers. The media (or media data) include text, image, video, audio, and metadata. Example metadata may include hashtags and labels.

Through one or more web-based interfaces (e.g., web-based user interfaces) , the web server 112 can support various functionality of the data management system 122 of the application server 116.

FIG. 2 is a block diagram illustrating an example data management system 200 that facilitates the generation of structured query language outputs, according to various embodiments of the present disclosure. For some embodiments, the data management system 200 represents an example of the data management system 122 described with respect to FIG. 1. As shown, the data management system 200 comprises a text inquiry receiving component 210, an embedding vector generating component 220, an embedding vector matching component 230, a prompt generating component 240, a SQL query generating component 250, a SQL query execution detecting component 260, and a question-SQL pair generating component 270. According to various embodiments, one or more of the text inquiry receiving component 210, the embedding vector generating component 220, the embedding vector matching component 230, the prompt generating component 240, the SQL query generating component 250, the SQL query execution detecting component 260, and the question-SQL pair generating component 270 are implemented by one or more hardware processors 202. Data generated by one or more of the text inquiry receiving component 210, the embedding vector generating component 220, the embedding vector matching component 230, the prompt generating component 240, the SQL query generating component 250, the SQL query execution detecting component 260, and the question-SQL pair generating component 270 may be stored in a database (or datastore) 280 of the data management system 200.

The text inquiry receiving component 210 is configured to receive one or more user inquiries in natural language (e.g., text format) . A natural language can be any language that occurs naturally in a human community through the process of use. An example inquiry can include one or more questions.

The embedding vector generating component 220 is configured to use one or more machine learning models (e.g., BERT ML models) to generate one or more embeddings (also referred to as embedding vectors described herein) that represent the natural language text inquiries. An embedding can be a representation (e.g., a real-valued vector) of a word used in text analysis.

The embedding vector matching component 230 is configured to use BERT ML models to match the one or more embeddings of the inquiry against (or on) numerous embeddings generated for the data tables based on semantic similarity. Embeddings generated for the data tables can also be referred to as table schema embeddings. Such table schema embeddings can be stored in a table schema embedding database.

The prompt generating component 240 is configured to generate one or more prompts that include the received one or more text inquiries and one or more table schemas associated with the one or more identified data tables.

The SQL query generating component 250 is configured to use the one or more prompts as inputs to one or more large language models and generates one or more model responses (e.g., SQL queries) based on the one or more prompts.

The SQL query execution detecting component 260 is configured to detect successful or unsuccessful (e.g., failed) execution of the SQL query against the data tables (or databases) . Execution can be detected based on user interactions with an SQL development platform described herein.

The question-SQL pair generating component 270 is configured to generate question-SQL pairs (also referred to as text-SQL examples) as examples for future prompt generation. Such user feedback can be used to refine the text-SQL repository, ensuring that future queries can be generated under the guidance of more accurate text-to-SQL examples.

FIG. 3 is a flowchart illustrating an example method 300 for generating structured query language outputs using machine learning models, according to various embodiments of the present disclosure. It will be understood that example methods described herein may be performed by a machine in accordance with some embodiments. For example, method 300 can be performed by the data management system 122 described with respect to FIG. 1, the data management system 200 described with respect to FIG. 2, or individual components thereof. An operation of various methods described herein may be performed by one or more hardware processors (e.g., central processing units or graphics processing units) of a computing device (e.g., a desktop, server, laptop, mobile phone, tablet, etc. ) , which may be part of a computing system based on a cloud architecture. Example methods described herein may also be implemented in the form of executable instructions stored on a machine-readable medium or in the form of electronic circuitry. For instance, the operations of method 300 may be represented by executable instructions that, when executed by a processor of a computing device, cause the computing device to perform method 300. Depending on the embodiment, an operation of an example method described herein may be repeated in different ways or involve intervening operations not shown. Though the operations of example methods may be depicted and described in a certain order, the order in which the operations are performed may vary among embodiments, including performing certain operations in parallel.

At operation 302, a processor receives one or more user inquiries in natural language (e.g., text format) . A natural language can be any language that occurs naturally in a human community through the process of use. An example inquiry can include one or more questions.

At operation 304, a processor uses one or more machine learning models (e.g., BERT ML models) to generate one or more embeddings (also referred to as embedding vectors described herein) that represent the natural language text inquiries.

At operation 306, a processor uses BERT ML models to match the one or more embeddings of the inquiry against (or on) numerous embeddings generated for the data tables based on semantic similarity. Embeddings generated for the data tables can also be referred to as table schema embeddings. Such table schema embeddings can be stored in a table schema embedding database.

At operation 308, a processor generates one or more prompts that include the received one or more text inquiries and one or more table schemas associated with the one or more identified data tables. A table schema can be a structure or an organization of data within a data table. A table schema can define columns and/or fields in the data table, as well as data types, constraints, and relationships with other data tables. A well-defined table schema is crucial for maintaining data integrity and consistency, making it possible to query and work with the data effectively.

At operation 310, a processor uses the one or more prompts as inputs to one or more large language ML models and generates one or more model responses (including SQL queries) based on the one or more prompts.

At operation 312, a processor causes the display of the generated SQL queries on a device (e.g., a device via which text inquiries were received) .

Though not illustrated, method 300 can include an operation where a graphical user interface is displayed (or caused to be displayed) by the hardware processor. For instance, the operation can cause a client device (e.g., the client device 102 communicatively coupled to the data management system 122) to display the graphical user interface. This operation for displaying the graphical user interface can be separate from operations 302 through 312 or, alternatively, form part of one or more of operations 302 through 312.

FIG. 4 is a flowchart illustrating an example method 400 for generating structured query language outputs using machine learning models, according to various embodiments of the present disclosure. It will be understood that example methods described herein may be performed by a machine in accordance with some embodiments. For example, method 400 can be performed by the data management system 122 described with respect to FIG. 1, the data management system 200 described with respect to FIG. 2, or individual components thereof. An operation of various methods described herein may be performed by one or more hardware processors (e.g., central processing units or graphics processing units) of a computing device (e.g., a desktop, server, laptop, mobile phone, tablet, etc. ) , which may be part of a computing system based on a cloud architecture. Example methods described herein may also be implemented in the form of executable instructions stored on a machine-readable medium or in the form of electronic circuitry. For instance, the operations of method 400 may be represented by executable instructions that, when executed by a processor of a computing device, cause the computing device to perform method 400. Depending on the embodiment, an operation of an example method described herein may be repeated in different ways or involve intervening operations not shown. Though the operations of example methods may be depicted and described in a certain order, the order in which the operations are performed may vary among embodiments, including performing certain operations in parallel. Operations in method 400 can be performed dependently or independently from operations in methods 300 and 500.

At operation 402, a processor identifies (or accesses) a plurality of historical questions (e.g., historical inquiries) associated with one or more identified data tables that are relevant to the text inquiry described herein. Each historical question is associated with (or paired with) a syntactically correct SQL query that matches the historical question.

At operation 404, a processor uses BERT ML models to identify, among the plurality of historical questions, one or more historical questions that are similar (e.g., semantically similar) to one or more questions in the received text inquiry.

At operation 406, a processor generates a prompt that includes such question-SQL pairs (e.g., semantically similar historical questions and the corresponding syntactically correct SQL queries) as training examples for the large language ML models to generate SQL queries that match the user’s inquiry. Under this approach, since the training examples guide the generation of model responses, the large language ML models are more likely to generate syntactically accurate SQL queries, thereby improving the accuracy of SQL query generation.

Though not illustrated, method 400 can include an operation where a graphical user interface can be displayed (or caused to be displayed) by the hardware processor. For instance, the operation can cause a client device (e.g., the client device 102 communicatively coupled to the data management system 122) to display the graphical user interface. This operation for displaying the graphical user interface can be separate from operations 402 through 406 or, alternatively, form part of one or more of operations 402 through 406.

FIG. 5 is a flowchart illustrating an example method 500 for generating structured query language outputs using machine learning models, according to various embodiments of the present disclosure. It will be understood that example methods described herein may be performed by a machine in accordance with some embodiments. For example, method 500 can be performed by the data management system 122 described with respect to FIG. 1, the data management system 200 described with respect to FIG. 2, or individual components thereof. An operation of various methods described herein may be performed by one or more hardware processors (e.g., central processing units or graphics processing units) of a computing device (e.g., a desktop, server, laptop, mobile phone, tablet, etc. ) , which may be part of a computing system based on a cloud architecture. Example methods described herein may also be implemented in the form of executable instructions stored on a machine-readable medium or in the form of electronic circuitry. For instance, the operations of method 500 may be represented by executable instructions that, when executed by a processor of a computing device, cause the computing device to perform method 500. Depending on the embodiment, an operation of an example method described herein may be repeated in different ways or involve intervening operations not shown. Though the operations of example methods may be depicted and described in a certain order, the order in which the operations are performed may vary among embodiments, including performing certain operations in parallel. Operations in method 500 can be performed dependently or independently from operations in methods 300 and 400.

At operation 502, a processor detects successful or unsuccessful (e.g., failed) execution of the SQL query against the data tables (or databases) . Execution can be detected based on user interactions with an SQL development platform described herein. Upon detecting a failed execution of a SQL query, the processor can determine whether the SQL query is modified by a user of the device based on the detection of a successful execution later on.

At operation 504, upon detecting a failed execution of a SQL query, the processor can determine whether the SQL query is modified by a user of the device based on the detection of a successful execution later on.

At operation 506, a processor detecting a successful execution of a modified SQL query against the database.

At operation 508, a processor generates a question-SQL pair that includes both the natural language text and the modified SQL query. Such a question-SQL pair can be used as an example to guide large language ML models to generate SQL outputs in response to receiving semantically similar text inquiries in the future.

At operation 510, a processor stores the question-SQL pair in a mapping data repository.

Though not illustrated, method 500 can include an operation where a graphical user interface can be displayed (or caused to be displayed) by the hardware processor. For instance, the operation can cause a client device (e.g., the client device 102 communicatively coupled to the data management system 122) to display the graphical user interface. This operation for displaying the graphical user interface can be separate from operations 502 through 510 or, alternatively, form part of one or more of operations 502 through 510.

FIG. 6 is a diagram illustrating data flow 600 within an example data management system that facilitates the generation of structured query language outputs during operation, according to various embodiments of the present disclosure. As shown, upon receiving from user 602 a text inquiry 604 in natural language, the data management system (e.g., system 122 and 200) generates a question embedding 628 that is used to match table schema embeddings 614. An example table schema embedding can be an embedding pre-generated for a particular data table. By matching a question embedding 628 with (or against) the table schema embeddings 614, the data management system can identify one or more data tables that are relevant to (or associated with) the text inquiry 604. The data management system, via the SQL genius server 620, can generate a prompt 622 as an input to one or more large language ML models 624. Prompt 622 can include, without limitation, the text inquiry 604, table schemas associated with the relevant data tables, and one or more training examples described herein. The one or more large language ML models 624 can generate one or more SQL queries 626 based on prompt 622. User 602 can run the one or more SQL queries 626 on the SQL development platform 606.

A training example can include a question-SQL pair that is stored in question-SQL mapping repository 612. As shown, embedding service 608 can be included in the data management system described herein, or be communicatively coupled to the data management system. Embedding service 608 can include one or more BERT ML models that are used to generate embeddings for text inquiries and data tables and identify relevant data tables by matching embeddings of data tables to embeddings of text inquiries based on semantic similarity.

The data management system can include the SQL genius server 620, which generates SQL queries based on text inquiries described herein.

FIG. 7 is a flowchart illustrating an example method 700 for generating structured query language outputs using machine learning models, according to various embodiments of the present disclosure. As shown, the data management system (e.g., systems 122 and 200) can detect whether an execution of a model-generated SQL query is run successfully on the SQL development platform described herein. Upon detecting a failed execution, the data management system can determine whether modifications to the SQL query are made by a user. Upon detecting a successful execution of either a model-generated SQL query or a user-modified SQL query, the data management system can generate a question-SQL pair that includes the SQL query and the associated question and store the pair in the question-SQL mapping repository (repository 612) for future prompt generation.

As used herein, an ML model can be generated (or built) based on configured parameters and trained based on training data. Once generated and trained, a machine learning model can receive one or more inputs, extract one or more features, and generate an output for the inputs based on the model’s training. A system administrator (or an authorized user) can provide feedback to each component based on the outputs generated by the associated ML models and adjust the model parameters as needed to improve the accuracy of the outputs. Such a feedback loop mechanism (also referred to as closed-loop learning) leverages the outputs of ML models and the corresponding user inputs to retrain and improve models over time for better performance.

FIG. 8 is a block diagram illustrating an example of a software architecture 802 that may be installed on a machine, according to some example embodiments. FIG. 8 is merely a non-limiting example of a software architecture, and it will be appreciated that many other architectures may be implemented to facilitate the functionality described herein. The software architecture 802 may be executing on hardware such as a machine 900 of FIG. 9 that includes, among other things, processors 910, memory 930, and input/output (I/O) components 950. A representative hardware layer 804 is illustrated and can represent, for example, the machine 900 of FIG. 9. The representative hardware layer 804 comprises one or more processing units 806 having associated executable instructions 808. The executable instructions 808 represent the executable instructions of the software architecture 802. The hardware layer 804 also includes memory or storage modules 810, which also have the executable instructions 808. The hardware layer 804 may also comprise other hardware 812, which represents any other hardware of the hardware layer 804, such as the other hardware illustrated as part of the machine 900.

In the example architecture of FIG. 8, the software architecture 802 may be conceptualized as a stack of layers, where each layer provides particular functionality. For example, the software architecture 802 may include layers such as an operating system 814, libraries 816, frameworks/middleware 818, applications 820, and a presentation layer 844. Operationally, the applications 820 or other components within the layers may invoke API calls 824 through the software stack and receive a response, returned values, and so forth (illustrated as messages 826) in response to the API calls 824. The layers illustrated are representative in nature, and not all software architectures have all layers. For example, some mobile or special-purpose operating systems may not provide a frameworks/middleware 818 layer, while others may provide such a layer. Other software architectures may include additional or different layers.

The operating system 814 may manage hardware resources and provide common services. The operating system 814 may include, for example, a kernel 828, services 830, and drivers 832. The kernel 828 may act as an abstraction layer between the hardware and the other software layers. For example, the kernel 828 may be responsible for memory management, processor management (e.g., scheduling) , component management, networking, security settings, and so on. The services 830 may provide other common services for the other software layers. The drivers 832 may be responsible for controlling or interfacing with the underlying hardware. For instance, the drivers 832 may include display drivers, camera drivers, drivers, flash memory drivers, serial communication drivers (e.g., Universal Serial Bus (USB) drivers) , drivers, audio drivers, power management drivers, and so forth depending on the hardware configuration.

The libraries 816 may provide a common infrastructure that may be utilized by the applications 820 and/or other components and/or layers. The libraries 816 typically provide functionality that allows other software modules to perform tasks in an easier fashion than by interfacing directly with the underlying operating system 814 functionality (e.g., kernel 828, services 830, or drivers 832) . The libraries 816 may include system libraries 834 (e.g., C standard library) that may provide functions such as memory allocation functions, string manipulation functions, mathematic functions, and the like. In addition, the libraries 816 may include API libraries 836 such as media libraries (e.g., libraries to support presentation and manipulation of various media formats such as MPEG4, H. 264, MP3, AAC, AMR, JPG, and PNG) , graphics libraries (e.g., an OpenGL framework that may be used to render 2D and 3D graphic content on a display) , database libraries (e.g., SQLite that may provide various relational database functions) , web libraries (e.g., WebKit that may provide web browsing functionality) , and the like. The libraries 816 may also include a wide variety of other libraries 838 to provide many other APIs to the applications 820 and other software components/modules.

The frameworks 818 (also sometimes referred to as middleware) may provide a higher-level common infrastructure that may be utilized by the applications 820 or other software components/modules. For example, the frameworks 818 may provide various graphical user interface functions, high-level resource management, high-level location services, and so forth. The frameworks 818 may provide a broad spectrum of other APIs that may be utilized by the applications 820 and/or other software components/modules, some of which may be specific to a particular operating system or platform.

The applications 820 include built-in applications 840 and/or third-party applications 842. Examples of representative built-in applications 840 may include, but are not limited to, a home application, a contacts application, a browser application, a book reader application, a location application, a media application, a messaging application, or a game application.

The third-party applications 842 may include any of the built-in applications 840, as well as a broad assortment of other applications. In a specific example, the third-party applications 842 (e.g., an application developed using the Android^TM or iOS^TM software development kit (SDK) by an entity other than the vendor of the particular platform) may be mobile software running on a mobile operating system such as iOS^TM, Android^TM, or other mobile operating systems. In this example, the third-party applications 842 may invoke the API calls 824 provided by the mobile operating system such as the operating system 814 to facilitate functionality described herein.

The applications 820 may utilize built-in operating system functions (e.g., kernel 828, services 830, or drivers 832) , libraries (e.g., system libraries 834, API libraries 836, and other libraries 838) , or frameworks/middleware 818 to create user interfaces to interact with users of the system. Alternatively, or additionally, in some systems, interactions with a user may occur through a presentation layer, such as the presentation layer 844. In these systems, the application/module “logic” can be separated from the aspects of the application/module that interact with the user.

Some software architectures utilize virtual machines. In the example of FIG. 8, this is illustrated by a virtual machine 848. The virtual machine 848 creates a software environment where applications/modules can execute as if they were executing on a hardware machine (e.g., the machine 900 of FIG. 9) . The virtual machine 848 is hosted by a host operating system (e.g., the operating system 814) and typically, although not always, has a virtual machine monitor 846, which manages the operation of the virtual machine 848 as well as the interface with the host operating system (e.g., the operating system 814) . A software architecture executes within the virtual machine 848, such as an operating system 850, libraries 852, frameworks 854, applications 856, or a presentation layer 858. These layers of software architecture executing within the virtual machine 848 can be the same as corresponding layers previously described or may be different.

FIG. 9 illustrates a diagrammatic representation of a machine 900 in the form of a computer system within which a set of instructions may be executed for causing the machine 900 to perform any one or more of the methodologies discussed herein, according to an embodiment. Specifically, FIG. 9 shows a diagrammatic representation of the machine 900 in the example form of a computer system, within which instructions 916 (e.g., software, a program, an application, an applet, an app, or other executable code) for causing the machine 900 to perform any one or more of the methodologies discussed herein may be executed. For example, the instructions 916 may cause the machine 900 to execute the method 400 described above with respect to FIG. 4. The instructions 916 transform the general, non-programmed machine 900 into a particular machine 900 programmed to carry out the described and illustrated functions in the manner described. In alternative embodiments, the machine 900 operates as a standalone device or may be coupled (e.g., networked) to other machines. In a networked deployment, the machine 900 may operate in the capacity of a server machine or a client machine in a server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine 900 may comprise, but not be limited to, a server computer, a client computer, a personal computer (PC) , a tablet computer, a laptop computer, a netbook, a personal digital assistant (PDA) , an entertainment media system, a cellular telephone, a smart phone, a mobile device, or any machine capable of executing the instructions 916, sequentially or otherwise, that specify actions to be taken by the machine 900. Further, while only a single machine 900 is illustrated, the term “machine” shall also be taken to include a collection of machines 900 that individually or jointly execute the instructions 916 to perform any one or more of the methodologies discussed herein.

The machine 900 may include processors 910, memory 930, and I/O components 950, which may be configured to communicate with each other such as via a bus 902. In an embodiment, the processors 910 (e.g., a hardware processor, such as a central processing unit (CPU) , a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, a graphics processing unit (GPU) , a digital signal processor (DSP) , an application-specific integrated circuit (ASIC) , a radio-frequency integrated circuit (RFIC) , another processor, or any suitable combination thereof) may include, for example, a processor 912 and a processor 914 that may execute the instructions 916. The term “processor” is intended to include multi-core processors that may comprise two or more independent processors (sometimes referred to as “cores” ) that may execute instructions contemporaneously. Although FIG. 9 shows multiple processors 910, the machine 900 may include a single processor with a single core, a single processor with multiple cores (e.g., a multi-core processor) , multiple processors with a single core, multiple processors with multiples cores, or any combination thereof.

The memory 930 may include a main memory 932, a static memory 934, and a storage unit 936 including machine-readable medium 938, each accessible to the processors 910 such as via the bus 902. The main memory 932, the static memory 934, and the storage unit 936 store the instructions 916 embodying any one or more of the methodologies or functions described herein. The instructions 916 may also reside, completely or partially, within the main memory 932, within the static memory 934, within the storage unit 936, within at least one of the processors 910 (e.g., within the processor’s cache memory) , or any suitable combination thereof, during execution thereof by the machine 900.

The I/O components 950 may include a wide variety of components to receive input, provide output, produce output, transmit information, exchange information, capture measurements, and so on. The specific I/O components 950 that are included in a particular machine will depend on the type of machine. For example, portable machines such as mobile phones will likely include a touch input device or other such input mechanisms, while a headless server machine will likely not include such a touch input device. It will be appreciated that the I/O components 950 may include many other components that are not shown in FIG. 9. The I/O components 950 are grouped according to functionality merely for simplifying the following discussion, and the grouping is in no way limiting. In various embodiments, the I/O components 950 may include output components 952 and input components 954. The output components 952 may include visual components (e.g., a display such as a plasma display panel (PDP) , a light-emitting diode (LED) display, a liquid crystal display (LCD) , a projector, or a cathode ray tube (CRT) ) , acoustic components (e.g., speakers) , haptic components (e.g., a vibratory motor, resistance mechanisms) , other signal generators, and so forth. The input components 954 may include alphanumeric input components (e.g., a keyboard, a touch screen configured to receive alphanumeric input, a photo-optical keyboard, or other alphanumeric input components) , point-based input components (e.g., a mouse, a touchpad, a trackball, a joystick, a motion sensor, or another pointing instrument) , tactile input components (e.g., a physical button, a touch screen that provides location and/or force of touches or touch gestures, or other tactile input components) , audio input components (e.g., a microphone) , and the like.

In further embodiments, the I/O components 950 may include biometric components 956, motion components 958, environmental components 960, or position components 962, among a wide array of other components. The motion components 958 may include acceleration sensor components (e.g., accelerometer) , gravitation sensor components, rotation sensor components (e.g., gyroscope) , and so forth. The environmental components 960 may include, for example, illumination sensor components (e.g., photometer) , temperature sensor components (e.g., one or more thermometers that detect ambient temperature) , humidity sensor components, pressure sensor components (e.g., barometer) , acoustic sensor components (e.g., one or more microphones that detect background noise) , proximity sensor components (e.g., infrared sensors that detect nearby objects) , gas sensors (e.g., gas detection sensors to detect concentrations of hazardous gases for safety or to measure pollutants in the atmosphere) , or other components that may provide indications, measurements, or signals corresponding to a surrounding physical environment. The position components 962 may include location sensor components (e.g., a Global Positioning System (GPS) receiver component) , altitude sensor components (e.g., altimeters or barometers that detect air pressure from which altitude may be derived) , orientation sensor components (e.g., magnetometers) , and the like.

Communication may be implemented using a wide variety of technologies. The I/O components 950 may include communication components 964 operable to couple the machine 900 to a network 980 or devices 970 via a coupling 982 and a coupling 972, respectively. For example, the communication components 964 may include a network interface component or another suitable device to interface with the network 980. In further examples, the communication components 964 may include wired communication components, wireless communication components, cellular communication components, near field communication (NFC) components, components (e.g., Low Energy) , components, and other communication components to provide communication via other modalities. The devices 970 may be another machine or any of a wide variety of peripheral devices (e.g., a peripheral device coupled via a USB) .

Moreover, the communication components 964 may detect identifiers or include components operable to detect identifiers. For example, the communication components 964 may include radio frequency identification (RFID) tag reader components, NFC smart tag detection components, optical reader components (e.g., an optical sensor to detect one-dimensional bar codes such as Universal Product Code (UPC) bar code, multi-dimensional bar codes such as Quick Response (QR) code, Aztec code, Data Matrix, Dataglyph, MaxiCode, PDF417, Ultra Code, UCC RSS-2D bar code, and other optical codes) , or acoustic detection components (e.g., microphones to identify tagged audio signals) . In addition, a variety of information may be derived via the communication components 964, such as location via Internet Protocol (IP) geolocation, location viasignal triangulation, location via detecting an NFC beacon signal that may indicate a particular location, and so forth.

Certain embodiments are described herein as including logic or a number of components, modules, elements, or mechanisms. Such modules can constitute either software modules (e.g., code embodied on a machine-readable medium or in a transmission signal) or hardware modules. A “hardware module” is a tangible unit capable of performing certain operations and can be configured or arranged in a certain physical manner. In various example embodiments, one or more computer systems (e.g., a standalone computer system, a client computer system, or a server computer system) or one or more hardware modules of a computer system (e.g., a processor or a group of processors) are configured by software (e.g., an application or application portion) as a hardware module that operates to perform certain operations as described herein.

In various embodiments, a hardware module is implemented mechanically, electronically, or any suitable combination thereof. For example, a hardware module can include dedicated circuitry or logic that is permanently configured to perform certain operations. For example, a hardware module can be a special-purpose processor, such as a field-programmable gate array (FPGA) or an ASIC. A hardware module may also include programmable logic or circuitry that is temporarily configured by software to perform certain operations. For example, a hardware module can include software encompassed within a general-purpose processor or other programmable processor. It will be appreciated that the decision to implement a hardware module mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software) can be driven by cost and time considerations.

Accordingly, the phrase “module” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired) , or temporarily configured (e.g., programmed) to operate in a certain manner or to perform certain operations described herein. Considering embodiments in which hardware modules are temporarily configured (e.g., programmed) , each of the hardware modules need not be configured or instantiated at any one instance in time. For example, where a hardware module comprises a general-purpose processor configured by software to become a special-purpose processor, the general-purpose processor may be configured as respectively different special-purpose processors (e.g., comprising different hardware modules) at different times. Software can accordingly configure a particular processor or processors, for example, to constitute a particular hardware module at one instance of time and to constitute a different hardware module at a different instance of time.

Hardware modules can provide information to, and receive information from, other hardware modules. Accordingly, the described hardware modules can be regarded as being communicatively coupled. Where multiple hardware modules exist contemporaneously, communications can be achieved through signal transmission (e.g., over appropriate circuits and buses) between or among two or more of the hardware modules. In embodiments in which multiple hardware modules are configured or instantiated at different times, communications between or among such hardware modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple hardware modules have access. For example, one hardware module performs an operation and stores the output of that operation in a memory device to which it is communicatively coupled. A further hardware module can then, at a later time, access the memory device to retrieve and process the stored output. Hardware modules can also initiate communications with input or output devices, and can operate on a resource (e.g., a collection of information) .

The various operations of example methods described herein can be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors constitute processor-implemented modules that operate to perform one or more operations or functions described herein. As used herein, “processor-implemented module” refers to a hardware module implemented using one or more processors.

Similarly, the methods described herein can be at least partially processor-implemented, with a particular processor or processors being an example of hardware. For example, at least some of the operations of a method can be performed by one or more processors or processor-implemented modules. Moreover, the one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS) . For example, at least some of the operations may be performed by a group of computers (as examples of machines 900 including processors 910) , with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., an API) . In certain embodiments, for example, a client device may relay or operate in communication with cloud computing systems and may access circuit design information in a cloud environment.

The performance of certain of the operations may be distributed among the processors, not only residing within a single machine 900, but deployed across a number of machines 900. In some example embodiments, the processors 910 or processor-implemented modules are located in a single geographic location (e.g., within a home environment, an office environment, or a server farm) . In other example embodiments, the processors or processor-implemented modules are distributed across a number of geographic locations.

EXECUTABLE INSTRUCTIONS AND MACHINE STORAGE MEDIUM

The various memories (i.e., 930, 932, 934, and/or the memory of the processor (s) 910) and/or the storage unit 936 may store one or more sets of instructions 916 and data structures (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein. These instructions (e.g., the instructions 916) , when executed by the processor (s) 910, cause various operations to implement the disclosed embodiments.

As used herein, the terms “machine-storage medium, ” “device-storage medium, ” and “computer-storage medium” mean the same thing and may be used interchangeably. The terms refer to a single or multiple storage devices and/or media (e.g., a centralized or distributed database, and/or associated caches and servers) that store executable instructions 916 and/or data. The terms shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media, including memory internal or external to processors. Specific examples of machine-storage media, computer-storage media and/or device-storage media include non-volatile memory, including by way of example semiconductor memory devices, e.g., erasable programmable read-only memory (EPROM) , electrically erasable programmable read-only memory (EEPROM) , FPGA, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The terms “machine-storage media, ” “computer-storage media, ” and “device-storage media” specifically exclude carrier waves, modulated data signals, and other such media, at least some of which are covered under the term “signal medium” discussed below.

TRANSMISSION MEDIUM

In various embodiments, one or more portions of the network 980 may be an ad hoc network, an intranet, an extranet, a virtual private network (VPN) , a LAN, a wireless LAN (WLAN) , a WAN, a wireless WAN (WWAN) , a metropolitan-area network (MAN) , the Internet, a portion of the Internet, a portion of the public switched telephone network (PSTN) , a plain old telephone service (POTS) network, a cellular telephone network, a wireless network, anetwork, another type of network, or a combination of two or more such networks. For example, the network 980 or a portion of the network 980 may include a wireless or cellular network, and the coupling 982 may be a Code Division Multiple Access (CDMA) connection, a Global System for Mobile communications (GSM) connection, or another type of cellular or wireless coupling. In this example, the coupling 982 may implement any of a variety of types of data transfer technology, such as Single Carrier Radio Transmission Technology (1xRTT) , Evolution-Data Optimized (EVDO) technology, General Packet Radio Service (GPRS) technology, Enhanced Data rates for GSM Evolution (EDGE) technology, third Generation Partnership Project (3GPP) including 3G, fourth generation wireless (4G) networks, Universal Mobile Telecommunications System (UMTS) , High-Speed Packet Access (HSPA) , Worldwide Interoperability for Microwave Access (WiMAX) , Long-Term Evolution (LTE) standard, others defined by various standard-setting organizations, other long-range protocols, or other data transfer technology.

The instructions may be transmitted or received over the network using a transmission medium via a network interface device (e.g., a network interface component included in the communication components) and utilizing any one of a number of well-known transfer protocols (e.g., hypertext transfer protocol (HTTP) ) . Similarly, the instructions may be transmitted or received using a transmission medium via the coupling (e.g., a peer-to-peer coupling) to the devices 970. The terms “transmission medium” and “signal medium” mean the same thing and may be used interchangeably in this disclosure. The terms “transmission medium” and “signal medium” shall be taken to include any intangible medium that is capable of storing, encoding, or carrying the instructions for execution by the machine, and include digital or analog communications signals or other intangible media to facilitate communication of such software. Hence, the terms “transmission medium” and “signal medium” shall be taken to include any form of modulated data signal, carrier wave, and so forth. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.

COMPUTER-READABLE MEDIUM

The terms “machine-readable medium, ” “computer-readable medium, ” and “device-readable medium” mean the same thing and may be used interchangeably in this disclosure. The terms are defined to include both machine- storage media and transmission media. Thus, the terms include both storage devices/media and carrier waves/modulated data signals. For instance, an embodiment described herein can be implemented using a non-transitory medium (e.g., a non-transitory computer-readable medium) .

Throughout this specification, plural instances may implement resources, components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components.

As used herein, the term “or” may be construed in either an inclusive or exclusive sense. The terms “a” or “an” should be read as meaning “at least one, ” “one or more, ” or the like. The presence of broadening words and phrases such as “one or more, ” “at least, ” “but not limited to, ” or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent. Additionally, boundaries between various resources, operations, modules, engines, and data stores are somewhat arbitrary, and particular operations are illustrated in a context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within a scope of various embodiments of the present disclosure. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

It will be understood that changes and modifications may be made to the disclosed embodiments without departing from the scope of the present disclosure. These and other changes or modifications are intended to be included within the scope of the present disclosure.

Claims

A system comprising:

one or more hardware processors; and

at least one non-transitory machine-readable medium for storing instructions that, when executed by the one or more hardware processors, cause the one or more hardware processors to perform operations comprising:

receiving a natural language text from a device;

generating, using a first machine learning model, a first embedding vector that represents the natural language text;

matching the first embedding vector with a second embedding vector that represents a data table, the data table being associated with a table schema;

generating, based on the natural language text, a prompt as an input to a second machine learning model, the prompt including the table schema; and

generating, using the second machine learning model, a response based on the prompt, the response including a Structured query language (SQL) query that can be executed on a database; and

causing display of the SQL query in a user interface of the device.
The system of claim 1, wherein the operations comprise:

accessing a plurality of data tables from one or more SQL databases, each data table being associated with a table schema;

generating, using the first machine learning model, a plurality of embedding vectors, each embedding vector representing a corresponding data table; and

storing the plurality of embedding vectors in a vector database.
The system of claim 2, wherein the plurality of embedding vectors comprises the second embedding vector that represents the data table.
The system of claim 1, wherein the operations comprise:

detecting, via the device, a successful execution of the SQL query against the database;

generating a question-SQL pair that includes both the natural language text and the SQL query; and

storing the question-SQL pair in a mapping data repository.
The system of claim 1, wherein the natural language text comprises a question, and wherein the operations comprise:

accessing a plurality of historical questions associated with the data table, each historical question being associated with a syntactically correct SQL query;

identifying, using the first machine learning model, a historical question based on the natural language text received from the device; and

generating the prompt that includes both the historical question and the syntactically correct SQL query as a training example for the first machine learning model.
The system of claim 5, wherein the historical questions associated with the data table are identified from a mapping data repository that includes a plurality of question-SQL pairs, each question-SQL pair including a syntactically correct SQL query that matches a respective question.
The system of claim 1, wherein the operations comprise:

detecting a failed execution of the SQL query against the database;

determining that a modification is made to the SQL query; and

detecting a successful execution of a modified SQL query against the database.
The system of claim 7, wherein the operations comprise:

generating a question-SQL pair that includes both the natural language text and the modified SQL query; and

storing the question-SQL pair in a mapping data repository.
The system of claim 1, wherein the first machine learning model comprises at least one Bidirectional Encoder Representations from Transformers (BERT) machine learning model.
The system of claim 1, wherein the second machine learning model comprises at least one large language machine learning model.
A method comprising:

receiving a natural language text from a device;

generating, using a first machine learning model, a first embedding vector that represents the natural language text;

matching the first embedding vector with a second embedding vector that represents a data table, the data table being associated with a table schema;

generating, based on the natural language text, a prompt as an input to a second machine learning model, the prompt including the table schema; and

generating, using the second machine learning model, a response based on the prompt, the response including a Structured query language (SQL) query that can be executed on a database; and

causing display of the SQL query in a user interface of the device.
The method of claim 11, comprising:

accessing a plurality of data tables from one or more SQL databases, each data table being associated with a table schema;

generating, using the first machine learning model, a plurality of embedding vectors, each embedding vector representing a corresponding data table; and

storing the plurality of embedding vectors in a vector database.
The method of claim 12, wherein the plurality of embedding vectors comprises the second embedding vector that represents the data table.
The method of claim 11, comprising:

detecting, via the device, a successful execution of the SQL query against the database;

generating a question-SQL pair that includes both the natural language text and the SQL query; and

storing the question-SQL pair in a mapping data repository.
The method of claim 11, wherein the natural language text comprises a question, comprising:

accessing a plurality of historical questions associated with the data table, each historical question being associated with a syntactically correct SQL query;

identifying, using the first machine learning model, a historical question based on the natural language text received from the device; and

generating the prompt that includes both the historical question and the syntactically correct SQL query as a training example for the first machine learning model.
The method of claim 15, wherein the historical questions associated with the data table are identified from a mapping data repository that includes a plurality of question-SQL pairs, each question-SQL pair including a syntactically correct SQL query that matches a respective question.
The method of claim 11, comprising:

detecting a failed execution of the SQL query against the database;

determining that a modification is made to the SQL query; and

detecting a successful execution of a modified SQL query against the database.
The method of claim 17, comprising:

generating a question-SQL pair that includes both the natural language text and the modified SQL query; and

storing the question-SQL pair in a mapping data repository.
The method of claim 11, wherein the first machine learning model comprises at least one Bidirectional Encoder Representations from Transformers (BERT) machine learning model, and wherein the second machine learning model comprises at least one large language machine learning model.
A machine-storage medium for storing instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to perform operations comprising:

receiving a natural language text from a device;

generating, using a first machine learning model, a first embedding vector that represents the natural language text;

matching the first embedding vector with a second embedding vector that represents a data table, the data table being associated with a table schema;

generating, based on the natural language text, a prompt as an input to a second machine learning model, the prompt including the table schema; and

generating, using the second machine learning model, a response based on the prompt, the response including a Structured query language (SQL) query that can be executed on a database; and

causing display of the SQL query in a user interface of the device.