US20220164349A1 - Distributed queries through dynamic views - Google Patents
Distributed queries through dynamic views Download PDFInfo
- Publication number
- US20220164349A1 US20220164349A1 US17/529,898 US202117529898A US2022164349A1 US 20220164349 A1 US20220164349 A1 US 20220164349A1 US 202117529898 A US202117529898 A US 202117529898A US 2022164349 A1 US2022164349 A1 US 2022164349A1
- Authority
- US
- United States
- Prior art keywords
- query
- fields
- source
- data
- tables
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
- G06F16/2456—Join operations
Definitions
- a data query and retrieval system receives a query request and identifies a plurality of data warehouses, or stores, where data needed to satisfy the query is stored. Data items, or fields, called for by the query request are grouped into sets based on fields stored at the same data warehouse. Individual data connections are generated for each data warehouse accessed, and a database view defined for the fields available at each respective data warehouse. The individual views are invoked in a single query statement by a query processor for accessing the respective data warehouses via the corresponding database view. In this manner, a query operator or user need not define multiple query expressions or perform separate queries for accessing data fields or dimensions residing at different data warehouses.
- Configurations disclosed below therefore provide a method for querying a multidimensional database, including identifying, for each of a plurality of fields sought by a query request, a data warehouse where the sought fields are stored, and forming, for each data warehouse corresponding to the plurality of fields, a view indicative of the fields sought from the respective data warehouse.
- a query engine generates a query expression for retrieval of each of the fields sought by the query request by accessing the respective fields via the formed views.
- FIG. 1 is a context diagram of a database environment suitable for use with configurations herein;
- FIG. 2 is a block diagram of a file system supporting a data warehouse in the environment of FIG. 1 ;
- FIG. 3 is a modified Venn diagram of a multidimensional database, or datacube, environment suitable for configurations herein;
- FIG. 4 shows a processing arrangement of a query engine as in FIG. 1 for querying a database as disclosed herein;
- FIG. 5 shows an example subquery and connection arrangement based on a query request using the approach of FIGS. 1-4 ;
- FIG. 6 is a flowchart of query processing as in FIGS. 1-5 .
- Configurations disclosed herein allow for a query to obtain data from multiple data sources in a single request.
- the data sources may be completely different from each other. They can be a standard relational database, distributed database, a cloud based database or any other database or data warehouse accessible by the query engine. They can all be used seamlessly to bring data together and form the results the user requests.
- Modern databases have benefited from advancements in computer networking technology to allow a distributed physical structure. While databases ultimately exist for the purpose of accessing particular data items as a single concise report or stream, the data so gathered may be stored in a multitude of different physical locations, ranging from the same file or table on a storage device, to different storage devices/disks that are part of the same physical assembly (i.e. rack mounted array), to tables in different physical locations joined by network connectivity.
- the term “data warehouse” may often be used to collectively refer to the aggregate body of distributed data accessible via a single set of control instructions, or DBMS (DataBase Management System).
- DBMS DataBase Management System
- the DBMS and related synonyms (RDBMS, for Relational DBMS, Disk Farm) are often used to describe a common point or channel of control that regulates appropriate access to the data so stored for, in addition to robust query capability, access control and security.
- Configurations herein are based, in part, on the observation that database managers (DBMS) emanate from different vendors, and owners of data often select a particular vendor's product for an individual usage, application or focus. While interoperability is often supported in the way of APIs (Application Programming Interfaces), data conversion approaches, and remote login capability, conventional approaches suffer from the shortcoming that it may not be a seamless process to access data across multiple vendor's platforms. Accordingly, configurations herein substantially overcome the shortcomings of separate data storage regimes by providing a query engine having multiple connections to each of a plurality of data warehouses, such that each data warehouse corresponds to a vendor imposed storage arrangement such as a DBMS.
- DBMS database managers
- FIG. 1 is a context diagram of a database environment suitable for use with configurations herein.
- a user 103 employs a user device 104 for interacting with a server 105 .
- the server 105 launches one or more applications (apps) 120 - 1 . . . 120 -N ( 120 generally) for performing various tasks that invoke database exchanges via query requests.
- the server 105 also includes a compute engine 400 and query logic 450 for receiving the query requests and generating a query response, discussed further below.
- the apps 120 and compute engine 400 may occupy the same server 105 or computing device, or the app 120 may reside on a user machine and exchange remotely with the compute engine.
- the compute engine 400 receives query requests 115 from the apps 120 , based on a user request.
- the query requests seek data that resides on multiple data warehouses 110 - 1 . . . 110 - 2 ( 110 generally).
- the compute engine 400 establishes a respective connection 420 - 1 . . . 420 - 2 ( 420 generally) to each data warehouse 110 -N for accessing and returning the data stored thereby.
- Each data warehouse generally corresponds to a DBMS of a particular vendor, and may have different access and query syntax requirements.
- Each data warehouse 110 also accesses a file system 210 - 1 . . . 210 -N, which physically stores the data and related indices and metadata. Any suitable physical arrangement and combination of storage media devices 116 may be included in the file system, and they may be distributed across different physical locations via a network 130 .
- Configurations herein identify, for each of a plurality of fields sought by a query request, a data warehouse 110 where the sought fields are stored, and forms, for each respective data warehouse 110 corresponding to the plurality of fields, a view indicative of the fields sought from the respective data warehouse, where each data warehouse has an independent access medium and storage domain.
- a view is defined to meet the vendor imposed syntax and access forms for the particular data warehouse 110 .
- the access medium is based on a set of control instructions for accessing the respective fields stored in the data warehouse, and may be fulfilled by a vendor specific DBMS or similar software.
- the storage domain is based on a storage repository under the control of the access medium, typically a collection of hard disk drives (HDD) or solid state drives (SSD), discussed further in FIG. 2 .
- the query logic 450 generates a respective subquery from the user request 115 based on each view of the plurality of views, where the subquery is for retrieval of each of the fields sought by the query request by accessing the respective fields via the formed view. Each subquery, therefore, seeks the fields available in the data warehouse to which it is directed.
- FIG. 2 is a block diagram of an example file system supporting a data warehouse in the environment of FIG. 1 ;
- FIG. 2 is depicts a physical file storage system suitable for supporting the data warehouses of FIG. 1 , however the extent of FIG. 2 is to illustrate the various physical storage domains that may underlie a particular vendor's DBMS. Configurations herein rest on the premise that the user need not know which tables and data warehouse(s) are invoked by the user request 115 .
- the computing environment 201 comprises one or more instances of a client device 204 (e.g., a desktop computer), one or more instances of an application server 205 , and one or more instances of a distributed data file system 210 .
- the user application 120 and database 140 may comprise and/or operate in conjunction with a client device 204 and any number of physically distributed and networked storage entities, typically disk drives or SSD (solid state drive) devices.
- SSD solid state drive
- the aforementioned devices, servers, and systems can communicate through a network 202 (e.g., a wireless network, a wide area network (WAN), a local area network (LAN), the Internet, an intranet, a private network, etc.).
- the distributed data file system 210 can be configurable to store and process large volumes of data (e.g., using an HDFS framework) and comprise one or more instances of a client gateway node 212 , one or more instances of a primary name node 213 , one or more instances of a secondary name node 214 , one or more instances of a support server 215 (e.g., executing data analysis processes, etc.), and a plurality of data nodes 216 (e.g., data node 1 216 - 1 , data node 2 216 -.
- a network 202 e.g., a wireless network, a wide area network (WAN), a local area network (LAN), the Internet, an intranet, a private network, etc.
- the distributed data file system 210 is configurable to store large files of data across multiple computing devices (e.g., the plurality of data nodes 216 ), rather than store all of the files on a single computing device with a large amount of storage capacity.
- the plurality of data nodes 216 can be configured with one or more rack-mounted servers coupled (e.g., using SATA or SAS interconnects) to multiple hard disk drives for storing the data.
- the primary name node 213 is responsible for storage and management of metadata associated with the distributed data file system 210 , such that, for example, the primary name node 213 can provide data location information for data processing operations.
- other configurations, partitions, and architectures of the distributed data file system 210 shown in computing environment 201 are possible.
- a single-node implementation may have all of the nodes (e.g., client gateway node 212 , primary name node 213 , the plurality of data nodes 216 , etc.) reside on a single computing device.
- a small cluster of two nodes may have a first computing device operate as a name node and a second computing device operate as a data node.
- the distributed data file system 210 can represent any database management system.
- Each connection 420 is made between the compute engine 400 and the data warehouse 110 , assigned a view, and the file system 210 of each respective data warehouse (typically a vendor DBMS) accesses the physical data via the view.
- FIG. 3 is a modified Venn diagram of a multidimensional database, or datacube, environment suitable for configurations herein.
- a query approach 300 includes the following components
- query creation for the leaf (data containing) nodes of the virtual cube system was a manual process of column identification and joining of tables.
- the disclosed approach in contrast, integrates a virtual “Island” system for joining together tables within the same data source connection 420 , and aggregating the virtual islands into an “archipelago” satisfying the query request from data across different connections.
- the general idea is to build the query by identifying a starting table and creating a virtual island 302 containing that table. As tables are joined, if they are on the same connection, they are added to that virtual island. When a table being joined requires a connection different from that island, a new island is created with that table. This process continues until all of the required tables have been joined together. This effectively creates a linked list of virtual islands, which we identify as a virtual archipelago 301 (literally meaning a “collection or set of islands”).
- the disclosed method for querying a database includes identifying, for each of a plurality of fields sought by a view in a query request, a source, defined by a data warehouse where the sought fields are stored.
- Each data source stores tables, including a starting table, aggregated into an island, meaning a group of tables at one source.
- a view encompasses all the tables in an island.
- Each table includes at least the fields, arranged in columns, needed for the query request. The table likely also contains other fields not called for by the query request.
- a virtual cube 350 - 1 . . . 350 - 2 ( 350 generally) assigns or creates a dimension for each field for an island.
- the virtual cube need only include dimensions called for by a respective query.
- the archipelago therefore is a grouping or list of islands, each island corresponding to a source.
- Island1 302 - 1 defines a view:
- each connection 420 is driven by a set of control instructions of the respective DBMS for accessing the set of tables in the respective data warehouse.
- the set of tables in one data warehouse are distinct from tables in other data warehouses accessible via other connections of the plurality of connections, meaning under the control of a particular DBMS.
- the query logic 450 defines an 302 island based on the set of accumulated tables at the respective data warehouse, and generates the archipelago 301 by defining an aggregation, such as a linked list or similar structure, for interconnecting the islands 302 corresponding to the plurality of connections.
- FIG. 4 shows a simplified processing arrangement of a query engine as in FIG. 1 for querying a database as disclosed herein.
- the query engine includes the compute engine 400 , which has an interface, access or API to a query processor 410 .
- a SPARKTM query implementation may be invoked, however other suitable query engines may also be used.
- the use of a query processor 410 avoids dissemination of sensitive login or access control information for the respective data warehouses.
- the query engine 200 establishes connections 420 - 1 . . . 420 - 3 ( 420 generally) to each of data warehouses 110 - 1 . . . 110 - 3 ( 110 generally), and delegates subqueries as needed to the query processor 410 . It then forms or defines, for each data warehouse corresponding to the plurality of fields, a view indicative of the fields sought from the respective data warehouse, and generates a query expression for retrieval of each of the fields sought by the query request by accessing the respective fields via the formed view, as now described with respect to FIG. 5 .
- the compute engine 400 invokes the query logic 450 that analyzes fields as the tree is being built to identify the sub-trees that reside within a single data source connection 420 .
- the query logic 450 analyzes fields as the tree is being built to identify the sub-trees that reside within a single data source connection 420 .
- the sub-trees being joined or unioned will be turned into compute views.
- the query will substitute a select from the view that is created, and the compute node will be able to do the appropriate data substitution at run time.
- FIG. 5 shows an example subquery and connection arrangement based on a query request using the approach of FIGS. 1-4
- the compute engine 400 performs query parsing and processing based on the query request 115 in the environment of FIGS. 1 and 2 .
- multiple connections 420 exist to each of a plurality of data warehouses 110 , such that each data warehouse corresponds to a vendor imposed storage arrangement
- FIG. 3 shows satisfaction of a query request invoking each data warehouse 110 .
- the compute engine 400 receives the query request 115 having a plurality of fields and one or more conditions, in which the conditions define selective retrieval of the data in the fields in a conditional syntax such as SQL.
- the query logic 450 identifies, for each field of the plurality of fields, a source that stores the field. Each field is defined by one or more dimensions in the respective source. Each source is a data warehouse 110 or other repository, set of tables or collection configured for storing the field as a dimension of a multidimensional set of tables.
- the query logic 450 defines, for each of the identified sources, a view for accessing the fields stored in a respective source, and establishes a connection 420 to each source of a plurality of the sources.
- the compute engine 400 and/or invoking the query processor 450 as needed, accesses each of the sources via the established connection 420 .
- the example query involves three different data source connections.
- the query logic 450 forms a hierarchical structure 300 based on dependencies in the query request, and identifies, from the hierarchical structure 500 , one or more sub-trees v 1 . . . v 4 corresponding to the same source 430 (data warehouse). It then forms a view corresponding to each of the sub-trees sharing the same source.
- a view generally corresponds to a subset or subquery of fields common to the respective source.
- the lower leaf nodes v 1 -v 4 denote subqueries.
- the bottom right section labeled v 4 all emanate from the same connection coni.
- V 1 is also accessible via v 1 , and may be accessed by the same connection Con 1 . While building this query, we can identify all of that data comes from one database and build that subquery into a query. Then when it needs to be joined with the connection to v 3 , the subquery for v 3 is also turned into a view. The join above it can then select from these two views to join the data together. Likewise v 1 and v 2 can also be joined together using views. These joins are then unioned and further aggregated to produce a final result.
- the compute engine 400 is a package wrapping an underlying database (Spark in the initial implementation). It has the responsibility of the view creation and delegation. When the create view requests are received, they contain all of the information required to get the data from the appropriate source.
- the final query can be run on the compute database itself, which then sends the final results.
- the query logic 450 for each view, identifies a set of tables stored at the source that contain the fields, and iteratively accumulates the tables identified in the view by a join. In this manner, multiple dimensions are accumulated by traversing the joined tables.
- the query logic substitutes a selection operation for each field called for by the view to effect retrieval.
- the delegation to the query processor 410 avoids authentication and access information such as passwords.
- the compute engine 400 may establish access via the connection 420 to the respective data warehouses 430 , and permit the query processor 410 to perform computationally intensive tasks.
- FIG. 6 is a flowchart of query processing as in FIGS. 1-5 .
- the compute engine 400 receives a query request 115 having a plurality of fields and one or more conditions, such that the conditions define selective retrieval of the data in the fields.
- This is typically a SQL or SQL-like construct.
- Query logic 450 identifies, for each field of the plurality of fields, a source that stores the field, in which each field is defined by one or more dimensions in the respective source, as depicted at step 602 .
- Each source generally denotes a data warehouse 110 .
- a view identifies a set of tables stored at the source that contain the fields, as disclosed at step 603 .
- the query processor 410 iteratively accumulates the tables identified in the view by a join, as depicted at step 604 .
- the view is then defined for each of the identified sources for accessing the fields stored in a respective source, as shown at step 605 .
- This includes forming a hierarchical structure based on dependencies in the query request, as depicted at step 606 , and identifying, from the hierarchical structure, one or more sub-trees corresponding to the same source, shown at step 607 and described above with respect to FIG. 5
- the query processor 410 forms a view corresponding to each of the sub-trees sharing the same source (step 608 ), corresponding to the island 302 structure discussed above.
- a connection 420 is defined, for each of the identified sources of the plurality of the sources needed for the query request 115 , as depicted at step 609 Configurations herein are particularly beneficial when the connections pertain to data warehouse under different vendors.
- the query processor 410 establishes a connection to a data warehouse 110 defining the source, as depicted at step 611 and accumulates a set of tables corresponding to the fields sought from the respective source, shown at step 612 .
- the query processor aggregates the collective connections (the “islands” described above) by forming a linked list of the set of accumulated tables corresponding to each connection, as depicted at step 613 .
- the query proceeds by accessing each of the sources via the established connection, as disclosed at step 614 . This may include, for each view, substituting a selection.
- programs and methods defined herein are deliverable to a user processing and rendering device in many forms, including but not limited to a) information permanently stored on non-writeable storage media such as ROM devices, b) information alterably stored on writeable non-transitory storage media such as solid state drives (SSDs) and media, flash drives, floppy disks, magnetic tapes, CDs, RAM devices, and other magnetic and optical media, or c) information conveyed to a computer through communication media, as in an electronic network such as the Internet or telephone modem lines.
- SSDs solid state drives
- the operations and methods may be implemented in a software executable object or as a set of encoded instructions for execution by a processor responsive to the instructions, including virtual machines and hypervisor controlled execution environments.
- ASICs Application Specific Integrated Circuits
- FPGAs Field Programmable Gate Arrays
- state machines controllers or other hardware components or devices, or a combination of hardware, software, and firmware components.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- This patent application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent App. No. 63/116,351, filed Nov. 20, 2020, entitled “DISTRIBUTED QUERIES THROUGH DYNAMIC VIEWS,” incorporated herein by reference in entirety.
- Electronic databases store tremendous amounts of data, and have been doing so for several decades ever since the cost of computer hardware came within reach for most businesses and consumers. Large “data warehouses” now store vast amounts of data stored and are indexed according to a storage format, often according to tables or multidimensional arrangements, and indices that allow access to the data though interfaces and software defined by the particular vendor.
- A data query and retrieval system receives a query request and identifies a plurality of data warehouses, or stores, where data needed to satisfy the query is stored. Data items, or fields, called for by the query request are grouped into sets based on fields stored at the same data warehouse. Individual data connections are generated for each data warehouse accessed, and a database view defined for the fields available at each respective data warehouse. The individual views are invoked in a single query statement by a query processor for accessing the respective data warehouses via the corresponding database view. In this manner, a query operator or user need not define multiple query expressions or perform separate queries for accessing data fields or dimensions residing at different data warehouses.
- Configurations disclosed below therefore provide a method for querying a multidimensional database, including identifying, for each of a plurality of fields sought by a query request, a data warehouse where the sought fields are stored, and forming, for each data warehouse corresponding to the plurality of fields, a view indicative of the fields sought from the respective data warehouse. A query engine generates a query expression for retrieval of each of the fields sought by the query request by accessing the respective fields via the formed views.
- The foregoing and other objects, features and advantages of the invention will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
-
FIG. 1 is a context diagram of a database environment suitable for use with configurations herein; -
FIG. 2 is a block diagram of a file system supporting a data warehouse in the environment ofFIG. 1 ; -
FIG. 3 is a modified Venn diagram of a multidimensional database, or datacube, environment suitable for configurations herein; -
FIG. 4 shows a processing arrangement of a query engine as inFIG. 1 for querying a database as disclosed herein; -
FIG. 5 shows an example subquery and connection arrangement based on a query request using the approach ofFIGS. 1-4 ; and -
FIG. 6 is a flowchart of query processing as inFIGS. 1-5 . - Various configurations depicting the above features and benefits as disclosed herein are shown and described further below. Configurations disclosed herein allow for a query to obtain data from multiple data sources in a single request. The data sources may be completely different from each other. They can be a standard relational database, distributed database, a cloud based database or any other database or data warehouse accessible by the query engine. They can all be used seamlessly to bring data together and form the results the user requests.
- Modern databases have benefited from advancements in computer networking technology to allow a distributed physical structure. While databases ultimately exist for the purpose of accessing particular data items as a single concise report or stream, the data so gathered may be stored in a multitude of different physical locations, ranging from the same file or table on a storage device, to different storage devices/disks that are part of the same physical assembly (i.e. rack mounted array), to tables in different physical locations joined by network connectivity. The term “data warehouse” may often be used to collectively refer to the aggregate body of distributed data accessible via a single set of control instructions, or DBMS (DataBase Management System). The DBMS and related synonyms (RDBMS, for Relational DBMS, Disk Farm) are often used to describe a common point or channel of control that regulates appropriate access to the data so stored for, in addition to robust query capability, access control and security.
- Configurations herein are based, in part, on the observation that database managers (DBMS) emanate from different vendors, and owners of data often select a particular vendor's product for an individual usage, application or focus. While interoperability is often supported in the way of APIs (Application Programming Interfaces), data conversion approaches, and remote login capability, conventional approaches suffer from the shortcoming that it may not be a seamless process to access data across multiple vendor's platforms. Accordingly, configurations herein substantially overcome the shortcomings of separate data storage regimes by providing a query engine having multiple connections to each of a plurality of data warehouses, such that each data warehouse corresponds to a vendor imposed storage arrangement such as a DBMS.
-
FIG. 1 is a context diagram of a database environment suitable for use with configurations herein. Referring toFIG. 1 , in adatabase environment 101, auser 103 employs auser device 104 for interacting with aserver 105. Theserver 105 launches one or more applications (apps) 120-1 . . . 120-N (120 generally) for performing various tasks that invoke database exchanges via query requests. Theserver 105 also includes acompute engine 400 andquery logic 450 for receiving the query requests and generating a query response, discussed further below. The apps 120 andcompute engine 400 may occupy thesame server 105 or computing device, or the app 120 may reside on a user machine and exchange remotely with the compute engine. - The
compute engine 400 receives query requests 115 from the apps 120, based on a user request. In configurations herein, the query requests seek data that resides on multiple data warehouses 110-1 . . . 110-2 (110 generally). Thecompute engine 400 establishes a respective connection 420-1 . . . 420-2 (420 generally) to each data warehouse 110-N for accessing and returning the data stored thereby. Each data warehouse generally corresponds to a DBMS of a particular vendor, and may have different access and query syntax requirements. Each data warehouse 110 also accesses a file system 210-1 . . . 210-N, which physically stores the data and related indices and metadata. Any suitable physical arrangement and combination of storage media devices 116 may be included in the file system, and they may be distributed across different physical locations via anetwork 130. - Configurations herein identify, for each of a plurality of fields sought by a query request, a data warehouse 110 where the sought fields are stored, and forms, for each respective data warehouse 110 corresponding to the plurality of fields, a view indicative of the fields sought from the respective data warehouse, where each data warehouse has an independent access medium and storage domain. Stated differently, a view is defined to meet the vendor imposed syntax and access forms for the particular data warehouse 110. In general, the access medium is based on a set of control instructions for accessing the respective fields stored in the data warehouse, and may be fulfilled by a vendor specific DBMS or similar software. The storage domain is based on a storage repository under the control of the access medium, typically a collection of hard disk drives (HDD) or solid state drives (SSD), discussed further in
FIG. 2 . - The
query logic 450 generates a respective subquery from the user request 115 based on each view of the plurality of views, where the subquery is for retrieval of each of the fields sought by the query request by accessing the respective fields via the formed view. Each subquery, therefore, seeks the fields available in the data warehouse to which it is directed. -
FIG. 2 is a block diagram of an example file system supporting a data warehouse in the environment ofFIG. 1 ;FIG. 2 is depicts a physical file storage system suitable for supporting the data warehouses ofFIG. 1 , however the extent ofFIG. 2 is to illustrate the various physical storage domains that may underlie a particular vendor's DBMS. Configurations herein rest on the premise that the user need not know which tables and data warehouse(s) are invoked by the user request 115. Referring toFIGS. 1 and 2 , thecomputing environment 201 comprises one or more instances of a client device 204 (e.g., a desktop computer), one or more instances of anapplication server 205, and one or more instances of a distributeddata file system 210. The user application 120 and database 140 may comprise and/or operate in conjunction with aclient device 204 and any number of physically distributed and networked storage entities, typically disk drives or SSD (solid state drive) devices. - The aforementioned devices, servers, and systems can communicate through a network 202 (e.g., a wireless network, a wide area network (WAN), a local area network (LAN), the Internet, an intranet, a private network, etc.). More specifically, the distributed
data file system 210 can be configurable to store and process large volumes of data (e.g., using an HDFS framework) and comprise one or more instances of aclient gateway node 212, one or more instances of aprimary name node 213, one or more instances of asecondary name node 214, one or more instances of a support server 215 (e.g., executing data analysis processes, etc.), and a plurality of data nodes 216 (e.g.,data node 1 216-1,data node 2 216-.2,data node 3 216-3, to data node N 216-N). The distributeddata file system 210 is configurable to store large files of data across multiple computing devices (e.g., the plurality of data nodes 216), rather than store all of the files on a single computing device with a large amount of storage capacity. To accommodate such distributed storage, the plurality ofdata nodes 216 can be configured with one or more rack-mounted servers coupled (e.g., using SATA or SAS interconnects) to multiple hard disk drives for storing the data. - As an example, when a file is loaded into distributed
data file system 210, it is replicated and partitioned into “blocks” of data, which are distributed and stored across the plurality ofdata nodes 216. Theprimary name node 213 is responsible for storage and management of metadata associated with the distributeddata file system 210, such that, for example, theprimary name node 213 can provide data location information for data processing operations. Further, other configurations, partitions, and architectures of the distributeddata file system 210 shown incomputing environment 201 are possible. For example, a single-node implementation may have all of the nodes (e.g.,client gateway node 212,primary name node 213, the plurality ofdata nodes 216, etc.) reside on a single computing device. Also, a small cluster of two nodes may have a first computing device operate as a name node and a second computing device operate as a data node. Further, the distributeddata file system 210 can represent any database management system. Each connection 420 is made between thecompute engine 400 and the data warehouse 110, assigned a view, and thefile system 210 of each respective data warehouse (typically a vendor DBMS) accesses the physical data via the view. -
FIG. 3 is a modified Venn diagram of a multidimensional database, or datacube, environment suitable for configurations herein. - Referring to
FIG. 1-3 , aquery approach 300 includes the following components -
- Virtual island 302-1 . . . 302-2 (302 generally)
- Virtual cube
-
Compute Engine 400
A virtual cube is an arrangement of dimensions of fields, each of which define a dimension of the virtual cube for multidimensional queries. Although the number of dimensions and records therein may be quite large, the stored data items are generally referenced in database tables indexed and arranged for defining the virtual cube.
- In conventional approaches, query creation for the leaf (data containing) nodes of the virtual cube system was a manual process of column identification and joining of tables. The disclosed approach, in contrast, integrates a virtual “Island” system for joining together tables within the same data source connection 420, and aggregating the virtual islands into an “archipelago” satisfying the query request from data across different connections.
- The general idea is to build the query by identifying a starting table and creating a virtual island 302 containing that table. As tables are joined, if they are on the same connection, they are added to that virtual island. When a table being joined requires a connection different from that island, a new island is created with that table. This process continues until all of the required tables have been joined together. This effectively creates a linked list of virtual islands, which we identify as a virtual archipelago 301 (literally meaning a “collection or set of islands”).
- Whenever the virtual archipelago consists of more than one island, hence needing multiple data warehouses to satisfy the query request 115, it takes each island and creates a view selecting the required columns from each table available at the respective connection 420. Any constraints and aggregations that can be pushed down into that table are also included in the view creation to benefit query performance. The virtualized query is then created joining these views together as if they were physical tables in a non-virtualized query. The disclosed method for querying a database includes identifying, for each of a plurality of fields sought by a view in a query request, a source, defined by a data warehouse where the sought fields are stored.
- Summarizing the elements presented above, configurations herein present data sources, or simply a source, to be accessible by one connection. Each data source stores tables, including a starting table, aggregated into an island, meaning a group of tables at one source. A view encompasses all the tables in an island. Each table includes at least the fields, arranged in columns, needed for the query request. The table likely also contains other fields not called for by the query request. A virtual cube 350-1 . . . 350-2 (350 generally) assigns or creates a dimension for each field for an island. The virtual cube need only include dimensions called for by a respective query. The archipelago therefore is a grouping or list of islands, each island corresponding to a source.
- Extending to an example query:
-
SELECT d1.customer_id, d1.customer_name, d2.occupation_id, d2.occupation_name, sum(f1.sales) FROM fact_sales f1 JOIN dim_customer d1 ON f1.customer_id = d1.customer_id JOIN dim_occupation d2 ON d1.occupation_id = d2.occupation_id GROUP BY 1, 2, 3, 4
Assume fact_sales comes fromdata source connection 1, and
dim_customer/dim_occupation come fromdata source connection 2. - This will create a
Virtual Archipelago 301 as inFIG. 3 . - Island1 302-1 defines a view:
-
CREATE VIEW tmp_view_1 SELECT customer_id, sum(sales) FROM fact1 GROUP BY customer_id
Island2 302-2 defines a view: -
CREATE VIEW tmp_view_2 SELECT d1.customer_id, d1.customer_name, d2.occupation_id, d2.occupation_name FROM dim_customer d1 JOIN dim_occupation d2 ON d1.occupation_id = d2.occupation_id
The original query would then be turned into: -
SELECT v1.customer_id, v2.customer_name, v2.occupation_id, v2.occupation_name, v1.sales FROM tmp_view_1 v1 JOIN tmp_view_2 v2 ON v1.customer_id = v2.customer_id - For computing a query response to a query request, therefore, each connection 420 is driven by a set of control instructions of the respective DBMS for accessing the set of tables in the respective data warehouse. Thus, the set of tables in one data warehouse are distinct from tables in other data warehouses accessible via other connections of the plurality of connections, meaning under the control of a particular DBMS. The
query logic 450 defines an 302 island based on the set of accumulated tables at the respective data warehouse, and generates thearchipelago 301 by defining an aggregation, such as a linked list or similar structure, for interconnecting the islands 302 corresponding to the plurality of connections. -
FIG. 4 shows a simplified processing arrangement of a query engine as inFIG. 1 for querying a database as disclosed herein. Referring toFIGS. 3 and 4 , the query engine includes thecompute engine 400, which has an interface, access or API to a query processor 410. In an example arrangement, a SPARK™ query implementation may be invoked, however other suitable query engines may also be used. The use of a query processor 410 avoids dissemination of sensitive login or access control information for the respective data warehouses. - The query engine 200 establishes connections 420-1 . . . 420-3 (420 generally) to each of data warehouses 110-1 . . . 110-3 (110 generally), and delegates subqueries as needed to the query processor 410. It then forms or defines, for each data warehouse corresponding to the plurality of fields, a view indicative of the fields sought from the respective data warehouse, and generates a query expression for retrieval of each of the fields sought by the query request by accessing the respective fields via the formed view, as now described with respect to
FIG. 5 . - Building on virtual tree based cube approaches, and making use of the virtual islands 302-1 . . . 302-2 (302 generally) system discussed above, the
compute engine 400 invokes thequery logic 450 that analyzes fields as the tree is being built to identify the sub-trees that reside within a single data source connection 420. As the tree is being built, if a join or a union involves data from different sources, the sub-trees being joined or unioned will be turned into compute views. The query will substitute a select from the view that is created, and the compute node will be able to do the appropriate data substitution at run time. -
FIG. 5 shows an example subquery and connection arrangement based on a query request using the approach ofFIGS. 1-4 , InFIG. 5 , thecompute engine 400 performs query parsing and processing based on the query request 115 in the environment ofFIGS. 1 and 2 . From thequery engine 400, multiple connections 420 exist to each of a plurality of data warehouses 110, such that each data warehouse corresponds to a vendor imposed storage arrangement,FIG. 3 shows satisfaction of a query request invoking each data warehouse 110. Thecompute engine 400 receives the query request 115 having a plurality of fields and one or more conditions, in which the conditions define selective retrieval of the data in the fields in a conditional syntax such as SQL. - The
query logic 450 identifies, for each field of the plurality of fields, a source that stores the field. Each field is defined by one or more dimensions in the respective source. Each source is a data warehouse 110 or other repository, set of tables or collection configured for storing the field as a dimension of a multidimensional set of tables. Thequery logic 450 defines, for each of the identified sources, a view for accessing the fields stored in a respective source, and establishes a connection 420 to each source of a plurality of the sources. Thecompute engine 400, and/or invoking thequery processor 450 as needed, accesses each of the sources via the established connection 420. - Referring to
FIGS. 3-5 , the example query involves three different data source connections. Thequery logic 450 forms ahierarchical structure 300 based on dependencies in the query request, and identifies, from the hierarchical structure 500, one or more sub-trees v1 . . . v4 corresponding to the same source 430 (data warehouse). It then forms a view corresponding to each of the sub-trees sharing the same source. A view generally corresponds to a subset or subquery of fields common to the respective source. - In
FIG. 5 , the lower leaf nodes v1-v4 denote subqueries. The bottom right section labeled v4 all emanate from the same connection coni. V1 is also accessible via v1, and may be accessed by thesame connection Con 1. While building this query, we can identify all of that data comes from one database and build that subquery into a query. Then when it needs to be joined with the connection to v3, the subquery for v3 is also turned into a view. The join above it can then select from these two views to join the data together. Likewise v1 and v2 can also be joined together using views. These joins are then unioned and further aggregated to produce a final result. - In the example configuration, the
compute engine 400 is a package wrapping an underlying database (Spark in the initial implementation). It has the responsibility of the view creation and delegation. When the create view requests are received, they contain all of the information required to get the data from the appropriate source. - This includes:
-
- the connection details
- the dialect
- the structure of the query results
- the fully planned textual query to be run, i.e query request character string
Using this information, thecompute engine 400 can make a request to the query processor 410 for sending the textual query and connection information. This allows thequery logic 450 to manage all of the authentication and direct connections to the underlying datasource. None of this sensitive information has to be sent to the query processor. The results of the query are streamed back to the compute engine, which are then stored temporarily until all query results are generated based on subqueries from all connections 420.
- Once all of the data has been retrieved from the views and put into temporary tables on a compute database 412, the final query can be run on the compute database itself, which then sends the final results.
- This includes, for each source of the plurality of sources, establishing the connection 420 to a data warehouse 110 defining the source, accumulating a set of tables corresponding to the fields sought from the respective source, and forming a linked list of the set of accumulated tables corresponding to each connection. Using the generated views, the
query logic 450, for each view, identifies a set of tables stored at the source that contain the fields, and iteratively accumulates the tables identified in the view by a join. In this manner, multiple dimensions are accumulated by traversing the joined tables. To materialize the query results, for each of the data warehouses, and hence for each view, the query logic substitutes a selection operation for each field called for by the view to effect retrieval. - In particular configurations, the delegation to the query processor 410 avoids authentication and access information such as passwords. The
compute engine 400 may establish access via the connection 420 to the respective data warehouses 430, and permit the query processor 410 to perform computationally intensive tasks. -
FIG. 6 is a flowchart of query processing as inFIGS. 1-5 . Referring toFIGS. 1-6 , atstep 601, thecompute engine 400 receives a query request 115 having a plurality of fields and one or more conditions, such that the conditions define selective retrieval of the data in the fields. This is typically a SQL or SQL-like construct.Query logic 450 identifies, for each field of the plurality of fields, a source that stores the field, in which each field is defined by one or more dimensions in the respective source, as depicted atstep 602. Each source generally denotes a data warehouse 110. For each source, a view identifies a set of tables stored at the source that contain the fields, as disclosed atstep 603. The query processor 410 iteratively accumulates the tables identified in the view by a join, as depicted atstep 604. - The view is then defined for each of the identified sources for accessing the fields stored in a respective source, as shown at
step 605. This includes forming a hierarchical structure based on dependencies in the query request, as depicted atstep 606, and identifying, from the hierarchical structure, one or more sub-trees corresponding to the same source, shown atstep 607 and described above with respect toFIG. 5 The query processor 410 forms a view corresponding to each of the sub-trees sharing the same source (step 608), corresponding to the island 302 structure discussed above. A connection 420 is defined, for each of the identified sources of the plurality of the sources needed for the query request 115, as depicted atstep 609 Configurations herein are particularly beneficial when the connections pertain to data warehouse under different vendors. Atstep 610, for each source of the plurality of sources, the query processor 410 establishes a connection to a data warehouse 110 defining the source, as depicted at step 611 and accumulates a set of tables corresponding to the fields sought from the respective source, shown atstep 612. The query processor aggregates the collective connections (the “islands” described above) by forming a linked list of the set of accumulated tables corresponding to each connection, as depicted atstep 613. The query proceeds by accessing each of the sources via the established connection, as disclosed atstep 614. This may include, for each view, substituting a selection. - By forming different views to each warehouse or repository, computer operation and efficiency are improved by executing fewer instructions than would be required for accessing each data warehouse separately with a source-specific query.
- Those skilled in the art should readily appreciate that the programs and methods defined herein are deliverable to a user processing and rendering device in many forms, including but not limited to a) information permanently stored on non-writeable storage media such as ROM devices, b) information alterably stored on writeable non-transitory storage media such as solid state drives (SSDs) and media, flash drives, floppy disks, magnetic tapes, CDs, RAM devices, and other magnetic and optical media, or c) information conveyed to a computer through communication media, as in an electronic network such as the Internet or telephone modem lines. The operations and methods may be implemented in a software executable object or as a set of encoded instructions for execution by a processor responsive to the instructions, including virtual machines and hypervisor controlled execution environments. Alternatively, the operations and methods disclosed herein may be embodied in whole or in part using hardware components, such as Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), state machines, controllers or other hardware components or devices, or a combination of hardware, software, and firmware components.
- While the system and methods defined herein have been particularly shown and described with references to embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.
Claims (18)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/529,898 US12045246B2 (en) | 2020-11-20 | 2021-11-18 | Distributed queries through dynamic views |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202063116351P | 2020-11-20 | 2020-11-20 | |
| US17/529,898 US12045246B2 (en) | 2020-11-20 | 2021-11-18 | Distributed queries through dynamic views |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20220164349A1 true US20220164349A1 (en) | 2022-05-26 |
| US12045246B2 US12045246B2 (en) | 2024-07-23 |
Family
ID=81658295
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/529,898 Active 2042-03-31 US12045246B2 (en) | 2020-11-20 | 2021-11-18 | Distributed queries through dynamic views |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US12045246B2 (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115033605A (en) * | 2022-06-16 | 2022-09-09 | 未鲲(上海)科技服务有限公司 | Data query method and device, electronic equipment and storage medium |
| US20250028714A1 (en) * | 2023-07-17 | 2025-01-23 | Splunk Inc. | Query execution using a data processing scheme of a separate data processing system |
| US12271389B1 (en) | 2022-06-10 | 2025-04-08 | Splunk Inc. | Reading query results from an external data system |
| US12360987B2 (en) * | 2023-10-05 | 2025-07-15 | Capital One Services, Llc | Data attribute retrieval |
| US12393631B2 (en) | 2016-09-26 | 2025-08-19 | Splunk Inc. | Processing data using nodes in a scalable environment |
| US12436963B2 (en) | 2022-04-29 | 2025-10-07 | Splunk Inc. | Retrieving data identifiers from queue for search of external data system |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20010054042A1 (en) * | 1996-05-17 | 2001-12-20 | Richard M. Watkins | Computing system for information management |
| US20050120021A1 (en) * | 2003-12-02 | 2005-06-02 | Microsoft Corporation | Metadata driven intelligent data navigation |
| US20060136354A1 (en) * | 2004-12-17 | 2006-06-22 | Bell Christopher R | Apparatus and method for data warehousing |
| US20100223296A1 (en) * | 1998-09-04 | 2010-09-02 | Kalido Limited | Data Processing System |
| US20120150791A1 (en) * | 2008-06-02 | 2012-06-14 | Ian Alexander Willson | Methods and systems for loading data into a temporal data warehouse |
| US20130238549A1 (en) * | 2012-03-07 | 2013-09-12 | Mircosoft Corporation | Using Dimension Substitutions in OLAP Cubes |
| US11216511B1 (en) * | 2019-07-16 | 2022-01-04 | Splunk Inc. | Executing a child query based on results of a parent query |
-
2021
- 2021-11-18 US US17/529,898 patent/US12045246B2/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20010054042A1 (en) * | 1996-05-17 | 2001-12-20 | Richard M. Watkins | Computing system for information management |
| US20100223296A1 (en) * | 1998-09-04 | 2010-09-02 | Kalido Limited | Data Processing System |
| US20050120021A1 (en) * | 2003-12-02 | 2005-06-02 | Microsoft Corporation | Metadata driven intelligent data navigation |
| US20060136354A1 (en) * | 2004-12-17 | 2006-06-22 | Bell Christopher R | Apparatus and method for data warehousing |
| US20120150791A1 (en) * | 2008-06-02 | 2012-06-14 | Ian Alexander Willson | Methods and systems for loading data into a temporal data warehouse |
| US20130238549A1 (en) * | 2012-03-07 | 2013-09-12 | Mircosoft Corporation | Using Dimension Substitutions in OLAP Cubes |
| US11216511B1 (en) * | 2019-07-16 | 2022-01-04 | Splunk Inc. | Executing a child query based on results of a parent query |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12393631B2 (en) | 2016-09-26 | 2025-08-19 | Splunk Inc. | Processing data using nodes in a scalable environment |
| US12436963B2 (en) | 2022-04-29 | 2025-10-07 | Splunk Inc. | Retrieving data identifiers from queue for search of external data system |
| US12271389B1 (en) | 2022-06-10 | 2025-04-08 | Splunk Inc. | Reading query results from an external data system |
| CN115033605A (en) * | 2022-06-16 | 2022-09-09 | 未鲲(上海)科技服务有限公司 | Data query method and device, electronic equipment and storage medium |
| US20250028714A1 (en) * | 2023-07-17 | 2025-01-23 | Splunk Inc. | Query execution using a data processing scheme of a separate data processing system |
| US20250028720A1 (en) * | 2023-07-17 | 2025-01-23 | Splunk Inc. | Identifying components to obtain and process data according to a query |
| US12265525B2 (en) | 2023-07-17 | 2025-04-01 | Splunk Inc. | Modifying a query for processing by multiple data processing systems |
| US12360987B2 (en) * | 2023-10-05 | 2025-07-15 | Capital One Services, Llc | Data attribute retrieval |
Also Published As
| Publication number | Publication date |
|---|---|
| US12045246B2 (en) | 2024-07-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12045246B2 (en) | Distributed queries through dynamic views | |
| US12222944B2 (en) | Processing database queries using format conversion | |
| US11816126B2 (en) | Large scale unstructured database systems | |
| US12235835B2 (en) | Systems and methods for efficiently querying external tables | |
| US11163757B2 (en) | Querying over external tables in database systems | |
| US8935232B2 (en) | Query execution systems and methods | |
| US9684699B2 (en) | System to convert semantic layer metadata to support database conversion | |
| US10885031B2 (en) | Parallelizing SQL user defined transformation functions | |
| US20170011090A1 (en) | Parallelizing sql on distributed file systems | |
| CN103430144A (en) | Data source analytics | |
| US11657069B1 (en) | Dynamic compilation of machine learning models based on hardware configurations | |
| Borkar et al. | Have your data and query it too: From key-value caching to big data management | |
| US11636124B1 (en) | Integrating query optimization with machine learning model prediction | |
| US8103624B2 (en) | Apparatus and method for automating the logging of table changes in a database | |
| WO2024239782A1 (en) | Query plan construction method and apparatus, electronic device and storage medium | |
| US10776368B1 (en) | Deriving cardinality values from approximate quantile summaries | |
| US20200311067A1 (en) | Database partition pruning using dependency graph | |
| US10558637B2 (en) | Modularized data distribution plan generation | |
| US11995078B2 (en) | Query intake for legacy databases | |
| Lazos | Migrating a data warehouse from a relational database to a document store and lessons learned | |
| Wadkar et al. | Data Warehousing Using Hadoop | |
| Maccioni et al. | NoXperanto: Crowdsourced polyglot persistence | |
| Pal | SQL for Streaming, Semi-Structured, and Operational Analytics | |
| Castrejon-Castillo | HAL Id: hal-01002695 | |
| Becla et al. | Database Architecture |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| AS | Assignment |
Owner name: ATSCALE, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SAFAIE, ROUZBEH;DOSCH, DANIEL J.;BAREFOOT, JOSEPH;AND OTHERS;SIGNING DATES FROM 20220331 TO 20240424;REEL/FRAME:067949/0641 |