[go: up one dir, main page]

WO2019189962A1 - Procédé de parallélisation d'interrogation pour des données ayant une copie existante dans une base de données de distribution - Google Patents

Procédé de parallélisation d'interrogation pour des données ayant une copie existante dans une base de données de distribution Download PDF

Info

Publication number
WO2019189962A1
WO2019189962A1 PCT/KR2018/003696 KR2018003696W WO2019189962A1 WO 2019189962 A1 WO2019189962 A1 WO 2019189962A1 KR 2018003696 W KR2018003696 W KR 2018003696W WO 2019189962 A1 WO2019189962 A1 WO 2019189962A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
range
data
server
master server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/KR2018/003696
Other languages
English (en)
Korean (ko)
Inventor
최재용
정태균
백성인
한혁
진성일
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
REALTIMETECH Co Ltd
Original Assignee
REALTIMETECH Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by REALTIMETECH Co Ltd filed Critical REALTIMETECH Co Ltd
Publication of WO2019189962A1 publication Critical patent/WO2019189962A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention changes a query for data in which a replica exists in a distributed database into a plurality of range condition queries corresponding to the number of nodes in which a replica is stored, and simultaneously executes a range condition query in multiple nodes, thereby allowing a large amount of replicas to exist.
  • the present invention relates to a technique for reducing query execution time for distributed data.
  • a distributed database system exists as a system for managing such a large amount of data.
  • a distributed database system includes a master server 10 and a plurality of slave servers 20, as shown in FIG.
  • the master server 10 manages the slave servers 20 and manages the position of the slave server 20 to which data belongs.
  • the slave server 20 is a server that manages the partition to which the actual data belongs, and the data is arranged and managed sequentially based on the key.
  • a distributed database creates and manages a plurality of replicas distributed to each server for each file to improve data stability and performance. At this time, the replica may not be created according to the characteristics of the file.
  • database replication is one of distributed database technologies that copies an object stored in one database to another physically separate database so that it can be used in two or more database servers.
  • This replication technology can improve performance by distributing access to applications that use the same object across multiple database servers, or by allowing the replicated database server to be used for other purposes to meet different operational requirements.
  • a query is requested from a client and a result is obtained by executing a corresponding query in a specific slave server in which original data or replica data is generally stored.
  • the data is actually stored in multiple slave servers (nodes) for high availability in a distributed database, but the utilization of replicas is low unless a failure occurs.
  • the present invention was created in view of the above circumstances, and by changing a query into a plurality of range condition queries for a table in which a replica exists in a distributed database, and simultaneously executing the query through a plurality of nodes, the replica exists. Its technical purpose is to provide a query parallelization method for data in which there is a replica in a distributed database that can reduce the query execution time for large distributed tables.
  • a master server for distributing a query for a query request from the client, and a plurality of slave servers are stored in the data table to perform the query and return the results
  • a method for parallelizing queries for data in which a replica in a distributed database is configured wherein when the target table is a distributed table and the target table includes columns that can be scoped by analyzing a query requested from a client at a master server, A first step of judging the query as a split target query; a second step of judging whether the number of search target records for the split target query in the master server exceeds a preset reference record number; and a second step in the second step in the master server Exceeded preset number of reference records
  • the record area is partitioned based on the number of records to be searched on the master server and the number of slave servers where the copy exists.
  • the master server determines that the query is a split target query when the master server is not a unique scan that searches a single record through parsing the query.
  • a query parallelization method is provided.
  • the master server determines that the partition target query is a partition target query when the data type is a number (INT) or a column including a date (DATE) in the target table.
  • a query parallelization method for data is provided.
  • the master server provides a query parallelization method for data in which there is a copy in a distributed database, wherein the reference record number is set differently according to a query condition based on query execution time.
  • the master server sets the record range of the range query to be provided to the slave server based on the current load of the slave server in which the copy is stored.
  • a query parallelization method for is provided.
  • the master server converts the sharding condition column into a range condition column in the where condition of the query, and sets a range corresponding to the divided record area as the range condition column value, thereby differenting each slave server.
  • a query parallelization method is provided for data in which a replica exists in a distributed database, which generates a range query having a range condition column value.
  • the replica by replicating a query for a table in which a replica exists in a distributed database and executing a query for a different query range at the same time in a plurality of nodes where a target table exists, the replica is executed. You can shorten the query execution time for existing large distributed tables.
  • FIG. 1 is a conceptual diagram illustrating a general distributed database configuration.
  • Fig. 2 is a diagram for explaining the configuration of a distributed database having a query parallelizing function for data in which a replica exists to which the present invention is applied.
  • FIG. 3 is a view for explaining a query parallelizing method for data in which a replica exists in a distributed database according to a first embodiment of the present invention.
  • FIG. 4 is a diagram illustrating a process of converting an original query into a plurality of range queries in FIG.
  • FIG. 2 is a diagram for explaining the configuration of a distributed database having a query parallelization function for data in which a copy of the present invention is applied.
  • a distributed database having a query parallelization function for a table having a replica to which the present invention is applied distributes a query to a query request from a client, and the result provided from the slave server 200 is provided. It is configured to include a master server 100 for merging and providing to the client, and a plurality of slave servers 200 to store the data and to perform a query and return the result.
  • the slave server 200 includes a database server for storing the original data and a database server for storing the replica.
  • the original data may be stored in the master database 100 and the slave server 200 may be configured as a replica server that stores a replica.
  • the master server 100 includes a query analysis module 110, a query optimization module 120, a query partitioning module 130, a query distribution module 140, and a result merging module 150.
  • the query analysis module 110 parses the query requested from the client to analyze the command. For example, the query analysis module 110 analyzes command types such as search (SELECT syntax), storage (INSERT syntax), join (JOIN syntax), and the like.
  • search SELECT syntax
  • storage INSERT syntax
  • join JOIN syntax
  • the query optimization module 120 optimizes the client's query and analyzes whether the query is a split target query.
  • the query optimization module 120 determines whether the target table for obtaining a result of the original query is a distributed table in which a replication table exists through syntax analysis. In this case, when the "PARTITION BY" syntax exists in the query, it is determined as a distribution table, and a query including the distribution table is determined as a partition target query.
  • a table is a basic structure for storing data in a database, and one table is composed of one or more records.
  • the query optimization module 120 checks whether the query includes a range partitionable condition through parsing the query only when the target table is a distributed table in which a copy exists, and determines that the query is a partitionable condition query. In this case, the query is finally determined to be a split target query.
  • the partitionable condition of the query may include a type (INT) or a date (DATE).
  • the query splitting module 130 generates a plurality of range queries to be sent to the slave server 200 in which the copy exists for the split target query.
  • the query splitting module 130 converts the split target query into a range query corresponding to the number of slave servers 300, that is, the number of nodes, in which a copy of the target table is stored. Multiple range queries are set differently for column ranges in the where condition, and this column is set to a field corresponding to a partitionable condition.
  • the query distribution module 140 creates a thread and simultaneously transmits a range query to each slave server 200.
  • the result merging module 150 receives the results of the range query from each slave server 200, collects them, and provides them to the query requesting client.
  • FIG. 3 is a diagram illustrating a query parallelizing method for data in which a replica exists in a distributed database according to an embodiment of the present invention.
  • the master server 100 parses and parses the original query requested from the client to determine whether the query is a split target query (ST100).
  • the master server 100 determines whether the target table of the original query is a distributed table in which a copy exists in the slave server 200. If the original query includes the phrase "PARTITION BY", it is determined as a distribution table.
  • the master server 100 determines whether the target table, more specifically, a condition for analyzing a condition is included and includes a column for specifying a range (ST200). In this case, the master server 100 may determine whether the query is a range designation by parsing the query and checking whether the query is a unique scan that searches a single record. In addition, when the master server 100 satisfies a preset range expression column condition, for example, when the data type is a number (INT) or a date (DATE), the master server 100 may determine the range designation query.
  • a preset range expression column condition for example, when the data type is a number (INT) or a date (DATE)
  • the master server 100 may determine the range designation query.
  • the master server 100 determines that the original query is a split target query when the original query exists in the distribution table and includes a column that can specify a range.
  • the master server 100 determines whether the total number of records affected by the query, that is, the number of search target records exceeds a predetermined record reference value. (ST300).
  • the record reference value is used to determine whether to divide the range query. If the total number of records affected by the query, that is, the number of records to be searched, is less than the preset reference value, the original query is not divided.
  • the record reference value may be set differently according to the query condition in consideration of the query execution time according to the condition. For example, the record reference value may be set smaller when the range condition is a date than when the range condition is an ID.
  • the master server 100 checks the slave server 200 in which the original table and the copy table are stored (ST400).
  • the master server 100 generates a plurality of range queries having different condition ranges so as to correspond to the number of slave servers 200 that can execute the query based on the state of the slave server 200 (ST500).
  • the master server 100 may set the slave server 200 whose current load is less than or equal to a predetermined level among the slave server 200 in which the copy is stored as the query executable slave server 200.
  • the master server 100 generates a range query to execute a query for different search target records by dividing the number of search target records corresponding to the query condition by the number of slave servers 200 that can execute the query.
  • the master server 100 may divide the query range differently based on the current load amount of each slave server 200 without equally dividing the query range.
  • FIG. 4 is a diagram illustrating an example of dividing an original query into a plurality of range queries.
  • (A) illustrates a table schema
  • (B) illustrates that the original query 300 for (A) is divided into a plurality of range queries 310 to 330.
  • the original query 300 may be generated as three first to third range queries 310 to 330.
  • the loc column condition which is a sharding condition in the Where condition of the original query, is converted to id, which is a range condition column, and the id column range is set based on the number of records to be searched and the number of slave servers.
  • id which is a range condition column
  • the query is divided into three range queries. Since the total number of records to be searched is 30 million, an id range is set to search for records of 10 million different areas for each node. You created a range query.
  • the master server 100 may set different id ranges set in the slave servers Nod1 to Node3 in consideration of the state of the slave servers (Node1 to Node3) in which the copy is stored, for example, a load level or a failure. have. For example, 15 million record areas are set for the first slave server Node1 having the minimum load, and 10 million record areas are designated for the second slave server Node2 with medium load, and the load is relatively small. Many third slave servers Node3 may have 5 million record areas.
  • the master server 100 creates a thread and simultaneously transmits the range query to each slave server 200 in which the target table and a copy thereof are stored in parallel at the same time (ST600).
  • the master server 100 transmits identification information on the original query, for example, table schema information, to each slave server 200 by including it in the range query.
  • Each slave server 200 executes a range query to obtain a result stored in a corresponding table, and provides the obtained result to the master server 100. At this time, each slave server 200 provides the original query identification information together with the master server 100, and the master server 100 corresponds to the range query received from each slave server 200 based on the original query identification information. Collect the result and provide it to the query request client.
  • the query is executed simultaneously on a plurality of nodes where the replica exists, so that the replica exists in large capacity. You can shorten the query time for the table.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne une technologie pour transformer une interrogation concernant des données, dont une copie existe dans une base de données de distribution, en une pluralité d'interrogations de condition de plage correspondant au nombre de nœuds stockant la copie et pour exécuter en même temps les interrogations de condition de plage dans une pluralité de nœuds. Ainsi, la technique permet la réduction d'un temps de fonctionnement d'interrogation concernant des données de distribution de capacité élevée, dont une copie existe.
PCT/KR2018/003696 2018-03-27 2018-03-29 Procédé de parallélisation d'interrogation pour des données ayant une copie existante dans une base de données de distribution Ceased WO2019189962A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020180035203A KR102049420B1 (ko) 2018-03-27 2018-03-27 분산 데이터베이스에서의 복제본이 존재하는 데이터에 대한 질의 병렬화 방법
KR10-2018-0035203 2018-03-27

Publications (1)

Publication Number Publication Date
WO2019189962A1 true WO2019189962A1 (fr) 2019-10-03

Family

ID=68060363

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2018/003696 Ceased WO2019189962A1 (fr) 2018-03-27 2018-03-29 Procédé de parallélisation d'interrogation pour des données ayant une copie existante dans une base de données de distribution

Country Status (2)

Country Link
KR (1) KR102049420B1 (fr)
WO (1) WO2019189962A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111694803A (zh) * 2020-06-15 2020-09-22 深圳前海微众银行股份有限公司 数据查询方法、装置、设备及计算机存储介质
CN114817402A (zh) * 2022-04-25 2022-07-29 山东浪潮科学研究院有限公司 分布式数据库于多region部署场景下的SQL执行优化方法
CN115934670A (zh) * 2023-03-09 2023-04-07 智者四海(北京)技术有限公司 Hdfs多机房的副本放置策略验证方法与装置
CN120416254A (zh) * 2025-06-30 2025-08-01 北京奥星贝斯科技有限公司 一种副本分发系统、方法、装置、存储介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100132752A (ko) * 2009-06-10 2010-12-20 (주)자이네스 데이터베이스 분산을 통한 서비스 성능 향상을 위한 질의 데이터 분산 처리시스템
JP2015106219A (ja) * 2013-11-29 2015-06-08 Kddi株式会社 分散型データ仮想化システム、クエリ処理方法及びクエリ処理プログラム
JP2015125726A (ja) * 2013-12-27 2015-07-06 Kddi株式会社 分散クエリ処理装置、クエリ処理方法及びクエリ処理プログラム
KR20160092259A (ko) * 2015-01-27 2016-08-04 전북대학교산학협력단 큐브리드 기반 미들웨어, 및 큐브리드 기반 미들웨어를 이용한 분산 병렬 질의 처리 방법
KR20170096302A (ko) * 2016-02-16 2017-08-24 전북대학교산학협력단 이종 데이터 처리를 위한 분산 병렬 처리 시스템

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101078484B1 (ko) * 2004-08-30 2011-10-31 주식회사 케이티 부하를 고려한 네트워크 관리시스템 및 관리방법
KR101666064B1 (ko) 2010-08-05 2016-10-13 에스케이텔레콤 주식회사 분산 파일 시스템에서 url정보를 이용한 데이터 관리 장치 및 그 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100132752A (ko) * 2009-06-10 2010-12-20 (주)자이네스 데이터베이스 분산을 통한 서비스 성능 향상을 위한 질의 데이터 분산 처리시스템
JP2015106219A (ja) * 2013-11-29 2015-06-08 Kddi株式会社 分散型データ仮想化システム、クエリ処理方法及びクエリ処理プログラム
JP2015125726A (ja) * 2013-12-27 2015-07-06 Kddi株式会社 分散クエリ処理装置、クエリ処理方法及びクエリ処理プログラム
KR20160092259A (ko) * 2015-01-27 2016-08-04 전북대학교산학협력단 큐브리드 기반 미들웨어, 및 큐브리드 기반 미들웨어를 이용한 분산 병렬 질의 처리 방법
KR20170096302A (ko) * 2016-02-16 2017-08-24 전북대학교산학협력단 이종 데이터 처리를 위한 분산 병렬 처리 시스템

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111694803A (zh) * 2020-06-15 2020-09-22 深圳前海微众银行股份有限公司 数据查询方法、装置、设备及计算机存储介质
CN114817402A (zh) * 2022-04-25 2022-07-29 山东浪潮科学研究院有限公司 分布式数据库于多region部署场景下的SQL执行优化方法
CN115934670A (zh) * 2023-03-09 2023-04-07 智者四海(北京)技术有限公司 Hdfs多机房的副本放置策略验证方法与装置
CN120416254A (zh) * 2025-06-30 2025-08-01 北京奥星贝斯科技有限公司 一种副本分发系统、方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
KR102049420B1 (ko) 2019-11-27
KR20190113055A (ko) 2019-10-08

Similar Documents

Publication Publication Date Title
Ladwig et al. CumulusRDF: linked data management on nested key-value stores
US10534770B2 (en) Parallelizing SQL on distributed file systems
EP3688621B1 (fr) Routage de demandes dans des systèmes de bases de données à mémorisation partagée
EP3903205A1 (fr) Technique de support complet de service en nuage d'objet de document json autonome (ajd)
CN113407600A (zh) 一种动态实时同步多源大表数据的增强实时计算方法
WO2019189962A1 (fr) Procédé de parallélisation d'interrogation pour des données ayant une copie existante dans une base de données de distribution
WO2016199955A1 (fr) Système et procédé de réduction de carte basée sur une table hachage à dispersion de code
US11567969B2 (en) Unbalanced partitioning of database for application data
Cao et al. Polardb-x: An elastic distributed relational database for cloud-native applications
KR20140096936A (ko) Dlp 시스템의 빅데이터 처리 시스템 및 방법
Chen et al. SSTD: A distributed system on streaming spatio-textual data
CN110109931B (zh) 一种用于防止rac实例间数据访问发生冲突的方法及系统
CN106156319A (zh) 可伸缩的分布式的资源描述框架数据存储方法及装置
Gu et al. Rainbow: a distributed and hierarchical RDF triple store with dynamic scalability
Qi Digital forensics and NoSQL databases
Xu et al. VSFS: A searchable distributed file system
WO2020111371A1 (fr) Système et procédé de construction de base de connaissances intégrées
Li et al. Replichard: Towards tradeoff between consistency and performance for metadata
Shi et al. Research on distributed relational database based on MySQL
Yamasaki et al. RDF data partitioning for efficient SPARQL query processing with Spark SQL
Janke Study on data placement strategies in distributed RDF stores
Patgiri MDS: In-depth insight
CN113918644B (zh) 一种管理应用程序的数据的方法及相关装置
Singh et al. High scalability of HDFS using distributed namespace
Xu et al. FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18912465

Country of ref document: EP

Kind code of ref document: A1