WO2022107994A1

WO2022107994A1 - Big data augmented analysis profiling system

Info

Publication number: WO2022107994A1
Application number: PCT/KR2021/000685
Authority: WO
Inventors: 송광헌; 최범진; 이우성
Original assignee: PCN Inc
Current assignee: PCN Inc
Priority date: 2020-11-20
Filing date: 2021-01-18
Publication date: 2022-05-27
Anticipated expiration: 2023-05-20
Also published as: KR102541934B1; KR20220069482A

Abstract

The purpose of the present invention is to propose a big data augmented analysis profiling system which can increase data processing efficiency, and the present invention provides a big data augmented analysis profiling system comprising: a data collection module; a data preprocessing module; a data quality management module; an augmented analysis profiling module; and an augmented analysis collaboration module.

Description

Big Data Augmented Analysis Profiling System

본 발명은 빅데이터 증강분석 프로파일링 시스템에 관한 것이다.The present invention relates to a big data augmented analysis profiling system.

빅데이터 분석은 기존 데이터베이스 관리도구의 능력을 넘어서는 대량(수십 테라바이트)의 정형 또는 비정형 데이터의 집합에서 데이터로부터 가치를 추출하고 결과를 분석하는 기술이다. Big data analysis is a technology that extracts values from data and analyzes the results from a set of large (tens of terabytes) structured or unstructured data that exceeds the capabilities of existing database management tools.

다양한 종류의 대규모 데이터에 대한 생성, 수집, 분석, 표현을 그 특징으로 하는 빅데이터 기술의 발전은 다변화된 현대 사회를 더욱 정확하게 예측하여 효율적으로 작동하게 하고 개인화된 현대 사회 구성원마다 맞춤형 정보를 제공, 관리, 분석 가능하게 하며 과거에는 불가능했던 기술을 실현시키기도 한다. The development of big data technology, which is characterized by the generation, collection, analysis, and expression of various types of large-scale data, more accurately predicts the diversified modern society, makes it work efficiently, and provides customized information for each individualized modern society member; It makes management and analysis possible, and it also enables technology that was previously impossible to be realized.

이같이 빅데이터는 정치, 사회, 경제, 문화, 과학 기술 등 전 영역에 걸쳐서 사회와 인류에게 가치있는 정보를 제공할 수 있는 가능성을 제시하며 그 중요성이 부각되고 있다. As such, big data presents the possibility of providing valuable information to society and mankind in all areas such as politics, society, economy, culture, and science and technology, and its importance is being highlighted.

일반적으로 빅데이터 분석은 데이터 수집/저장, 데이터 전처리, 데이터 정제(프로파일링), 데이터 분석 및 데이터 시각화을 거친다. In general, big data analysis goes through data collection/storage, data preprocessing, data purification (profiling), data analysis, and data visualization.

최근 빅데이터 수집을 위한 데이터 소스는 음성, 문서, SNS 데이터와 같은 비정형 데이터, 로그 데이터, 머신 데이터 및 운용 데이터와 같은 반정형 데이터와 DB/DW와 같은 정형 데이터를 포함한다. Recently, data sources for big data collection include unstructured data such as voice, document, and SNS data, semi-structured data such as log data, machine data and operational data, and structured data such as DB/DW.

빅데이터 분석을 위해서는 데이터 수집부터 시각화까지 일련의 복잡한 과정을 거쳐야 하기 때문에 데이터 처리 시간을 단축하고 또한 다른 사용자간의 작업 내용을 공유하면서 데이터 처리의 효율성을 높일 필요가 있다.Because big data analysis requires a series of complex processes from data collection to visualization, it is necessary to shorten the data processing time and increase the efficiency of data processing while sharing the work between other users.

상기한 종래기술의 문제점을 해결하기 위해, 본 발명은 데이터 처리 효율을 높일 수 있는 빅데이터 증강분석 프로파일링 시스템을 제안하고자 한다.In order to solve the problems of the prior art, the present invention intends to propose a big data augmented analysis profiling system capable of increasing data processing efficiency.

상기한 바와 같은 목적을 달성하기 위하여, 본 발명의 일 실시예에 따르면, 빅데이터 증강분석 프로파일링 시스템으로서, 데이터 소스로부터 비정형 데이터, 반정형 데이터 및 정형 데이터 중 적어도 하나를 포함하는 데이터를 입력 받아 상기 데이터의 특징을 추출하여 상기 필터링된 데이터의 클래스를 분류하고, 상기 클래스가 분류된 데이터를 하나 이상의 데이터베이스에 저장하는 데이터 수집 모듈; 상기 하나 이상의 데이터베이스에 저장된 데이터를 분할하고, 분할된 데이터의 동시 전처리를 위한 멀티-스레드 기법을 적용하여 노드 당 데이터 처리 가능량을 개선하는 데이터 전처리 모듈; 데이터의 도메인의 특성에 따라 데이터의 결측치, 이상치 또는 민감정보 등을 탐지할 수 있도록 정규식(Regular Expression 또는 Regex)을 메타데이터로 추가하여 데이터 품질을 향상시키는 데이터 품질 관리 모듈; 사용자 목적에 맞게 워크플로우(Workflow)를 구성하며, 기계학습 기반의 런타임 환경을 제공하는 증강분석 프로파일링 모듈; 및 상기 데이터 수집 모듈, 상기 데이터 전처리 모듈, 상기 데이터 품질 관리 모듈 및 상기 증강분석 프로파일링 모듈에서 수행한 과정에 대한 작성자, 코멘트 및 데이터 처리량을 식별할 수 있는 인터페이스를 제공하며, 각 모듈에서 수행한 과정에 대한 로그를 기록하고, 기록된 로그를 기계학습하여 다른 작업자가 수행하는 과정에 가이드 정보로 제공하는 증강분석 협업 모듈을 포함하는 빅데이터 증강분석 프로파일링 시스템이 제공된다. In order to achieve the above object, according to an embodiment of the present invention, as a big data augmented analysis profiling system, receiving data including at least one of unstructured data, semi-structured data, and structured data from a data source a data collection module for classifying a class of the filtered data by extracting characteristics of the data, and storing the classified data in one or more databases; a data pre-processing module that divides data stored in the one or more databases and improves the data processing capacity per node by applying a multi-threading technique for simultaneous pre-processing of the divided data; A data quality management module that improves data quality by adding regular expressions (Regular Expressions or Regex) as metadata to detect missing values, outliers, or sensitive information in data according to the characteristics of the data domain; an augmented analysis profiling module that configures a workflow according to a user's purpose and provides a machine learning-based runtime environment; and the data collection module, the data preprocessing module, the data quality management module, and the augmented analysis profiling module provide an interface that can identify the author, comment and data throughput for the process performed in each module, A big data augmented analysis profiling system is provided that includes an augmented analysis collaboration module that records a log for the process, machine learning the recorded log, and provides guide information to the process performed by other workers.

상기 데이터 수집 모듈은, 미리 설정된 알고리즘을 이용하여 복수의 데이터 각각에서 추출된 특징에 따른 클래스를 미리 분류할 수 있다. The data collection module may pre-classify a class according to a feature extracted from each of a plurality of data using a preset algorithm.

상기 데이터의 특징은 하나 이상의 필드 각각의 타입, 도메인 및 최대/최소값을 포함할 수 있다. The characteristics of the data may include a type, a domain, and a maximum/minimum value of each of one or more fields.

상기 타입은 숫자, 텍스트 및 이진 데이터 중 적어도 하나이고, 상기 도메인은 카테고리, 날짜, 시간, 금액, 좌표, 백분율, 분수 및 지수 중 적어도 하나일 수 있다. The type may be at least one of number, text, and binary data, and the domain may be at least one of category, date, time, amount, coordinate, percentage, fraction, and exponent.

상기 데이터 전처리 모듈은 전처리 대상 리스트, 전처리 정의 템플릿(Recipe), 데이터 분할 처리 등 대용량의 데이터 전처리를 위한 메타데이터 공유 관리 기능을 제공하는 메모리 캐시를 포함할 수 있다. The data preprocessing module may include a memory cache that provides a metadata sharing management function for a large amount of data preprocessing, such as a preprocessing target list, a preprocessing definition template (Recipe), and data division processing.

상기 데이터 전처리 모듈은 전처리 대상 데이터를 분할하고, 분할된 데이터의 전처리를 위한 스레드를 생성하고 회수하는 멀티-스레드 관리 기능을 수행하는 Split 모듈을 포함할 수 있다. The data pre-processing module may include a Split module performing a multi-thread management function of dividing the pre-processing target data, and creating and retrieving a thread for pre-processing the divided data.

상기 데이터 전처리 모듈은 상기 Split 모듈로부터 전처리가 완료된 데이터를 통합하고 저장하는 통합 모듈을 포함할 수 있다. The data pre-processing module may include an integration module that integrates and stores pre-processed data from the split module.

데이터 사전(Dictionary), 키워드(Keyword), 정규식에 대한 시맨틱 메타정보를 관리할 수 있다. You can manage semantic meta information about data dictionary, keyword, and regular expression.

상기 정규식은 룰 기반으로 구성될 수 있고, 데이터 특성에 따른 탐지 및 분류 등의 데이터 품질을 관리하고, 데이터 사전에 대한 정규 표현 알고리즘 등록 및 수정과 같은 사용자 정의 기능을 제공할 수 있다. The regular expression may be configured based on a rule, manage data quality such as detection and classification according to data characteristics, and may provide a user-defined function such as registering and modifying a regular expression algorithm for a data dictionary.

상기 증강분석 협업 모듈은 각 작업 프로세스의 생성, 삭제 및 수정이 가능한 워크스페이스 인터페이스를 출력할 수 있다.The augmented analysis collaboration module may output a workspace interface capable of creating, deleting, and modifying each work process.

본 발명에 따르면, 데이터 도메인의 판별 정확도를 높이고, 데이터 품질 평가의 정확도 및 분석 결과의 정확도를 높일 수 있는 장점이 있다.Advantageous Effects of Invention According to the present invention, there is an advantage in that it is possible to increase the identification accuracy of a data domain, and increase the accuracy of the data quality evaluation and the accuracy of the analysis result.

도 1은 본 발명의 바람직한 일 실시예에 따른 빅데이터 증강분석 플랫폼 구성을 도시한 도면이다. 1 is a diagram showing the configuration of a big data augmented analysis platform according to a preferred embodiment of the present invention.

도 2는 본 발명의 바람직한 일 실시예에 따른 데이터 수집 모듈의 데이터 수집 과정을 설명하기 위한 도면이다. 2 is a view for explaining a data collection process of the data collection module according to an embodiment of the present invention.

도 3 내지 도 4는 서로 다른 클래스의 데이터를 도시한 도면이다. 3 to 4 are diagrams illustrating data of different classes.

도 5는 본 실시예에 따른 데이터 전처리 모듈의 아키텍쳐를 도시한 도면이다. 5 is a diagram showing the architecture of the data pre-processing module according to the present embodiment.

도 6은 본 발명의 일 실시예에 따른 데이터 품질 관리 모듈을 설명하기 위한 도면이다. 6 is a diagram for explaining a data quality management module according to an embodiment of the present invention.

도 7은 본 발명의 일 실시예에 따른 증강분석 프로파일링 모듈의 구성을 도시한 도면이다.7 is a diagram illustrating the configuration of an augmented analysis profiling module according to an embodiment of the present invention.

도 8은 본 실시예에 따른 증강분석 협업 모듈이 제공하는 워크스페이스 인터페이스를 예시적으로 도시한 도면이다. 8 is a diagram exemplarily illustrating a workspace interface provided by the augmented analysis collaboration module according to the present embodiment.

도 9는 본 실시예에 따른 증강분석 협업 모듈이 제공하는 협업 인터페이스를 예시적으로 도시한 도면이다.9 is a diagram exemplarily illustrating a collaboration interface provided by the augmented analysis collaboration module according to the present embodiment.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention.

도 1에 도시된 바와 같이, 본 실시예에 따른 빅데이터 증강분석 플랫폼은 데이터 수집 모듈(100), 데이터 전처리 모듈(102), 데이터 품질 관리 모듈(104), 증강분석 프로파일링 모듈(106) 및 증강분석 협업 모듈(108)을 포함할 수 있다. As shown in FIG. 1 , the big data augmented analysis platform according to this embodiment includes a data collection module 100 , a data preprocessing module 102 , a data quality management module 104 , an augmented analysis profiling module 106 and augmented analytics collaboration module 108 .

데이터 수집 모듈(100)은 빅데이터 분석을 위한 데이터를 지능적으로 수집하고 분류한다. The data collection module 100 intelligently collects and classifies data for big data analysis.

도 1에 도시된 바와 같이, 본 실시예에 따른 빅데이터 지능형 수집 모듈(100)은 데이터 소스(200)와 네트워크를 통해 연결되며, 수집된 데이터를 하나 이상의 데이터베이스(202-n)에 저장한다. 1, the big data intelligent collection module 100 according to the present embodiment is connected to the data source 200 through a network, and stores the collected data in one or more databases 202-n.

데이터 소스(200)로부터 수집되는 데이터는, 정형 데이터(Structured Data), 반정형 데이터(Semistructured-Data) 및 비정형 데이터(Unstructured-Data)를 포함할 수 있다. Data collected from the data source 200 may include structured data, semistructured data, and unstructured data.

정형 데이터는 관계형 데이터베이스 시스템의 테이블과 같이 고정된 필드(컬럼)에 저장되는 데이터와 파일, 그리고 지정된 행과 열에 의해 데이터의 속성이 구별되는 스프레드시트 형태의 데이터이다. Structured data is data and files stored in fixed fields (columns) like tables in a relational database system, and spreadsheet-type data in which data properties are distinguished by specified rows and columns.

정형 데이터의 경우, 스키마 구조를 가지고 있기 때문에 데이터를 탐색하는 과정이 테이블 탐색, 컬럼 구조 탐색, 로우 탐색 순으로 정형화되어 있다.In the case of structured data, since it has a schema structure, the data search process is standardized in the order of table search, column structure search, and row search.

반정형 데이터는 데이터 내부에 정형데이터의 스키마에 해당되는 메타데이터를 갖고 있으며. 일반적으로 파일 형태로 저장되는 데이터로서, 로그 데이터, 머신 데이터 및 운용 데이터를 포함한다. Semi-structured data has metadata corresponding to the schema of structured data inside the data. As data generally stored in a file format, it includes log data, machine data, and operation data.

반정형 데이터의 경우 데이터 내부에 데이터 구조에 대한 메타데이터를 갖고 있기 때문에 어떤 형태를 가진 데이터인지를 파악하는 것이 필요하다. 데이터 내부에 있는 규칙성을 파악해 데이터를 파싱할 수 있는 파싱 규칙을 적용한다.In the case of semi-structured data, it is necessary to understand what type of data it is because it has metadata about the data structure inside the data. Apply parsing rules that can parse data by identifying regularities within data.

비정형 데이터는 음성, 문서 및 SNS 데이터와 같이 언어 분석이 가능한 텍스트 데이터, 음성, 이미지 및 동영상과 같은 멀티미디어 데이터를 포함한다. The unstructured data includes text data capable of language analysis such as voice, document, and SNS data, and multimedia data such as voice, image, and video.

본 실시예에 따른 데이터 수집 모듈(100)은 네트워크를 통해 연결되는 데이터 소스(200)로부터 정형, 반정형 및 비정형 데이터를 포함하는 데이터를 입력 받고, 입력된 데이터의 클래스를 분류한다. The data collection module 100 according to the present embodiment receives data including structured, semi-structured, and unstructured data from a data source 200 connected through a network, and classifies the input data.

본 실시예에 따른 데이터 수집 모듈(100)은 데이터의 특징을 추출하여 데이터의 클래스를 분류한다. The data collection module 100 according to the present embodiment classifies data classes by extracting features of data.

본 실시예에 따르면, 클래스 분류 과정은 사전 기계학습된 알고리즘을 통해 수행될 수 있다. According to this embodiment, the class classification process may be performed through a pre-machine-learning algorithm.

본 실시예에 따르면, k-NN, RNN 및 BERT 알고리즘을 미리 클래스를 알고 있는 데이터로 사전에 기계학습시킨다. According to this embodiment, k-NN, RNN, and BERT algorithms are machine-learned in advance with data for which the class is known.

사전 기계학습이 완료된 이후, 새로운 데이터가 입력되면, 데이터 수집 모듈(100)은 데이터의 특징을 추출하고, 추출된 특징을 기계학습된 알고리즘에 입력값으로 하여 복수의 클래스 각각에 속하는 확률을 계산한다. After the pre-machine learning is completed, when new data is input, the data collection module 100 extracts features of the data, and calculates the probability of belonging to each of a plurality of classes by using the extracted features as input values to the machine-learning algorithm. .

사전 기계학습을 통해 데이터 분류 데이터베이스가 구축될 수 있으며, 데이터 분류 데이터베이스는 미리 정의된 복수의 클래스와 각 클래스에 대응되는 특징에 관한 정보를 저장한다. A data classification database may be built through pre-machine learning, and the data classification database stores a plurality of predefined classes and information about features corresponding to each class.

데이터의 특징은 하나 이상의 필드(컬럼) 각각의 타입, 도메인 및 최대/최소값을 포함할 수 있다. Characteristics of data may include a type, a domain, and a maximum/minimum value of each of one or more fields (columns).

여기서, 타입은 숫자, 텍스트 및 이진 데이터일 수 있고, 도메인은 카테고리, 날짜, 시간, 금액, 좌표, 백분율, 분수 및 지수를 포함할 수 있다. Here, the type may be numeric, text, and binary data, and the domain may include category, date, time, amount, coordinate, percentage, fraction, and exponent.

데이터는 복수의 필드를 가지며, 데이터 수집 모듈(100)은 복수의 필드의 타입, 도메인 및 최대/최소값을 사전 기계학습된 알고리즘에 입력하여 해당 데이터의 특징과 유사한 특징을 갖는 클래스를 비교하고, 복수의 클래스 각각에 대한 확률값을 계산한다. Data has a plurality of fields, and the data collection module 100 inputs the types, domains, and maximum/minimum values of the plurality of fields into a pre-machine-learning algorithm to compare classes having characteristics similar to those of the corresponding data, and Calculate the probability value for each class of

도 3을 참조하면, 데이터 수집 모듈(100)은 복수의 필드의 타입, 도메인 및 최대/최소값을 사전 기계학습된 알고리즘에 입력하여 해당 데이터의 특징과 유사한 특징을 갖는 클래스를 비교하고, 복수의 클래스 각각에 대한 확률값을 계산한다. Referring to FIG. 3 , the data collection module 100 inputs types, domains, and maximum/minimum values of a plurality of fields into a pre-machine learning algorithm to compare classes having characteristics similar to those of the corresponding data, and a plurality of classes Calculate the probability value for each.

도 3에서, 제1 필드는 시간, 제2 필드 내지 제5 필드는 최대/최소를 갖는 숫자이므로, 데이터 수집 모듈(100)은 해당 데이터를 기상 관련 클래스로 결정할 수 있다. In FIG. 3 , the first field is time, and the second to fifth fields are numbers having a maximum/minimum value, so the data collection module 100 may determine the corresponding data as a weather-related class.

또한, 도 4와 같이, 제1 필드가 소정의 정수가 반복적으로 나타나고, 제2 필드가 최소 및 최대값을 갖는 데이터이고, 제3 필드가 텍스트이고 동일한 텍스트가 반복적으로 나타나는 경우, 이러한 특징을 추출하여 방(room)과 관련된 클래스로 분류한다. In addition, as shown in FIG. 4 , when a predetermined integer repeatedly appears in the first field, the second field is data having minimum and maximum values, and the third field is text and the same text appears repeatedly, these features are extracted Thus, it is classified into a class related to a room.

도 5를 참조하면, 본 실시예에 따른 데이터 전처리 모듈(102)은 파일기반(Comma-Separated Values, CSV) 데이터 전처리 용량의 제약사항을 개선하기 위해 데이터를 분할하고, 분할된 데이터 동시 전처리를 위한 멀티-스레드 기법을 적용하여 노드 당 데이터 처리 가능량을 개선한다. Referring to FIG. 5 , the data preprocessing module 102 according to the present embodiment divides data in order to improve the file-based (Comma-Separated Values, CSV) data preprocessing capacity constraint, and divides the data for simultaneous preprocessing of the divided data. By applying a multi-threading technique, the data processing capacity per node is improved.

데이터 전처리 모듈(102)의 메모리 캐시(Memory Cache)는 전처리 대상 리스트, 전처리 정의 템플릿(Recipe), 데이터 분할 처리 등 대용량의 데이터 전처리를 위한 메타정보 공유 관리 기능을 제공한다. The memory cache of the data preprocessing module 102 provides a meta information sharing management function for preprocessing large amounts of data, such as a preprocessing target list, a preprocessing definition template (Recipe), and data division processing.

데이터 전처리 모듈(102)의 Split 모듈은 전처리 대상 데이터를 분할하고, 분할된 데이터의 전처리를 위한 스레드를 생성하고 회수하는 멀티-스레드 관리 기능을 수행한다. The Split module of the data pre-processing module 102 performs a multi-thread management function of dividing the pre-processing target data and creating and retrieving a thread for pre-processing the divided data.

데이터 전처리 모듈(102)의 통합 모듈(Integration)은 Split 모듈로부터 전처리가 완료된 데이터를 통합하고 저장한다. The integration module of the data pre-processing module 102 integrates and stores pre-processed data from the Split module.

데이터 전처리 모듈(102)의 DFS(Distribute File Storage)는 원격지 분산 파일 스토리지를 이용하여 데이터 전처리 저장소를 지원한다. Distribute File Storage (DFS) of the data preprocessing module 102 supports data preprocessing storage using remote distributed file storage.

본 실시예에 따른 데이터 전처리 모듈(102)은 데이터 처리 속도 향상을 위해 메모리 기반 "key-value" 구조로 데이터 관리가 가능하고, In-memory store를 통한 빠른 데이터 액세스가 가능한 Redis(REmote Dictionary Server)를 적용하여 데이터 분할 및 데이터 동시 처리에 필요한 메타정보를 공유한다. The data preprocessing module 102 according to this embodiment enables data management in a memory-based “key-value” structure to improve data processing speed, and enables fast data access through an in-memory store (Redis (Remote Dictionary Server)) to share meta-information necessary for data division and simultaneous data processing.

여기서, Redis는 모든 데이터를 메모리에 저장하고 조회하기에 빠른 Read, Write 속도를 보장하는　비관계형 데이터베이스이며, 5가지< String, Set, Sorted Set, Hash, List >의 데이터 형식을 지원한다.Here, Redis is a non-relational database that guarantees fast read and write speeds for storing and retrieving all data in memory, and supports five data types: String, Set, Sorted Set, Hash, List.

또한, 데이터 전처리 모듈(102)은 로컬 스토리지 이외에 외부 저장소를 통해 데이터 전처리가 가능하도록 HDFS 기반의 원격지 분산 파일 스토리지를 이용한다. In addition, the data pre-processing module 102 uses HDFS-based remote distributed file storage to enable data pre-processing through external storage in addition to local storage.

그리고, 데이터 전처리 모듈(102)은 자바 스레딩 기법을 적용하여 HDFS로부터 데이터스트림(DataStream)의 메인 메모리 공유가 가능하도록 volatile을 적용하여 분할 데이터의 전처리 및 통합 시 데이터 무결성을 확보한다. In addition, the data pre-processing module 102 applies volatile to enable main memory sharing of a data stream from HDFS by applying a Java threading technique to secure data integrity during pre-processing and integration of divided data.

도 6을 참조하면, 데이터 품질 관리 모듈(104)는 식별하고자 하는 도메인을 정의하고, 도메인을 메타데이터로 등록한다. 또한 데이터 품질 관리 모듈(104)은 관리되는 데이터의 도메인의 특성에 따라 데이터의 결측치, 이상치 또는 민감정보 등을 탐지할 수 있도록 정규식(Regular Expression 또는 Regex)을 메타데이터로 추가하여 데이터 품질을 향상시킨다. Referring to FIG. 6 , the data quality management module 104 defines a domain to be identified and registers the domain as metadata. In addition, the data quality management module 104 improves data quality by adding a regular expression (Regular Expression or Regex) as metadata to detect missing values, outliers, or sensitive information of data according to the characteristics of the domain of the managed data. .

데이터 품질 관리 모듈(104)은 데이터 사전(Dictionary), 키워드(Keyword), 정규식에 대한 시맨틱 메타정보를 관리한다. The data quality management module 104 manages semantic meta information for a data dictionary, a keyword, and a regular expression.

데이터 사전은 추가 및 변경이 가능하며, 추가적인 데이터 품질 측정이 필요한 데이터 도메인을 확장할 수 있도록 데이터 카테고리 조회, 등록, 편집이 가능하다. Data dictionary can be added and changed, and data categories can be viewed, registered, and edited to expand data domains that require additional data quality measurement.

본 실시예에 따른 정규식은 룰 기반으로 구성될 수 있고, 데이터 특성에 따른 탐지 및 분류 등의 데이터 품질을 관리하고, 데이터 사전에 대한 정규 표현 알고리즘 등록 및 수정과 같은 사용자 정의 기능을 제공한다. The regular expression according to the present embodiment can be configured based on rules, manages data quality such as detection and classification according to data characteristics, and provides a user-defined function such as registration and modification of a regular expression algorithm for a data dictionary.

데이터 품질 관리 모듈(104)에서 정의한 정규식에 의해 데이터 결측치 및 이상치를 탐지할 수 있고 민감정보를 식별할 수 있다. Data missing values and outliers can be detected and sensitive information can be identified by the regular expression defined in the data quality management module 104 .

또한, 데이터 품질 관리 모듈(104)에서 정의한 정규식에 대한 메타데이터를 기반으로 도메인 자동 분류가 가능하다. In addition, domain automatic classification is possible based on the metadata for the regular expression defined by the data quality management module 104 .

데이터 품질 관리 모듈(104)은 데이터 품질 향상을 위한 Imputation(mean, regression)이 모듈화되고 적용되며, 데이터 품질 향상을 위한 데이터 분포, 통계 측정 기능을 적용한다. In the data quality management module 104, Imputation (mean, regression) for data quality improvement is modularized and applied, and data distribution and statistical measurement functions for data quality improvement are applied.

본 실시예에 따르면, 도메인 자동 판별 성능 향상을 위한 루씬(Apache Lucene) 기반 검색엔진이 적용될 수 있고, 검색용 색인(Index)를 통해 대량의 데이터 도메인 판별이 가능하다. According to this embodiment, an Apache Lucene-based search engine for improving domain automatic identification performance can be applied, and a large amount of data domains can be determined through a search index.

루씬(Apache Lucene) 기반 검색엔진 관련하여, Look를 통한 Apache Lucene 색인파일 구조 및 검색 결과 분석이 가능하고, Apache Lucene Open API를 활용하여 데이터 사전용 색인(Index) 생성, 조회, 삭제가 가능하다. 또한, 데이터 전처리 모듈(102)과의 데이터 통합을 위해 시맨틱을 공유한다. In relation to the Apache Lucene-based search engine, it is possible to analyze the Apache Lucene index file structure and search results through Look, and it is possible to create, inquire, and delete indexes for data dictionary using Apache Lucene Open API. It also shares semantics for data integration with the data pre-processing module 102 .

도 7은 본 발명의 일 실시예에 따른 증강분석 프로파일링 모듈의 구성을 도시한 도면이다. 7 is a diagram illustrating the configuration of an augmented analysis profiling module according to an embodiment of the present invention.

본 실시예에 따른 증강분석 프로파일링 모듈(106)은 사용자 목적에 맞게 증강분석 프로파일 단계가 실행되도록 워크플로우(Workflow)를 구성하며, 기계학습 기반으로 프로파일링 모듈의 독립적 실행과 유기적 결합을 위한 런타임(run-time) 환경을 제공할 수 있는 프레임워크로 개발된다.The augmented analysis profiling module 106 according to this embodiment configures a workflow so that the augmented analysis profile step is executed according to the user's purpose, and a runtime for independent execution and organic combination of the profiling module based on machine learning It is developed as a framework that can provide a (run-time) environment.

도 7을 참조하면, service pool은 기계학습 기반 증강분석 프로파일링 실행을 위한 pool을 제공한다. Referring to FIG. 7 , the service pool provides a pool for executing machine learning-based augmented analysis profiling.

증강분석 프로파일링 모듈(106)은 증강분석 프로파일링 실행 플로우를 정의하고, 데이터 로딩, 저장 등 모든 실행에 필요한 Config를 생성한다. 또한, 파이프라인을 구성하여 배포하는 기능을 수행한다. The augmented analysis profiling module 106 defines the augmented analysis profiling execution flow, and generates Config necessary for all executions such as data loading and storage. In addition, it performs the function of configuring and distributing pipelines.

파이프라인은 증강분석 프로파일링의 실행 순서, 반복 주기와 같은 미리 정의된 워크플로우의 일괄 실행, 중지 및 회수 등의 관리 기능을 수행한다.The pipeline performs management functions such as execution sequence of augmented analytics profiling, batch execution of predefined workflows such as iteration cycle, stop and recall, etc.

본 실시예에 따르면, 단독 모듈로 개발된 증강분석 프로파일링 모듈을 파이프라인 내 스트리밍 처리가 가능하도록 Apache Beam으로 Wrapping한다.According to this embodiment, the augmented analysis profiling module developed as a single module is wrapped with Apache Beam to enable streaming processing in the pipeline.

Apache Beam은 Google Dataflow, Spark 등과 같은 다양한 런타임에 배포하여 실행할 수 있다. 또한 Data Read/Write, 모듈 간 인터페이스와 같은 공통 기능 제공을 위한 다양한 Artifacts를 제공한다. Apache Beam can be deployed and run on various runtimes such as Google Dataflow, Spark, etc. In addition, various Artifacts are provided to provide common functions such as Data Read/Write and interface between modules.

파이프라인의 런타임을 위해 Direct Runner와 Apache Spark Runner, 2 종류의 실행 환경을 제공할 수 있다. Two types of execution environments can be provided for pipeline runtime: Direct Runner and Apache Spark Runner.

Direct Runner는 별도의 외부 분석 Cluster를 구성할 필요가 없기 때문에 개발/테스트 단계에서의 활용할 수 있으며, 실제 Production 의 실행환경에서는 Apache Spark를 활용할 수 있도록 Stand-Alone 형태로 제공한다. Direct Runner does not need to configure a separate external analysis cluster, so it can be used in the development/test stage, and in the actual production execution environment, it is provided in a stand-alone form so that Apache Spark can be utilized.

Apache Spark를 통해 실행된 파이프라인은 증강분석 프로파일링 모듈(106)이 외부에 구성된 HDFS, RDBMS 로부터 데이터를 Read하거나 분석결과를 Write할 수 있도록 기능을 제공한다. The pipeline executed through Apache Spark provides a function so that the augmented analysis profiling module 106 can read data from externally configured HDFS and RDBMS or write analysis results.

본 실시예에 따른 증강분석 협업 모듈(108)은 데이터 수집/정제/프로파일링 분석 과정을 수행함에 있어 복수의 작업자 간 협업이 가능하도록 한다. The augmented analysis collaboration module 108 according to this embodiment enables collaboration between a plurality of workers in performing a data collection/refinement/profiling analysis process.

도 8에 도시된 바와 같이, 본 실시예에 따른 증강분석 협업 모듈(108)은 각 작업 프로세스를 관리할 수 있는 워크스페이스 인터페이스를 제공한다. As shown in FIG. 8 , the augmented analysis collaboration module 108 according to the present embodiment provides a workspace interface capable of managing each work process.

본 실시예에 따른 워크스페이스 인터페이스에서 각 작업자는 작업 프로세스의 생성, 수정 및 삭제를 수행할 수 있다. In the workspace interface according to the present embodiment, each worker may create, modify, and delete a work process.

또한, 증강분석 협업 모듈(108)은 분석된 프로파일링에 대한 이력 관리 및 공유 서비스를 제공한다. In addition, the augmented analysis collaboration module 108 provides a history management and sharing service for the analyzed profiling.

도 9에 도시된 바와 같이, 증강분석 협업 모듈(108)은 데이터 수집, 전처리, META 프로파일링, 시각화 프로파일링과 같은 일련의 모듈에서 수행한 과정에 대한 작성자, 코멘트 및 데이터 처리량을 식별할 수 있는 인터페이스를 제공하며, 각 모듈에서 수행한 과정에 대한 로그를 기록한다. As shown in FIG. 9 , the augmented analytics collaboration module 108 can identify authors, comments, and data throughput for processes performed in a series of modules such as data collection, preprocessing, META profiling, and visualization profiling. It provides an interface and records the log of the process performed by each module.

기록된 로그는 기계학습을 통해 추후 다른 작업자가 수행하는 과정에 가이드 정보로 제공될 수 있다. The recorded log can be provided as guide information for the process performed by other workers later through machine learning.

본 실시예에 따른 빅데이터 증강분석 프로파일링 분석을 위한 각 모듈에서의 과정은 프로세서 및 메모리를 포함하는 장치에서 수행될 수 있다. The process in each module for big data augmented analysis profiling analysis according to the present embodiment may be performed in a device including a processor and a memory.

프로세서는 컴퓨터 프로그램을 실행할 수 있는 CPU(central processing unit)나 그밖에 가상 머신 등을 포함할 수 있다. The processor may include a central processing unit (CPU) or other virtual machine capable of executing a computer program.

메모리는 고정식 하드 드라이브나 착탈식 저장 장치와 같은 불휘발성 저장 장치를 포함할 수 있다. 착탈식 저장 장치는 컴팩트 플래시 유닛, USB 메모리 스틱 등을 포함할 수 있다. 메모리는 각종 랜덤 액세스 메모리와 같은 휘발성 메모리도 포함할 수 있다.The memory may include a non-volatile storage device such as a fixed hard drive or a removable storage device. The removable storage device may include a compact flash unit, a USB memory stick, and the like. The memory may also include volatile memory, such as various random access memories.

상기한 본 발명의 실시예는 예시의 목적을 위해 개시된 것이고, 본 발명에 대한 통상의 지식을 가지는 당업자라면 본 발명의 사상과 범위 안에서 다양한 수정, 변경, 부가가 가능할 것이며, 이러한 수정, 변경 및 부가는 하기의 특허청구범위에 속하는 것으로 보아야 할 것이다.The above-described embodiments of the present invention have been disclosed for purposes of illustration, and various modifications, changes, and additions will be possible within the spirit and scope of the present invention by those skilled in the art having ordinary knowledge of the present invention, and such modifications, changes and additions should be regarded as belonging to the following claims.

Claims

As a big data augmented analysis profiling system,

receiving data including at least one of unstructured data, semi-structured data, and structured data from a data source, extracting features of the data, classifying the class of the filtered data, and storing the classified data in one or more databases data collection module to store;

a data pre-processing module that divides data stored in the one or more databases and improves the data processing capacity per node by applying a multi-threading technique for simultaneous pre-processing of the divided data;

a data quality management module that improves data quality by adding regular expressions (Regex) as metadata to detect missing values, outliers, or sensitive information in data according to the characteristics of the data domain;

an augmented analysis profiling module that configures a workflow according to a user's purpose and provides a machine learning-based runtime environment; and

Provides an interface that can identify authors, comments, and data throughput for the processes performed in the data collection module, the data preprocessing module, the data quality management module, and the augmented analysis profiling module, and the process performed in each module A big data augmented analysis profiling system including an augmented analysis collaboration module that records logs for and provides guide information to the processes performed by other workers by machine learning the recorded logs.

According to claim 1,

The data collection module,

A big data augmented analysis profiling system that pre-classifies classes according to features extracted from each of a plurality of data using a preset algorithm.

3. The method of claim 2,

The data feature is a big data augmented analysis profiling system including a type, domain, and maximum/minimum values of each of one or more fields.

4. The method of claim 3,

the type is at least one of numeric, text, and binary data;

The domain is at least one of a category, date, time, amount, coordinates, percentage, fraction, and index of a big data augmented analysis profiling system.

According to claim 1,

The data preprocessing module is a big data augmented analysis profiling system including a memory cache that provides a metadata sharing management function for a large amount of data preprocessing such as a preprocessing target list, a preprocessing definition template (Recipe), and data division processing.

6. The method of claim 5,

The data pre-processing module is a big data augmented analysis profiling system including a Split module that divides pre-processing target data, and performs a multi-thread management function of creating and retrieving threads for pre-processing of the divided data.

7. The method of claim 6,

The data pre-processing module is a big data augmented analysis profiling system including an integration module for integrating and storing the pre-processed data from the Split module.

According to claim 1,

Big data augmented analysis profiling system that manages semantic meta information for data dictionary, keyword, and regular expression.

9. The method of claim 8,

Big data augmented analysis profiling that the regular expression can be configured based on rules, manages data quality such as detection and classification according to data characteristics, and provides user-defined functions such as registration and modification of regular expression algorithms for data dictionary system.

9. The method of claim 8,

The augmented analysis collaboration module is a big data augmented analysis profiling device that outputs a workspace interface that can create, delete and modify each work process.