KR20160133719A

KR20160133719A - Chunk file generating apparatus and method for thereof

Info

Publication number: KR20160133719A
Application number: KR1020150066601A
Authority: KR
Inventors: 오경진; 신응석
Original assignee: 삼성에스디에스 주식회사
Priority date: 2015-05-13
Filing date: 2015-05-13
Publication date: 2016-11-23

Abstract

청크 파일 생성 방법이 개시된다. 본 발명의 일 실시예에 따른 청크 파일 생성 방법은, 데이터를 기 설정된 크기의 청크 파일로 분할하는 단계, 청크 파일을 기 저장된 복수의 고정 길이 청크 파일들과 비교하는 단계, 복수의 고정 길이 청크 파일들과 불일치하는 청크 파일을 기 저장된 복수의 가변 길이 청크 파일과 비교하는 단계, 복수의 가변 길이 청크 파일과 부분적으로도 일치하지 않는 청크 파일을 고정 길이 청크 파일과 비교하여 고정 길이 청크 파일과 부분적으로 일치하는 데이터 영역을 신규 청크 파일로 생성하는 단계를 포함한다.A chunk file creation method is disclosed. A method of generating a chunk file according to an embodiment of the present invention includes dividing data into chunk files of predetermined sizes, comparing the chunk files with a plurality of previously stored fixed-length chunk files, Length chunk file with a plurality of pre-stored variable-length chunk files, comparing the chunk file that does not partially match the plurality of variable-length chunk files with a fixed-length chunk file, And generating a matching data area as a new chunk file.

Description

[0001] CHUNK FILE GENERATING APPARATUS AND METHOD FOR THEREOF [0002]

본 발명은 청크 파일 생성 장치 및 그 방법에 관한 것으로, 보다 상세하게는 데이터가 분할한 청크 파일을 기 저장된 고정 길이 청크 파일 및 가변 길이 청크 파일과 비교하여 데이터 중복 제거율이 높은 청크 파일을 생성할 수 있는 청크 파일 생성 장치 및 그 방법에 관한 것이다.The present invention relates to a chunk file generating apparatus and method thereof, and more particularly, to a chunk file generating apparatus and a chunk file generating method capable of generating a chunk file having a high data deduplication ratio by comparing a chunk file divided by data with a previously stored fixed length chunk file and a variable length chunk file The present invention relates to a chunk file generating apparatus and method therefor.

데이터 중복 제거(Deduplication)란 서로 다른 데이터들 간에 중복되는 부분을 검출해내고 중복된 부분을 제거함으로써 스토리지 활용의 효율성을 높일 수 있을 뿐만 아니라, 데이터 전송 시 발생되는 트래픽을 절감할 수 있는 기술을 의미한다.Deduplication is a technology that not only improves the efficiency of storage utilization by detecting overlapping parts between different data and eliminates redundant parts, but also a technology that can reduce the traffic generated during data transmission. do.

도 1은 종래 데이터 중복 제거 기술을 이용하여 송신 서버(10)에서 수신 서버(20)로 파일을 전송하는 방법을 설명하기 위한 도면이다.FIG. 1 is a diagram for explaining a method of transmitting a file from a transmission server 10 to a reception server 20 using a conventional data de-duplication technique.

도 1에 도시된 바와 같이 송신 서버(10) 및 수신 서버(20)에는 대용량의 파일을 기 설정된 크기의 파일로 분할한 복수의 청크(chunk) 파일들이 저장되어 있을 수 있다.As shown in FIG. 1, the transmission server 10 and the reception server 20 may store a plurality of chunk files obtained by dividing a large-capacity file into files of predetermined sizes.

송신 서버(10)에서 수신 서버(20)로 청크 파일 A(11), 청크 파일 B(12) 및 청크 파일 C(13)로 구성된 제1 파일(30)과 청크 파일 A(11) 및 청크 파일 C(13)로 구성된 제2 파일(40)을 전송하고 할 때, 제1 파일(30)과 제2 파일(40) 그 자체를 전송하면 청크 파일 A(11) 청크 파일 C(13)에 해당되는 부분이 중복 전송되므로 불필요한 리소스가 낭비된다는 문제점이 발생된다.A first file 30 and a chunk file A 11 constituted by the chunk file A 11, the chunk file B 12 and the chunk file C 13 and the chunk file A 11 and the chunk file A 11 are transmitted from the transmission server 10 to the reception server 20, When the first file 30 and the second file 40 themselves are transmitted when the second file 40 composed of the C 13 is transmitted, the chunk file A 11 corresponds to the chunk file C 13 The unnecessary resources are wasted.

제1 파일(30)과 제2 파일(40)을 전송하는데 데이터 중복 제거 기술을 적용하면, 청크 파일 A(11), 청크 파일 B(12) 및 청크 파일 C(13) 각각에 대응되는 식별자, 예를 들어 각 청크 파일의 해시값과 제1 파일(30) 관련 정보 및 제2 파일(40) 관련 정보가 제1 서버(10)로부터 제2 서버(20)로 전송된다.When the data de-duplication technique is applied to transfer the first file 30 and the second file 40, an identifier corresponding to each of the chunk file A (11), chunk file B (12), and chunk file C (13) For example, the hash value of each chunk file, the first file 30 related information, and the second file 40 related information are transmitted from the first server 10 to the second server 20.

이때, 제1 파일(30) 관련 정보 및 제2 파일(40) 관련 정보에는 제1 파일(30) 및 제2 파일(40) 각각이 어떤 청크 파일로 구성되는지에 대한 정보가 포함될 수 있다.At this time, the information related to the first file 30 and the information related to the second file 40 may include information on which chunk file each of the first file 30 and the second file 40 is composed of.

수신 서버(20)는 제1 서버로부터 수신된 제1 파일(30) 관련 정보 및 제2 파일(40)를 이용하여 제1 파일(30) 과 제2 파일(40)을 생성할 수 있다.The receiving server 20 may generate the first file 30 and the second file 40 using the second file 40 and the first file 30 related information received from the first server.

상술한 바와 같은 방법으로 파일을 전송하면, 고용량의 데이터가 실제 전송되지 않으므로 네트워크 리소스를 절감시키면서도 데이터 그 자체가 전송되는 것과 동일한 효과를 달성할 수 있게 된다.When a file is transferred in the above-described manner, since a large amount of data is not actually transferred, it is possible to achieve the same effect as that of transferring the data itself while reducing network resources.

한편, 상술한 청크 파일을 생성하는 방법으로 종래 고정 길이 청킹 알고리즘(Fixed Length Chunking Algorithm)과 가변 길이 청킹 알고리즘(Variable-Length Chunking Algorithm)이 사용되었다.Meanwhile, a fixed length chunking algorithm and a variable-length chunking algorithm have been used as a method of generating the chunk file.

고정 길이 청킹 알고리즘은 고용량의 데이터 파일을 기 설정된 동일한 크기를 갖는 복수개의 파일로 분할하여 청크 파일을 생성하는 방식이고, 가변 길이 청킹 알고리즘은 데이터 파일을 다양한 크기를 갖는 청크 파일로 분할하는 방식을 의미한다.The fixed-length chunking algorithm is a method of generating a chunk file by dividing a high-capacity data file into a plurality of files having the same predetermined size, and the variable-length chunking algorithm is a method of dividing a data file into chunks having various sizes do.

종래에는 필요에 따라 상술한 두가지 청크 파일 생성 방법 중 하나의 방법을 선택하여 청크 파일을 생성하였는바, 타겟 데이터를 신속하게 복수의 청크 파일로 분할할 수 있는 고정 길이 청킹 알고리즘의 장점과, 청크 파일의 크기를 다양하게 함으로써 중복 데이터 적중률을 효과적으로 높일 수 있는 가변 길이 청킹 알고리즘의 장점을 동시에 누릴 수 없다는 문제점이 있었다.Conventionally, a chunk file is generated by selecting one of the above-described two chunk file generation methods as needed. The advantage of the fixed-length chunking algorithm that the target data can be quickly divided into a plurality of chunk files, A variable length chunking algorithm capable of effectively increasing the redundant data hit ratio can not be simultaneously enjoyed.

따라서, 고정 길이 청킹 알고리즘과 가변 길이 청킹 알고리즘을 적응적으로 선택하여 두 알고리즘의 장점을 동시에 누릴 수 있는 새로운 형태의 중복 제거 기술의 필요성이 대두되었다.Therefore, there is a need for a new type of de-duplication technique that can adaptively select the fixed-length chunking algorithm and the variable-length chunking algorithm to enjoy the advantages of both algorithms simultaneously.

미국등록특허 US7,733,910호US Patent No. 7,733,910 미국등록특허 US6,667,700호US Patent 6,667,700

본 발명은 상술한 문제점을 해결하기 위해 안출된 것으로, 본 발명의 목적은 고정 길이 청킹 알고리즘과 가변 길이 청킹 알고리즘을 결합하여 최적의 크기를 갖는 청크 파일을 생성하는 기술을 제공하는데 있다.It is an object of the present invention to provide a technique for generating a chunk file having an optimal size by combining a fixed length chunking algorithm and a variable length chunking algorithm.

또한, 고정 길이 청킹 알고리즘과 가변 길이 청킹 알고리즘을 결합하여 청크 파일을 생성함으로서 청크 파일 생성 시간을 단축시키고 중복 제거율을 높일 수 있는 데이터 중복 제거 기술을 제공하는데 있다.The present invention also provides a data de-duplication technique capable of shortening a chunk file creation time and increasing a duplicate removal rate by generating a chunk file by combining a fixed length chunking algorithm and a variable length chunking algorithm.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속한 기술분야의 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical objects of the present invention are not limited to the above-mentioned technical problems, and other technical subjects not mentioned can be clearly understood by those skilled in the art from the following description.

상술한 목적을 달성하기 위한 본 발명의 일 실시예에 따른 청크 파일 생성 방법은, 데이터를 기 설정된 크기의 청크 파일로 분할하는 단계, 상기 청크 파일을 기 저장된 복수의 고정 길이 청크 파일들과 비교하는 단계, 상기 복수의 고정 길이 청크 파일들과 불일치하는 청크 파일을 기 저장된 복수의 가변 길이 청크 파일과 비교하는 단계, 상기 복수의 가변 길이 청크 파일과 부분적으로도 일치하지 않는 청크 파일을 상기 고정 길이 청크 파일과 비교하여 상기 고정 길이 청크 파일과 부분적으로 일치하는 데이터 영역을 신규 청크 파일로 생성하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method of generating a chunk file, the method comprising: dividing data into chunks of predetermined size; comparing the chunk file with a plurality of previously stored fixed-length chunk files; Comparing the plurality of variable length chunk files with a plurality of previously stored variable length chunk files; comparing the plurality of variable length chunk files with a chunk file that does not partially match the plurality of variable length chunk files; And generating a new chunk file as a data area that partially matches the fixed-length chunk file as compared with the fixed-length chunk file.

본 발명의 일 실시예에 따르면, 상기 청크 파일을 기 저장된 복수의 고정 길이 청크 파일들과 비교하는 단계는, 상기 기 저장된 복수의 고정 길이 청크 파일과 일치하는 청크 파일은, 상기 청크 파일과 일치하는 고정 길이 청크 파일에 대응되는 식별자로 변환하여 전하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the step of comparing the chunk file with a plurality of pre-stored fixed-length chunk files may include the step of comparing a chunk file coincident with the plurality of previously stored fixed- And converting the identifier into an identifier corresponding to the fixed-length chunk file and transmitting the identifier.

본 발명의 일 실시예에 따르면, 상기 청크 파일과 일치하는 고정 길이 청크 파일에 대응되는 식별자로 변환하여 전송하는 단계는, 상기 청크 파일과 동일한 상기 고정 길이 청크 파일의 해시값(Hash Value)를 산출하는 단계 및 상기 산출된 해시값을 전송하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the step of converting the identifier into an identifier corresponding to the fixed-length chunk file matching the chunk file and transmitting the identifier may include calculating a hash value of the fixed-length chunk file that is the same as the chunk file And transmitting the calculated hash value.

본 발명의 일 실시예에 따르면, 상기 데이터를 기 설정된 크기의 청크 파일로 분할하는 단계는, 상기 데이터를 기 저장된 고정 길이 청크 파일과 동일한 크기로 분할하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the step of dividing the data into chunks of predetermined sizes may include dividing the data into chunks equal in size to previously stored fixed-length chunks.

본 발명의 일 실시예에 따르면, 상기 가변 길이 청크 파일의 크기는 상기 고정 길이 청크 파일 크기 이하일 수 있다.According to an embodiment of the present invention, the size of the variable length chunk file may be smaller than the fixed length chunk file size.

본 발명의 일 실시예에 따르면, 상기 복수의 고정 길이 청크 파일들과 불일치하는 청크 파일을 기 저장된 복수의 가변 길이 청크 파일과 비교하는 단계는, 상기 가변 길이 청크 파일과 부분적으로 일치하는 데이터 영역을 상기 가변 길이 청크 파일에 대응되는 식별자로 변환하는 단계 및 상기 데이터 영역의 일부가 상기 가변 길이 청크 파일에 대응되는 식별자로 변환된 청크 파일을 전송하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the step of comparing a chunk file inconsistent with the plurality of fixed-length chunk files to a plurality of pre-stored variable length chunk files may include: Converting the chunk file into an identifier corresponding to the variable-length chunk file, and transmitting a chunk file in which a part of the data area is converted into an identifier corresponding to the variable-length chunk file.

본 발명의 일 실시예에 따르면, 상기 복수의 고정 길이 청크 파일들과 불일치하는 청크 파일을 기 저장된 복수의 가변 길이 청크 파일과 비교하는 단계는, 상기 청크 파일의 첫 번째 바이트(byte)에 포함된 데이터와 동일한 데이터를 첫 번째 바이트에 포함하고 있는 가변 길이 청크 파일을 검색하는 단계, 상기 청크 파일에서 상기 검색된 가변 길이 청크 파일의 길이만큼의 해시값을 산출하는 단계, 상기 산출된 해시값과 상기 검색된 가변 길이 청크 파일의 해시값을 비교하는 단계 및 상기 산출된 해시값과 상기 검색된 가변 길이 청크 파일의 해시값이 동일하면, 상기 청크 파일과 상기 가변 길이 청크 파일이 부분적으로 일치하는 것으로 판단하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the step of comparing a chunk file inconsistent with the plurality of fixed-length chunk files to a plurality of pre-stored variable length chunk files may include comparing the chunk file included in the first byte of the chunk file Searching a variable length chunk file including the same data as the data in the first byte; calculating a hash value corresponding to the length of the searched variable length chunk file in the chunk file; Comparing the hash value of the variable length chunk file and determining that the chunk file and the variable length chunk file partially match if the calculated hash value and the searched variable length chunk file have the same hash value, .

본 발명의 일 실시예에 따르면, 상기 청크 파일의 데이터 중 상기 가변 길이 청크 파일과 일치하는 영역 이외의 영역은 데이터 그 자체를 전송하는 단계를 더 포함하는 할 수 있다.According to an embodiment of the present invention, the method may further include transmitting data itself to an area other than the area coincident with the variable-length chunk file among the data of the chunk file.

본 발명의 일 실시예에 따르면, 상기 가변 길이 청크 파일과 부분적으로 일치하는 데이터 영역을 상기 가변 길이 청크 파일에 대응되는 식별자로 변환하는 단계는, 상기 청크 파일과 부분적으로 일치하는 가변 길이 청크 파일이 복수 개 존재하는 경우, 상기 청크 파일과 부분적으로 일치하는 가변 길이 청크 파일 중 가장 큰 가변 길이 청크 파일을 선택하는 단계 및 상기 가장 큰 가변 길이 청크 파일과 부분적으로 일치하는 데이터 영역을 상기 가장 큰 가변 길이 청크 파일에 대응되는 식별자로 변환하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the step of converting the data area partially matching the variable-length chunk file into the identifier corresponding to the variable-length chunk file may include the step of converting a variable-length chunk file partially matching the chunk file Selecting a largest variable length chunk file among a variable length chunk file partially corresponding to the chunk file when a plurality of chunk files are present; Into an identifier corresponding to the chunk file.

본 발명의 일 실시예에 따르면, 상기 청크 파일 중 상기 고정 길이 청크 파일과 불일치하고, 상기 가변 길이 청크 파일과 부분적으로도 일치하지 않는 청크 파일은 고정 길이 청크 파일로 저장하는 단계를 더 포함할 수 있다.According to an embodiment of the present invention, the method may further include storing the chunk file inconsistent with the fixed-length chunk file and partially inconsistent with the variable-length chunk file as a fixed-length chunk file have.

본 발명의 또 다른 실시예에 따른 청크 파일 생성 장치는, 복수의 고정 길이 청크 파일을 저장하는 제1 저장부, 복수의 가변 길이 청크 파일을 저장하는 제2 저장부, 데이터를 기 설정된 크기의 청크 파일로 분할하는 데이터 분할부, 상기 분할된 청크 파일을 상기 제1 저장부에 저장된 복수의 고정 길이 청크 파일과 비교하는 고정 길이 청크 파일 비교부, 상기 복수의 고정 길이 청크 파일들과 불일치하는 청크 파일을 상기 제2 저장부에 저장된 상기 복수의 가변 길이 청크 파일과 비교하는 가변 길이 청크 파일 비교부, 상기 복수의 가변 길이 청크 파일과 부분적으로도 일치하지 않는 청크 파일을 상기 고정 길이 청크 파일과 비교하여 상기 고정 길이 청크 파일과 부분적으로 일치하는 데이터 영역을 신규 청크 파일로 생성하는 청크 파일 생성부를 포함한다.According to another aspect of the present invention, there is provided an apparatus for generating a chunk file including a first storage unit for storing a plurality of fixed-length chunk files, a second storage unit for storing a plurality of variable-length chunk files, A fixed length chunk file comparison unit for comparing the divided chunk file with a plurality of fixed length chunk files stored in the first storage unit, a chunk file inconsistent with the plurality of fixed length chunk files, Length chunk file to be compared with the plurality of variable-length chunk files stored in the second storage unit, and a variable length chunk file comparing unit for comparing the variable length chunk file with the fixed-length chunk file And a chunk file generation unit for generating a data area that partially matches the fixed-length chunk file as a new chunk file The.

본 발명의 일 실시예에 따르면, 상기 고정 길이 청크 파일 비교부는, 상기 기 저장된 복수의 고정 길이 청크 파일과 일치하는 청크 파일은, 상기 청크 파일과 일치하는 고정 길이 청크 파일에 대응되는 식별자로 변환할 수 있다.According to an embodiment of the present invention, the fixed-length chunk file comparison unit may convert the chunk file coincident with the plurality of previously stored fixed-length chunk files into an identifier corresponding to the fixed-length chunk file matching the chunk file .

본 발명의 일 실시예에 따르면, 상기 고정 길이 청크 파일에 대응되는 식별자는 상기 고정 길이 청크 파일의 해시값(Hash Value)일 수 있다.According to an embodiment of the present invention, the identifier corresponding to the fixed-length chunk file may be a hash value of the fixed-length chunk file.

본 발명의 일 실시예에 따르면, 상기 데이터 분할부는, 상기 데이터를 상기 제1 저장부에 저장된 상기 고정 길이 청크 파일과 동일한 크기로 분할할 수 있다.According to an embodiment of the present invention, the data division unit may divide the data into the same size as the fixed length chunk file stored in the first storage unit.

본 발명의 일 실시예에 따르면, 상기 가변 길이 청크 파일 비교부는, 상기 가변 길이 청크 파일과 부분적으로 일치하는 상기 청크 파일의 데이터 영역을 상기 가변 길이 청크 파일에 대응되는 식별자로 변환할 수 있다.According to an embodiment of the present invention, the variable-length chunk file comparison unit may convert a data area of the chunk file partially matching the variable-length chunk file into an identifier corresponding to the variable-length chunk file.

본 발명의 일 실시예에 따르면, 상기 가변 길이 청크 파일 비교부는, 상기 청크 파일의 첫 번째 바이트(byte)에 포함된 데이터와 동일한 데이터를 첫 번째 바이트에 포함하고 있는 가변 길이 청크 파일을 검색하고, 상기 청크 파일에서 상기 검색된 가변 길이 청크 파일의 길이만큼의 해시값을 산출하여, 상기 산출된 해시값과 상기 검색된 가변 길이 청크 파일의 해시값을 비교한 후, 상기 산출된 해시값과 상기 검색된 가변 길이 청크 파일의 해시값이 동일하면, 상기 청크 파일과 상기 가변 길이 청크 파일이 부분적으로 일치하는 것으로 판단할 수 있다.According to an embodiment of the present invention, the variable length chunk file comparison unit searches for a variable length chunk file including the same data as the data included in the first byte of the chunk file in the first byte, And a hash value of the searched variable length chunk file is compared with the hash value of the searched variable length chunk file, and the calculated hash value is compared with the calculated hash value, If the hash value of the chunk file is the same, it can be determined that the chunk file and the variable-length chunk file partially match.

본 발명의 일 실시예에 따르면, 상기 가변 길이 청크 파일 비교부는, 상기 청크 파일과 부분적으로 일치하는 가변 길이 청크 파일이 복수 개 존재하는 경우, 상기 청크 파일과 부분적으로 일치하는 가변 길이 청크 파일 중 가장 큰 가변 길이 청크 파일을 선택하고, 상기 가장 큰 가변 길이 청크 파일과 부분적으로 일치하는 데이터 영역을 상기 가장 큰 가변 길이 청크 파일에 대응되는 식별자로 변환할 수 있다.According to an embodiment of the present invention, when there are a plurality of variable-length chunk files partially matching the chunk file, the variable-length chunk file comparison unit compares the variable-length chunk file, which is partially coincident with the chunk file, It is possible to select a large variable-length chunk file and to convert a data area that partially coincides with the largest variable-length chunk file into an identifier corresponding to the largest variable-length chunk file.

본 발명의 일 실시예에 따르면, 상기 청크 파일 생성부는, 상기 청크 파일 중 상기 고정 길이 청크 파일과 불일치하고, 상기 가변 길이 청크 파일과 부분적으로도 일치하지 않는 청크 파일을 고정 길이 청크 파일로 상기 제1 저장부에 저장할 수 있다.According to an embodiment of the present invention, the chunk file generation unit may generate a chunk file which is inconsistent with the fixed-length chunk file of the chunk file and partially not coincident with the variable-length chunk file, 1 storage unit.

본 발명의 또 다른 실시예에 따른 청크 파일 생성 장치는, 하나 이상의 프로세서, 상기 프로세서에 의하여 수행되는 컴퓨터 프로그램을 로드(load)하는 메모리 및 청크 파일을 생성하는 컴퓨터 프로그램을 저장하는 스토리지를 포함하되,상기 컴퓨터 프로그램은, 데이터를 기 설정된 크기의 청크 파일로 분할하는 오퍼레이션, 상기 청크 파일을 기 저장된 복수의 고정 길이 청크 파일들과 비교하는 오퍼레이션, 상기 복수의 고정 길이 청크 파일들과 불일치하는 청크 파일을 기 저장된 복수의 가변 길이 청크 파일과 비교하는 오퍼레이션, 상기 복수의 가변 길이 청크 파일과 부분적으로도 일치하지 않는 청크 파일을 상기 고정 길이 청크 파일과 비교하여 상기 고정 길이 청크 파일과 부분적으로 일치하는 데이터 영역을 신규 청크 파일로 생성하는 오퍼레이션을 포함한다.According to another embodiment of the present invention, there is provided an apparatus for generating a chunk file, the apparatus comprising: storage for storing a computer program for generating a chunk file and a memory for loading a computer program executed by the processor; The computer program causes the computer to perform a process of dividing data into chunks of predetermined size, an operation of comparing the chunk file with a plurality of pre-stored fixed-length chunk files, a chunk file inconsistent with the plurality of fixed-length chunk files, Length chunk file and a chunk file which partially does not coincide partially with the plurality of variable-length chunk files is compared with the fixed-length chunk file, and the data area partially matching the fixed-length chunk file To create a new chunk file Including the illustration.

본 발명의 또 다른 실시예에 따른 컴퓨터 프로그램은 컴퓨팅 장치와 결합하여 데이터를 기 설정된 크기의 청크 파일로 분할하는 단계, 상기 청크 파일을 기 저장된 복수의 고정 길이 청크 파일들과 비교하는 단계, 상기 복수의 고정 길이 청크 파일들과 불일치하는 청크 파일을 기 저장된 복수의 가변 길이 청크 파일과 비교하는 단계, 상기 복수의 가변 길이 청크 파일과 부분적으로도 일치하지 않는 청크 파일을 상기 고정 길이 청크 파일과 비교하여 상기 고정 길이 청크 파일과 부분적으로 일치하는 데이터 영역을 신규 청크 파일로 생성하는 단계를 실행시키기 위하여 기록 매체에 저장된다.A computer program according to another embodiment of the present invention includes a step of dividing data into chunks of predetermined size in combination with a computing device, comparing the chunks to a plurality of pre-stored fixed-length chunk files, Comparing a chunk file that is partially inconsistent with the plurality of variable length chunk files to the fixed length chunk file and comparing the chunk file that does not partially match the variable length chunk files with the fixed length chunk file And generating a new chunk file as a data area that partially matches the fixed-length chunk file.

상술한 본 발명의 실시예들에 따라 청크 파일을 생성하고 이를 데이터 중복 제거에 활용하면 중복 제거율을 극대화시킬 수 있다는 효과를 달성할 수 있다.According to the embodiments of the present invention, when a chunk file is generated and used for data deduplication, it is possible to maximize the redundancy removal rate.

또한, 동일한 데이터가 중복되어 전송됨으로써 네트워크 리소스가 낭비되는 것을 방지할 수 있다는 효과를 달성할 수 있다In addition, it is possible to prevent the network resources from being wasted by transmitting the same data redundantly

도 1은 종래 데이터 중복 제거 기술을 이용하여 제1 서버(10)에서 제2 서버(20)로 파일을 전송하는 방법을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따라 데이터 스트림을 기 저장된 고정 길이 청크 파일과 동일한 크기의 청크 파일로 분할하는 과정을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따라 청크 파일과 부분적으로 일치하는 가변 길이 청크 파일이 저장되어 있는지 여부를 판단하는 과정을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따라 새로운 청크 파일을 생성하여 저장하는 방법을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따라 청크 파일을 이용하여 데이터를 전송하는 과정을 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따라 새로 생성된 청크 파일에 의해 업데이트된 저장부를 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따라 청크 파일을 데이터 중복 제거에 활용하는 과정을 설명하기 위한 도면이다.
도 8은 본 발명의 일 실시예에 따라 청크 파일과 부분적으로 일치하는 가변 길이 청크 파일이 복수개 존재하는 경우 데이터 중복 제거를 실행하는 방법을 설명하기 위한 도면이다.
도 9는 본 발명의 일 실시예에 따라 데이터 중복 제거에 사용되는 청크 파일이 설명되는 과정을 설명하기 위한 흐름도이다.
도 10은 본 발명의 일 실시예에 따른 청크 파일 생성 장치를 설명하기 위한 기능 블록도이다.
도 11은 본 발명의 일 실시예에 따라 상술한 방법으로 수신한 데이터를 원래 데이터로 복구하는 방법을 설명하기 위한 도면이다.
도 12는 본 발명의 또 다른 실시예에 따른 청크 파일 생성 장치를 설명하기 위한 도면이다.1 is a diagram illustrating a method for transferring a file from a first server 10 to a second server 20 using a conventional data de-duplication technique.
2 is a diagram for explaining a process of dividing a data stream into a chunk file having the same size as a fixed-length chunk file stored in advance according to an embodiment of the present invention.
3 is a diagram for explaining a process of determining whether a variable length chunk file partially matching a chunk file is stored according to an embodiment of the present invention.
4 is a diagram for explaining a method of generating and storing a new chunk file according to an embodiment of the present invention.
5 is a diagram illustrating a process of transmitting data using a chunk file according to an embodiment of the present invention.
6 is a diagram for explaining a storage unit updated by a newly created chunk file according to an embodiment of the present invention.
7 is a diagram illustrating a process of utilizing a chunk file for data deduplication according to an embodiment of the present invention.
FIG. 8 is a diagram for explaining a method for performing data de-duplication when a plurality of variable-length chunk files partially matching a chunk file exist in accordance with an embodiment of the present invention.
9 is a flowchart for explaining a chunk file used for data deduplication according to an embodiment of the present invention.
10 is a functional block diagram for explaining a chunk file generating apparatus according to an embodiment of the present invention.
FIG. 11 is a diagram for explaining a method of recovering received data into original data according to the above-described method according to an embodiment of the present invention.
12 is a view for explaining a chunk file generating apparatus according to another embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. To fully disclose the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense commonly understood by one of ordinary skill in the art to which this invention belongs. Also, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise.

또한, 본 명세서에서 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함될 수 있다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.Also, the singular forms herein may include plural forms unless specifically stated in the text. It is noted that the terms "comprises" and / or "comprising" used in the specification are intended to be inclusive in a manner similar to the components, steps, operations, and / Or additions.

도 2는 본 발명의 일 실시예에 따라 데이터 스트림을 기 저장된 고정 길이 청크 파일과 동일한 크기의 청크 파일로 분할하는 과정을 설명하기 위한 도면이다.2 is a diagram for explaining a process of dividing a data stream into a chunk file having the same size as a fixed-length chunk file stored in advance according to an embodiment of the present invention.

제1 서버(200)는 기 설정된 동일한 크기의 고정 길이 청크 파일들을 저장하는 제1 저장부(210)와 서로 다른 크기의 가변 길이 청크 파일들을 저장하는 제2 저장부(230)를 포함한다.The first server 200 includes a first storage unit 210 storing fixed-length chunk files of the same size and a second storage unit 230 storing variable-length chunk files of different sizes.

제2 서버(300)도 마찬가지로 고정 길이 청크 파일들을 저장하는 제3 저장부(310) 및 가변 길이 청크 파일들을 저장하는 제4 저장부(330)를 포함한다. The second server 300 also includes a third storage unit 310 for storing fixed length chunk files and a fourth storage unit 330 for storing variable length chunk files.

이때, 가변 길이 청크 파일의 크기는 고정 길이 청크 파일 크기 이하일 수 있다.At this time, the size of the variable-length chunk file may be less than or equal to the fixed-length chunk file size.

예를 들어, 고정 길이 청크 파일들은 2048byte가 되도록 분할되어 제1 저장부(210) 및 제3 저장부(310)에 저장되었다면, 가변 길이 청크 파일들은 2048byte 이하인 1024byte 또는 512byte의 크기가 되도록 분할되어 제2 저장부(230) 및 제4 저장부(330)에 저장될 수 있다.For example, if the fixed length chunk files are divided into 2048 bytes and stored in the first storage unit 210 and the third storage unit 310, the variable length chunk files are divided into 1024 bytes or 512 bytes of 2048 bytes or less 2 storage unit 230 and the fourth storage unit 330, respectively.

또한, 제1 서버(200) 및 제2 서버(300)의 각 저장부는 동일한 청크 파일들이 저장되도록 동기화되어 있을 수 있다.In addition, each storage unit of the first server 200 and the second server 300 may be synchronized so that the same chunk files are stored.

예를 들어, 제1 고정 길이 청크 파일(211)이 새로 생성되었다고 가정하면, 제1 서버(200)는 제2 서버(300)로 제1 고정 길이 청크 파일(211)과 그에 대응되는 식별자를 함께 전송한다.For example, assuming that a first fixed length chunk file 211 is newly created, the first server 200 transmits the first fixed length chunk file 211 and its corresponding identifier together to the second server 300 send.

제1 고정 길이 청크 파일(211)과 그에 대응되는 식별자를 수신한 제2 서버(300)는 이를 제3 저장부(310)에 저장한다.The second server 300, which receives the first fixed length chunk file 211 and its corresponding identifier, stores it in the third storage unit 310.

이때, 제1 고정 길이 청크 파일(211)에 대응되는 식별자는 제1 고정 길이 청크 파일(211)의 해시값(Hash Value)일 수 있다.At this time, the identifier corresponding to the first fixed length chunk file 211 may be a hash value of the first fixed length chunk file 211.

상술한 과정을 거쳐 새로 생성된 제1 고정 길이 청크 파일(211)이 제1 서버(200) 및 제2 서버(300)저장되므로, 차후 제1 서버(200)에서 제2 서버(300)로 제1 고정 길이 청크 파일(211)에 대응되는 데이터가 전송될 때는 제1 고정 길이 청크 파일(211) 그 자체가 전송되는 것이 아니고 그에 대응 되는 식별자, 예를 들면 제1 고정 길이 청크 파일(211)의 해시값만이 전송될 수 있다.The first fixed length chunk file 211 generated by the above process is stored in the first server 200 and the second server 300 so that the first fixed server 200 and the second server 300 When the data corresponding to one fixed length chunk file 211 is transmitted, the first fixed length chunk file 211 itself is not transmitted but an identifier corresponding thereto, for example, the first fixed length chunk file 211 Only hash values can be transmitted.

따라서, 네트워크 리소스를 절감시키면서도 제1 고정 길이 청크 파일(211)이 전송되는 것과 동일한 효과를 누릴 수 있게 된다.Thus, it is possible to enjoy the same effect as that of the first fixed length chunk file 211 while reducing network resources.

이하에서는, 제1 서버(200)에서 제2 서버(300)로 데이터 스트림(400)을 전송하는 과정을 예로 들어 설명한다.Hereinafter, a process of transmitting the data stream 400 from the first server 200 to the second server 300 will be described as an example.

데이터 스트림(400)에 대한 전송 요청이 수신되면, 데이터 스트림(400)이 기 설정된 크기의 청크 파일(410, 420, 430, 440)로 분할된다. 이때, 데이터 스트림(400)은 고정 길이 청크 파일과 동일한 크기로 분할될 수 있다.When a transmission request for the data stream 400 is received, the data stream 400 is divided into chunks 410, 420, 430, and 440 of a predetermined size. At this time, the data stream 400 may be divided into the same size as the fixed-length chunk file.

고정 길이 청크 파일과 동일한 크기로 분할된 청크 파일(410, 420, 430, 440)들은 송신측에 해당되는 제1 서버(200)의 제1 저장부(210)에 저장된 고정 길이 청크 파일(211, 212, 213)들과 비교된다.The chunk files 410, 420, 430 and 440 which are divided into the same size as the fixed length chunk file are stored in the fixed length chunk files 211 and 212 stored in the first storage unit 210 of the first server 200, 212, 213).

즉, 분할된 청크 파일 중(410, 420, 430, 440) 제1 저장부(210)에 기 저장된 고정 길이 청크 파일과 동일한 청크 파일이 있는지 여부가 검색된다.That is, whether or not there is a chunk file identical to the fixed length chunk file previously stored in the first storage unit 210 among the divided chunk files 410, 420, 430, and 440 is retrieved.

본 실시예에서는 제1 청크 파일(410)과 동일한 제1 고정 길이 청크 파일(211)이 제1 저장부(210)에 기 저장되어 있음을 알 수 있다.In this embodiment, the first fixed length chunk file 211, which is the same as the first chunk file 410, is stored in the first storage unit 210.

따라서, 데이터 스트림(400)이 전송될 때, 제1 청크 파일(410)에 대응되는 부분은 데이터 그 자체가 전송되는 것이 아니고 제1 청크 파일(410)에 대응되는 식별자로 변환되어 전송된다.Accordingly, when the data stream 400 is transmitted, the portion corresponding to the first chunk file 410 is not transmitted, but is converted into an identifier corresponding to the first chunk file 410 and transmitted.

제1 청크 파일(410)에 대응되는 식별자를 수신한 제2 서버(300)는 기 저장된 복수의 청크 파일들 중 수신된 식별자에 대응되는 고정 길이 청크 파일(311)을 이용하여 데이터를 복원하게 된다.The second server 300 receiving the identifier corresponding to the first chunk file 410 restores the data using the fixed length chunk file 311 corresponding to the received identifier among the plurality of previously stored chunk files .

본 발명의 일 실시예에 따르면, 제1 청크 파일(410)에 해당되는 부분을 전송할 때, 제1 청크 파일(410) 자체를 전송하는 것이 아니고 제1 청크 파일(410)의 해시값을 전송하여 전송하려는 데이터가 제1 청크 파일(410)임을 제2 서버(300)에 알려줄 수 있다.According to an embodiment of the present invention, when transmitting a portion corresponding to the first chunk file 410, the first chunk file 410 is not transmitted, but the hash value of the first chunk file 410 is transmitted And notify the second server 300 that the data to be transmitted is the first chunk file 410.

한편, 데이터 스트림(400)을 분할하여 생성한 청크 파일(410, 420, 430, 440)과 일치하는 청크 파일이 제1 저장부(210)에 저장되어 있지 않은 경우가 있을 수 있다.On the other hand, there may be a case where a chunk file corresponding to the chunk files 410, 420, 430, and 440 generated by dividing the data stream 400 is not stored in the first storage unit 210.

복수의 고정 길이 청크 파일들과 불일치하는 청크 파일(420, 430, 440)은 제2 저장부(230)에 저장된 복수의 가변 길이 청크 파일들과 비교된다.Length chunk files and mismatched chunk files 420, 430, and 440 are compared with a plurality of variable-length chunk files stored in the second storage unit 230.

도 3은 본 발명의 일 실시예에 따라 청크 파일과 부분적으로 일치하는 가변 길이 청크 파일이 저장되어 있는지 여부를 판단하는 과정을 설명하기 위한 도면이다.3 is a diagram for explaining a process of determining whether a variable length chunk file partially matching a chunk file is stored according to an embodiment of the present invention.

데이터 스트림(400)을 분할하여 생성한 청크 파일(410, 420, 430, 440)과 동일한 파일이 제1 저장부(210)에 저장되어 있지 않은 경우, 청크 파일(410, 420, 430, 440)과 부분적으로 일치하는 청크 파일이 가변 제2 저장부(230)에 저장되어 있는지 확인된다.The chunk files 410, 420, 430, and 440 are not stored in the first storage unit 210 when the same files as the chunk files 410, 420, 430, and 440 generated by dividing the data stream 400 are stored, Is stored in the variable second storage unit 230. If the chunk file is stored in the variable second storage unit 230,

즉, 고정 길이 청크 파일(211, 212, 213)과 불일치 하는 제2 청크 파일(420) 내지 제4 청크 파일(440)을 제2 저장부(230)에 저장된 가변 길이 청크 파일(231, 233)들과 비교한다.That is, the second chunk file 420 to the fourth chunk file 440 that do not match with the fixed length chunk files 211, 212, and 213 are stored in the variable length chunk files 231 and 233 stored in the second storage unit 230, .

구체적으로, 비교 대상이 되는 청크 파일(420, 430, 440)의 첫 번째 byte에 저장된 정보와 동일한 정보를 저장하고 있는 가변 길이 청크 파일이 있는지 여부를 확인한다.Specifically, it is checked whether or not there is a variable-length chunk file storing the same information as the information stored in the first byte of the chunk files 420, 430, and 440 to be compared.

예를 들어, 도 2에 도시된 제2 청크 파일(420)과 부분적으로 일치하는 가변 길이 청크 파일을 검색하고자 하는 경우, 첫 byte에 포함된 정보인 “b”와 동일한 정보를 포함하고 있는 가변 길이 청크 파일이 있는지 여부를 검색한다.For example, when a variable length chunk file partially matching the second chunk file 420 shown in FIG. 2 is to be searched, a variable length chunk file including the same information as " b " Retrieves whether a chunk file exists.

이후, “b”를 포함하고 있는 제1 가변 길이 청크 파일(231)이 검색되면, 검색된 제1 가변 길이 청크 파일(231)의 길이 만큼에 대한 제2 청크 파일(420)의 해시값을 계산한다.Thereafter, when the first variable length chunk file 231 including "b" is searched, the hash value of the second chunk file 420 is calculated for the length of the searched first variable chunk file 231 .

예를 들어, 첫 번째 바이트열에 “b”를 포함하고 있는 제1 가변 길이 청크 파일(231)이 1024byte인 경우, 제2 청크 파일(420)의 일부분인 1024byte에 대한 해시값을 계산하여 검색된 제1 가변 길이 청크 파일(231)이 제2 청크 파일(420)의 일부와 일치하는지 여부를 확인한다.For example, if the first variable length chunk file 231 including " b " in the first byte string is 1024 bytes, the hash value for 1024 bytes, which is a part of the second chunk file 420, It is confirmed whether or not the variable-length chunk file 231 matches a part of the second chunk file 420.

해시값 비교를 통해 제2 청크 파일(420)의 일부와 일치하는 제1 가변 길이 청크 파일(231)이 확인되면 제1 가변 길이 청크 파일(231)과 부분적으로 일치하는 데이터 영역을 그에 대응되는 식별자로 변환하여 전송한다.When the first variable length chunk file 231 matching the part of the second chunk file 420 is confirmed through comparison of the hash value, the data area partially matching the first variable length chunk file 231 is identified with an identifier And transmits it.

이때, 전송되는 식별자는 제1 가변 길이 청크 파일(231)의 해시값일 수 있다.At this time, the identifier to be transmitted may be a hash value of the first variable length chunk file 231.

제1 서버(200)의 제1 저장부(210) 및 제2 저장부(230)와 제2 서버(300)의 제3 저장부(310) 및 제4 저장부(330)에 저장된 파일들이 동기화 되어 있으므로, 제1 가변 길이 청크 파일(231)에 대응되는 식별자가 수신되면 제2 서버(230)는 그에 대응되는 파일을 검색하여 데이터를 복원할 수 있게 된다.The first storage unit 210 and the second storage unit 230 of the first server 200 and the third storage unit 310 of the second server 300, Since the files stored in the fourth storage unit 330 are synchronized, when the identifier corresponding to the first variable length chunk file 231 is received, the second server 230 can retrieve the corresponding file and restore the data .

제2 청크 파일(420) 중 제1 가변 길이 청크 파일(231)과 일치하지 않는 부분인 “x” 경우 그에 대응되는 청크 파일이 가변 길이 청크 파일 저장부(230)에 저장되어 있지 않으므로 해당 부분은 데이터 그 자체가 전송될 수 있다.If the chunk file corresponding to the first chunk file 420 is "x" which does not match the first variable-length chunk file 231, the corresponding chunk file is not stored in the variable-length chunk file storage unit 230, The data itself may be transmitted.

즉, 본 실시예에서, 제2 청크 파일(420) 중 “bb”에 해당되는 부분은 제1 가변 길이 청크 파일(231)의 해시값 형태로 제2 서버(300)로 전송되고 나머지 “x”에 해당되는 부분은 데이터 그 자체가 제2 서버(300)로 전송된다.That is, in this embodiment, a portion corresponding to "bb" in the second chunk file 420 is transmitted to the second server 300 in the form of a hash value of the first variable length chunk file 231, and the remaining "x" The data itself is transmitted to the second server 300.

다만, “x”에 대응되는 부분이 가변 길이 청크 파일이 제2 저장부(230)에 저장되어 있었다면, 그에 대응되는 식별자가 제2 서버(300)로 전송될 수도 있다.However, if the variable length chunk file corresponding to "x" is stored in the second storage unit 230, an identifier corresponding to the variable length chunk file may be transmitted to the second server 300.

한편, 제3 청크 파일(430) 및 제4 청크 파일(440)의 경우 제1 저장부(210) 에 동일한 고정 길이 청크 파일이 저장되어 있지 않고, 부분적으로 일치하는 가변 길이 청크 파일도 제2 저장부(230)에 저장되어 있지 않음을 알 수 있다.On the other hand, in the case of the third chunk file 430 and the fourth chunk file 440, the same fixed length chunk file is not stored in the first storage unit 210, and the partially coincident variable length chunk file is also stored in the second storage It can be seen that it is not stored in the storage unit 230.

따라서, 제3 청크 파일(430) 및 제4 청크 파일(440)을 제1 저장부(210) 또는 제2 저장부(230)에 저장해야, 차후 다른 데이터 스트림을 전송할 때, 제3 청크 파일(430) 및 제 4 청크 파일(440)을 활용할 수 있게 된다.Accordingly, the third chunk file 430 and the fourth chunk file 440 must be stored in the first storage unit 210 or the second storage unit 230. When the next different data stream is transmitted, the third chunk file 430 430 and the fourth chunk file 440 are available.

그러나, 상술한 과정을 거쳐 생성된 제3 청크 파일(430) 및 제4 청크 파일(440)을 그대로 제1 저장부(210) 또는 제2 저장부(230)에 저장할 경우 데이터 중복 제거에 활용되지 못하고 저장부의 공간만 차지하는 경우가 발생될 수 있다.However, if the third chunk file 430 and the fourth chunk file 440 generated through the above-described process are directly stored in the first storage unit 210 or the second storage unit 230, It may happen that the storage space occupies only a space.

즉, 데이터 스트림들에 제3 청크 파일 (430) 과 같은 “mkj” 또는 제4 청크 파일(440)과 같은 “yde” 배열이 포함되는 경우가 드물다면 제3 청크 파일(430) 및 제4 청크 파일(440)의 활용 빈도가 떨어지게 된다.That is, if it is unlikely that the data streams contain a "yde" array such as "mkj" such as the third chunk file 430 or a fourth chunk file 440, the third chunk file 430 and the fourth chunk file 440, The utilization frequency of the file 440 is reduced.

따라서, 신규 청크 파일을 제1 저장부(210) 또는 제2 저장부(230)에 저장할 때 중복 제거율을 높이기 위해 소정의 프로세스를 거쳐야 한다.Accordingly, when a new chunk file is stored in the first storage unit 210 or the second storage unit 230, a predetermined process is required to increase the duplicate removal rate.

도 4는 본 발명의 일 실시예에 따라 새로운 청크 파일을 생성하여 저장하는 방법을 설명하기 위한 도면이다.4 is a diagram for explaining a method of generating and storing a new chunk file according to an embodiment of the present invention.

데이터 스트림(400)을 분할하여 생성한 청크 파일 중 제1 저장부(210) 및 제2 저장부(230)에 동일하거나 부분적으로 일치하는 청크 파일이 없는 제3 청크 파일(430) 및 제4 청크 파일(440)의 경우, 그 파일은 소정의 프로세스를 거쳐 제1 저장부(210) 또는 제2 저장부(230)에 저장된다.A third chunk file 430 having no identical or partially identical chunk file in the first storage unit 210 and the second storage unit 230 among the chunk files generated by dividing the data stream 400, In the case of the file 440, the file is stored in the first storage unit 210 or the second storage unit 230 through a predetermined process.

구체적으로, 제3 청크 파일(430)과 제4 청크 파일(440) 을 고정 길이 청크 파일 저장부(210)에 저장된 고정 길이 청크 파일(211, 212, 213)들과 다시 비교하여 중복되는 데이터 영역이 있는를 확인한다.More specifically, the third chunk file 430 and the fourth chunk file 440 are compared with the fixed length chunk files 211, 212, and 213 stored in the fixed-length chunk file storage 210, .

본 실시예에서 제4 청크 파일(440)과 제2 고정 길이 청크 파일(212)간에 “de”를 포함하는 부분이 일치한다.Quot; de " between the fourth chunk file 440 and the second fixed length chunk file 212 in this embodiment.

중복되는 데이터 영역, 즉, “de”를 포함하는 부분은 가변 길이 청크 파일로 생성되어 제2 저장부(230)에 저장된다.The portion including the redundant data area, i.e., " de " is generated as a variable-length chunk file and stored in the second storage unit 230. [

상술한 바와 같이, 제3 청크 파일(430) 및 제4 청크 파일(440) 자체를 고정 길이 청크 파일 저장부(210) 또는 가변 길이 청크 파일 저장부(230)에 저장하지 않고 기 저장된 고정 길이 청크 파일과 비교하여 일치하는 데이터 영역을 가변 길이 청크 파일로 생성하는 이유는 중복 제거율을 높이기 위해서다.The third chunk file 430 and the fourth chunk file 440 themselves may be stored in the fixed length chunk file storage unit 210 or the variable length chunk file storage unit 230, The reason why the data area that is matched with the file is created as a variable-length chunk file is to increase the duplicate removal rate.

데이터 스트림에 한번 포함된 데이터 배열은 다른 데이터 스트림에서 다시 반복될 수 있는 가능성이 높기 때문이다.This is because the data array once contained in the data stream is likely to be repeated again in another data stream.

따라서, 처음 생성된 “yde”를 청크 파일로 생성하여 저장하는 것보다 이미 한번 반복된 적이 있는 “de”로 가변 길이 청크 파일을 생성하는 것이 중복 제거율을 높이는데 도움을 줄 수 있다.Therefore, creating a variable length chunk file with "de", which has already been repeated once, rather than creating and storing the first generated "yde" as a chunk file can help increase the duplicate removal rate.

한편, 제3 청크 파일(430)의 경우 제1 저장부(210)에 저장된 고정 길이 청크 파일들(211, 212, 213)들과 부분적으로도 일치하지 않음을 알 수 있다.In the case of the third chunk file 430, on the other hand, it is partially not the same as the fixed length chunk files 211, 212, and 213 stored in the first storage unit 210.

즉, 제3 청크 파일(430)은 제1 서버(200) 뿐만 아니라 제2 서버(300)에도 저장되어 있지 않으므로, 제3 청크 파일(430)이 제1 저장부(210)에 저장되고 데이터 그 자체가 제2 서버(300)로 전송된다.That is, since the third chunk file 430 is not stored in the first server 200 as well as the second server 300, the third chunk file 430 is stored in the first storage unit 210, Is transmitted to the second server 300 itself.

이후, 제3 청크 파일(430)과 동일한 데이터 배열을 포함하는 데이터 스트림이 제1 서버(200)에서 제2 서버(300)로 전송될 때, 제1 저장부(210)에 저장된 제3 청크 파일(430)이 이용될 수 있다.When a data stream including the same data arrangement as the third chunk file 430 is transmitted from the first server 200 to the second server 300, the third chunk file 430, which is stored in the first storage unit 210, (430) may be used.

상술한 과정을 거쳐 청크 파일을 생성한 후 이를 저장하면, 데이터 중복 제거 시 중복 제거율을 극대화시킬 수 있다는 효과를 달성할 수 있다.If a chunk file is created through the above-described process and stored, the effect of maximizing the deduplication rate can be achieved.

도 5는 본 발명의 일 실시예에 따라 청크 파일을 이용하여 데이터를 전송하는 과정을 설명하기 위한 도면이다.5 is a diagram illustrating a process of transmitting data using a chunk file according to an embodiment of the present invention.

제1 청크 파일(410)의 경우 상술한 바와 같이, 제1 서버(200)의 제1 저장부(210) 및 제 2 서버(300)의 제3 저장부(310)에 그와 동일한 고정 길이 청크 파일(211, 311)이 저장되어 있으므로 제1 청크 파일(410) 그 자체가 전송되는 것이 아니고 그에 대응되는 제1 식별자(510)가 전송된다.The first storage unit 210 of the first server 200 and the third storage unit 310 of the second server 300 store the same fixed length chunks as the first chunk file 410, Since the files 211 and 311 are stored, the first chunk file 410 itself is not transmitted, and the corresponding first identifier 510 is transmitted.

본 발명의 일 실시예에 따르면 제1 식별자(510)는 제1 청크 파일(410)의 해시값일 수 있다.According to one embodiment of the present invention, the first identifier 510 may be a hash value of the first chunk file 410.

제2 청크 파일(420)의 경우, 제2 청크 파일(420)의 일부와 동일한 제1 가변 길이 청크 파일(231)이 제1 서버(200)의 제2 저장부(230) 및 제2 서버(300)의 제4 저장부(330)에 저장되어 있으므로, 제1 가변 길이 청크 파일(231)에 대응되는 식별자와 나머지 데이터가 전송된다.In the case of the second chunk file 420, the first variable length chunk file 231, which is the same as a part of the second chunk file 420, is stored in the second storage unit 230 of the first server 200 and the second server 300, the identifier corresponding to the first variable length chunk file 231 and the remaining data are transmitted.

구체적으로, 제2 청크 파일(420)는 “bb”에 해당되는 부분은 그에 대응되는 식별자로 전송되고 나머지 “x”부분은 데이터가 그대로 전송되도록 할 수 있다.Specifically, the second chunk file 420 may be transmitted with an identifier corresponding to "bb", and the remaining "x" portion may be transmitted as it is.

다만, 본 실시예에서는 제2 청크 파일(420) 중 “x”부분에 대응되는 가변 길이 청크 파일이 제2 저장부(230) 및 제4 저장부(330)에 저장되어 있지 않아 데이터 그 자체가 전송되는 것으로 설명하였으나, “x” 부분에 대응되는 가변 길이 청크 파일이 있는 경우 데이터 그 자체를 전송하는 대신 그에 대응되는 식별자가 전송되도록 할 수도 있다.In this embodiment, however, the variable length chunk file corresponding to the "x" portion of the second chunk file 420 is not stored in the second storage unit 230 and the fourth storage unit 330, However, if there is a variable length chunk file corresponding to the " x " portion, instead of transmitting the data itself, an identifier corresponding thereto may be transmitted.

한편, 제3 청크 파일(430) 및 제 4 청크 파일(440)은 그에 대응되는 청크 파일들이 제1 저장부(210) 내지 제4 저장부(330)에 저장되어 있지 않으므로 데이터 그 자체가 전송된다.Since the chunk files corresponding to the third chunk file 430 and the fourth chunk file 440 are not stored in the first storage unit 210 to the fourth storage unit 330, the data itself is transmitted .

다만, 상술한 과정을 통해 소정의 프로세스를 거쳐 제1 저장부(210) 내지 제4 저장부(330)가 업데이트 되고 차후 데이터 스트림을 전송할 때, 그 청크 파일들이 이용될 수도 있다.However, the chunks may be used when the first storage unit 210 to the fourth storage unit 330 are updated through a predetermined process and the next data stream is transmitted through the above process.

도 6은 본 발명의 일 실시예에 따라 새로 생성된 청크 파일에 의해 업데이트된 저장부를 설명하기 위한 도면이다.6 is a diagram for explaining a storage unit updated by a newly created chunk file according to an embodiment of the present invention.

상술한 바와 같이 제3 청크 파일(430)의 전부 또는 일부와 동일한 청크 파일들이 제1 저장부(210) 내지 제4 저장부(330)에 저장되어 있지 않으므로, 제3 청크 파일(430)에 해당되는 부분은 데이터 그 자체가 전송되고 제1 저장부(210) 및 제3 저장부(310)에 고정 길이 청크 파일로서(214, 314)로 저장된다.Since the chunk files identical to all or a part of the third chunk file 430 are not stored in the first storage unit 210 to the fourth storage unit 330 as described above, The data itself is transferred and stored as a fixed length chunk file 214, 314 in the first storage unit 210 and the third storage unit 310.

또한, 제2 서버(300)에 제3 청크 파일(430)이 고정 길이 청크 파일(314)로서 저장될 때, 그에 대응되는 식별자, 예를 들면 제3 청크 파일(430)의 해시값도 함께 저장될 수 있다.When the third chunk file 430 is stored in the second server 300 as a fixed length chunk file 314, the corresponding identifier, for example, the hash value of the third chunk file 430, is also stored .

따라서, 이후 “mkj” 배열을 포함하는 데이터 스트림이 전송될 때 데이터 그 자체가 전송되는 것이 아니고 그에 대응되는 고정 길이 청크 파일(214, 314)의 해시값이 전송되도록 함으로써 네트워크 리소스를 절감시키면서 실제 “mkj”가 전송되는 것과 동일한 효과를 누릴 수 있게 된다.Therefore, when the data stream including the " mkj " array is transmitted thereafter, the data itself is not transmitted but the hash value of the corresponding fixed length chunk file 214, 314 is transmitted, quot; mkj " is transmitted.

마찬가지로, 제4 청크 파일(440)의 일부와 동일하며 신규 청크 파일로 저장된 제3 가변 길이 청크 파일(235, 335)도 제2 저장부(230) 및 제4 저장부(330)에 저장될 수 있다.Likewise, the third variable length chunk files 235 and 335, which are the same as part of the fourth chunk file 440 and stored as a new chunk file, may also be stored in the second storage unit 230 and the fourth storage unit 330 have.

제3 가변 길이 청크 파일(235, 335)의 경우에도 그에 대응되는 식별자가 함께 저장되어, 차후 전송하고자 하는 데이터 스트림에 “de”가 포함되어 있으면 데이터 그 자체를 전송하지 않고 그에 대응되는 식별자를 전송함으로써 네트워크 리소스를 절감시킬 수 있게 된다.Even in the case of the third variable length chunk files 235 and 335, identifiers corresponding thereto are stored together. If "de" is included in the data stream to be transmitted later, the data itself is not transmitted and an identifier corresponding thereto is transmitted Thereby reducing network resources.

한편, 상술한 바와 같이 제1 서버(200) 및 제2 서버(300)에 청크 파일들을 저장하고 이를 데이터 전송에 활용하면 동일한 데이터가 중복되어 전송됨으로써 네트워크 리소스가 낭비되는 것을 방지할 수 있다는 효과를 달성할 수 있다.Meanwhile, as described above, when chunk files are stored in the first server 200 and the second server 300 and used for data transmission, the same data is transmitted in duplicate, thereby preventing wasting of network resources Can be achieved.

도 7은 본 발명의 일 실시예에 따라 청크 파일을 데이터 중복 제거에 활용하는 과정을 설명하기 위한 도면이다.7 is a diagram illustrating a process of utilizing a chunk file for data deduplication according to an embodiment of the present invention.

도 7에서는 본 발명의 일 실시예에 따라 청크 파일들을 이용하여 데이터 스트림(700)을 제1 서버(200)에서 제2 서버(300)로 전송하는 과정을 설명하도록 한다.FIG. 7 illustrates a process of transmitting a data stream 700 from a first server 200 to a second server 300 using chunk files according to an embodiment of the present invention.

데이터 스트림(700)은 제1 저장부(210)에 저장된 고정 길이 청크 파일들과 동일한 길이로 분할된다.The data stream 700 is divided into the same length as the fixed length chunk files stored in the first storage unit 210.

상술한 과정을 거쳐 생성된 제5 청크 파일(710)은 그와 동일한 제3 고정 길이 청크 파일(213)이 제1 저장부(210)에 저장되어 있으므로, 제5 청크 파일(710) 데이터 그 자체가 전송되는 것이 아니고 제3 고정 길이 청크 파일(213)에 대응되는 식별자, 예를 들면 제3 고정 길이 청크 파일(213)의 해시값이 전송된다.The fifth chunk file 710 generated through the above process stores the same third fixed length chunk file 213 in the first storage unit 210 and thus the fifth chunk file 710 itself An identifier corresponding to the third fixed length chunk file 213, for example, a hash value of the third fixed length chunk file 213 is transmitted.

제6 청크 파일(720)의 경우, 제6 청크 파일(720)의 일부와 일치하는 제2 가변 길이 청크 파일(233)이 제2 저장부(230)에 저장되어 있으므로, 그에 해당되는 부분의 식별자, 예를 들면 제2 가변 길이 청크 파일(233)의 해시값(780)이 전송된다.In the case of the sixth chunk file 720, since the second variable length chunk file 233 matching the part of the sixth chunk file 720 is stored in the second storage unit 230, the identifier of the corresponding part , The hash value 780 of the second variable length chunk file 233, for example.

이때, 제6 청크 파일(720)의 나머지 부분인 x(770)는 그에 대응되는 청크 파일이 제1 저장부(210) 및 제2 저장부(230)에 저장되어 있지 않으므로 데이터 그 자체가 전송된다.At this time, since the chunk file corresponding to the remaining part of the sixth chunk file 720, x (770), is not stored in the first storage unit 210 and the second storage unit 230, the data itself is transmitted .

제7 청크 파일(730)의 경우, 제7 청크 파일(730)의 일부와 일치하는 제1 가변 길이 청크 파일(231)이 제2 저장부(230)에 저장되어 있으므로, 제1 가변 길이 청크 파일(231)에 대응되는 식별자, 예를 들면 제1 가변 길이 청크 파일(231)의 해시값이 전송된다.In the case of the seventh chunk file 730, since the first variable length chunk file 231 matching the part of the seventh chunk file 730 is stored in the second storage unit 230, An identifier corresponding to the first variable length chunk file 231, for example, a hash value of the first variable length chunk file 231, is transmitted.

마찬가지로 제7 청크 파일(730)의 나머지 부분에 해당되는 h(800)는 데이터 그 자체가 전송된다.Likewise, h (800) corresponding to the remaining part of the seventh chunk file 730 is transferred with the data itself.

제8 청크 파일(740) 및 제9 청크 파일(750)의 경우 그에 대응되는 고정 길이 청크 파일들이 제1 저장부(210)에 저장되어 있으므로, 제1 고정 길이 청크 파일(211)의 해시값 및 제3 고정 길이 청크 파일(213)의 해시값이 전송된다.Length chunk files corresponding to the eighth chunk file 740 and the ninth chunk file 750 are stored in the first storage unit 210. The hash value of the first fixed length chunk file 211 and the hash value of the first fixed- The hash value of the third fixed length chunk file 213 is transmitted.

상술한 데이터 전송 과정을 살펴보면 데이터 스트림(700)의 제5 청크 파일(710)과 제9 청크 파일(750)이 동일한 데이터 배열을 가지므로, 데이터 그 자체를 전송하게 되면 동일한 데이터가 반복하여 전송되게 된다.Since the fifth chunk file 710 and the ninth chunk file 750 of the data stream 700 have the same data arrangement, if the data itself is transmitted, the same data is repeatedly transmitted do.

그러나, 본 발명의 일 실시예에 따라 저장부에 저장된 청크 파일을 이용하여 그에 대응되는 식별자, 예를 들면 제3 고정 길이 청크 파일(213)의 해시값을 전송하면, 중복되는 데이터를 전송하지 않으면서 데이터 그 자체를 전송하는 것과 동일한 효과를 누릴 수 있게 된다.However, if a hash value of the third fixed length chunk file 213 is transmitted using an identifier corresponding to the chunk file stored in the storage unit according to an embodiment of the present invention, It is possible to enjoy the same effect as transmitting the data itself.

상술한 청크 파일의 식별자 및 데이터를 수신한 제2 서버(300)는 제1 서버로부터 수신받은 식별자, 즉, 청크 파일들의 해시값을 이용하여 본래 데이터 스트림(700)을 생성할 수 있게 된다.The second server 300 receiving the identifier and the data of the chunk file described above can generate the original data stream 700 using the identifier received from the first server, that is, the hash value of the chunk files.

한편, 청크 파일과 부분적으로 일치하는 가변 길이 청크 파일이 복수개 존재하는 경우가 발생될 수 있다.On the other hand, a plurality of variable-length chunk files partially matching the chunk file may be present.

도 8은 본 발명의 일 실시예에 따라 청크 파일과 부분적으로 일치하는 가변 길이 청크 파일이 복수개 존재하는 경우 데이터 중복 제거를 실행하는 방법을 설명하기 위한 도면이다.FIG. 8 is a diagram for explaining a method for performing data de-duplication when a plurality of variable-length chunk files partially matching a chunk file exist in accordance with an embodiment of the present invention.

제10 청크 파일(810)의 경우 그와 동일한 고정 길이 청크 파일이 제1 저장부(210)에 저장되어 있으므로 그에 대응되는 식별자, 예를 들면 제10 청크 파일(810)의 해시값이 전송된다.In the case of the tenth chunk file 810, since the same fixed length chunk file is stored in the first storage unit 210, a hash value of the corresponding identifier, for example, the tenth chunk file 810, is transmitted.

제11 청크 파일(820)의 경우 제1 저장부(210)에 그와 동일한 고정 길이 청크 파일이 저장되어 있지 않으므로 청크 파일(820)과 부분적으로 일치하는 가변 길이 청크 파일이 제2 저장부(230)에 존재하는지 여부가 판단된다.In the case of the eleventh chunk file 820, since the same fixed length chunk file is not stored in the first storage unit 210, a variable length chunk file partially matching the chunk file 820 is stored in the second storage unit 230 Or not).

제2 저장부(230)를 살펴보면, 제11 청크 파일(820)과 부분적으로 일치하는 가변 길이 청크 파일이 2개 있음을 알 수 있다.In the second storage unit 230, it can be seen that there are two variable-length chunk files partially matching the eleventh chunk file 820. [

구체적으로, 제11 청크 파일(820)과 “hc”데이터 영역이 일치하는 제4 가변 길이 청크 파일(814)과, “”hcj”데이터 영역이 일치하는 제6 가변 길이 청크 파일(816)이 저장되어 있음을 알 수 있다.Specifically, the fourth variable length chunk file 814 in which the 11th chunk file 820 and the " hc " data area coincide with the 6th variable length chunk file 816 in which the " .

이 경우에는, 제11 청크 파일(820)과 부분적으로 일치하는 데이터 영역을 포함하는 가변 길이 청크 파일 중 그 파일의 크기가 가장 큰 가변 길이 청크 파일이 선택된다.In this case, among the variable-length chunk files including the data area that partially coincides with the eleventh chunk file 820, a variable-length chunk file having the largest file size is selected.

즉, 제11 청크 파일(820)과 부분적으로 일치하는 제4 가변 길이 청크 파일(814) 및 제6 가변 길이 청크 파일(816) 중 크기가 더 큰 제6 가변 길이 청크 파일(816)이 선택되어 제11 청크 파일(820) 중 제6 가변 길이 청크 파일(816)과 동일한 데이터 배열을 가지는 영역이 제6 가변 길이 청크 파일(816)에 대응되는 식별자로 변환되어 전송된다.That is, the sixth variable-length chunk file 816, which is the larger one among the fourth variable-length chunk file 814 and the sixth variable-length chunk file 816 partially matching the eleventh chunk file 820, is selected An area having the same data arrangement as the sixth variable length chunk file 816 among the eleventh chunk file 820 is converted into an identifier corresponding to the sixth variable length chunk file 816 and transferred.

제4 가변 길이 청크 파일(814)이 선택되는 경우, 제11 청크 파일(820)의 데이터 영역 중 “hj”에 해당되는 부분만 제4 가변 길이 청크 파일(814)에 대응되는 식별자로 변환되어 전송되고 나머지 “jhm” 데이터 영역은 데이터 그 자체가 전송되어야 하지만, 제6 가변 길이 청크 파일(816)이 선택되는 경우 그에 대응되는 데이터 영역을 제외한 “hm”만이 데이터 그 자체로 전송되기 때문이다.When the fourth variable length chunk file 814 is selected, only the portion corresponding to "hj" in the data area of the eleventh chunk file 820 is converted into an identifier corresponding to the fourth variable length chunk file 814, And the remaining "jhm" data area has to be transmitted as the data itself, but when the sixth variable length chunk file 816 is selected, only "hm" except the corresponding data area is transmitted in the data itself.

즉, 크기가 더 큰 가변 길이 청크 파일이 선택될수록 데이터 그 자체가 전송되는 영역이 줄어들게 되므로 네트워크 리소스를 더욱 효율적으로 관리할 수 있기 때문이다.That is, as the variable-length chunk file having a larger size is selected, the area where the data itself is transferred is reduced, so that the network resource can be more efficiently managed.

도 9는 본 발명의 일 실시예에 따라 데이터 중복 제거에 사용되는 청크 파일이 설명되는 과정을 설명하기 위한 흐름도이다.9 is a flowchart for explaining a chunk file used for data deduplication according to an embodiment of the present invention.

전송 대상 데이터를 기 설정된 크기의 청크 파일로 분할한다(S910). 이때, 청크 파일은 기 저장된 고정 길이 청크 파일의 크기와 동일하도록 분할된다.The data to be transferred is divided into chunk files of a predetermined size (S910). At this time, the chunk file is divided so as to be equal to the size of the previously stored fixed-length chunk file.

상술한 과정을 거쳐 생성된 청크 파일은 제1 저장부(210)에 기 저장된 고정 길이 청크 파일들과 비교되어(S920), 청크 파일 중 제1 저장부(210)에 기 저장된 고정 길이 청크 파일과 동일한 청크 파일이 있는지 여부가 판단된다.The chunk file generated through the above-described process is compared with the fixed length chunk files previously stored in the first storage unit 210 (S920), and the fixed length chunk file previously stored in the first storage unit 210 of the chunk files It is determined whether or not the same chunk file exists.

판단 결과, 제1 저장부(210)에 기 저장된 고정 길이 청크 파일과 동일한 청크 파일은 그에 대응되는 식별자, 예를 들면 그 청크 파일의 해시값이 전송된다(S930).As a result, the identifier corresponding to the same chunk file as the fixed-length chunk file previously stored in the first storage unit 210, for example, the hash value of the chunk file, is transmitted (S930).

고정 길이 청크 파일들과 불일치하는 청크 파일들은 제2 저장부(230)에 저장된 가변 길이 청크 파일들과 비교되어, 가변 길이 청크 파일과 부분적으로 일치하는 청크 파일이 존재하는지 여부가 판단된다(S940).Length chunk files and mismatching chunk files are compared with the variable-length chunk files stored in the second storage unit 230 to determine whether there is a chunk file that partially matches the variable-length chunk file (S940) .

판단 결과, 제2 저장부(230)에 기 저장된 가변 길이 청크 파일과 부분적으로 일치하는 청크 파일은 가변 길이 청크 파일에 대응되는 부분이 그 가변 길이 청크 파일의 해시값으로 전송된다(S950).As a result of the determination, the chunk file that partially matches the variable length chunk file previously stored in the second storage unit 230 is transmitted as a hash value of the variable-length chunk file corresponding to the variable-length chunk file (S950).

제2 저장부(230)에 기 저장된 가변 길이 청크 파일과 부분적으로도 일치하지 않는 청크 파일들은 제1 저장부에 기 저장된 고정 길이 청크 파일과 다시 비교된다(S960).In operation S960, the variable length chunk file stored in the second storage unit 230 is partially compared with the fixed length chunk file stored in the first storage unit.

이후, 고정 길이 청크 파일과 부분적으로 일치하는 데이터 영역이 신규 청크 파일로 생성(S970)되어 제2 저장부(230)에 가변 길이 청크 파일로 저장된다.Thereafter, a data area partially matching the fixed-length chunk file is created as a new chunk file (S970) and stored in the second storage unit 230 as a variable-length chunk file.

상술한 바와 같이, 기 저장된 고정 길이 청크 파일들과의 비교를 통해 부분적으로 일치하는 데이터 영역으로 청크 파일을 생성하면 데이터 중복 제거율을 높일 수 있다는 효과를 달성할 수 있다.As described above, by generating a chunk file in a partially matching data area through comparison with pre-stored fixed length chunk files, an effect of increasing the data duplication removal rate can be achieved.

제1 저장부(210)에 고정 길이 청크 파일로 저장되어 있는 데이터 배열은 이미 다른 데이터 스트림에서 반복된 적이 있는 데이터 배열이고, 한 번 반복된 적이 있는 데이터 배열은 재차 또 반복될 가능성이 있기 때문이다.The data arrangement stored in the first storage unit 210 as a fixed length chunk file is a data arrangement which has already been repeated in another data stream, and the data arrangement which has been repeated once may be repeated again and again .

즉, 이미 반복된 적이 있는 데이터 배열인 고정 길이 청크 파일로 가변 길이 청크 파일을 생성하면 다른 데이터 스트림에서 또 다시 반복될 가능성이 높은 데이터 배열로 가변 길이 청크 파일을 만들 수 있게 되는바, 데이터 중복 제거율을 높일 수 있게 되는 것이다.That is, when a variable-length chunk file is generated with a fixed-length chunk file which is an array of data that has been repeated, it is possible to create a variable-length chunk file with a data array that is likely to be repeated again in another data stream. .

도 10은 본 발명의 일 실시예에 따른 청크 파일 생성 장치를 설명하기 위한 기능 블록도이다.10 is a functional block diagram for explaining a chunk file generating apparatus according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 청크 파일 생성 장치(1000)는 데이터 분할부(1010), 제1 저장부(1020), 제2 저장부(1030), 고정 길이 청크 파일 비교부(1040), 가변 길이 청크 파일 비교부(1050), 청크 파일 생성부(1060) 및 송신부(1070)를 포함한다.The apparatus 1000 for generating a chunk file according to an exemplary embodiment of the present invention includes a data division unit 1010, a first storage unit 1020, a second storage unit 1030, a fixed length chunk file comparison unit 1040, A length chunk file comparison unit 1050, a chunk file generation unit 1060, and a transmission unit 1070.

도 10에는 본 발명의 실시예들과 관련 있는 구성요소들만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상이 기술자라면 도 10에 도시된 구성요소 이외에 다른 범용적인 구성요소들이 더 포함될 수 있음은 물론이다.In Fig. 10, only the components related to the embodiments of the present invention are shown. Therefore, it is a matter of course that other general components other than the components shown in FIG. 10 may be further included in the technical field of the present invention.

청크 파일 생성 장치(1000)는 파일을 송수신할 수 있는 컴퓨팅 장치 예를 들면, 일반 서버, 프록시 서버, 퍼스널 컴퓨터, 랩탑 컴퓨터 등으로 구현할 수 있음은 물론이다.The chunk file generating apparatus 1000 may be implemented as a computing device capable of transmitting and receiving files, such as a general server, a proxy server, a personal computer, a laptop computer, or the like.

데이터 분할부(1010)는 데이터를 기 설정된 크기의 청크 파일로 분할한다. 이때, 데이터는 제1 저장부(1020)에 저장된 고정 길이 청크 파일의 크기와 동일한 크기로 분할된다.The data divider 1010 divides the data into chunks of predetermined size. At this time, the data is divided into the same size as that of the fixed length chunk file stored in the first storage unit 1020.

제1 저장부(1020)는 복수의 고정 길이 청크 파일들을 저장한다. 구체적으로, 제1 저장부(1020)에는 다른 데이터에서 반복된 적이 있는 데이터 배열을 가지는 복수의 고정 길이 청크 파일들이 저장되어 있다.The first storage unit 1020 stores a plurality of fixed-length chunk files. Specifically, the first storage unit 1020 stores a plurality of fixed-length chunk files having data arrays repeated in other data.

여기에서, 고정 길이 청크 파일이란 고정 길이 청킹 알고리즘에 의해 기 설정된 일정한 크기로 생성된 청크 파일을 의미한다.Here, the fixed-length chunk file means a chunk file generated with a fixed size predetermined by the fixed-length chunking algorithm.

또한, 제1 저장부(1020)에는 기 저장된 복수의 고정 길이 청크 파일에 대응되는 식별자, 예를 들면 각 고정 길이 청크 파일들의 해시값도 함께 저장되어 있을 수 있다.In addition, the first storage unit 1020 may store an identifier corresponding to a plurality of previously stored fixed-length chunk files, for example, a hash value of each fixed-length chunk file.

제2 저장부(1030)는 복수의 가변 길이 청크 파일들을 저장한다. 구체적으로, 제2 저장부(1030)에는 다른 데이터에서 반복된 적이 있는 데이터 배열을 가지는 복수의 가변 길이 청크 파일들이 저장되어 있다.The second storage unit 1030 stores a plurality of variable length chunk files. Specifically, the second storage unit 1030 stores a plurality of variable-length chunk files having data arrays repeated in other data.

여기에서, 가변 길이 청크 파일이란, 가변 길이 청킹 알고리즘에 의해 임의의 크기를 갖도록 생성된 청크 파일을 의미한다.Here, the variable-length chunk file means a chunk file generated so as to have an arbitrary size by a variable-length chunking algorithm.

고정 길이 청크 파일 비교부(1040)는 분할된 청크 파일을 제1 저장부(1020)에 기 저장된 고정 길이 청크 파일과 비교한다.The fixed length chunk file comparison unit 1040 compares the divided chunk file with the fixed length chunk file previously stored in the first storage unit 1020.

비교 결과, 데이터 분할부(1010)를 통해 생성된 청크 파일 중 제1 저장부(1020)에 기 저장된 고정 길이 청크 파일과 일치하는 청크 파일은 그에 대응되는 식별자로 외부에 전송된다.As a result of comparison, the chunk file corresponding to the fixed length chunk file previously stored in the first storage unit 1020 among the chunk files generated through the data division unit 1010 is transmitted to the outside as an identifier corresponding thereto.

이때, 청크 파일과 일치하는 고정 길이 청크 파일의 식별자는 그 고정 길이 청크 파일의 해시값일 수 있다.At this time, the identifier of the fixed-length chunk file matching the chunk file may be the hash value of the fixed-length chunk file.

데이터 분할부(1010)에서 생성된 청크 파일 중 제1 저장부(1020)에 기 저장된 고정 길이 청크 파일과 불일치 하는 청크 파일들은 가변 길이 청크 파일 비교부(1050)에서 제2 저장부(1030)에 저장된 가변 길이 청크 파일들과 비교된다.The variable length chunk file comparison unit 1050 compares the chunk files generated by the data division unit 1010 with the fixed length chunk files previously stored in the first storage unit 1020 in the second storage unit 1030 It is compared with stored variable length chunk files.

비교 결과, 데이터 분할부(1010)에서 생성된 청크 파일 중 제2 저장부(130)에 기 저장된 가변 길이 청크 파일과 부분적으로 일치하는 청크 파일은 가변 길이 청크 파일과 부분적으로 일치하는 데이터 영역이 그 가변 길이 청크 파일의 식별자로 전송된다.As a result of the comparison, the chunk file which partially matches the variable-length chunk file pre-stored in the second storage unit 130 among the chunk files generated by the data division unit 1010 has a data area partially coincident with the variable- Is transmitted as an identifier of the variable-length chunk file.

이때, 가변 길이 청크 파일의 식별자는 그 가변 길이 청크 파일의 해시값일 수 있다.At this time, the identifier of the variable-length chunk file may be a hash value of the variable-length chunk file.

청크 파일 중 제1 저장부(1020)에 저장된 고정 길이 청크 파일들과 불일치하고, 제2 저장부(1030)에 저장된 가변 길이 청크 파일과도 부분적으로 일치하지 않는 청크 파일은 청크 파일 생성부(1060)에서 제1 저장부(1020)에 저장된 고정 길이 청크 파일들과 비교된다.A chunk file which is inconsistent with the fixed length chunk files stored in the first storage unit 1020 and partially not coincident with the variable length chunk files stored in the second storage unit 1030 is stored in the chunk file generation unit 1060 Length chunk files stored in the first storage unit 1020 in the fixed length chunk file.

비교 결과 상기 청크 파일 중 고정 길이 청크 파일과 부분적으로 일치하는 데이터 영역이 있는 경우, 그 데이터 영역으로 신규 청크 파일을 생성한다.If there is a data area that partially matches the fixed-length chunk file among the chunk files as a result of comparison, a new chunk file is created in the data area.

이때, 생성된 신규 청크 파일의 크기는 제1 저장부(1020)에 저장된 고정 길이 청크 파일의 크기 보다 작게되며, 이 신규 청크 파일은 제2 저장부(1030)에 저장되게 된다.At this time, the size of the generated new chunk file is smaller than the size of the fixed length chunk file stored in the first storage unit 1020, and the new chunk file is stored in the second storage unit 1030.

도 11은 본 발명의 일 실시예에 따라 상술한 방법으로 수신한 데이터를 원래 데이터로 복구하는 방법을 설명하기 위한 도면이다.FIG. 11 is a diagram for explaining a method of recovering received data into original data according to the above-described method according to an embodiment of the present invention.

도 11에 도시된 바와 같이, 데이터 스트림은 각 청크 파일에 대응되는 식별자, 예를 들면 청크 파일의 해시값과 데이터의 조합으로 전송된다.As shown in FIG. 11, the data stream is transmitted by a combination of an identifier corresponding to each chunk file, for example, a hash value of the chunk file and data.

상술한 청크 파일들의 해시값 및 데이터의 조합을 수신한 제2 서버(300)는 해시값에 대응되는 청크 파일들을 제3 저장부(310) 또는 제4 저장부(330)에서 검색하여 본래의 데이터를 복원한다.The second server 300 receiving the hash value of the chunk files and the data combination searches the third storage unit 310 or the fourth storage unit 330 for the chunk files corresponding to the hash value, .

예를 들어, 데이터 중복 제거를 통해 제1 서버(200)로부터 전송된 데이터가 제1 식별자(760)-데이터 x(770)-제2 식별자(780)-제3 식별자(790)-데이터 n(800)-제3 식별자(810)-제1 식별자(760)라면, 각 식별자에 대응되는 청크 파일들을 검색하여 본래의 데이터를 복원한다.For example, data transmitted from the first server 200 through data de-duplication may be transmitted as a first identifier 760, a data x 770, a second identifier 780, a third identifier 790, 800) - the third identifier 810 - the first identifier 760, the chunk files corresponding to the respective identifiers are searched to restore the original data.

구체적으로, 제1 식별자(760)에 대응되는 청크 파일이 제3 고정 길이 청크 파일(313)이라면, 제1 식별자(730)가 위치한 영역을 제3 고정 길이 청크 파일(313)로 대체하는 식으로 데이터를 복원해나갈 수 있다.More specifically, if the chunk file corresponding to the first identifier 760 is the third fixed length chunk file 313, the area in which the first identifier 730 is located is replaced with the third fixed length chunk file 313 Data can be restored.

통상적으로 각 식별자는 그에 대응되는 청크 파일의 데이터 크기에 비해 그 파일의 크기가 작으므로 상술한 과정을 거쳐 데이터를 전송하면 네트워크 리소스를 절감시킬 수 있다는 효과를 달성할 수 있다.Generally, since each identifier has a smaller file size than the data size of the corresponding chunk file, it is possible to save network resources by transferring data through the process described above.

뿐만 아니라, 데이터 스트림에 포함된 중복되는 데이터 배열을 식별자로 대체함으로써 전송되는 데이터양을 줄일 수 있다는 효과를 달성할 수 있다.In addition, it is possible to reduce the amount of transmitted data by replacing the redundant data array included in the data stream with an identifier.

도 12는 본 발명의 또 다른 실시예에 따른 청크 파일 생성 장치를 설명하기 위한 도면이다.12 is a view for explaining a chunk file generating apparatus according to another embodiment of the present invention.

본 발명의 또 다른 실시예에 따른 청크 파일 생성 장치(1200)는 프로세서(1210), 스토리지(1220), 메모리(1230), 네트워크 인터페이스(1240) 및 버스(150)를 포함한다.A chunk file generating apparatus 1200 according to another embodiment of the present invention includes a processor 1210, a storage 1220, a memory 1230, a network interface 1240,

프로세서(1210)는 청크 파일 생성 프로그램(1231)을 실행할 수 있는 프로세서이다. 그러나, 이에 한정되지 않으며 다른 프로그램을 실행하도록 구현할 수도 있다.The processor 1210 is a processor capable of executing the chunk file creation program 1231. [ However, the present invention is not limited to this, and may be implemented to execute another program.

스토리지(1220)는 가변 길이 청크 파일 저장부(1221) 및 고정 길이 청크 파일 저장부(1223)를 포함한다. 또한, 청크 파일 생성 프로그램(1231)의 실행 파일 및 기타 리소스 파일을 포함할 수도 있다.The storage 1220 includes a variable length chunk file storage unit 1221 and a fixed length chunk file storage unit 1223. It may also include an executable file of the chunk file creation program 1231 and other resource files.

메모리(1231)은 청크 파일 생성 프로그램(1231)을 로딩한다. 메모리(1231)에 로딩된 청크 파일 생성 프로그램(1231)은 프로세서(1210)에 의해 실행된다.The memory 1231 loads the chunk file creation program 1231. [ The chunk file creation program 1231 loaded in the memory 1231 is executed by the processor 1210.

네트워크 인터페이스(1240)에는 다른 컴퓨팅 장치가 연결될 수 있다. 네트워크 인터페이스(1240)에 연결되는 다른 컴퓨팅 장치는 디스플레이 장치, 사용자 단말 등이 될 수 있다.The network interface 1240 may be coupled to other computing devices. Other computing devices connected to the network interface 1240 may be a display device, a user terminal, or the like.

또한, 본 발명의 일 실시예에 따른 네트워크 인터페이스(1240)는 이더넷, FireWire, USB 등으로 구현될 수 있다.Also, the network interface 1240 according to an embodiment of the present invention may be implemented by Ethernet, FireWire, USB, or the like.

버스(1250)는 상술한 객체 인식 프로세서(1210), 스토리지(1220) 및 메모리(1230) 등이 연결되어 데이터 이동 통로로서의 역할을 수행한다.The bus 1250 is connected to the object recognition processor 1210, the storage 1220, and the memory 1230, and functions as a data movement path.

한편, 상술한 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 또한, 상술한 방법에서 사용된 데이터의 구조는 컴퓨터로 읽을 수 있는 기록매체에 여러 수단을 통하여 기록될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다.Meanwhile, the above-described method can be implemented in a general-purpose digital computer that can be created as a program that can be executed by a computer and operates the program using a computer-readable recording medium. In addition, the structure of the data used in the above-described method can be recorded on a computer-readable recording medium through various means. The computer-readable recording medium includes a storage medium such as a magnetic storage medium (e.g., ROM, floppy disk, hard disk, etc.), optical reading medium (e.g., CD ROM,

본 실시예와 관련된 기술 분야에서 통상의 지식을 가진 자는 상기된 기재의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 방법들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed methods should be considered from an illustrative point of view, not from a restrictive point of view. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

Claims

Dividing the data into chunk files of predetermined sizes;
Comparing the chunk file with a plurality of previously stored fixed length chunk files;
Comparing the chunk file inconsistent with the plurality of fixed length chunk files to a plurality of pre-stored variable length chunk files; And
And generating a new chunk file by comparing a chunk file partially not coincident with the plurality of variable-length chunk files to the fixed-length chunk file to partially match the fixed-length chunk file, Generation method.

The method according to claim 1,
Comparing the chunk file with a pre-stored plurality of fixed length chunk files comprises:
And converting the chunk file corresponding to the plurality of previously stored fixed length chunk files into an identifier corresponding to the fixed length chunk file matching the chunk file and transmitting the identifier.

3. The method of claim 2,
Converting the chunk file into an identifier corresponding to the fixed-length chunk file corresponding to the chunk file,
Calculating a hash value of the fixed-length chunk file that is the same as the chunk file; And
And transmitting the calculated hash value.

The method according to claim 1,
Wherein the step of dividing the data into chunks of predetermined sizes comprises:
And dividing the chunk file into the same size as the previously stored fixed length chunk file.

The method according to claim 1,
Wherein the size of the variable-length chunk file is equal to or smaller than the fixed-length chunk file size.

The method according to claim 1,
Comparing the chunk file inconsistent with the plurality of fixed length chunk files to a plurality of pre-stored variable length chunk files,
Converting a data area partially matching the variable-length chunk file into an identifier corresponding to the variable-length chunk file; And
And transmitting a chunk file in which a part of the data area is converted into an identifier corresponding to the variable-length chunk file.

The method according to claim 6,
Comparing the chunk file inconsistent with the plurality of fixed length chunk files to a plurality of pre-stored variable length chunk files,
Searching for a variable length chunk file that includes data in the first byte that is the same as the data contained in the first byte of the chunk file;
Calculating a hash value corresponding to the length of the searched variable length chunk file in the chunk file;
Comparing the calculated hash value with a hash value of the searched variable length chunk file; And
And determining that the chunk file and the variable-length chunk file are partially matched if the calculated hash value is the same as the hash value of the searched variable-length chunk file.

The method according to claim 6,
And transmitting the data itself to an area other than an area corresponding to the variable-length chunk file among the data of the chunk file.

The method according to claim 6,
The step of converting a data area partially matching the variable-length chunk file into an identifier corresponding to the variable-
Selecting a variable length chunk file having a largest file size among variable length chunk files partially corresponding to the chunk file when a plurality of variable length chunk files partially matching the chunk file are present; And
And converting the data area partially matching the largest variable-length chunk file into an identifier corresponding to the largest variable-length chunk file.

The method according to claim 1,
Storing the chunk file inconsistent with the fixed-length chunk file and partially inconsistent with the variable-length chunk file as a fixed-length chunk file among the chunk files.

A first storage unit for storing a plurality of fixed-length chunk files;
A second storage unit for storing a plurality of variable length chunk files;
A data dividing unit dividing the data into chunks of predetermined size;
A fixed length chunk file comparing unit for comparing the divided chunk file with a plurality of fixed length chunk files stored in the first storage unit;
A variable length chunk file comparing unit for comparing a chunk file inconsistent with the plurality of fixed length chunk files to the plurality of variable length chunk files stored in the second storage unit;
And a chunk file generating unit for generating a new chunk file by comparing a chunk file partially not coincident with the plurality of variable length chunk files to the fixed-length chunk file and partially matching the fixed-length chunk file Chunk file creation device.

12. The method of claim 11,
The fixed-length chunk file comparison unit includes:
And converts the chunk file coincident with the plurality of previously stored fixed length chunk files into an identifier corresponding to the fixed length chunk file matching the chunk file.

13. The method of claim 12,
Wherein the identifier corresponding to the fixed-length chunk file is a hash value of the fixed-length chunk file.

12. The method of claim 11,
Wherein the data division unit comprises:
And divides the data into the same size as the fixed-length chunk file stored in the first storage unit.

12. The method of claim 11,
Wherein the size of the variable-length chunk file is equal to or smaller than the fixed-length chunk file size.

12. The method of claim 11,
The variable length chunk file comparison unit comprises:
And converts the data area of the chunk file partially matching the variable-length chunk file into an identifier corresponding to the variable-length chunk file.

17. The method of claim 16,
The variable length chunk file comparison unit comprises:
Length chunk file including the same data as the data included in the first byte of the chunk file in the first byte and searching for a hash value corresponding to the length of the searched variable-length chunk file in the chunk file Length chunk file and the hash value of the searched variable-length chunk file, and if the calculated hash value is the same as the hash value of the searched variable-length chunk file, the chunk file and the variable- A chunk file generation device that determines that a file is partially matched.

17. The method of claim 16,
The variable length chunk file comparison unit comprises:
When a plurality of variable-length chunk files partially matching the chunk file are present, selects the largest variable-length chunk file among the variable-length chunk files partially matching the chunk file, And converts the partially matched data area into an identifier corresponding to the largest variable-length chunk file.

12. The method of claim 11,
Wherein the chunk file generating unit comprises:
Storing a chunk file which is inconsistent with the fixed-length chunk file and partially not coincident with the variable-length chunk file in the chunk file as a fixed-length chunk file in the first storage unit;

One or more processors;
A memory for loading a computer program executed by the processor; And
A storage for storing a computer program for generating a chunk file,
The computer program comprising:
An operation of dividing the data into chunks of predetermined size;
Comparing the chunk file with a pre-stored plurality of fixed length chunk files;
Comparing the chunk file inconsistent with the plurality of fixed length chunk files to a plurality of pre-stored variable length chunk files;
And a chunk file including operations for generating a new chunk file by partially comparing a chunk file that partially does not match the plurality of variable-length chunk files with the fixed-length chunk file and partially matching the fixed-length chunk file, Generating device.

In combination with a computing device
Dividing the data into chunk files of predetermined sizes;
Comparing the chunk file with a plurality of previously stored fixed length chunk files;
Comparing the chunk file inconsistent with the plurality of fixed length chunk files to a plurality of pre-stored variable length chunk files;
Length chunk file and a chunk file that partially does not coincide partially with the plurality of variable-length chunk files to the fixed-length chunk file to generate a new chunk file that partially coincides with the fixed-length chunk file, A computer program stored on a medium.