WO2025170271A1

WO2025170271A1 - Subsequent nucleic acid sequence storage device and method for storing subsequent nucleic acid sequence using same

Info

Publication number: WO2025170271A1
Application number: PCT/KR2025/001453
Authority: WO
Inventors: 전재현; 이원경; 윤기석
Original assignee: Seegene Inc
Current assignee: Seegene Inc
Priority date: 2024-02-05
Filing date: 2025-01-24
Publication date: 2025-08-14
Anticipated expiration: 2026-08-05

Abstract

Provided is a subsequent nucleic acid sequence storage device. A subsequent nucleic acid sequence storage device according to an embodiment may comprise: a step for dividing a reference nucleic acid sequence into continuously fragmented fragments of a predetermined length, assigning a position identifier (PI) to each of the fragments, and assigning sequence identifiers (SIs) to nucleic acid sequences included in each of the fragments of the reference nucleic acid sequence; a step for aligning a subsequent nucleic acid sequence to the reference nucleic acid; a step for assigning PIs to fragments of the subsequent nucleic acid sequence respectively corresponding to the fragments of the reference nucleic acid sequence to which the PIs have been assigned, and assigning an SI to the nucleic acid sequence included in each of the fragments; and a step for storing the PIs and SIs for the fragments of the subsequent nucleic acid sequence in a nucleic acid sequence database.

Description

Subsequent nucleic acid sequence storage device and method for storing subsequent nucleic acid sequence using the same

본 발명은 후속 핵산서열 저장 장치 및 이를 이용한 후속 핵산서열을 저장하는 방법에 관한 것이다.The present invention relates to a subsequent nucleic acid sequence storage device and a method for storing subsequent nucleic acid sequences using the same.

종래에 핵산서열을 저장하는 방법은 핵산서열이 저장된 공개 데이터베이스로부터 핵산서열을 다운로드 받는 방식으로 핵산서열을 저장하고 관리하였다. 이때, 공개 데이터베이스인 NCBI 또는 GISAID의 데이터베이스가 활용되곤 했다.Traditionally, nucleic acid sequences were stored and managed by downloading them from public databases. Public databases like NCBI or GISAID were often utilized.

다만, NCBI 또는 GISAD와 같은 데이터베이스에는 많은 수의 핵산서열이 저장되어 있기 때문에 이로부터 핵산서열을 다운로드 하는 동작에도 많은 시간이 필요하였고, 이를 저장하여 별도 데이터베이스를 구축하는 데에도 많은 시간과 비용이 필요하였다. 또한, NCBI 또는 GISAD 데이터베이스에 새로운 핵산서열이 추가되거나 기존에 저장된 핵산서열이 변경되는 경우, 데이터베이스 내에 모든 핵산서열을 새롭게 다시 다운로드해야 하는 어려움이 존재하였다.However, because databases such as NCBI or GISAD contain a large number of nucleic acid sequences, downloading them from them requires considerable time, and storing them in a separate database also requires significant time and cost. Furthermore, when new nucleic acid sequences are added to NCBI or GISAD databases or existing stored nucleic acid sequences are modified, the difficulty of having to re-download all nucleic acid sequences within the database presents itself.

종래의 방식은 전체 핵산 서열 중에 일부가 중복된 핵산서열을 포함하는 지와 무관하게 모든 핵산서열을 새롭게 다운로드 받아야만 했다. 예를 들어, 새롭게 추가되는 핵산서열이 일부가 기존에 데이터베이스에 저장되어 있다고 하더라도, 일부가 중복된 핵산서열인지 여부와 무관하게 모든 핵산 서열을 새롭게 다운로드 받아서 저장할 수밖에 없었다. 따라서, 중복된 정보임에도 불구하고 중첩하여 데이터베이스에 저장하기 때문에, 불필요하게 많은 저장 공간을 사용해야 된다는 문제점이 존재하였다.Conventional methods required a fresh download of all nucleic acid sequences, regardless of whether they contained overlapping sequences. For example, even if some of the newly added sequences were already stored in the database, all sequences had to be downloaded and stored anew, regardless of whether they were duplicates. Therefore, despite the redundant information, they were stored in the database in an overlapping fashion, resulting in an unnecessary increase in storage space.

특히, 중증급성호흡기증후군 코로나바이러스(SARS-CoV-2)와 같은 팬데믹 상황에서, NCBI 또는 GISAD 같은 종(species) 또는 유사 종의 핵산서열이 폭발적으로 증가하는 경우에 같은 종 또는 유사 종의 핵산서열을 지속적으로 업데이트할 필요가 있었다. 이 경우에도 업데이트되는 핵산서열을 모두 새롭게 다운로드 하였는데, 이미 일부가 중복된 핵산서열이 저장되어 있음에도 불구하고, 또 다시 전체 핵산서열을 다운로드 받을 수밖에 없었다. 이 경우 폭발적으로 증가하는 핵산서열의 정보를 업데이트하는데 중복된 핵산서열을 반복적으로 저장함에 따라 비효율적으로 저장 공간을 활용할 수밖에 없었다. 또한, 데이터베이스에 저장된 핵산서열을 다시 추출하여 필요한 핵산서열을 가져올 때에도 중복된 핵산서열을 또다시 검색해야 했기 때문에 핵산서열을 추출하는 동작의 처리 속도가 느리다는 문제가 존재하였다.In particular, in pandemic situations such as severe acute respiratory syndrome coronavirus (SARS-CoV-2), when the nucleic acid sequences of a species or similar species, such as NCBI or GISAD, increased explosively, there was a need to continuously update the nucleic acid sequences of the same or similar species. In this case, all updated nucleic acid sequences were newly downloaded, and even though some of the nucleic acid sequences were already stored, the entire nucleic acid sequence had to be downloaded again. In this case, the repeated storage of duplicate nucleic acid sequences to update the information of the explosively increasing nucleic acid sequences inevitably resulted in inefficient use of storage space. Furthermore, when re-extracting the nucleic acid sequences stored in the database and retrieving the required nucleic acid sequences, the duplicate nucleic acid sequences had to be searched again, which slowed down the processing speed of the nucleic acid sequence extraction operation.

(선행기술문헌)(Prior art literature)

(특허문헌)(Patent Document)

한국공개특허공보 제10-2015-0016572호 (2015.02.12. 공개)Korean Patent Publication No. 10-2015-0016572 (published on February 12, 2015)

본 명세서에 의해 해결하고자 하는 과제는 이미 저장된 핵산서열과 일부가 중복된 후속 핵산서열을 데이터베이스에 저장할 때 저장 공간을 효율적으로 사용하기 위한 것이다.The problem to be solved by this specification is to efficiently use storage space when storing subsequent nucleic acid sequences that partially overlap with already stored nucleic acid sequences in a database.

본 명세서에 의해 해결하고자 하는 다른 과제는 저장된 후속 핵산서열을 보다 용이하게 추출하고 시간과 비용을 절감하는 것이다.Another challenge addressed by the present disclosure is to facilitate the extraction of stored subsequent nucleic acid sequences, saving time and cost.

다만, 본 명세서에 기재된 해결하고자 하는 과제들은 이상에서 언급한 과제들로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해될 수 있을 것이다.However, the problems to be solved described in this specification are not limited to the problems mentioned above, and other problems not mentioned can be clearly understood by those skilled in the art of the present invention from the description below.

일 실시예에 따른 후속 핵산서열을 저장하는 방법은 레퍼런스 핵산서열을 소정 길이의 연속적으로 절편화된 단편들로 분할하여 상기 단편들 각각에 PI(position identifier)를 할당하고, 상기 레퍼런스 핵산서열의 단편들 각각에 포함된 핵산서열에 SI(sequence identifier)를 부여하는 단계, 상기 후속 핵산서열을 상기 레퍼런스 핵산서열에 alignment하는 단계, 상기 PI가 부여된 레퍼런스 핵산서열의 단편들 각각에 대응하는 후속 핵산서열의 단편에 PI를 할당하고, 상기 단편들 각각에 포함된 핵산서열에 SI를 부여하는 단계, 및 상기 핵산서열 데이터베이스에 상기 후속 핵산서열의 단편들에 대한 PI 및 SI를 저장하는 단계를 포함할 수 있다.A method for storing a subsequent nucleic acid sequence according to one embodiment may include the steps of dividing a reference nucleic acid sequence into consecutively fragmented fragments of a predetermined length, assigning a PI (position identifier) to each of the fragments, and assigning an SI (sequence identifier) to a nucleic acid sequence included in each of the fragments of the reference nucleic acid sequence, aligning the subsequent nucleic acid sequence to the reference nucleic acid sequence, assigning a PI to a fragment of the subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence to which the PI has been assigned, and assigning an SI to a nucleic acid sequence included in each of the fragments, and storing the PI and SI for the fragments of the subsequent nucleic acid sequence in the nucleic acid sequence database.

일 실시예에서, 상기 레퍼런스 핵산서열은, 동물, 식물, 균류(fungi), 프로토조아(protozoa), 박테리아 또는 바이러스의 핵산서열일 수 있다.In one embodiment, the reference nucleic acid sequence may be a nucleic acid sequence of an animal, plant, fungi, protozoa, bacteria, or virus.

일 실시예에서, 상기 레퍼런스 핵산서열은, 인플루엔자 바이러스 핵산서열, 호흡기 세포융합 바이러스(respiratory syncytial virus; RSV) 핵산서열, 아데노바이러스 핵산서열, 엔테로바이러스 핵산서열, 파라인플루엔자 바이러스(parainfluenza virus) 핵산서열, 메타뉴모바이러스(metapneumovirus; MPV) 핵산서열, 보카바이러스 핵산서열, 라이노바이러스 핵산서열 및/또는 코로나바이러스 핵산서열로 이루어진 군으로부터 선택된 호흡기 바이러스일 수 있다.In one embodiment, the reference nucleic acid sequence may be a respiratory virus selected from the group consisting of an influenza virus nucleic acid sequence, a respiratory syncytial virus (RSV) nucleic acid sequence, an adenovirus nucleic acid sequence, an enterovirus nucleic acid sequence, a parainfluenza virus nucleic acid sequence, a metapneumovirus (MPV) nucleic acid sequence, a bocavirus nucleic acid sequence, a rhinovirus nucleic acid sequence, and/or a coronavirus nucleic acid sequence.

일 실시예에서, 상기 레퍼런스 핵산서열이 절편화되는 소정 길이는, 상기 레퍼런스 핵산서열이 속하는 종(species) 별로 동일 또는 상이할 수 있다.In one embodiment, the predetermined length to which the reference nucleic acid sequence is fragmented may be the same or different depending on the species to which the reference nucleic acid sequence belongs.

일 실시예에서, 상기 후속 핵산서열의 단편들 각각에 부여된 SI는, 상기 후속 핵산서열의 단편의 핵산서열이 상기 레퍼런스 핵산서열의 단편의 핵산서열과 동일하면 상기 레퍼런스 핵산서열의 단편에 기 부여된 SI의 identifier를 부여하고, 상이하면 신규한 identifier를 부여하는 방식으로 부여될 수 있다.In one embodiment, the SI assigned to each of the fragments of the subsequent nucleic acid sequence may be assigned in such a way that if the nucleic acid sequence of the fragment of the subsequent nucleic acid sequence is identical to the nucleic acid sequence of the fragment of the reference nucleic acid sequence, the identifier of the SI assigned to the fragment of the reference nucleic acid sequence is assigned, and if they are different, a new identifier is assigned.

일 실시예에서, 상기 레퍼런스 핵산서열의 단편들 각각에 부여된 SI 및 상기 후속 핵산서열의 단편들 각각에 부여된 SI는, 그들의 핵산서열과 매칭하여 참조 데이터베이스에 저장될 수 있다.In one embodiment, the SI assigned to each of the fragments of the reference nucleic acid sequence and the SI assigned to each of the fragments of the subsequent nucleic acid sequence can be stored in a reference database by matching their nucleic acid sequences.

일 실시예에서, 상기 참조 데이터베이스에 저장된 상기 레퍼런스 핵산서열 및 상기 후속 핵산서열 각각의 SI에 매칭되는 핵산서열은 서로 상이할 수 있다.In one embodiment, the nucleic acid sequences matching the SI of each of the reference nucleic acid sequence and the subsequent nucleic acid sequence stored in the reference database may be different from each other.

일 실시예에서, 상기 신규한 identifier가 부여된 SI는, 상기 SI에 매칭되는 핵산서열과 함께 참조 데이터베이스에 저장되고, 상기 참조 데이터베이스는, 서로 상이한 핵산서열에 대한 identifier가 부여된 데이터를 저장할 수 있다.In one embodiment, the SI assigned with the novel identifier is stored in a reference database together with a nucleic acid sequence matching the SI, and the reference database can store data assigned with identifiers for different nucleic acid sequences.

일 실시예에서, 상기 핵산서열 데이터베이스는, 상기 레퍼런스 핵산서열 및 상기 후속 핵산서열 각각의 PI에 매칭되는 SI를 포함할 수 있다.In one embodiment, the nucleic acid sequence database may include SI matching the PI of each of the reference nucleic acid sequence and the subsequent nucleic acid sequence.

일 실시예에서, 상기 후속 핵산서열을 상기 레퍼런스 핵산서열에 정렬하는 단계는, 상기 후속 핵산서열에 insertion 변이가 발생된 영역 또는 deletion 변이가 발생된 영역을 반영하여 상기 레퍼런스 핵산서열의 단편에 대응하도록 정렬할 수 있다.In one embodiment, the step of aligning the subsequent nucleic acid sequence to the reference nucleic acid sequence may align the subsequent nucleic acid sequence to correspond to a fragment of the reference nucleic acid sequence by reflecting a region in which an insertion mutation or a deletion mutation occurred in the subsequent nucleic acid sequence.

일 실시예에서, 상기 후속 핵산서열이 insertion 또는 deletion 변이를 포함하는 경우, 상기 insertion 또는 deletion 변이를 포함하는 PI에 해당하는 후속 핵산서열의 단편은 상기 레퍼런스 핵산서열의 단편과 길이가 상이할 수 있다.In one embodiment, when the subsequent nucleic acid sequence comprises an insertion or deletion mutation, a fragment of the subsequent nucleic acid sequence corresponding to the PI comprising the insertion or deletion mutation may have a different length from a fragment of the reference nucleic acid sequence.

일 실시예에서, 상기 후속 핵산서열 이후에 획득되는 추가 후속 핵산서열을 상기 레퍼런스 핵산서열에 정렬하는 단계 및 상기 PI가 부여된 레퍼런스 핵산서열의 단편들 각각에 대응하는 추가 후속 핵산서열의 단편에 PI를 할당하고, 상기 추가 후속 핵산서열의 단편들 각각에 포함된 핵산서열에 SI를 부여하는 단계를 더 포함하고, 상기 추가 후속 핵산서열의 단편들 각각에 부여된 SI는, 상기 추가 후속 핵산서열이 저장되기 전에 기 저장된 상기 후속 핵산서열 및 상기 레퍼런스 핵산서열을 상기 추가 후속 핵산서열과 비교하여 동일한지 상이한지 나타내는 identifier일 수 있다.In one embodiment, the method further includes the steps of aligning an additional subsequent nucleic acid sequence obtained after the subsequent nucleic acid sequence to the reference nucleic acid sequence, assigning a PI to a fragment of the additional subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence to which the PI is assigned, and assigning SI to a nucleic acid sequence included in each of the fragments of the additional subsequent nucleic acid sequence, wherein the SI assigned to each of the fragments of the additional subsequent nucleic acid sequence may be an identifier indicating whether the previously stored subsequent nucleic acid sequence and the reference nucleic acid sequence are the same or different from the additional subsequent nucleic acid sequence before the additional subsequent nucleic acid sequence is stored.

일 실시예에서, 상기 핵산서열 데이터베이스 및 상기 참조 데이터베이스로부터 후속 핵산서열을 획득하는 단계를 더 포함할 수 있다.In one embodiment, the method may further include obtaining subsequent nucleic acid sequences from the nucleic acid sequence database and the reference database.

일 실시예에서, 상기 핵산서열 데이터베이스 및 상기 참조 데이터베이스로부터 후속 핵산서열을 획득하는 단계는, 상기 후속 핵산서열에 대한 유전자 정보 및 핵산서열 구간 정보 중 적어도 하나를 포함하는 쿼리를 생성하는 단계 및 상기 핵산서열 데이터베이스 및 상기 참조 데이터베이스를 이용하여 상기 쿼리에 대응하는 핵산서열 정보로 재구성할 수 있다.In one embodiment, the step of obtaining a subsequent nucleic acid sequence from the nucleic acid sequence database and the reference database includes the step of generating a query including at least one of genetic information and nucleic acid sequence section information for the subsequent nucleic acid sequence, and reconstructing nucleic acid sequence information corresponding to the query using the nucleic acid sequence database and the reference database.

일 실시예에서, 상기 핵산서열 정보를 상기 쿼리에 대응하는 핵산서열 정보로 재구성하는 단계는, 상기 쿼리가 핵산서열의 유전자 정보인 경우, 상기 후속 핵산서열 중에서 상기 쿼리의 유전자 정보에 해당하는 PI들을 지정하는 단계, 상기 지정된 PI에 포함된 핵산서열에 부여된 상기 SI의 핵산서열을 연속적으로 연결하는 단계 및 상기 연속적으로 연결된 핵산서열의 양 끝단을 상기 쿼리의 유전자 정보에 해당하는 영역으로 trimming하는 단계를 포함할 수 있다.In one embodiment, the step of reconstructing the nucleic acid sequence information into nucleic acid sequence information corresponding to the query may include, when the query is genetic information of a nucleic acid sequence, a step of designating PIs corresponding to the genetic information of the query among the subsequent nucleic acid sequences, a step of continuously connecting the nucleic acid sequences of the SI assigned to the nucleic acid sequences included in the designated PIs, and a step of trimming both ends of the continuously connected nucleic acid sequences to a region corresponding to the genetic information of the query.

일 실시예에서, 상기 핵산서열 정보를 상기 쿼리에 대응하는 핵산서열 정보로 재구성하는 단계는, 상기 쿼리가 핵산서열 구간 정보인 경우, 상기 후속 핵산서열 중에서 상기 쿼리의 구간 정보 해당하는 PI들을 지정하는 단계, 상기 지정된 PI에 포함된 핵산서열에 부여된 상기 SI 핵산서열을 연속적으로 연결하는 단계 및 상기 연속적으로 연결된 핵산서열 양 끝단을 상기 쿼리의 구간 정보에 해당하는 영역으로 trimming하는 단계를 포함할 수 있다.In one embodiment, the step of reconstructing the nucleic acid sequence information into nucleic acid sequence information corresponding to the query may include, when the query is nucleic acid sequence section information, a step of designating PIs corresponding to the section information of the query among the subsequent nucleic acid sequences, a step of continuously connecting the SI nucleic acid sequences assigned to the nucleic acid sequences included in the designated PIs, and a step of trimming both ends of the continuously connected nucleic acid sequences to a region corresponding to the section information of the query.

다른 실시예에 따른 후속 핵산서열 저장 장치는 소정 길이의 연속적으로 절편화된 단편들을 포함하는 레퍼런스 핵산서열을 이용하여 후속 핵산서열을 저장하는 장치로서, 적어도 하나의 명령을 저장하는 메모리, 네트워크 및 프로세서를 포함하며, 상기 프로세서는 상기 적어도 하나의 명령을 수행함으로써, 상기 레퍼런스 핵산서열을 소정 길이의 연속적으로 절편화된 단편들로 분할하여 상기 단편들 각각에 PI(position identifier)를 할당하고, 상기 레퍼런스 핵산서열의 단편들 각각에 포함된 핵산서열에 SI(sequence identifier)를 부여하는 인스트럭션(instruction), 상기 후속 핵산서열을 상기 레퍼런스 핵산서열에 alignment하는 단계, 상기 PI가 부여된 레퍼런스 핵산서열의 단편들 각각에 대응하는 후속 핵산서열의 단편에 PI를 할당하고, 상기 단편들 각각에 포함된 핵산서열에 SI를 부여하는 인스트럭션, 상기 후속 핵산서열의 단편들 각각에 할당된 PI는 상기 레퍼런스 핵산서열의 PI에 대응하는 identifier이고, 상기 후속 핵산서열의 단편들 각각에 부여된 SI는 상기 후속 핵산서열의 PI에 해당되는 레퍼런스 핵산서열의 단편과 후속 핵산서열의 단편이 동일한지 상이한지를 representing하는 identifier인 것이고, 및 상기 후속 핵산서열의 단편들에 대한 PI 및 SI를 포함하는 핵산서열 데이터베이스를 수득하는 인스트럭션을 포함할 수 있다.According to another embodiment, a subsequent nucleic acid sequence storage device is a device for storing a subsequent nucleic acid sequence using a reference nucleic acid sequence including continuously fragmented fragments of a predetermined length, the device including a memory storing at least one command, a network, and a processor, wherein the processor executes the at least one command to divide the reference nucleic acid sequence into continuously fragmented fragments of a predetermined length, assign a PI (position identifier) to each of the fragments, and assign an SI (sequence identifier) to a nucleic acid sequence included in each of the fragments of the reference nucleic acid sequence, a step of aligning the subsequent nucleic acid sequence to the reference nucleic acid sequence, an instruction for assigning a PI to a fragment of the subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence to which the PI has been assigned, and assigning an SI to a nucleic acid sequence included in each of the fragments, the PI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier corresponding to the PI of the reference nucleic acid sequence, and each of the fragments of the subsequent nucleic acid sequence is an identifier corresponding to the PI of the reference nucleic acid sequence. The assigned SI is an identifier representing whether a fragment of the reference nucleic acid sequence corresponding to the PI of the subsequent nucleic acid sequence is the same as or different from a fragment of the subsequent nucleic acid sequence, and may include instructions for obtaining a nucleic acid sequence database including PI and SI for fragments of the subsequent nucleic acid sequence.

또 다른 실시예에 따른 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 프로그램을 저장하는 컴퓨터 판독가능한 기록매체로서, 상기 컴퓨터 프로그램은, 소정 길이의 연속적으로 절편화된 단편들을 포함하는 레퍼런스 핵산서열을 소정 길이의 연속적으로 절편화된 단편들로 분할하여 상기 단편들 각각에 PI(position identifier)를 할당하고, 상기 레퍼런스 핵산서열의 단편들 각각에 포함된 핵산서열에 SI(sequence identifier)를 부여하는 단계, 상기 후속 핵산서열을 상기 레퍼런스 핵산서열에 alignment하는 단계, 상기 PI가 부여된 레퍼런스 핵산서열의 단편들 각각에 대응하는 후속 핵산서열의 단편에 PI를 할당하고, 상기 단편들 각각에 포함된 핵산서열에 SI를 부여하는 단계, 상기 후속 핵산서열의 단편들 각각에 할당된 PI는 상기 레퍼런스 핵산서열의 PI에 대응하는 identifier이고, 상기 후속 핵산서열의 단편들 각각에 부여된 SI는 상기 후속 핵산서열의 PI에 해당되는 레퍼런스 핵산서열의 단편과 후속 핵산서열의 단편이 동일한지 상이한지를 representing하는 identifier인 것이고, 및 상기 후속 핵산서열의 단편들에 대한 PI 및 SI를 포함하는 핵산서열 데이터베이스를 수득하는 단계를 포함해서 수행하도록 프로그램될 수 있다.According to another embodiment, a computer-readable recording medium is provided, which is a computer-readable recording medium storing a computer program, wherein the computer program comprises: a step of dividing a reference nucleic acid sequence including continuously fragmented fragments of a predetermined length into continuously fragmented fragments of a predetermined length, assigning a PI (position identifier) to each of the fragments, and assigning an SI (sequence identifier) to a nucleic acid sequence included in each of the fragments of the reference nucleic acid sequence; a step of aligning the subsequent nucleic acid sequence to the reference nucleic acid sequence; a step of assigning a PI to a fragment of the subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence to which the PI has been assigned, and assigning an SI to a nucleic acid sequence included in each of the fragments; the PI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier corresponding to the PI of the reference nucleic acid sequence, and the SI assigned to each of the fragments of the subsequent nucleic acid sequence determines whether a fragment of the reference nucleic acid sequence corresponding to the PI of the subsequent nucleic acid sequence is the same as or different from a fragment of the subsequent nucleic acid sequence. The method may be programmed to perform the steps of: obtaining a nucleic acid sequence database including PI and SI for fragments of the subsequent nucleic acid sequence, wherein the identifier is an identifier representing the nucleic acid sequence;

또 다른 실시예에 따른 컴퓨터 판독가능한 기록매체에 저장된 컴퓨터 프로그램으로서, 상기 컴퓨터 프로그램은, 소정 길이의 연속적으로 절편화된 단편들을 포함하는 레퍼런스 핵산서열을 소정 길이의 연속적으로 절편화된 단편들로 분할하여 상기 단편들 각각에 PI(position identifier)를 할당하고, 상기 레퍼런스 핵산서열의 단편들 각각에 포함된 핵산서열에 SI(sequence identifier)를 부여하는 단계, 상기 후속 핵산서열을 상기 레퍼런스 핵산서열에 alignment하는 단계, 상기 PI가 부여된 레퍼런스 핵산서열의 단편들 각각에 대응하는 후속 핵산서열의 단편에 PI를 할당하고, 상기 단편들 각각에 포함된 핵산서열에 SI를 부여하는 단계, 상기 후속 핵산서열의 단편들 각각에 할당된 PI는 상기 레퍼런스 핵산서열의 PI에 대응하는 identifier이고, 상기 후속 핵산서열의 단편들 각각에 부여된 SI는 상기 후속 핵산서열의 PI에 해당되는 레퍼런스 핵산서열의 단편과 후속 핵산서열의 단편이 동일한지 상이한지를 representing하는 identifier인 것이고, 및 상기 후속 핵산서열의 단편들에 대한 PI 및 SI를 포함하는 핵산서열 데이터베이스를 수득하는 단계를 포함해서 수행하도록 프로그램될 수 있다.According to another embodiment, a computer program stored in a computer-readable recording medium, the computer program comprising: a step of dividing a reference nucleic acid sequence including continuously fragmented fragments of a predetermined length into continuously fragmented fragments of a predetermined length, assigning a PI (position identifier) to each of the fragments, and assigning an SI (sequence identifier) to a nucleic acid sequence included in each of the fragments of the reference nucleic acid sequence; a step of aligning the subsequent nucleic acid sequence to the reference nucleic acid sequence; a step of assigning a PI to a fragment of the subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence to which the PI has been assigned, and assigning an SI to a nucleic acid sequence included in each of the fragments; the PI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier corresponding to the PI of the reference nucleic acid sequence, and the SI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier representing whether a fragment of the reference nucleic acid sequence corresponding to the PI of the subsequent nucleic acid sequence is the same as or different from a fragment of the subsequent nucleic acid sequence; and It can be programmed to perform a step including obtaining a nucleic acid sequence database including PI and SI for fragments of the above subsequent nucleic acid sequence.

일 실시예에 따르면, 후속 핵산서열을 데이터베이스에 저장할 때 identifier를 이용하여 저장함에 따라 기존에 비해 적은 데이터베이스의 저장 공간을 사용하게 되는 장점이 있다.According to one embodiment, when storing subsequent nucleic acid sequences in a database, there is an advantage of using less database storage space than before by storing them using identifiers.

일 실시예에 따르면, 같은 종(species) 또는 유사 종의 핵산서열이 폭발적으로 증가하는 경에도 이미 저장된 핵산서열과 중복된 핵산서열을 다시 반복하지 저장하지 않고 데이터베이스로부터 핵산서열을 추출하는 시간과 비용을 절약할 수 있다.According to one embodiment, even when the number of nucleic acid sequences of the same species or similar species increases explosively, it is possible to save time and cost for extracting nucleic acid sequences from a database without repeatedly storing nucleic acid sequences that overlap with already stored nucleic acid sequences.

도 1은 일 실시예에 따른 핵산서열 저장 장치가 공개 데이터베이스로부터 핵산서열을 획득하는 동작을 개념적으로 나타내는 도면이다.FIG. 1 is a diagram conceptually illustrating an operation of a nucleic acid sequence storage device according to one embodiment of the present invention to obtain a nucleic acid sequence from a public database.

도 2는 일 실시예에 따른 후속 핵산서열을 저장하는 방법의 단계를 나타내는 순서도이다.FIG. 2 is a flowchart illustrating steps of a method for storing a subsequent nucleic acid sequence according to one embodiment.

도 3은 획득한 레퍼런스 핵산서열과 후속 핵산서열을 정렬한 예시를 나타내는 도면이다.Figure 3 is a drawing showing an example of alignment of an acquired reference nucleic acid sequence and a subsequent nucleic acid sequence.

도 4는 레퍼런스 핵산서열과 후속 핵산서열을 소정 길이의 연속적으로 절편화된 단편들로 분할하여 상기 단편들 각각에 PI(position identifier)를 할당하는 예시를 나타내는 도면이다.FIG. 4 is a diagram showing an example of dividing a reference nucleic acid sequence and a subsequent nucleic acid sequence into consecutively fragmented fragments of a predetermined length and assigning a PI (position identifier) to each of the fragments.

도 5는 후속 핵산서열에 발생된 deletion 변이를 고려하여 후속 핵산서열을 레퍼런스 핵산서열에 align한 예시를 나타내는 도면이다.Figure 5 is a drawing showing an example of aligning a subsequent nucleic acid sequence to a reference nucleic acid sequence by taking into account a deletion mutation that occurred in the subsequent nucleic acid sequence.

도 6은 도 5의 후속 핵산서열을 소정 길이의 연속적으로 절편화된 단편들 각각에 PI가 할당된 예시를 나타내는 도면이다.Figure 6 is a diagram showing an example in which a PI is assigned to each of the consecutively fragmented fragments of a predetermined length in the subsequent nucleic acid sequence of Figure 5.

도 7은 후속 핵산서열에 발생된 insertion 변이를 고려하여 후속 핵산서열을 레퍼런스 핵산서열에 align한 예시를 나타내는 도면이다.Figure 7 is a drawing showing an example of aligning a subsequent nucleic acid sequence to a reference nucleic acid sequence by taking into account insertion mutations that occurred in the subsequent nucleic acid sequence.

도 8은 도 7의 후속 핵산서열을 소정 길이의 연속적으로 절편화된 단편들 각각에 PI가 할당된 예시를 나타내는 도면이다.Figure 8 is a diagram showing an example in which a PI is assigned to each of the consecutively fragmented fragments of a predetermined length in the subsequent nucleic acid sequence of Figure 7.

도 9는 레퍼런스 핵산서열에 PI를 할당하고 SI(sequence identifier)를 부여하는 과정을 나타내는 도면이다.Figure 9 is a diagram showing the process of assigning PI to a reference nucleic acid sequence and providing a sequence identifier (SI).

도 10은 도 9 이후에 저장되는 추가 후속 핵산서열에 PI를 할당하고 SI를 부여하는 과정을 나타내는 도면이다.Figure 10 is a diagram showing a process of assigning PI and granting SI to additional subsequent nucleic acid sequences stored after Figure 9.

도 11은 도 9 및 도 10의 과정을 통해 SI가 부여된 참조 데이터베이스의 예시를 나타내는 도면이다.FIG. 11 is a diagram showing an example of a reference database to which SI is assigned through the process of FIGS. 9 and 10.

도 12는 수득된 핵산서열 데이터베이스의 예시를 나타내는 도면이다.Figure 12 is a drawing showing an example of the obtained nucleic acid sequence database.

도 13은 핵산서열 데이터베이스로부터 후속 핵산서열을 획득하는 동작을 개념적으로 나타내는 도면이다.Figure 13 is a diagram conceptually illustrating an operation of obtaining a subsequent nucleic acid sequence from a nucleic acid sequence database.

도 14는 데이터베이스에 저장된 PI 및 SI를 이용하여 쿼리에 대응하는 단편을 지정하는 예시를 나타내는 도면이다.Figure 14 is a diagram showing an example of specifying a fragment corresponding to a query using PI and SI stored in a database.

도 15는 핵산서열의 쿼리의 구간 정보의 예시를 나타내는 도면이다.Figure 15 is a diagram showing an example of interval information of a query of a nucleic acid sequence.

도 16은 일 실시예에 따른 핵산서열 저장 장치의 하드웨어 구성도이다.Figure 16 is a hardware configuration diagram of a nucleic acid sequence storage device according to one embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 일 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.The advantages and features of the present invention, and the methods for achieving them, will become clearer with reference to the embodiments described in detail below together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. Only one embodiment is provided to ensure that the disclosure of the present invention is complete and to fully inform those skilled in the art of the scope of the invention, and the present invention is defined solely by the scope of the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.When describing embodiments of the present invention, detailed descriptions of known functions or configurations will be omitted if they are deemed to unnecessarily obscure the gist of the invention. Furthermore, the terms described below are defined in light of their functions in the embodiments of the present invention and may vary depending on the intent or custom of the user or operator. Therefore, their definitions should be based on the overall content of this specification.

도 1을 설명하기에 앞서, 본원에서 사용된 용어에 대해 살펴보기로 한다.Before explaining Figure 1, let us look at the terms used herein.

본 개시에서 사용되는 "핵산서열(nucleic acid sequence)"은 DNA 또는 RNA의 뉴클레오타이드 중합체를 의미하며 염기서열과 동일한 의미로 사용될 수 있다. 핵산서열은 생물학적 구성자체 또는 bioinformatics에서 의미하는 핵산 정보를 모두 포함하는 의미이다. 이러한 핵산서열은 임의의 형태의 유기체에 포함된 핵산서열일 수 있다.As used herein, the term "nucleic acid sequence" refers to a nucleotide polymer of DNA or RNA and can be used in the same sense as a base sequence. A nucleic acid sequence encompasses all nucleic acid information, whether a biological structure itself or in bioinformatics. This nucleic acid sequence may be a nucleic acid sequence contained in any type of organism.

본 개시에서 사용되는 유기체는 하나의 속, 종, 아종, 서브타입, 지노타입, 시로타입, 스트레인, 분리종(isolate) 또는 재배종(cultivar)에 속한 유기체를 의미할 수 있다. 유기체는 원핵세포(예컨대, Mycoplasma pneumoniae, Chlamydophila pneumoniae, Legionella pneumophila, Haemophilus influenzae, Streptococcus pneumoniae, Bordetella pertussis, Bordetella parapertussis, Neisseria meningitidis, Listeria monocytogenes, Streptococcus agalactiae, Campylobacter, Clostridium difficile, Clostridium perfringens, Salmonella, Escherichia coli, Shigella, Vibrio, Yersinia enterocolitica, Aeromonas, Chlamydia trachomatis, Neisseria gonorrhoeae, Trichomonas vaginalis, Mycoplasma hominis, Mycoplasma genitalium, Ureaplasma urealyticum, Ureaplasma parvum, Mycobacterium tuberculosis, Treponema pallidum, Candida, Mobiluncus, Megasphaera, Lacto spp., Mycoplasma genitalium, Clostridium difficile, Helicobacter Pylori, ClariR, CPE, Group B Streptococcus, Enterobacter cloacae complex, Proteus mirabilis, Klebsiella aerogenes, Pseudomonas aeruginosa, Klebsiella oxytoca, Serratia marcescens, Klebsiella pneumoniae, Actinomycetaceae actinotignum, Enterococcus faecium Staphylococcus epidermidis, Enterococcus faecalis, Staphylococcus saprophyticus, Staphylococcus aureus, Acinetobacter baumannii, Morganella morganii, Aerococcus urinae, Pantoea aglomerans, Citrobacter Freundii, Providencia stuartii, Citrobacter koseri, Streptococcus anginosus, Trichophyton mentagrophytes complex, Microsporum spp., Trichophyton rubrum, Epidermophyton floccosum, Trichophyton tonsurans), 진핵세포(예컨대, 원생동물과 기생동물, 균류, 효모, 고등 식물, 하등 동물 및 포유동물과 인간을 포함하는 고등동물), 바이러스 또는 비로이드를 포함할 수 있다. 상기 진핵세포 중 기생충(parasite)은 예를 들어, Giardia lamblia, Entamoeba histolytica, Cryptosporidium, Blastocystis hominis, Dientamoeba fragilis, Cyclospora cayetanensis, stercoralis, trichiura, hymenolepis, Necator americanus, Enterobius vermicularis, Taenia spp., Ancylostoma duodenale, Ascaris lumbricoides, Enterocytozoon spp./Encephalitozoon spp. 를 포함할 수 있다. 바이러스는 예를 들어, 호흡기 질환을 유발하는 인플루엔자 A 바이러스(Flu A), 인플루엔자 B 바이러스(Flu B), 호흡 씬시티얼 바이러스 A(Respiratory syncytial virus A: RSV A), 호흡 씬시티얼 바이러스 B(Respiratory syncytial virus B: RSV B), 코비드(Covid)-19 바이러스, 파라인플루엔자 바이러스 1(PIV 1), 파라인플루엔자 바이러스 2(PIV 2), 파라인플루엔자 바이러스 3(PIV 3), 파라인플루엔자 바이러스 4(PIV 4), 메타뉴모바이러스(MPV), 인간 엔테로바이러스(HEV), 인간 보카바이러스(HBoV), 인간 라이노바이러스(HRV), 코로나바이러스 및 아데노바이러스, 그리고 위장관 질환을 유발하는 유발하는 노로바이러스, 로타바이러스, 아데노바이러스, 아스트로바이러스 및 사포바이러스를 포함할 수 있다. 다른 예시로, 상기 바이러스는 HPV(human papillomavirus), MERS-CoV(Middle East respiratory syndrome-related coronavirus), 댕기바이러스(Dengue virus), HSV(Herpes simplex virus), HHV(Human herpes virus), EMV(Epstein-Barr virus), VZV(Varicella zoster virus), CMV(Cytomegalovirus), HIV, Parvovirus B19, Parechovirus, Mumps, Dengue virus, Chikungunya virus, Zika virus, West Nile virus, 간염 바이러스 및 폴리오바이러스를 포함할 수 있다. 타겟 분석물은 GBS serotype, Bacterial colony, v600e일 수 있다. 본 개시에서의 타겟 분석물은 전술한 바이러스뿐만 아니라 박테리아 등의 다양한 분석 대상물들을 포함할 수 있고, 크리스퍼(CRISPR) 기술을 이용해서 절단되는 유전자의 특정 부위일 수도 있으며, 전술한 예시들로 타겟 분석물에 대한 범위는 제한되지 않는다.An organism as used in this disclosure may mean an organism belonging to a genus, species, subspecies, subtype, genotype, sirotype, strain, isolate or cultivar. The organism may be a prokaryotic cell (e.g., Mycoplasma pneumoniae, Chlamydophila pneumoniae, Legionella pneumophila, Haemophilus influenzae, Streptococcus pneumoniae, Bordetella pertussis, Bordetella parapertussis, Neisseria meningitidis, Listeria monocytogenes, Streptococcus agalactiae, Campylobacter, Clostridium difficile, Clostridium perfringens, Salmonella, Escherichia coli, Shigella, Vibrio, Yersinia enterocolitica, Aeromonas, Chlamydia trachomatis, Neisseria gonorrhoeae, Trichomonas vaginalis, Mycoplasma hominis, Mycoplasma genitalium, Ureaplasma urealyticum, Ureaplasma parvum, Mycobacterium tuberculosis, Treponema pallidum, Candida, Mobiluncus, Megasphaera, Lacto spp., Mycoplasma genitalium, Clostridium difficile, Helicobacter Pylori, ClariR, CPE, Group B Streptococcus, Enterobacter cloacae complex, Proteus mirabilis, Klebsiella aerogenes, Pseudomonas aeruginosa, Klebsiella oxytoca, Serratia marcescens, Klebsiella pneumoniae, Actinomycetaceae actinotignum, Enterococcus faecium Staphylococcus epidermidis, Enterococcus faecalis, Staphylococcus saprophyticus, Staphylococcus aureus, Acinetobacter baumannii, Morganella morganii, Aerococcus urinae, Pantoea aglomerans, Citrobacter Freundii, Providencia stuartii, Citrobacter koseri, The parasites may include, for example, Streptococcus anginosus, Trichophyton mentagrophytes complex, Microsporum spp., Trichophyton rubrum, Epidermophyton floccosum, Trichophyton tonsurans), eukaryotic cells (e.g., protozoa and parasites, fungi, yeasts, higher plants, lower animals and higher animals including mammals and humans), viruses or viroids. Among the eukaryotic cells, parasites include, for example, Giardia lamblia, Entamoeba histolytica, Cryptosporidium, Blastocystis hominis, Dientamoeba fragilis, Cyclospora cayetanensis, stercoralis, trichiura, hymenolepis, Necator americanus, Enterobius vermicularis, Taenia spp., Ancylostoma duodenale, Ascaris lumbricoides, Enterocytozoon spp./Encephalitozoon spp. The viruses may include, for example, influenza A virus (Flu A), influenza B virus (Flu B), respiratory syncytial virus A (RSV A), respiratory syncytial virus B (RSV B), Covid-19 virus, parainfluenza virus 1 (PIV 1), parainfluenza virus 2 (PIV 2), parainfluenza virus 3 (PIV 3), parainfluenza virus 4 (PIV 4), metapneumovirus (MPV), human enterovirus (HEV), human bocavirus (HBoV), human rhinovirus (HRV), coronaviruses and adenoviruses that cause respiratory illness, and norovirus, rotavirus, adenovirus, astrovirus and sapovirus that cause gastrointestinal illness. As another example, the virus may include human papillomavirus (HPV), Middle East respiratory syndrome-related coronavirus (MERS-CoV), Dengue virus, Herpes simplex virus (HSV), Human herpes virus (HHV), Epstein-Barr virus (EMV), Varicella zoster virus (VZV), Cytomegalovirus (CMV), HIV, Parvovirus B19, Parechovirus, Mumps, Dengue virus, Chikungunya virus, Zika virus, West Nile virus, hepatitis virus, and poliovirus. The target analyte may be GBS serotype, bacterial colony, v600e. The target analyte in the present disclosure may include various analyte objects such as bacteria as well as the aforementioned viruses, and may also be a specific region of a gene cut using CRISPR technology, and the scope of the target analyte is not limited to the aforementioned examples.

본 개시에서의 "생물의 분류 체계"는 유기체가 속하는 범위를 구분하기 위한 분류 체계이다. 예를 들어, 종(Species), 속(Genus), 과(Family), 목(Order), 강(Class), 문(Phylum), 계(Kingdom) 및 역(Domain)으로 표현되는 분류 체계가 본 개시에서의 "생물의 분류 체계"의 범위 내에 포함될 수 있다. 일례로, 본 개시의 일 실시예에 따른 생물의 분류 체계는 계층 구조(hierarchical structure)를 가질 수 있다. 계층 구조는 상위 계층이 하위 계층을 포괄하는 형태의 생물학적 계통 구조인 계층 구조를 의미할 수 있다.The "biological classification system" in this disclosure refers to a classification system for distinguishing the scope of an organism. For example, a classification system expressed as Species, Genus, Family, Order, Class, Phylum, Kingdom, and Domain may be included within the scope of the "biological classification system" in this disclosure. For example, the biological classification system according to one embodiment of the present disclosure may have a hierarchical structure. The hierarchical structure may refer to a biological system structure in which an upper layer encompasses a lower layer.

여기서, 도 1은 예시적인 것에 불과하므로, 본 발명의 사상이 도 1에 도시된 것으로 한정 해석되는 것은 아니다. 예컨대 후속 핵산서열 저장 장치(100)가 후속 핵산서열을 획득하도록 후속 핵산서열을 제공하는 데이터베이스는 도시된 공개 데이터베이스(1) 이외의 외부 데이터베이스일 수 있다. Here, since Fig. 1 is merely exemplary, the spirit of the present invention is not to be construed as being limited to what is illustrated in Fig. 1. For example, the database that provides subsequent nucleic acid sequences so that the subsequent nucleic acid sequence storage device (100) can obtain the subsequent nucleic acid sequences may be an external database other than the illustrated public database (1).

도 1을 참조하면, 후속 핵산서열 저장 장치(100)는 데이터베이스와 네트워크로 연결될 수 있다. 네트워크는 후속 핵산서열 저장 장치(100)와 후속 핵산서열 저장 장치(100) 간의 데이터 송수신을 제공하는 무선 또는 유선 네트워크망을 지칭할 수 있다. 이 중, 무선 네트워크 망의 경우, 예컨대 LTE(long-term evolution), LTE-A(LTE advance), CDMA(code division multiple access), WCDMA(wideband CDMA), UMTS(universal mobile telecommunications system), WiBro(Wireless Broadband), WiFi(wireless fidelity), 블루투스(Bluetoothe), NFC(near field communication) 및 GNSS(global navigation stellite system) 등에서 적어도 하나를 포함할 수 있다. 또한, 유선 네트워크 망의 경우, 예컨대 USB(universal serial bus), HDMI(high definition multimedia interface), RS-232(recommended standard232), LAN(Local Area Network), WAN(Wide Area Network), 인터넷 및 전화망(telephone network) 등에서 적어도 하나를 포함할 수 있으나 이에 한정되지 않는다.Referring to FIG. 1, the subsequent nucleic acid sequence storage device (100) may be connected to a database and a network. The network may refer to a wireless or wired network that provides data transmission and reception between the subsequent nucleic acid sequence storage device (100) and the subsequent nucleic acid sequence storage device (100). Among these, in the case of a wireless network, for example, at least one of LTE (long-term evolution), LTE-A (LTE advance), CDMA (code division multiple access), WCDMA (wideband CDMA), UMTS (universal mobile telecommunications system), WiBro (Wireless Broadband), WiFi (wireless fidelity), Bluetooth (Bluetooth), NFC (near field communication), and GNSS (global navigation stellite system) may be included. Additionally, in the case of a wired network, it may include at least one of, but is not limited to, USB (universal serial bus), HDMI (high definition multimedia interface), RS-232 (recommended standard232), LAN (Local Area Network), WAN (Wide Area Network), Internet, and telephone network.

후속 핵산서열 저장 장치(100)가 핵산서열을 제공하는 공개 데이터베이스(1)는 생물정보에 관련된 정보를 저장하고 관리하며 핵산서열을 제공하는 데이터베이스일 수 있다. 일 실시예에서, 공개 데이터베이스(1)는 미국 국립생물정보센터 (national center for biotechnology information, NCBI) 또는 국제인플루엔자정보공유기구(global initiative on sharing avian flu data, GISAID)의 데이터베이스일 수 있으나, 이에 한정되지 않는다.The public database (1) that provides the nucleic acid sequence in the subsequent nucleic acid sequence storage device (100) may be a database that stores and manages information related to biological information and provides the nucleic acid sequence. In one embodiment, the public database (1) may be a database of the National Center for Biotechnology Information (NCBI) or the Global Initiative on Sharing Avian Flu Data (GISAID), but is not limited thereto.

일 실시예에 따른 후속 핵산서열 저장 장치(100)는 공개 데이터베이스(1)로부터 획득한 레퍼런스 핵산서열 및 후속 핵산서열을 저장할 수 있다. 레퍼런스 핵산서열은 특정 병원체에 대한 핵산서열의 기준이 되는 핵산서열일 수 있다. 일 실시예에서 레퍼런스 핵산서열은 후속 핵산서열 중에서 최초에 획득되는 핵산서열일 수 있다. 다른 실시예에서, 레퍼런스 핵산서열은 최초에 획득되는 핵산서열이 아닌 다른 핵산서열일 수도 있다.A subsequent nucleic acid sequence storage device (100) according to one embodiment can store a reference nucleic acid sequence and a subsequent nucleic acid sequence obtained from a public database (1). The reference nucleic acid sequence may be a nucleic acid sequence that serves as a standard for a nucleic acid sequence for a specific pathogen. In one embodiment, the reference nucleic acid sequence may be a nucleic acid sequence that is first obtained among the subsequent nucleic acid sequences. In another embodiment, the reference nucleic acid sequence may be a nucleic acid sequence other than the first obtained nucleic acid sequence.

레퍼런스 핵산서열은 동물, 식물, 균류(fungi), 프로토조아(protozoa), 박테리아 또는 바이러스의 핵산서열일 수 있다.The reference nucleic acid sequence may be a nucleic acid sequence of an animal, plant, fungi, protozoa, bacteria or virus.

일 실시예에서, 레퍼런스 핵산서열은, 인플루엔자 바이러스 핵산서열, 호흡기 세포융합 바이러스(respiratory syncytial virus; RSV) 핵산서열, 아데노 바이러스 핵산서열, 엔In one embodiment, the reference nucleic acid sequence is an influenza virus nucleic acid sequence, a respiratory syncytial virus (RSV) nucleic acid sequence, an adenovirus nucleic acid sequence, an

테로 바이러스 핵산서열, 파라인플루엔자 바이러스(parainfluenza virus) 핵산서열, 메타뉴모바이러스(metapneumovirus; MPV) 핵산서열, 보카바이러스 핵산서열, 라이노바이러스 핵산서열 및/또는 코로나바이러스 핵산서열로 이루어진 군으로부터 선택된 호흡기 바이러스일 수 있으나, 이에 한정되지 않는다.It may be a respiratory virus selected from the group consisting of, but not limited to, a telovirus nucleic acid sequence, a parainfluenza virus nucleic acid sequence, a metapneumovirus (MPV) nucleic acid sequence, a bocavirus nucleic acid sequence, a rhinovirus nucleic acid sequence, and/or a coronavirus nucleic acid sequence.

후속 핵산서열은 특정 병원체에 대하여 후속 핵산서열 저장 장치(100)에 저장되는 핵산서열 중에서 레퍼런스 핵산서열 이외의 핵산서열을 의미할 수 있다. 일 실시예에서, 레퍼런스 핵산서열 이외에 후속 핵산서열 저장 장치(100)에 획득되는 핵산서열은 모두 후속 핵산서열로 지칭될 수 있다.A subsequent nucleic acid sequence may refer to a nucleic acid sequence other than a reference nucleic acid sequence among the nucleic acid sequences stored in a subsequent nucleic acid sequence storage device (100) for a specific pathogen. In one embodiment, all nucleic acid sequences obtained in a subsequent nucleic acid sequence storage device (100) other than the reference nucleic acid sequence may be referred to as a subsequent nucleic acid sequence.

후속 핵산서열 저장 장치(100)는 레퍼런스 핵산서열을 저장한 뒤에 후속 핵산서열을 저장할 수 있다. 후속 핵산서열의 개수는 적어도 한 개이며, 복수 개일 수 있다. 후속 핵산서열의 개수는 제한되지 않는다.The subsequent nucleic acid sequence storage device (100) can store subsequent nucleic acid sequences after storing a reference nucleic acid sequence. The number of subsequent nucleic acid sequences is at least one, and may be plural. The number of subsequent nucleic acid sequences is not limited.

후속 핵산서열 저장 장치(100)는 핵산서열을 저장하고 관리할 수 있는 자체 데이터베이스(2)를 이용하여 레퍼런스 핵산서열 및 후속 핵산서열을 저장할 수 있다. 후속 핵산서열 저장 장치(100)가 핵산서열을 저장하고 관리할 수 있는 자체 데이터베이스(2)는 후속 핵산서열 저장 장치(100)에 구비된 데이터베이스 이거나 별도의 장치에 구비된 데이터베이스일 수도 있다. 일 실시예에서, 자체 데이터베이스(2)는 클라우드로 구현될 수도 있다. 후속 핵산서열 저장 장치(100)가 핵산서열을 저장하고 관리하는 자체 데이터베이스(2)의 구현 방식은 이에 한정되지 않고 다양한 방식으로 구현될 수 있다.The subsequent nucleic acid sequence storage device (100) can store a reference nucleic acid sequence and a subsequent nucleic acid sequence using its own database (2) capable of storing and managing nucleic acid sequences. The subsequent nucleic acid sequence storage device (100)'s own database (2) capable of storing and managing nucleic acid sequences may be a database provided in the subsequent nucleic acid sequence storage device (100) or a database provided in a separate device. In one embodiment, the own database (2) may be implemented in the cloud. The implementation method of the subsequent nucleic acid sequence storage device (100)'s own database (2) for storing and managing nucleic acid sequences is not limited thereto and may be implemented in various ways.

후속 핵산서열 저장 장치(100)는 레퍼런스 핵산서열을 절편화하여 복수의 단편들의 집합으로 저장할 수 있다. 레퍼런스 핵산서열은 소정 길이의 연속적으로 절편화된 단편들을 포함할 수 있다. 여기서 소정 길이는 미리 지정된 핵산 서열의 길이일 수 있으며, 각 단편들마다 일정할 수 있다.A subsequent nucleic acid sequence storage device (100) can fragment a reference nucleic acid sequence and store it as a collection of multiple fragments. The reference nucleic acid sequence may include consecutively fragmented fragments of a predetermined length. Here, the predetermined length may be a length of a pre-designated nucleic acid sequence and may be constant for each fragment.

후속 핵산서열 저장 장치(100)는 레퍼런스 핵산서열의 단편들 각각에 PI(position identifier)를 할당하고, 단편들 각각에 포함된 핵산서열에 SI(sequence identifier)를 부여할 수 있다. 여기서, PI는 단편을 식별할 수 있는 식별자를 의미하고, SI는 단편에 포함된 핵산서열을 식별할 수 있는 식별자를 의미하며, 자세한 내용은 후술하도록 한다.The subsequent nucleic acid sequence storage device (100) can assign a PI (position identifier) to each fragment of the reference nucleic acid sequence and can assign an SI (sequence identifier) to the nucleic acid sequence included in each fragment. Here, PI means an identifier that can identify the fragment, and SI means an identifier that can identify the nucleic acid sequence included in the fragment, which will be described in detail later.

레퍼런스 핵선서열은 소정 길이의 연속적으로 절편화된 단편들로 분할될 수 있다. 후속 핵산서열 저장 장치(100)는 레퍼런스 핵산서열을 소정 길이의 연속적으로 절편화된 단편들로 분할하여 상기 단편들 각각에 PI를 할당하고, 레퍼런스 핵산서열의 단편들 각각에 포함된 핵산서열에 SI를 부여할 수 있다.A reference nucleic acid sequence can be divided into consecutively fragmented fragments of a predetermined length. A subsequent nucleic acid sequence storage device (100) can divide the reference nucleic acid sequence into consecutively fragmented fragments of a predetermined length, assign a PI to each of the fragments, and assign an SI to the nucleic acid sequence contained in each of the fragments of the reference nucleic acid sequence.

후속 핵산서열 저장 장치(100)는 핵산서열 데이터베이스(2)에 후속 핵산서열의 단편들에 대한 PI 및 SI를 저장할 수 있다.The subsequent nucleic acid sequence storage device (100) can store PI and SI for fragments of the subsequent nucleic acid sequence in the nucleic acid sequence database (2).

일 실시예에 따른 후속 핵산서열 저장 장치(100)는 상기와 같은 방식으로 후속 핵산서열을 저장함에 따라 데이터베이스의 저장공간을 효과적으로 사용할 수 있다.A subsequent nucleic acid sequence storage device (100) according to one embodiment can effectively use the storage space of a database by storing subsequent nucleic acid sequences in the above manner.

이하, 도 2 내지 도 13을 참조하여 후속 핵산서열 저장 장치(100)의 동작을 살펴보도록 한다. 여기서 각 단계를 수행하는 주체는 컴퓨팅 장치이다. 일 실시예에서, 컴퓨팅 장치는 후속 핵산서열 저장 장치(100)일 수 있다.Hereinafter, the operation of the subsequent nucleic acid sequence storage device (100) will be examined with reference to FIGS. 2 to 13. Here, the entity performing each step is a computing device. In one embodiment, the computing device may be the subsequent nucleic acid sequence storage device (100).

도 2의 단계 S100에서, 레퍼런스 핵산서열의 단편들 각각에 PI가 할당되고, 단편들 각각에 포함된 핵산서열에 SI가 부여될 수 있다. 여기서, PI는 단편을 식별할 수 있는 식별자를 의미하고, SI는 단편에 포함된 핵산서열을 식별할 수 있는 식별자를 의미하며, 자세한 내용은 후술하도록 한다.In step S100 of FIG. 2, a PI may be assigned to each fragment of the reference nucleic acid sequence, and an SI may be assigned to the nucleic acid sequence included in each fragment. Here, PI refers to an identifier that can identify the fragment, and SI refers to an identifier that can identify the nucleic acid sequence included in the fragment, which will be described in detail later.

이후, 단계 S200에서 후속 핵산서열을 상기 레퍼런스 핵산서열에 정렬될 수 있다. 여기서 정렬은 시퀀스에 대한 sequence 정렬을 수행한다. 정렬은 DNA, RNA, 또는 단백질 서열들 사이의 기능적, 구조적인 상관관계를 밝혀 내기 위해 핵산서열들을 비교하여 상호 유사한 구간을 찾는 동작이다. 잘 알려진 서열과 알려져 있지 않은 핵산서열을 정렬하거나, 알려져 있지 않은 두 핵산서열을 정렬하는데, 이때 잘 알려진 서열을 상술한 레퍼런스 핵산서열 또는 참조서열이라고 한다. 본 명세서에서 이용되는 정렬은 종래에 알려진 알고리즘을 이용하여 글로벌 정렬(global alignment) 또는 로컬 정렬(local alignment), 서열 짝 정렬(pairwise sequence alignment) 또는 다중서열정렬(multiple sequence alignment)과 같은 방식으로 수행될 수 있다. 상술한 alignment 방식은 이에 한정되지 않고 후속 핵산서열을 레퍼런스 핵산서열에 정렬시킬 수 있는 다양한 방식일 수 있다.Thereafter, in step S200, the subsequent nucleic acid sequence may be aligned to the reference nucleic acid sequence. Here, alignment performs sequence alignment for the sequence. Alignment is an operation of comparing nucleic acid sequences to find similar sections in order to reveal functional and structural correlations between DNA, RNA, or protein sequences. A well-known sequence and an unknown nucleic acid sequence are aligned, or two unknown nucleic acid sequences are aligned, wherein the well-known sequence is referred to as the reference nucleic acid sequence or reference sequence. The alignment used in the present specification may be performed in a manner such as global alignment, local alignment, pairwise sequence alignment, or multiple sequence alignment using a conventionally known algorithm. The above-described alignment method is not limited thereto, and may be various methods capable of aligning the subsequent nucleic acid sequence to the reference nucleic acid sequence.

단계 S300에서 PI(position identifier)가 부여된 레퍼런스 핵산서열의 단편들 각각에 대응하는 후속 핵산서열의 단편에 PI가 할당되고, 단편들 각각에 포함된 핵산서열에 SI가 부여될 수 있다. 이때, 후속 핵산서열의 단편들 각각에 할당된 PI는 상기 레퍼런스 핵산서열의 PI에 대응하는 identifier이다. 후속 핵산서열의 단편들 각각에 부여된 SI는 후속 핵산서열의 PI에 해당되는 레퍼런스 핵산서열의 단편과 후속 핵산서열의 단편이 동일한지 상이한지를 representing하는 identifier이다.In step S300, a PI (position identifier) may be assigned to a fragment of a subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence, and an SI may be assigned to a nucleic acid sequence included in each of the fragments. At this time, the PI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier corresponding to the PI of the reference nucleic acid sequence. The SI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier representing whether the fragment of the reference nucleic acid sequence corresponding to the PI of the subsequent nucleic acid sequence and the fragment of the subsequent nucleic acid sequence are the same or different.

본 명세서에서 사용되는 용어 "sequence identifier"는 특정 핵산 서열(예: DNA 서열, RNA 서열)을 나타내는 식별자를 의미한다. 예를 들어, sequence identifier는 주석이 달린 서열 이름(예: 유전자 이름)이 포함된다. 주석이 달린 서열 이름의 예에는 공개적으로 접근 가능한 서열 데이터베이스(예: GenBank, EMBL, DDBJ 및 GSD)에 주석이 달린 서열 이름이 포함될 수 있다. 예를 들어, sequence identifier는 임의로 할당된 시퀀스 식별자를 포함할 수 있다. 임의로 부여된 서열 식별자의 예로는 전체 게놈 서열의 단편화에 의해 생성된 모든 서열 단편에 부여된 서열 식별자(ID)가 있을 수 있다. 예를 들어, 클라미디아 트라코마티스의 전체 게놈 서열은 게놈 서열을 절단하기 위한 서열 단편화 알고리즘에 의해 단편화될 수 있고, 필요한 경우 절단 및 병합할 수 있으며, 이렇게 생성된 모든 서열 단편에는 서열 식별자가 할당될 수 있다. 예를 들어, 단편이 100개의 단편을 포함하는 경우, 단편은 SEQID1 내지 SEQID100으로 부여될 수 있다. 본 명세서에서, 단편이 100개의 단편을 포함하고 각각 모두가 상이한 경우, 단편은 아라비아 숫자 1 내지 100의 sequence identifier가 부여되는 것으로 예시를 들었다.As used herein, the term "sequence identifier" refers to an identifier that represents a specific nucleic acid sequence (e.g., a DNA sequence, an RNA sequence). For example, a sequence identifier may include an annotated sequence name (e.g., a gene name). Examples of annotated sequence names may include sequence names annotated in publicly accessible sequence databases (e.g., GenBank, EMBL, DDBJ, and GSD). For example, a sequence identifier may include an arbitrarily assigned sequence identifier. An example of an arbitrarily assigned sequence identifier may be a sequence identifier (ID) assigned to every sequence fragment generated by fragmentation of a whole genome sequence. For example, the whole genome sequence of Chlamydia trachomatis may be fragmented by a sequence fragmentation algorithm for fragmenting the genome sequence, and, if necessary, may be fragmented and merged, and every sequence fragment thus generated may be assigned a sequence identifier. For example, if a fragment comprises 100 fragments, the fragments may be assigned SEQID1 to SEQID100. In this specification, it is exemplified that if a fragment contains 100 fragments and each fragment is different, the fragment is assigned a sequence identifier of Arabic numerals 1 to 100.

후속 핵산서열 저장 장치(100)는 레퍼런스 핵산서열과 후속 핵산서열을 저장할 때, 각각의 단편에 대한 PI 및 상기 PI에 별로 고유의 핵산서열을 나타내는 SI를 포함하는 데이터 구조로 할 수 있다. 이러한 후속 핵산서열 저장 장치(100)는 획득한 후속 핵산서열에 PI를 할당하고 SI를 부여하여 효율적인 공간으로 후속 핵산서열을 저장할 수 있다. 이러한 데이터 구조를 이용한 저장 방식은 후술하도록 한다.The subsequent nucleic acid sequence storage device (100) may be configured as a data structure including a PI for each fragment and an SI representing a unique nucleic acid sequence for each PI when storing a reference nucleic acid sequence and a subsequent nucleic acid sequence. This subsequent nucleic acid sequence storage device (100) can store subsequent nucleic acid sequences in an efficient space by assigning a PI to the acquired subsequent nucleic acid sequence and granting an SI. A storage method using this data structure will be described later.

단계 S400에서 후속 핵산서열의 단편들에 대한 PI 및 SI를 포함하는 핵산서열 데이터베이스(2)가 수득될 수 있다. 여기서 저장된 핵산서열 데이터베이스(2)는 후속 핵산서열 저장 장치(100)에 저장된 핵산서열들을 저장하고 관리하는 데이터베이스로서, 저장된 후속 핵산서열을 후속 핵산서열 저장 장치(100)로 제공할 수 있다.In step S400, a nucleic acid sequence database (2) including PI and SI for fragments of subsequent nucleic acid sequences can be obtained. The stored nucleic acid sequence database (2) here is a database that stores and manages nucleic acid sequences stored in a subsequent nucleic acid sequence storage device (100), and can provide the stored subsequent nucleic acid sequences to the subsequent nucleic acid sequence storage device (100).

보다 구체적인 동작을 살펴보기 위해 도 3 내지 도 10을 참조하여 설명하도록 한다.To examine more specific operations, the following description will be given with reference to Figures 3 to 10.

도 3과 같이 레퍼런스 핵산서열과 후속 핵산서열은 핵산서열의 길이가 동일할 수 있다. 레퍼런스 핵산서열은 0번 서열부터 499번까지의 서열을 포함할 수 있고, 후속 핵산서열 역시 0번 서열부터 499번까지의 서열을 포함할 수 있다. 이때, 레퍼런스 핵산서열과 후속 핵산서열은 모두 full sequence일 수 있다.As shown in Fig. 3, the reference nucleic acid sequence and the subsequent nucleic acid sequence may have the same length. The reference nucleic acid sequence may include sequences from sequence 0 to 499, and the subsequent nucleic acid sequence may also include sequences from sequence 0 to 499. In this case, both the reference nucleic acid sequence and the subsequent nucleic acid sequence may be full sequences.

도 4는 레퍼런스 핵산서열과 후속 핵산서열을 소정 길이의 연속적으로 절편화된 단편들로 분할하여 상기 단편들 각각에 PI를 할당하는 예시를 나타내는 도면이다.FIG. 4 is a diagram showing an example of dividing a reference nucleic acid sequence and a subsequent nucleic acid sequence into consecutively fragmented fragments of a predetermined length and assigning a PI to each of the fragments.

후속 핵산서열 저장 장치(100)는 레퍼런스 핵산서열을 소정 길이의 연속적으로 절편화된 단편들로 분할할 수 있다. 예를 들어, 후속 핵산서열 저장 장치(100)는 레퍼런스 핵산서열이 총 3만개가 있을 때, 100개 길이의 연속적으로 절편화된 단편으로 분할할 수 있다. 이 경우 레퍼런스 핵산서열은 총 300개의 단편으로 절편화되고, 레퍼런스 핵산서열의 단편은 P1 부터 P300까지 PI가 할당될 수 있다. The subsequent nucleic acid sequence storage device (100) can divide the reference nucleic acid sequence into consecutively fragmented fragments of a predetermined length. For example, when there are a total of 30,000 reference nucleic acid sequences, the subsequent nucleic acid sequence storage device (100) can divide them into consecutively fragmented fragments of 100 lengths. In this case, the reference nucleic acid sequence is fragmented into a total of 300 fragments, and the fragments of the reference nucleic acid sequence can be assigned PIs from P1 to P300.

상기 레퍼런스 핵산서열을 분할하는 단편의 길이는 10bp ~ 100bp, 10bp ~ 500bp, 10bp ~ 1000bp, 10bp ~ 10000bp, 50bp ~ 100bp, 50bp ~ 500bp, 50bp ~ 1000bp 또는 50bp ~ 10000bp일 수 있으며, 이에 한정되지 않는다.The length of the fragments dividing the above reference nucleic acid sequence may be 10 bp to 100 bp, 10 bp to 500 bp, 10 bp to 1000 bp, 10 bp to 10,000 bp, 50 bp to 100 bp, 50 bp to 500 bp, 50 bp to 1000 bp or 50 bp to 10,000 bp, but is not limited thereto.

레퍼런스 핵산서열이 절편화되는 소정 길이는, 상기 레퍼런스 핵산서열이 속하는 생물 분류 체계 별로 상이할 수 있다. 일 실시예에서, 소정 길이는 레퍼런스 핵산서열이 속하는 종(species) 별로 동일하거나 또는 상이할 수 있다. 도 4에서 후속 핵산서열 저장 장치(100)는 0번부터 499번까지의 레퍼런스 핵산서열을 총 100개의 길이의 연속적으로 절편화된 단편들로 분할하였다. 후속 핵산서열 저장 장치(100)는 도 4와 같이 연속적으로 절편화된 각 단편들에 PI로서 P1, P2, P3, P4 및 P5를 할당하였다. 도 4와 같이 레퍼런스 핵산서열의 P1 단편은 0번 핵산서열부터 99번 핵산서열을 포함할 수 있고, P2 단편은 100번 핵산서열부터 199번 핵산서열을 포함할 수 있고, P3 단편은 200번 핵산서열부터 299번 핵산서열을 포함할 수 있고, P4 단편은 300번 핵산서열부터 399번 핵산서열을 포함할 수 있고, P5 단편은 400번 핵산서열부터 499번 핵산서열을 포함할 수 있다.The predetermined length to which the reference nucleic acid sequence is fragmented may vary depending on the biological taxonomy to which the reference nucleic acid sequence belongs. In one embodiment, the predetermined length may be the same or different depending on the species to which the reference nucleic acid sequence belongs. In FIG. 4, the subsequent nucleic acid sequence storage device (100) divided the reference nucleic acid sequence from 0 to 499 into consecutively fragmented fragments of a total of 100 lengths. The subsequent nucleic acid sequence storage device (100) assigned P1, P2, P3, P4, and P5 as PIs to each of the consecutively fragmented fragments as shown in FIG. 4. As shown in FIG. 4, the P1 fragment of the reference nucleic acid sequence may include nucleic acid sequences from 0 to 99, the P2 fragment may include nucleic acid sequences from 100 to 199, the P3 fragment may include nucleic acid sequences from 200 to 299, the P4 fragment may include nucleic acid sequences from 300 to 399, and the P5 fragment may include nucleic acid sequences from 400 to 499.

이와 마찬가지로 후속 핵산서열 저장 장치(100)는 후속 핵산서열을 소정 길이의 연속적으로 절편화된 단편들로 분할할 수 있다. Likewise, the subsequent nucleic acid sequence storage device (100) can divide the subsequent nucleic acid sequence into consecutively fragmented fragments of a predetermined length.

일 실시예에서, 후속 핵산서열이 절편화되는 소정 길이는 레퍼런스 핵산서열이 절편화되는 소정 길이와 동일하다. 후속 핵산서열 저장 장치(100)는 1번부터 500번까지의 후속 핵산서열을 총 100개의 길이의 연속적으로 절편화된 단편들로 분할하였다. 후속 핵산서열 저장 장치(100)는 도 4와 같이 연속적으로 절편화된 각 단편들에 PI로서 P1, P2, P3, P4 및 P5를 할당하였다. In one embodiment, the predetermined length to which the subsequent nucleic acid sequence is fragmented is the same as the predetermined length to which the reference nucleic acid sequence is fragmented. The subsequent nucleic acid sequence storage device (100) divided the subsequent nucleic acid sequences from 1 to 500 into consecutively fragmented fragments of a total of 100 lengths. The subsequent nucleic acid sequence storage device (100) assigned P1, P2, P3, P4, and P5 as PIs to each of the consecutively fragmented fragments, as shown in FIG. 4.

마찬가지로, 도 4와 같이 후속 핵산서열의 P1 단편은 0번 핵산서열부터 99번 핵산서열을 포함할 수 있고, P2 단편은 100번 핵산서열부터 199번 핵산서열을 포함할 수 있고, P3 단편은 200번 핵산서열부터 299번 핵산서열을 포함할 수 있고, P4 단편은 300번 핵산서열부터 399번 핵산서열을 포함할 수 있고, P5 단편은 400번 핵산서열부터 499번 핵산서열을 포함할 수 있다.Similarly, as shown in FIG. 4, the P1 fragment of the subsequent nucleic acid sequence may include nucleic acid sequences from 0 to 99, the P2 fragment may include nucleic acid sequences from 100 to 199, the P3 fragment may include nucleic acid sequences from 200 to 299, the P4 fragment may include nucleic acid sequences from 300 to 399, and the P5 fragment may include nucleic acid sequences from 400 to 499.

후속 핵산서열 저장 장치(100)는 후속 핵산서열에 insertion 변이가 발생된 영역 또는 deletion 변이가 발생된 영역을 반영하여 레퍼런스 핵산서열의 단편에 대응하도록 정렬할 수 있다. 이하, 도 5 내지 도 8을 참조하여 후속 핵산서열에 변이가 발생된 경우에 PI를 할당하는 방법을 설명하도록 한다.The subsequent nucleic acid sequence storage device (100) can be aligned to correspond to a fragment of a reference nucleic acid sequence by reflecting a region in which an insertion mutation or a deletion mutation has occurred in the subsequent nucleic acid sequence. Hereinafter, a method for assigning PI when a mutation has occurred in the subsequent nucleic acid sequence will be described with reference to FIGS. 5 to 8.

기 후속 핵산서열을 상기 레퍼런스 핵산서열에 정렬할 때, 후속 핵산서열에 insertion 변이가 발생된 영역 또는 deletion 변이가 발생된 영역을 반영하여 레퍼런스 핵산서열의 단편에 대응하도록 정렬할 수 있다.When aligning a subsequent nucleic acid sequence to the reference nucleic acid sequence, the alignment can be performed to correspond to a fragment of the reference nucleic acid sequence by reflecting a region in which an insertion mutation or a deletion mutation occurred in the subsequent nucleic acid sequence.

후속 핵산서열이 insertion 또는 deletion 변이를 포함하는 경우, insertion 또는 deletion 변이를 포함하는 PI에 해당하는 후속 핵산서열의 단편은 레퍼런스 핵산서열의 단편과 길이가 상이할 수 있다. If the subsequent nucleic acid sequence contains an insertion or deletion mutation, the fragment of the subsequent nucleic acid sequence corresponding to the PI containing the insertion or deletion mutation may be different in length from the fragment of the reference nucleic acid sequence.

도 5는 후속 핵산서열에 발생된 deletion 변이를 고려하여 후속 핵산서열을 레퍼런스 핵산서열에 정렬한 예시를 나타내는 도면이다.Figure 5 is a diagram showing an example of aligning a subsequent nucleic acid sequence to a reference nucleic acid sequence by taking into account a deletion mutation that occurred in the subsequent nucleic acid sequence.

후속 핵산서열에 deletion 변이가 발생된 경우 후속 핵산서열을 레퍼런스 핵산서열에 정렬하였을 때, 도 5와 같이 deletion된 영역을 반영하여 레퍼런스 핵산서열에 대응하도록 정렬될 수 있다.When a deletion mutation occurs in a subsequent nucleic acid sequence, when the subsequent nucleic acid sequence is aligned with a reference nucleic acid sequence, it can be aligned to correspond to the reference nucleic acid sequence by reflecting the deleted region, as shown in Fig. 5.

도 6과 같이 후속 핵산서열 저장 장치(100)는 연속적으로 절편화된 레퍼런스 핵산서열에 대응하도록 후속 핵산서열을 절편화할 수 있다. 이 경우 후속 핵산서열 중에서 deletion 변이가 발생되지 않은 영역이 포함된 단편은 레퍼런스 핵산서열과 동일하게 소정 길이로 절편화될 수 있다. 후속 핵산서열 중에서 deletion 변이가 발생된 영역이 포함된 단편은 레퍼런스 소정 길이와 상이한 길이로 절편화된다. 일 실시예에서, 후속 핵선서열의 전체 길이는 후속 핵산서열에 deletion 변이가 발생된 경우, 레퍼런스 핵산서열의 단편에 비해 길이가 짧을 수 있다.As shown in FIG. 6, the subsequent nucleic acid sequence storage device (100) can fragment the subsequent nucleic acid sequence to correspond to the continuously fragmented reference nucleic acid sequence. In this case, a fragment including a region in which a deletion mutation does not occur among the subsequent nucleic acid sequences can be fragmented into a predetermined length identical to the reference nucleic acid sequence. A fragment including a region in which a deletion mutation occurs among the subsequent nucleic acid sequences is fragmented into a length different from the reference predetermined length. In one embodiment, the total length of the subsequent nucleic acid sequence may be shorter than that of a fragment of the reference nucleic acid sequence when a deletion mutation occurs in the subsequent nucleic acid sequence.

도 6과 같이 후속 핵산서열에서 P3과 P4의 단편에 걸쳐서 deletion 변이가 발생되었고, 후속 핵산서열 저장 장치(100)는 P3 단편과 P4 단편 사이에 deletion 변이가 발생된 영역을 반영하여 후속 핵산서열을 레퍼런스 핵산 서열에 정렬하였다. As shown in Fig. 6, a deletion mutation occurred across the fragments of P3 and P4 in the subsequent nucleic acid sequence, and the subsequent nucleic acid sequence storage device (100) aligned the subsequent nucleic acid sequence to the reference nucleic acid sequence by reflecting the region where the deletion mutation occurred between the P3 fragment and the P4 fragment.

후속 핵산서열에 insertion 변이가 발생된 경우 후속 핵산서열을 레퍼런스 핵산서열에 정렬하였을 때, 도 7과 같이 insertion된 영역을 반영하여 레퍼런스 핵산서열에 대응하도록 정렬될 수 있다.도 8과 같이 후속 핵산서열 저장 장치(100)는 연속적으로 절편화된 레퍼런스 핵산서열에 대응하도록 후속 핵산서열을 절편화할 수 있다. 이 경우 후속 핵산서열 중에서 insertion 변이가 발생되지 않은 영역이 포함된 단편은 레퍼런스 핵산서열과 동일하게 소정 길이로 절편화될 수 있다. 후속 핵산서열 중에서 insertion 변이가 발생된 영역이 포함된 단편은 레퍼런스 소정 길이와 상이한 길이로 절편화된다. 일 실시예에서, 후속 핵산서열의 전체 길이는 후속 핵산서열에 insertion 변이가 발생된 경우, 레퍼런스 핵산서열의 단편에 비해 길이가 길 수 있다.When an insertion mutation occurs in a subsequent nucleic acid sequence, when the subsequent nucleic acid sequence is aligned with a reference nucleic acid sequence, the subsequent nucleic acid sequence can be aligned to correspond to the reference nucleic acid sequence by reflecting the inserted region as shown in FIG. 7. As shown in FIG. 8, the subsequent nucleic acid sequence storage device (100) can fragment the subsequent nucleic acid sequence to correspond to the continuously fragmented reference nucleic acid sequence. In this case, a fragment including a region in which an insertion mutation does not occur among the subsequent nucleic acid sequences can be fragmented to a predetermined length identical to the reference nucleic acid sequence. A fragment including a region in which an insertion mutation occurs among the subsequent nucleic acid sequences is fragmented to a length different from the reference predetermined length. In one embodiment, the total length of the subsequent nucleic acid sequence may be longer than that of a fragment of the reference nucleic acid sequence when an insertion mutation occurs in the subsequent nucleic acid sequence.

도 8과 같이 후속 핵산서열에서 P1과 P2의 단편에 걸쳐서 insertion 변이가 발생되었고, 후속 핵산서열 저장 장치(100)는 P1 단편과 P2 단편 사이에 insertion 변이가 발생된 영역을 반영하여 후속 핵산서열을 레퍼런스 핵산 서열에 정렬하였다.As shown in FIG. 8, an insertion mutation occurred across the fragments of P1 and P2 in the subsequent nucleic acid sequence, and the subsequent nucleic acid sequence storage device (100) aligned the subsequent nucleic acid sequence to the reference nucleic acid sequence by reflecting the region where the insertion mutation occurred between the P1 fragment and the P2 fragment.

도 9와 같이 각각의 단편에 대한 PI 및 PI에 별로 고유의 핵산서열을 나타내는 SI 및 SI에 해당하는 핵산서열을 포함하는 참조 데이터베이스로 저장될 수 있다.As shown in Fig. 9, each fragment can be stored as a reference database including PI and SI representing a unique nucleic acid sequence for each PI and a nucleic acid sequence corresponding to the SI.

여기서 참조 데이터베이스는 SI를 이용하여 SI의 단편에 해당하는 핵산서열을 검색하기 위한 데이터베이스로서, 전술한 핵산서열 데이터베이스(2)와 동일한 구성이거나 또는 다른 구성일 수도 있으며, 참조 데이터베이스가 구현되는 구성의 형태는 이에 제한되지 않는다.Here, the reference database is a database for searching for a nucleic acid sequence corresponding to a fragment of SI using SI, and may have the same configuration as the nucleic acid sequence database (2) described above or a different configuration, and the form of the configuration in which the reference database is implemented is not limited thereto.

후속 핵산서열 저장 장치(100)는 도 9와 같이 참조 데이터베이스에 핵산서열을 저장할 수 있다. 후속 핵산서열 저장 장치(100)는 후속 핵산서열의 단편이 레퍼런스 핵산서열의 단편과 동일하면 동일한 identifier를 부여하고, 후속 핵산서열의 단편이 레퍼런스 핵산서열의 단편과 상이하면 상이한 identifier를 부여할 수 있다.The subsequent nucleic acid sequence storage device (100) can store the nucleic acid sequence in a reference database as shown in FIG. 9. The subsequent nucleic acid sequence storage device (100) can assign the same identifier to a fragment of the subsequent nucleic acid sequence if it is identical to a fragment of the reference nucleic acid sequence, and can assign a different identifier to a fragment of the subsequent nucleic acid sequence if it is different from a fragment of the reference nucleic acid sequence.

즉, 후속 핵산서열의 단편들 각각에 부여된 SI는, 상기 후속 핵산서열의 단편의 핵산서열이 상기 레퍼런스 핵산서열의 단편의 핵산서열과 동일하면 상기 레퍼런스 핵산서열의 단편에 기 부여된 SI의 identifier를 부여하고, 상이하면 신규한 identifier를 부여하는 방식으로 부여될 수 있다.That is, the SI assigned to each fragment of the subsequent nucleic acid sequence can be assigned in such a way that if the nucleic acid sequence of the fragment of the subsequent nucleic acid sequence is identical to the nucleic acid sequence of the fragment of the reference nucleic acid sequence, the identifier of the SI already assigned to the fragment of the reference nucleic acid sequence is assigned, and if they are different, a new identifier is assigned.

이때 레퍼런스 핵산서열이 최초에 저장될 때에, 기존에 저장된 핵산서열이 없는 경우에는 레퍼런스 핵산서열의 모든 단편에 대해 고유의 identifier가 부여될 수 있다. 일 실시예에서, 후속 핵산서열 저장 장치(100)는 PI 별로 핵산 서열을 저장하되, 동일한 핵산서열인 경우 동일한 SI를 부여하여 저장하고, 상이한 핵산서열인 경우 상이한 SI를 부여하여 참조 데이터베이스에 저장할 수 있다.At this time, when the reference nucleic acid sequence is initially stored, if there is no previously stored nucleic acid sequence, a unique identifier may be assigned to all fragments of the reference nucleic acid sequence. In one embodiment, the subsequent nucleic acid sequence storage device (100) may store nucleic acid sequences by PI, and may store them by assigning the same SI to the same nucleic acid sequences, and may store them by assigning different SIs to different nucleic acid sequences in the reference database.

레퍼런스 핵산서열의 단편들 각각에 부여된 SI 및 상기 후속 핵산서열의 단편들 각각에 부여된 SI는 그들의 핵산서열과 매칭하여 참조 데이터베이스에 저장될 수 있다. 참조 데이터베이스에 저장된 레퍼런스 핵산서열 및 후속 핵산서열 각각의 SI에 매칭되는 핵산서열은 서로 상이할 수 있다.The SI assigned to each fragment of the reference nucleic acid sequence and the SI assigned to each fragment of the subsequent nucleic acid sequence may be stored in a reference database by matching their nucleic acid sequences. The nucleic acid sequences matching the SI of each of the reference nucleic acid sequence and the subsequent nucleic acid sequence stored in the reference database may be different from each other.

또한, 후속 핵산서열 이후에 획득되는 추가 후속 핵산서열을 단편들 각각에 부여된 SI는, 기 저장된 후속 핵산서열 및 레퍼런스 핵산서열을 모두 고려하여 부여된 것이다.In addition, the SI assigned to each fragment of the additional subsequent nucleic acid sequence obtained after the subsequent nucleic acid sequence is assigned by considering both the previously stored subsequent nucleic acid sequence and the reference nucleic acid sequence.

예를 들어, 후속 핵산서열 저장 장치(100)는 기 저장된 후속 핵산서열 및 레퍼런스 핵산서열과 추가 후속 핵산서열의 단편이 동일하면 기 부여된 SI의 identifier를 적용하고, 기 저장된 후속 핵산서열 및 레퍼런스 핵산서열과 상이하면 신규한 identifier를 부여할 수 있다.For example, the subsequent nucleic acid sequence storage device (100) can apply the identifier of the SI assigned to the additional subsequent nucleic acid sequence if the fragment of the additional subsequent nucleic acid sequence is the same as the previously stored subsequent nucleic acid sequence and reference nucleic acid sequence, and can assign a new identifier if the fragment is different from the previously stored subsequent nucleic acid sequence and reference nucleic acid sequence.

참조 데이터베이스에는 신규한 identifier가 부여된 SI와 SI에 해당하는 핵산서열이 저장될 수 있다. 핵산서열은 후속 핵산서열을 소정 길이를 갖는 단편들로 분할하고, 분할된 각각의 단편에 포함된 핵산서열을 부분서열(sub-sequence)라고 한다.A reference database may store SIs with new identifiers and nucleic acid sequences corresponding to the SIs. The nucleic acid sequences are divided into fragments of a predetermined length from the subsequent nucleic acid sequences, and the nucleic acid sequences contained in each fragment are called subsequences.

도 9의 예시와 같이 레퍼런스 핵산서열을 절편화한 단편들이 P1에서 P300까지 존재하고, P1의 핵산서열이 "ATCGCGCATGC..."이고, P2의 핵산서열이 "AAAAGATGCTCA..."이고, P3의 핵산서열이 "AGATGGCAACGG..."이고, P300의 핵산서열이 "CTAATTCAGGGTC..."인 것으로 가정한다.As shown in the example of Fig. 9, it is assumed that fragments obtained by fragmenting the reference nucleic acid sequence exist from P1 to P300, and that the nucleic acid sequence of P1 is "ATCGCGCATGC...", the nucleic acid sequence of P2 is "AAAAGATGCTCA...", the nucleic acid sequence of P3 is "AGATGGCAACGG...", and the nucleic acid sequence of P300 is "CTAATTCAGGGTC...".

후속 핵산서열 저장 장치(100)는 레퍼런스 핵산서열을 획득하였을 때 기존에 저장하고 있는 핵산서열이 때문에 동일한 핵산서열이 없기 때문에 레퍼런스 핵산서열의 모든 단편의 핵산서열에 고유의 SI를 부여할 수 있다. 레퍼런스 핵산서열은 절편화된 모든 단편에 SI가 "1"로 부여될 수 있다.When a subsequent nucleic acid sequence storage device (100) obtains a reference nucleic acid sequence, since there is no identical nucleic acid sequence among the previously stored nucleic acid sequences, it can assign a unique SI to the nucleic acid sequences of all fragments of the reference nucleic acid sequence. The reference nucleic acid sequence can assign an SI of "1" to all fragmented fragments.

레퍼런스 핵산서열의 P1의 핵산서열"ATCGCGCATGC..."의 SI는 1로 부여될 수 있다. 레퍼런스 핵산서열의 P2의 핵산서열 "AAAAGATGCTCA..."의 SI는 1로 부여될 수 있다. 즉, 저장 장치(100)는 레퍼런스 핵산서열의 P1의 "ATCGCGCATGC..."를 저장하면서 SI를 1로 부여하고, P2의 "AAAAGATGCTCA..."를 저장하면서 SI를 1로 부여할 수 있다.The SI of the nucleic acid sequence "ATCGCGCATGC..." of P1 of the reference nucleic acid sequence can be assigned as 1. The SI of the nucleic acid sequence "AAAAGATGCTCA..." of P2 of the reference nucleic acid sequence can be assigned as 1. That is, the storage device (100) can assign SI as 1 while storing "ATCGCGCATGC..." of P1 of the reference nucleic acid sequence, and can assign SI as 1 while storing "AAAAGATGCTCA..." of P2.

여기서 레퍼런스 핵산서열의 P1의 SI가 "1"로 부여되었고, P2의 SI도 "1"로 부여되었다. 일 실시예에서, 레퍼런스 핵산서열의 단편들 각각에 부여된 SI는, 절편화된 단편들 각각에 동일한 타입의 identifier가 부여될 수 있다. 즉, 여러 개의 단편들 각각에 부여되는 SI는 모두 아라비아 숫자일 수 있다. 여러 개의 단편들 각각에 동일한 값의 SI가 부여되는 것으로 표현하였으나, 구현예에 따라서 단편들 마다 상이한 값의 SI가 부여될 수도 있다.Here, the SI of P1 of the reference nucleic acid sequence is assigned as "1", and the SI of P2 is also assigned as "1". In one embodiment, the SI assigned to each fragment of the reference nucleic acid sequence may be an identifier of the same type assigned to each fragmented fragment. That is, the SI assigned to each of multiple fragments may all be Arabic numerals. Although it is expressed that the same value of SI is assigned to each of multiple fragments, different values of SI may be assigned to each fragment depending on the implementation example.

다른 실시예에서, 레퍼런스 핵산서열의 단편들 각각에 부여된 SI는 절편화된 단편들 각각에 대하여 상이한 타입의 identifier가 부여될 수도 있다. 예를 들어, 레퍼런스 핵산서열의 P1의 SI는 아라비아 숫자 "1"로 부여되고, 레퍼런스 핵산서열의 P2의 SI는 영문 "a"로 부여될 수도 있다. 이는 하나의 구현예에 불과할 뿐 다양한 방식으로 SI가 부여될 수 있다.In another embodiment, the SI assigned to each fragment of the reference nucleic acid sequence may be assigned a different type of identifier for each fragmented fragment. For example, the SI of P1 of the reference nucleic acid sequence may be assigned the Arabic numeral "1," and the SI of P2 of the reference nucleic acid sequence may be assigned the English letter "a." This is merely one example of implementation, and the SI may be assigned in various ways.

도 12와 같이, 레퍼런스 핵산서열은 P1, P2, P3, P4, P5 내지 P300의 단편이 핵산서열 데이터베이스(2)에 {1, 1, 1, 1, 1, ..., 1}로 저장될 수 있다.As shown in Fig. 12, the reference nucleic acid sequences can be stored as fragments of P1, P2, P3, P4, P5 to P300 in the nucleic acid sequence database (2) as {1, 1, 1, 1, 1, ..., 1}.

기존의 저장 방식을 이용하는 경우 레퍼런스 핵산서열은 "ATCGCGCATGC AAAAGATGCTCAAGATGGCAACGG ... CTAATTCAGGGTC..."로 저장될 수 있다.When using a conventional storage method, the reference nucleic acid sequence can be stored as "ATCGCGCATGC AAAAGATGCTCAAGATGGCAACGG ... CTAATTCAGGGTC...".

이와 같이, 일 실시예에 따른 후속 핵산서열 저장 장치(100)는 기존의 저장 방식에 비해 더 작은 용량으로 핵산서열을 저장할 수 있다.In this way, the subsequent nucleic acid sequence storage device (100) according to one embodiment can store nucleic acid sequences in a smaller capacity compared to conventional storage methods.

도 10을 참조하여 후속 핵산서열(seq.1)을 저장하는 방법을 설명하도록 한다.A method for storing a subsequent nucleic acid sequence (seq.1) is described with reference to FIG. 10.

후속 핵산서열의 단편들 각각에 할당된 PI는 레퍼런스 핵산서열의 PI에 대응하는 identifier이다. 후속 핵산서열의 단편들 각각에 부여된 SI는 후속 핵산서열의 PI에 해당되는 레퍼런스 핵산서열의 단편과 후속 핵산서열의 단편이 동일한지 상이한지를 representing하는 identifier이다.The PI assigned to each fragment of the subsequent nucleic acid sequence is an identifier corresponding to the PI of the reference nucleic acid sequence. The SI assigned to each fragment of the subsequent nucleic acid sequence is an identifier indicating whether the fragment of the reference nucleic acid sequence corresponding to the PI of the subsequent nucleic acid sequence is identical or different from the fragment of the subsequent nucleic acid sequence.

예를 들어, 후속 핵산서열(seq.1)을 절편화한 단편들이 레퍼런스 핵산서열에 대응되도록 PI가 할당되어 P1에서 P300까지 존재한다고 가정한다. 이때, 후속 핵산서열(seq.1)의 P1의 핵산서열이 "ATCGCGCATGC..."이고, P2의 핵산서열이 "AGTTGCAATCAC..."이고, P3의 핵산서열이 "AGATGGCAACGG..."이고, P300의 핵산서열이 "CTAATTCAGGGTC..."인 것으로 가정한다.For example, it is assumed that fragments of the subsequent nucleic acid sequence (seq. 1) are assigned PIs so that they correspond to the reference nucleic acid sequence and exist from P1 to P300. At this time, it is assumed that the nucleic acid sequence of P1 of the subsequent nucleic acid sequence (seq. 1) is "ATCGCGCATGC...", the nucleic acid sequence of P2 is "AGTTGCAATCAC...", the nucleic acid sequence of P3 is "AGATGGCAACGG...", and the nucleic acid sequence of P300 is "CTAATTCAGGGTC...".

이 경우 후속 핵산서열 저장 장치(100)는 후속 핵산서열(seq.1)의 P1의 핵산서열 "ATCGCGCATGC..."를 저장하기 위해, 기존의 P1의 단편에 저장된 핵산 서열을 검색할 수 있다. 도 10에 저장된 참조 데이터베이스의 P1에는 이미 핵산서열 "ATCGCGCATGC..."가 저장되어 있다. 이 경우, 후속 핵산서열 저장 장치(100)는 이미 저장되어 있는 핵산서열 "ATCGCGCATGC..."의 SI를 그대로 이용하게 되므로, 별도로 핵산서열 "ATCGCGCATGC..."를 저장하지 않는다. In this case, the subsequent nucleic acid sequence storage device (100) can search for the nucleic acid sequence stored in the existing fragment of P1 in order to store the nucleic acid sequence "ATCGCGCATGC..." of P1 of the subsequent nucleic acid sequence (seq. 1). The nucleic acid sequence "ATCGCGCATGC..." is already stored in P1 of the reference database stored in FIG. 10. In this case, the subsequent nucleic acid sequence storage device (100) uses the SI of the already stored nucleic acid sequence "ATCGCGCATGC..." as is, and therefore does not store the nucleic acid sequence "ATCGCGCATGC..." separately.

후속 핵산서열 저장 장치(100)는 후속 핵산서열(seq.1)의 P2의 핵산서열 "AGTTGCAATCAC..."를 저장하기 위해, 기존의 P2의 단편에 저장된 핵산 서열을 검색할 수 있다. 도 10에 저장된 참조 데이터베이스의 P2에는 핵산서열 "AGTTGCAATCAC..."와 동일한 핵산서열이 저장되어 있지 않기 때문에, 핵산서열 "AGTTGCAATCAC..."는 기존에 부여된 SI인 "1" 이외의 다른 SI가 부여된다. 도 10에서는 P2의 핵산서열 "AGTTGCAATCAC..."의 SI가 "2"로 부여되었다.The subsequent nucleic acid sequence storage device (100) can search for a nucleic acid sequence stored in a fragment of an existing P2 in order to store the nucleic acid sequence "AGTTGCAATCAC..." of P2 of the subsequent nucleic acid sequence (seq. 1). Since the nucleic acid sequence identical to the nucleic acid sequence "AGTTGCAATCAC..." is not stored in P2 of the reference database stored in FIG. 10, the nucleic acid sequence "AGTTGCAATCAC..." is assigned an SI other than the previously assigned SI of "1." In FIG. 10, the SI of the nucleic acid sequence "AGTTGCAATCAC..." of P2 is assigned as "2."

후속 핵산서열 저장 장치(100)는 후속 핵산서열(seq.1)의 P3의 핵산서열 "AGATGGCAACGG..."와 P300의 핵산서열 "CTAATTCAGGGTC..."를 저장하기 위해, 데이터구조에서 P3 단편에 저장된 핵산서열에서"AGATGGCAACGG..."를 검색하고, 데이터구조에서 P300 단편에 저장된 핵산서열에서"CTAATTCAGGGTC..."를 검색할 수 있다. 도 10에 저장된 참조 데이터베이스의 P3에는 이미 핵산서열 "AGATGGCAACGG..."가 저장되어 있고, 참조 데이터베이스의 P300에는 이미 핵산서열 "CTAATTCAGGGTC..."가 저장되어 있다.The subsequent nucleic acid sequence storage device (100) can search for "AGATGGCAACGG..." in the nucleic acid sequence stored in the P3 fragment in the data structure and search for "CTAATTCAGGGTC..." in the nucleic acid sequence stored in the P3 fragment in the data structure to store the nucleic acid sequence "AGATGGCAACGG..." of P3 of the subsequent nucleic acid sequence (seq. 1) and the nucleic acid sequence "CTAATTCAGGGTC..." of P300. In the reference database stored in Fig. 10, the nucleic acid sequence "AGATGGCAACGG..." is already stored in P3, and the nucleic acid sequence "CTAATTCAGGGTC..." is already stored in P3 of the reference database.

후속 핵산서열 저장 장치(100)는 P3에 이미 저장되어 있는 핵산서열 "AGATGGCAACGG..."의 SI를 그대로 이용하게 되므로, 별도로 핵산서열 "AGATGGCAACGG..."를 저장하지 않는다. 마찬가지로, 후속 핵산서열 저장 장치(100)는 P300에 이미 저장되어 있는 핵산서열 "CTAATTCAGGGTC..."의 SI를 그대로 이용하게 되므로, 별도로 핵산서열 "CTAATTCAGGGTC..."를 저장하지 않는다.Since the subsequent nucleic acid sequence storage device (100) uses the SI of the nucleic acid sequence "AGATGGCAACGG..." already stored in P3 as is, it does not separately store the nucleic acid sequence "AGATGGCAACGG...". Similarly, since the subsequent nucleic acid sequence storage device (100) uses the SI of the nucleic acid sequence "CTAATTCAGGGTC..." already stored in P300 as is, it does not separately store the nucleic acid sequence "CTAATTCAGGGTC...".

도 12를 참조하면, 후속 핵산서열(seq.1)은 P1, P2, P3, 내지 P300의 단편이 핵산서열 데이터베이스(2)에 {1, 2, 4, 1, 5, ..., 4}로 저장될 수 있다.Referring to FIG. 12, the subsequent nucleic acid sequence (seq. 1) may be stored as fragments of P1, P2, P3, to P300 in the nucleic acid sequence database (2) as {1, 2, 4, 1, 5, ..., 4}.

다시 도 10을 참조하여 추가 후속 핵산서열(seq.2)을 저장하는 방법을 설명하도록 한다.Referring again to FIG. 10, a method for storing additional subsequent nucleic acid sequences (seq.2) is described.

일 실시예에서, 후속 핵산서열 저장 장치(100)는 추가 후속 핵산서열을 저장할 때, 추가 후속 핵산서열이 저장되기 전에 기 저장된 후속 핵산서열과 레퍼런스 핵산서열을 후속 핵산서열과 비교하여 SI를 부여할 수 있다.In one embodiment, when storing an additional subsequent nucleic acid sequence, the subsequent nucleic acid sequence storage device (100) can compare a previously stored subsequent nucleic acid sequence with a reference nucleic acid sequence before storing the additional subsequent nucleic acid sequence to assign an SI to the subsequent nucleic acid sequence.

후속 핵산서열 저장 장치(100)는 후속 핵산서열 이후에 추가 후속 핵산서열을 획득할 수 있다. 후속 핵산서열 저장 장치(100)는 획득되는 추가 후속 핵산서열의 단편들 중에 고유한 단편에 SI를 부여할 수 있다. 후속 핵산서열 저장 장치(100)는 추가 후속 핵산서열이 기 저장된 후속 핵산서열 및 레퍼런스 핵산서열과 동일하면 동일한 identifier를 부여하고, 추가 후속 핵산서열이 기 저장된 핵산서열 및 레퍼런스 핵산서열과 상이하면 상이한 identifier를 부여할 수 있다.The subsequent nucleic acid sequence storage device (100) can obtain an additional subsequent nucleic acid sequence after the subsequent nucleic acid sequence. The subsequent nucleic acid sequence storage device (100) can assign an SI to a unique fragment among the fragments of the additional subsequent nucleic acid sequence obtained. The subsequent nucleic acid sequence storage device (100) can assign the same identifier to the additional subsequent nucleic acid sequence if it is identical to the previously stored subsequent nucleic acid sequence and the reference nucleic acid sequence, and can assign a different identifier to the additional subsequent nucleic acid sequence if it is different from the previously stored nucleic acid sequence and the reference nucleic acid sequence.

즉, 후속 핵산서열(seq.2)의 단편들 각각에 부여된 SI는 후속 핵산서열(seq.2)의 PI에 해당되는 레퍼런스 핵산서열(ref)의 단편 및 후속 핵산서열(seq.2)을 저장하기 전에 기 저장된 후속 핵산서열(seq.1)의 단편이 후속 핵산서열의 단편과 동일한지 상이한지를 representing하는 identifier일 수 있다.That is, the SI assigned to each fragment of the subsequent nucleic acid sequence (seq. 2) may be an identifier representing whether the fragment of the reference nucleic acid sequence (ref) corresponding to the PI of the subsequent nucleic acid sequence (seq. 2) and the fragment of the previously stored subsequent nucleic acid sequence (seq. 1) before storing the subsequent nucleic acid sequence (seq. 2) are identical to or different from the fragment of the subsequent nucleic acid sequence.

예를 들어, 후속 핵산서열(seq.2)을 절편화한 단편들이 레퍼런스 핵산서열에 대응되도록 PI가 할당되어 P1에서 P300까지 존재한다고 가정한다. 이때, 후속 핵산서열의 P1의 핵산서열이 "ATCGCGCATGC..."이고, P2의 핵산서열이 "AGTTGCAATCAC..."이고, P3의 핵산서열이 "AGCGCGGATGC..."이고, P300의 핵산서열이 "CTAATTCAGGGTC..."인 것으로 가정한다.For example, it is assumed that fragments of the subsequent nucleic acid sequence (seq. 2) are assigned PIs so that they correspond to the reference nucleic acid sequence and exist from P1 to P300. At this time, it is assumed that the nucleic acid sequence of P1 of the subsequent nucleic acid sequence is "ATCGCGCATGC...", the nucleic acid sequence of P2 is "AGTTGCAATCAC...", the nucleic acid sequence of P3 is "AGCGCGGATGC...", and the nucleic acid sequence of P300 is "CTAATTCAGGGTC...".

이 경우 후속 핵산서열 저장 장치(100)는 후속 핵산서열(seq.2)의 P1의 핵산서열 "ATCGCGCATGC..."를 저장하기 위해, 기존의 P1의 단편에 저장된 핵산 서열을 검색할 수 있다. 도 10에 저장된 참조 데이터베이스의 P1에는 이미 핵산서열 "ATCGCGCATGC..."가 저장되어 있기 때문에, 후속 핵산서열 저장 장치(100)는 이미 저장되어 있는 핵산서열 "ATCGCGCATGC..."의 SI를 그대로 이용하게 되므로, 별도로 핵산서열 "ATCGCGCATGC..."를 저장하지 않는다. In this case, the subsequent nucleic acid sequence storage device (100) can search for the nucleic acid sequence stored in the existing fragment of P1 in order to store the nucleic acid sequence "ATCGCGCATGC..." of P1 of the subsequent nucleic acid sequence (seq. 2). Since the nucleic acid sequence "ATCGCGCATGC..." is already stored in P1 of the reference database stored in FIG. 10, the subsequent nucleic acid sequence storage device (100) uses the SI of the already stored nucleic acid sequence "ATCGCGCATGC..." as is, and thus does not store the nucleic acid sequence "ATCGCGCATGC..." separately.

후속 핵산서열 저장 장치(100)는 후속 핵산서열(seq.2)의 P2의 핵산서열 "AGTTGCAATCAC..."를 저장하기 위해, 기존의 P2의 단편에 저장된 핵산 서열을 검색할 수 있다. 도 10에 저장된 참조 데이터베이스의 P2에는 이미 핵산서열 "AGTTGCAATCAC..."와 동일한 핵산서열이 SI의 "2"에 저장되어 있다. 후속 핵산서열 저장 장치(100)는 이미 저장되어 있는 핵산서열 "AGTTGCAATCAC..."의 SI를 그대로 이용하게 되므로, 별도로 핵산서열 "AGTTGCAATCAC..."를 저장하지 않는다.The subsequent nucleic acid sequence storage device (100) can search for the nucleic acid sequence stored in the existing fragment of P2 in order to store the nucleic acid sequence "AGTTGCAATCAC..." of P2 of the subsequent nucleic acid sequence (seq. 2). In the reference database stored in Fig. 10, a nucleic acid sequence identical to the nucleic acid sequence "AGTTGCAATCAC..." is already stored in "2" of SI. Since the subsequent nucleic acid sequence storage device (100) uses the SI of the already stored nucleic acid sequence "AGTTGCAATCAC..." as is, it does not separately store the nucleic acid sequence "AGTTGCAATCAC...".

후속 핵산서열 저장 장치(100)는 후속 핵산서열(seq.2)의 P3의 핵산서열 "AGCGCGGATGC..."를 저장하기 위해, 기존의 P3의 단편에 저장된 핵산 서열을 검색할 수 있다. 도 10에 저장된 참조 데이터베이스의 P3에는 이미 핵산서열 "AGCGCGGATGC..."와 동일한 핵산서열이 SI의 "2"에 저장되어 있다. 후속 핵산서열 저장 장치(100)는 이미 저장되어 있는 핵산서열 "AGCGCGGATGC..."의 SI를 그대로 이용하게 되므로, 별도로 핵산서열 "AGCGCGGATGC..."를 저장하지 않는다. The subsequent nucleic acid sequence storage device (100) can search for the nucleic acid sequence stored in the fragment of the existing P3 in order to store the nucleic acid sequence "AGCGCGGATGC..." of P3 of the subsequent nucleic acid sequence (seq. 2). In the reference database stored in Fig. 10, a nucleic acid sequence identical to the nucleic acid sequence "AGCGCGGATGC..." is already stored in "2" of SI in P3. Since the subsequent nucleic acid sequence storage device (100) uses the SI of the already stored nucleic acid sequence "AGCGCGGATGC..." as is, it does not separately store the nucleic acid sequence "AGCGCGGATGC...".

마찬가지로, 후속 핵산서열 저장 장치(100)는 후속 핵산서열(seq.2)의 P300의 핵산서열 "CTAATTCAGGGTC..."를 저장하기 위해, 데이터구조에서 P300 단편에 저장된 핵산서열에서 핵산서열 "CTAATTCAGGGTC..."를 검색할 수 있다. 도 10에 저장된 참조 데이터베이스의 P300에는 이미 핵산서열 "CTAATTCAGGGTC..."가 저장되어 있다.Likewise, the subsequent nucleic acid sequence storage device (100) can search for the nucleic acid sequence "CTAATTCAGGGTC..." in the nucleic acid sequence stored in the P300 fragment in the data structure to store the nucleic acid sequence "CTAATTCAGGGTC..." of P300 of the subsequent nucleic acid sequence (seq. 2). The nucleic acid sequence "CTAATTCAGGGTC..." is already stored in P300 of the reference database stored in FIG. 10.

후속 핵산서열 저장 장치(100)는 P300에 이미 저장되어 있는 핵산서열 "CTAATTCAGGGTC..."의 SI를 그대로 이용하게 되므로, 별도로 부분서열 "CTAATTCAGGGTC..."를 저장하지 않는다.Since the subsequent nucleic acid sequence storage device (100) uses the SI of the nucleic acid sequence "CTAATTCAGGGTC..." already stored in P300, it does not separately store the partial sequence "CTAATTCAGGGTC...".

이하, 이러한 방식으로 저장되는 참조 데이터베이스의 예시를 설명하도록 한다.Below, we will describe an example of a reference database stored in this manner.

이러한 방식으로 구축된 참조 데이터베이스는 이후에 SI에 해당하는 부분서열을 이용하여 후속 핵산서열의 전체를 재구성하는 것에 이용된다.The reference database constructed in this manner is then used to reconstruct the entire subsequent nucleic acid sequence using the partial sequence corresponding to SI.

도 12와 같이, 후속 핵산서열(seq.2)은 P1, P2, P3, 내지 P300의 단편이 핵산서열 데이터베이스(2)에 {2, 2, 5, 1, 5, ..., 1}로 저장될 수 있다.As shown in Fig. 12, the subsequent nucleic acid sequence (seq.2) can be stored as {2, 2, 5, 1, 5, ..., 1} in the nucleic acid sequence database (2) as fragments of P1, P2, P3, to P300.

이와 같이 후속 핵산서열 저장 장치(100)는 기존에 이미 저장되어 있는 핵산서열과 중복된 핵산서열을 저장할 때에는 중복되는 핵산서열을 다시 반복하여 저장하지 않고, 핵산서열 데이터베이스(2)에 저장할 때 각각의 단편의 PI에 SI만 저장할 수 있다.In this way, when storing a nucleic acid sequence that overlaps with a nucleic acid sequence that has already been stored, the subsequent nucleic acid sequence storage device (100) does not repeatedly store the overlapping nucleic acid sequence, and when storing it in the nucleic acid sequence database (2), only the SI can be stored in the PI of each fragment.

즉, 종래의 저장 방식은 후속 핵산서열(seq.2)을 핵산서열 데이터베이스(2)에 ATCGCGCATGC...AGTTGCAATCAC...AGCGCGGATGC...CTAATTCAGGGTC..."로 저장하는 것에 비해 일 실시예에 따른 후속 핵산서열 저장 장치(100)는 후속 핵산서열(seq.2)을 핵산서열 데이터베이스(2)에 SI로 이루어진 정보만을 저장할 수 있다.That is, compared to the conventional storage method of storing the subsequent nucleic acid sequence (seq. 2) in the nucleic acid sequence database (2) as ATCGCGCATGC...AGTTGCAATCAC...AGCGCGGATGC...CTAATTCAGGGTC...", the subsequent nucleic acid sequence storage device (100) according to one embodiment can store only information consisting of SI of the subsequent nucleic acid sequence (seq. 2) in the nucleic acid sequence database (2).

참조 데이터베이스(1)는 서로 상이한 핵산서열에 대한 identifier가 부여된 데이터를 저장할 수 있다. 도 9 및 도 10의 과정을 거쳐 핵산서열"ATCGCGCATGC..."의 SI는 1로 부여되고, 핵산서열"AAAAGATGCTCA..."의 SI는 2로 부여되고, 핵산서열"AGATGGCAACGG..."의 SI는 3으로 부여되고, 핵산서열"CTAATTCAGGGTC..."의 SI는 4로 부여되고, 핵산서열"GCGTGAGCGCG..."의 SI는 240으로 부여될 수 있다. 여기서 SI 240과 같이 신규한 identifier가 부여된 SI는, 상기 매칭되는 핵산서열과 함께 참조 데이터베이스에 저장될 수 있다.The reference database (1) can store data to which identifiers for different nucleic acid sequences are assigned. Through the processes of FIGS. 9 and 10, the SI of the nucleic acid sequence "ATCGCGCATGC..." can be assigned as 1, the SI of the nucleic acid sequence "AAAAGATGCTCA..." can be assigned as 2, the SI of the nucleic acid sequence "AGATGGCAACGG..." can be assigned as 3, the SI of the nucleic acid sequence "CTAATTCAGGGTC..." can be assigned as 4, and the SI of the nucleic acid sequence "GCGTGAGCGCG..." can be assigned as 240. Here, the SI assigned with a new identifier, such as SI 240, can be stored in the reference database together with the matching nucleic acid sequence.

일 실시예에 따르면, 후속 핵산서열을 데이터베이스에 저장할 때 참조 데이터베이스(1)를 이용하여 저장함에 따라 기존에 비해 적은 데이터베이스의 저장 공간을 사용하게 되는 장점이 있다.According to one embodiment, when storing subsequent nucleic acid sequences in a database, there is an advantage in that less database storage space is used compared to the existing method by storing them using a reference database (1).

도 13은 핵산서열 데이터베이스(2)로부터 후속 핵산서열을 획득하는 동작을 개념적으로 나타내는 도면이고, 도 14는 데이터베이스에 저장된 PI 및 SI를 이용하여 쿼리에 대응하는 단편을 지정하는 예시를 나타내는 도면이다.Fig. 13 is a diagram conceptually illustrating an operation of obtaining a subsequent nucleic acid sequence from a nucleic acid sequence database (2), and Fig. 14 is a diagram illustrating an example of specifying a fragment corresponding to a query using PI and SI stored in the database.

일 실시예에서, 후속 핵산서열 저장 장치(100)는 생성된 쿼리에 대응하는 후속 핵산서열을 핵산서열 데이터베이스(2) 및 참조 데이터베이스로부터 추출할 수 있다. 이때, 후속 핵산서열을 추출하기 위하여 생성된 쿼리는 후속 핵산서열에 대한 유전자 정보, 핵산서열 구간 정보 중 적어도 하나를 포함할 수 있다.In one embodiment, the subsequent nucleic acid sequence storage device (100) can extract subsequent nucleic acid sequences corresponding to the generated query from the nucleic acid sequence database (2) and the reference database. At this time, the query generated to extract the subsequent nucleic acid sequence may include at least one of genetic information and nucleic acid sequence section information for the subsequent nucleic acid sequence.

이때, 후속 핵산서열 저장 장치(100)는 핵산서열 정보를 상기 쿼리에 대응하는 핵산서열 정보로 재구성할 수 있다.At this time, the subsequent nucleic acid sequence storage device (100) can reconstruct the nucleic acid sequence information into nucleic acid sequence information corresponding to the query.

즉, 후속 핵산서열 저장 장치(100)는 핵산서열 데이터베이스(2)에 저장된 SI를 이용하여 SI와 SI에 해당하는 부분서열이 저장된 참조 데이터베이스(1)를 참조하고, 참조된 결과를 이용하여 핵산서열 정보를 상기 쿼리에 대응하는 핵산서열 정보로 재구성할 수 있다.That is, the subsequent nucleic acid sequence storage device (100) can use the SI stored in the nucleic acid sequence database (2) to reference the reference database (1) in which the SI and the partial sequence corresponding to the SI are stored, and can use the referenced result to reconstruct the nucleic acid sequence information into nucleic acid sequence information corresponding to the query.

일 실시예에서, 쿼리가 핵산서열의 유전자 정보인 경우, 후속 핵산서열 저장 장치(100)는 후속 핵산서열 중에서 쿼리의 유전자 정보에 해당하는 단편들을 지정할 수 있다.In one embodiment, when the query is genetic information of a nucleic acid sequence, the subsequent nucleic acid sequence storage device (100) can designate fragments corresponding to the genetic information of the query among the subsequent nucleic acid sequences.

도 15와 같이 쿼리가 핵산서열의 유전자 정보에 해당하는 유전자의 위치가 150번 핵산서열부터 297번 핵산서열인 경우, 후속 핵산서열 저장 장치(100)는 핵산서열 데이터베이스(2)에 저장된 핵산서열들 모두에 대하여 P2 내지 P4 단편을 지정할 수 있다. As shown in Fig. 15, when the location of the gene corresponding to the genetic information of the nucleic acid sequence is from nucleic acid sequence number 150 to nucleic acid sequence number 297, the subsequent nucleic acid sequence storage device (100) can designate fragments P2 to P4 for all nucleic acid sequences stored in the nucleic acid sequence database (2).

구체적으로, 후속 핵산서열 저장 장치(100)는 쿼리가 핵산서열의 유전자 정보인 경우, 상기 후속 핵산서열 중에서 상기 쿼리의 유전자 정보에 해당하는 PI들을 지정하고, 지정된 PI에 포함된 핵산서열에 부여된 상기 SI의 핵산서열을 연속적으로 연결하며, 연속적으로 연결된 핵산서열의 양 끝단을 쿼리의 유전자 정보에 해당하는 영역으로 trimming할 수 있다.Specifically, when the query is genetic information of a nucleic acid sequence, the subsequent nucleic acid sequence storage device (100) can designate PIs corresponding to the genetic information of the query among the subsequent nucleic acid sequences, continuously connect the nucleic acid sequences of the SI assigned to the nucleic acid sequences included in the designated PIs, and trim both ends of the continuously connected nucleic acid sequences to a region corresponding to the genetic information of the query.

후속 핵산서열 저장 장치(100)는 지정된 단편들에 매칭되는 부분서열을 연속적으로 연결된 핵산서열로 결합할 수 있다. 이후, 후속 핵산서열 저장 장치(100)는 부분서열이 결합된 단편들의 양 끝단을 상기 상기 쿼리의 유전자 정보에 해당하는 영역으로 절편화할 수 있다. 즉, 후속 핵산서열 저장 장치(100)는 후처리를 하기 위해 150번 핵산서열 이전에 위치하는 핵산서열을 절편화하고, 297번 핵산서열 이후에 위치하는 핵산서열을 절편화할 수 있다.The subsequent nucleic acid sequence storage device (100) can combine partial sequences matching the designated fragments into a continuously connected nucleic acid sequence. Thereafter, the subsequent nucleic acid sequence storage device (100) can fragment both ends of the fragments to which the partial sequences are combined into regions corresponding to the genetic information of the query. That is, the subsequent nucleic acid sequence storage device (100) can fragment the nucleic acid sequence located before the 150th nucleic acid sequence and fragment the nucleic acid sequence located after the 297th nucleic acid sequence for post-processing.

마찬가지로, 쿼리가 핵산서열 구간 정보인 경우, 후속 핵산서열 중에서 상기 쿼리의 구간 정보 해당하는 핵산서열의 단편들을 지정할 수 있다. Similarly, if the query is nucleic acid sequence section information, fragments of the nucleic acid sequence corresponding to the section information of the query can be designated among subsequent nucleic acid sequences.

구체적으로, 후속 핵산서열 저장 장치(100)는 쿼리가 핵산서열 구간 정보인 경우, 상기 후속 핵산서열 중에서 상기 쿼리의 구간 정보 해당하는 PI들을 지정하고, 지정된 PI에 포함된 핵산서열에 부여된 상기 SI 핵산서열을 연속적으로 연결하며, 연속적으로 연결된 핵산서열 양 끝단을 상기 쿼리의 구간 정보에 해당하는 영역으로 trimming할 수 있다.Specifically, when the query is nucleic acid sequence section information, the subsequent nucleic acid sequence storage device (100) can designate PIs corresponding to the section information of the query among the subsequent nucleic acid sequences, continuously connect the SI nucleic acid sequences assigned to the nucleic acid sequences included in the designated PIs, and trim both ends of the continuously connected nucleic acid sequences to an area corresponding to the section information of the query.

도 14와 같이, 후속 핵산서열 저장 장치(100)는 후속 핵산서열 저장 장치(100)는 핵산서열 데이터베이스(2)에 저장된 핵산서열 중에서 150번 핵산서열부터 297번 핵산서열에 해당하는 단편인 P2 내지 P4 단편을 지정할 수 있다. 후속 핵산서열 저장 장치(100)는 지정된 단편들에 매칭되는 부분서열을 연속적으로 연결된 핵산서열로 결합할 수 있다. 이후, 후속 핵산서열 저장 장치(100)는 부분서열이 결합된 단편들의 양 끝단을 쿼리의 유전자 정보에 해당하는 영역으로 절편화할 수 있다. 즉, 후속 핵산서열 저장 장치(100)는 후처리를 하기 위해 150번 핵산서열 이전에 위치하는 핵산서열을 절편화하고, 297번 핵산서열 이후에 위치하는 핵산서열을 절편화할 수 있다. 일 실시예에 따르면, 같은 종(species) 또는 유사 종의 핵산서열이 폭발적으로 증가하는 경에도 이미 저장된 핵산서열과 중복된 핵산서열을 다시 반복하지 저장하지 않고 데이터베이스로부터 핵산서열을 추출하는 시간과 비용을 절약할 수 있다.As shown in FIG. 14, the subsequent nucleic acid sequence storage device (100) can designate fragments P2 to P4, which are fragments corresponding to nucleic acid sequences No. 150 to 297 among the nucleic acid sequences stored in the nucleic acid sequence database (2). The subsequent nucleic acid sequence storage device (100) can combine partial sequences matching the designated fragments into a continuously connected nucleic acid sequence. Thereafter, the subsequent nucleic acid sequence storage device (100) can fragment both ends of the fragments to which the partial sequences are combined into regions corresponding to the genetic information of the query. That is, the subsequent nucleic acid sequence storage device (100) can fragment the nucleic acid sequence located before the nucleic acid sequence No. 150 and fragment the nucleic acid sequence located after the nucleic acid sequence No. 297 for post-processing. According to one embodiment, even when the number of nucleic acid sequences of the same species or similar species increases explosively, it is possible to save time and cost for extracting nucleic acid sequences from a database without repeatedly storing nucleic acid sequences that overlap with already stored nucleic acid sequences.

도 16는 일 실시예에 따른 핵산서열 저장 장치의 하드웨어 구성도이다.Figure 16 is a hardware configuration diagram of a nucleic acid sequence storage device according to one embodiment.

본 개시내용에서의 컴포넌트, 모듈 또는 부는 특정의 태스크를 수행하거나 특정의 추상 데이터 유형을 구현하는 루틴, 프로시져, 프로그램, 컴포넌트, 참조 데이터베이스 등을 포함한다. 또한, 당업자라면 본 개시내용에서 제시되는 방법들이 단일-프로세서 또는 멀티프로세서 컴퓨팅 시스템, 미니컴퓨터, 메인프레임 컴퓨터는 물론 퍼스널 컴퓨터, 핸드헬드 컴퓨팅 장치, 마이크로프로세서-기반 또는 프로그램가능 가전 제품, 기타 등등(이들 각각은 하나 이상의 연관된 장치와 연결되어 동작할 수 있음)을 비롯한 다른 컴퓨터 시스템 구성으로 실시될 수 있다는 것을 충분히 인식할 것이다.A component, module, or portion in the present disclosure includes a routine, procedure, program, component, reference database, or the like that performs a particular task or implements a particular abstract data type. Furthermore, those skilled in the art will readily appreciate that the methods presented in the present disclosure can be implemented with other computer system configurations, including single-processor or multiprocessor computing systems, minicomputers, mainframe computers, as well as personal computers, handheld computing devices, microprocessor-based or programmable consumer electronics, and the like, each of which may be operatively connected to one or more associated devices.

본 개시내용에서 설명된 실시예들은 또한 어떤 태스크들이 통신 네트워크를 통해 연결되어 있는 원격 처리 장치들에 의해 수행되는 분산 컴퓨팅 환경에서 실시될 수 있다. 분산 컴퓨팅 환경에서, 프로그램 모듈은 로컬 및 원격 메모리 저장 장치 둘다에 위치할 수 있다.The embodiments described in this disclosure can also be implemented in distributed computing environments, where certain tasks are performed by remote processing devices that are connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

컴퓨팅 장치는 통상적으로 다양한 컴퓨터 판독가능 매체를 포함한다. 컴퓨터에 의해 액세스 가능한 매체는 그 어떤 것이든지 컴퓨터 판독가능 매체가 될 수 있고, 이러한 컴퓨터 판독가능 매체는 휘발성 및 비휘발성 매체, 일시적(transitory) 및 비일시적(non-transitory) 매체, 이동식 및 비-이동식 매체를 포함한다. 제한이 아닌 예로서, 컴퓨터 판독가능 매체는 컴퓨터 판독가능 저장 매체 및 컴퓨터 판독가능 전송 매체를 포함할 수 있다.Computing devices typically include a variety of computer-readable media. Computer-readable media can be any media accessible by a computer, including volatile and nonvolatile media, transitory and non-transitory media, removable and non-removable media. By way of example, and not limitation, computer-readable media can include computer-readable storage media and computer-readable transmission media.

컴퓨터 판독가능 저장 매체는 컴퓨터 판독가능 명령어, 참조 데이터베이스, 프로그램 모듈 또는 기타 데이터와 같은 정보를 저장하는 임의의 방법 또는 기술로 구현되는 휘발성 및 비휘발성 매체, 일시적 및 비-일시적 매체, 이동식 및 비이동식 매체를 포함한다. 컴퓨터 판독가능 저장 매체는 RAM, ROM, EEPROM, 플래시 메모리 또는 기타 메모리 기술, CD-ROM, DVD(digital video disk) 또는 기타 광 디스크 저장 장치, 자기 카세트, 자기 테이프, 자기 디스크 저장 장치 또는 기타 자기 저장 장치, 또는 컴퓨터에 의해 액세스될 수 있고 원하는 정보를 저장하는 데 사용될 수 있는 임의의 기타 매체를 포함하지만, 이에 한정되지 않는다.Computer-readable storage media includes volatile and nonvolatile media, transitory and non-transitory media, removable and non-removable media implemented in any method or technology for storing information such as computer-readable instructions, reference databases, program modules, or other data. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital video disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be accessed by a computer and used to store the desired information.

컴퓨터 판독가능 전송 매체는 통상적으로 반송파(carrier wave) 또는 기타 전송 메커니즘(transport mechanism)과 같은 피변조 데이터 신호(modulated data signal)에 컴퓨터 판독가능 명령어, 참조 데이터베이스, 프로그램 모듈 또는 기타 데이터 등을 구현하고 모든 정보 전달 매체를 포함한다. 피변조 데이터 신호라는 용어는 신호 내에 정보를 인코딩하도록 그 신호의 특성들 중 하나 이상을 설정 또는 변경시킨 신호를 의미한다. 제한이 아닌 예로서, 컴퓨터 판독가능 전송 매체는 유선 네트워크 또는 직접 배선 접속(direct-wired connection)과 같은 유선 매체, 그리고 음향, RF, 적외선, 기타 무선 매체와 같은 무선 매체를 포함한다. 상술된 매체들 중 임의의 것의 조합도 역시 컴퓨터 판독가능 전송 매체의 범위 안에 포함되는 것으로 한다.Computer-readable transmission media typically embodies computer-readable instructions, reference databases, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term modulated data signal means a signal that has one or more of its characteristics set or changed so as to encode information in the signal. By way of example, and not limitation, computer-readable transmission media includes wired media, such as a wired network or direct-wired connection, and wireless media, such as acoustic, RF, infrared, or other wireless media. Combinations of any of the above are also intended to be included within the scope of computer-readable transmission media.

컴퓨터(2002)를 포함하는 본 발명의 여러가지 측면들을 구현하는 예시적인 환경(2000)이 나타내어져 있으며, 컴퓨터(2002)는 처리 장치(2004), 시스템 메모리(2006) 및 시스템 버스(2008)를 포함한다. 본 명세서에서의 컴퓨터(200)는 컴퓨팅 장치와 상호 교환가능하게 사용될 수 있다. 시스템 버스(2008)는 시스템 메모리(2006)(이에 한정되지 않음)를 비롯한 시스템 컴포넌트들을 처리 장치(2004)에 연결시킨다. 처리 장치(2004)는 다양한 상용 프로세서들 중 임의의 프로세서일 수 있다. 듀얼 프로세서 및 기타 멀티프로세서 아키텍처도 역시 처리 장치(2004)로서 이용될 수 있다.An exemplary environment (2000) implementing various aspects of the present invention is illustrated, including a computer (2002), which includes a processing unit (2004), a system memory (2006), and a system bus (2008). The computer (2000) herein may be used interchangeably with a computing device. The system bus (2008) connects system components, including but not limited to the system memory (2006), to the processing unit (2004). The processing unit (2004) may be any of a variety of commercially available processors. Dual processors and other multiprocessor architectures may also be utilized as the processing unit (2004).

시스템 버스(2008)는 메모리 버스, 주변장치 버스, 및 다양한 상용 버스 아키텍처 중 임의의 것을 사용하는 로컬 버스에 추가적으로 상호 연결될 수 있는 몇 가지 유형의 버스 구조 중 임의의 것일 수 있다. 시스템 메모리(2006)는 판독 전용 메모리(ROM)(2010) 및 랜덤 액세스 메모리(RAM)(2012)를 포함한다. 기본 입/출력 시스템(BIOS)은 ROM, EPROM, EEPROM 등의 비휘발성 메모리(2010)에 저장되며, 이 BIOS는 시동 중과 같은 때에 컴퓨터(2002) 내의 구성요소들 간에 정보를 전송하는 일을 돕는 기본적인 루틴을 포함한다. RAM(2012)은 또한 데이터를 캐싱하기 위한 정적 RAM 등의 고속 RAM을 포함할 수 있다.The system bus (2008) may be any of several types of bus structures that may be additionally interconnected to a memory bus, a peripheral bus, and a local bus using any of a variety of commercial bus architectures. The system memory (2006) includes read-only memory (ROM) (2010) and random access memory (RAM) (2012). A basic input/output system (BIOS) is stored in non-volatile memory (2010), such as ROM, EPROM, or EEPROM, and contains basic routines that help transfer information between components within the computer (2002), such as during start-up. The RAM (2012) may also include high-speed RAM, such as static RAM, for caching data.

컴퓨터(2002)는 또한 내장형 하드 디스크 드라이브(HDD)(2014)(예를 들어, EIDE, SATA), 자기 플로피 디스크 드라이브(FDD)(2016)(예를 들어, 이동식 디스켓(2018)으로부터 판독을 하거나 그에 기록을 하기 위한 것임), SSD 및 광 디스크 드라이브(2020)(예를 들어, CD-ROM 디스크(2022)를 판독하거나 DVD 등의 기타 고용량 광 매체로부터 판독을 하거나 그에 기록을 하기 위한 것임)를 포함한다. 하드 디스크 드라이브(2014), 자기 디스크 드라이브(2016) 및 광 디스크 드라이브(2020)는 각각 하드 디스크 드라이브 인터페이스(2024), 자기 디스크 드라이브 인터페이스(2026) 및 광 드라이브 인터페이스(2028)에 의해 시스템 버스(2008)에 연결될 수 있다. 외장형 드라이브 구현을 위한 인터페이스(2024)는 예를 들어, USB(Universal Serial Bus) 및 IEEE 1394 인터페이스 기술 중 적어도 하나 또는 그 둘다를 포함한다.The computer (2002) also includes an internal hard disk drive (HDD) (2014) (e.g., EIDE, SATA), a magnetic floppy disk drive (FDD) (2016) (e.g., for reading from or writing to a removable diskette (2018)), a solid state drive (SSD), and an optical disk drive (2020) (e.g., for reading from or writing to a CD-ROM disk (2022) or other high-capacity optical media such as a DVD). The hard disk drive (2014), the magnetic disk drive (2016), and the optical disk drive (2020) may be connected to the system bus (2008) by a hard disk drive interface (2024), a magnetic disk drive interface (2026), and an optical drive interface (2028), respectively. The interface (2024) for implementing an external drive includes, for example, at least one or both of Universal Serial Bus (USB) and IEEE 1394 interface technologies.

이들 드라이브 및 그와 연관된 컴퓨터 판독가능 매체는 데이터, 데이터 구조, 컴퓨터 실행가능 명령어, 기타 등등의 비휘발성 저장을 제공한다. 컴퓨터(2002)의 경우, 드라이브 및 매체는 임의의 데이터를 적당한 디지털 형식으로 저장하는 것에 대응한다. 상기에서의 컴퓨터 판독가능 저장 매체에 대한 설명이 HDD, 이동식 자기 디스크, 및 CD 또는 DVD 등의 이동식 광 매체를 언급하고 있지만, 당업자라면 집 드라이브(zip drive), 자기 카세트, 플래쉬 메모리 카드, 카트리지, 기타 등등의 컴퓨터에 의해 판독가능한 다른 유형의 저장 매체도 역시 예시적인 운영 환경에서 사용될 수 있으며 또 임의의 이러한 매체가 본 발명의 방법들을 수행하기 위한 컴퓨터 실행가능 명령어를 포함할 수 있다는 것을 잘 알 것이다.These drives and their associated computer-readable media provide nonvolatile storage of data, data structures, computer-executable instructions, and the like. In the case of the computer (2002), the drives and media correspond to storing any data in a suitable digital format. While the description of computer-readable storage media above refers to HDDs, removable magnetic disks, and removable optical media such as CDs or DVDs, those skilled in the art will appreciate that other types of computer-readable storage media, such as zip drives, magnetic cassettes, flash memory cards, cartridges, and the like, may also be used in the exemplary operating environment, and that any such media may contain computer-executable instructions for performing the methods of the present invention.

운영 체제(2030), 하나 이상의 어플리케이션 프로그램(2032), 기타 프로그램 모듈(2034) 및 프로그램 데이터(2036)를 비롯한 다수의 프로그램 모듈이 드라이브 및 RAM(2012)에 저장될 수 있다. 운영 체제, 어플리케이션, 모듈 및/또는 데이터의 전부 또는 그 일부분이 또한 RAM(2012)에 캐싱될 수 있다. 본 발명이 여러가지 상업적으로 이용가능한 운영 체제 또는 운영 체제들의 조합에서 구현될 수 있다는 것을 잘 알 것이다.A number of program modules, including an operating system (2030), one or more application programs (2032), other program modules (2034), and program data (2036), may be stored in the drive and RAM (2012). All or portions of the operating system, applications, modules, and/or data may also be cached in RAM (2012). It will be appreciated that the present invention may be implemented in various commercially available operating systems or combinations of operating systems.

사용자는 하나 이상의 유선/무선 입력 장치, 예를 들어, 키보드(2038) 및 마우스(2040) 등의 포인팅 장치를 통해 컴퓨터(2002)에 명령 및 정보를 입력할 수 있다. 기타 입력 장치(도시 생략)로는 마이크, IR 리모콘, 조이스틱, 게임 패드, 스타일러스 펜, 터치 스크린, 기타 등등이 있을 수 있다. 이들 및 기타 입력 장치가 종종 시스템 버스(2008)에 연결되어 있는 입력 장치 인터페이스(2042)를 통해 처리 장치(2004)에 연결되지만, 병렬 포트, IEEE 1394 직렬 포트, 게임 포트, USB 포트, IR 인터페이스, 기타 등등의 기타 인터페이스에 의해 연결될 수 있다.A user may enter commands and information into the computer (2002) via one or more wired/wireless input devices, such as a keyboard (2038) and a pointing device such as a mouse (2040). Other input devices (not shown) may include a microphone, an IR remote control, a joystick, a game pad, a stylus pen, a touch screen, and the like. These and other input devices are often connected to the processing unit (2004) via an input device interface (2042) that is connected to the system bus (2008), but may be connected by other interfaces such as a parallel port, an IEEE 1394 serial port, a game port, a USB port, an IR interface, and the like.

모니터(2044) 또는 다른 유형의 디스플레이 장치도 역시 비디오 어댑터(2046) 등의 인터페이스를 통해 시스템 버스(2008)에 연결된다. 모니터(2044)에 부가하여, 컴퓨터는 일반적으로 스피커, 프린터, 기타 등등의 기타 주변 출력 장치(도시 생략)를 포함한다.A monitor (2044) or other type of display device is also connected to the system bus (2008) via an interface, such as a video adapter (2046). In addition to the monitor (2044), the computer typically includes other peripheral output devices (not shown), such as speakers, a printer, and so on.

컴퓨터(2002)는 유선 및/또는 무선 통신을 통한 원격 컴퓨터(들)(2048) 등의 하나 이상의 원격 컴퓨터로의 논리적 연결을 사용하여 네트워크화된 환경에서 동작할 수 있다. 원격 컴퓨터(들)(2048)는 워크스테이션, 서버 컴퓨터, 라우터, 퍼스널 컴퓨터, 휴대용 컴퓨터, 마이크로프로세서-기반 오락 기기, 피어 장치 또는 기타 통상의 네트워크 노드일 수 있으며, 일반적으로 컴퓨터(2002)에 대해 기술된 구성요소들 중 다수 또는 그 전부를 포함하지만, 간략함을 위해, 메모리 저장 장치(2050)만이 도시되어 있다. 도시되어 있는 논리적 연결은 근거리 통신망(LAN)(2052) 및/또는 더 큰 네트워크, 예를 들어, 원거리 통신망(WAN)(2054)에의 유선/무선 연결을 포함한다. 이러한 LAN 및 WAN 네트워킹 환경은 사무실 및 회사에서 일반적인 것이며, 인트라넷 등의 전사적 컴퓨터 네트워크(enterprise-wide computer network)를 용이하게 해주며, 이들 모두는 전세계 컴퓨터 네트워크, 예를 들어, 인터넷에 연결될 수 있다.The computer (2002) may operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) (2048), via wired and/or wireless communications. The remote computer(s) (2048) may be a workstation, a server computer, a router, a personal computer, a portable computer, a microprocessor-based entertainment device, a peer device, or other conventional network node, and generally include many or all of the components described for the computer (2002), although for simplicity, only the memory storage device (2050) is shown. The logical connections shown include wired/wireless connections to a local area network (LAN) (2052) and/or a larger network, such as a wide area network (WAN) (2054). Such LAN and WAN networking environments are common in offices and companies and facilitate enterprise-wide computer networks, such as intranets, all of which may be connected to a worldwide computer network, such as the Internet.

LAN 네트워킹 환경에서 사용될 때, 컴퓨터(2002)는 유선 및/또는 무선 통신 네트워크 인터페이스 또는 어댑터(2056)를 통해 로컬 네트워크(2052)에 연결된다. 어댑터(2056)는 LAN(2052)에의 유선 또는 무선 통신을 용이하게 해줄 수 있으며, 이 LAN(2052)은 또한 무선 어댑터(2056)와 통신하기 위해 그에 설치되어 있는 무선 액세스 포인트를 포함하고 있다. WAN 네트워킹 환경에서 사용될 때, 컴퓨터(2002)는 모뎀(2058)을 포함할 수 있거나, WAN(2054) 상의 통신 서버에 연결되거나, 또는 인터넷을 통하는 등, WAN(2054)을 통해 통신을 정하는 기타 수단을 갖는다. 내장형 또는 외장형 및 유선 또는 무선 장치일 수 있는 모뎀(2058)은 직렬 포트 인터페이스(2042)를 통해 시스템 버스(2008)에 연결된다. 네트워크화된 환경에서, 컴퓨터(2002)에 대해 설명된 프로그램 모듈들 또는 그의 일부분이 원격 메모리/저장 장치(2050)에 저장될 수 있다. 도시된 네트워크 연결이 예시적인 것이며 컴퓨터들 사이에 통신 링크를 설정하는 기타 수단이 사용될 수 있다는 것을 잘 알 것이다.When used in a LAN networking environment, the computer (2002) is connected to a local network (2052) via a wired and/or wireless communication network interface or adapter (2056). The adapter (2056) may facilitate wired or wireless communications to the LAN (2052), which may also include a wireless access point installed therein for communicating with the wireless adapter (2056). When used in a WAN networking environment, the computer (2002) may include a modem (2058), be connected to a communications server on the WAN (2054), or have other means for establishing communications over the WAN (2054), such as via the Internet. The modem (2058), which may be internal or external and wired or wireless, is connected to the system bus (2008) via a serial port interface (2042). In a networked environment, program modules described for the computer (2002) or portions thereof may be stored in a remote memory/storage device (2050). It will be appreciated that the network connections depicted are exemplary and that other means of establishing a communications link between the computers may be used.

컴퓨터(2002)는 무선 통신으로 배치되어 동작하는 임의의 무선 장치 또는 개체, 예를 들어, 프린터, 스캐너, 데스크톱 및/또는 휴대용 컴퓨터, PDA(portable data assistant), 통신 위성, 무선 검출가능 태그와 연관된 임의의 장비 또는 장소, 및 전화와 통신을 하는 동작을 한다. 이것은 적어도 Wi-Fi 및 블루투스 무선 기술을 포함한다. 따라서, 통신은 종래의 네트워크에서와 같이 미리 정의된 구조이거나 단순하게 적어도 2개의 장치 사이의 애드혹 통신(ad hoc communication)일 수 있다.The computer (2002) communicates with any wireless device or object that is configured and operates via wireless communication, such as a printer, a scanner, a desktop and/or portable computer, a portable data assistant (PDA), a communication satellite, any equipment or location associated with a radio-detectable tag, and a telephone. This includes at least Wi-Fi and Bluetooth wireless technologies. Accordingly, the communication may be a predefined structure, as in a conventional network, or may simply be an ad hoc communication between at least two devices.

한편, 전술한 다양한 실시예들에 따른 방법은 이러한 방법의 각 단계를 수행하도록 프로그램된 컴퓨터 판독가능한 기록매체에 저장된 컴퓨터 프로그램의 형태로 구현 가능하고, 또한 이러한 방법의 각 단계를 수행하도록 프로그램된 컴퓨터 프로그램을 저장하는 컴퓨터 판독가능한 기록매체의 형태로 구현될 수도 있다.Meanwhile, the method according to the various embodiments described above can be implemented in the form of a computer program stored in a computer-readable recording medium programmed to perform each step of the method, and can also be implemented in the form of a computer-readable recording medium storing a computer program programmed to perform each step of the method.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely an illustrative illustration of the technical idea of the present invention, and those skilled in the art will appreciate that various modifications and variations can be made without departing from the essential quality of the present invention. Therefore, the embodiments disclosed in the present invention are intended to illustrate, rather than limit, the technical idea of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within a scope equivalent thereto should be interpreted as being included within the scope of the rights of the present invention.

Claims

A method for storing a subsequent nucleic acid sequence using a reference nucleic acid sequence comprising continuously fragmented fragments of a predetermined length,

A step of dividing the above reference nucleic acid sequence into consecutively fragmented fragments of a predetermined length, assigning a PI (position identifier) to each of the fragments, and assigning a SI (sequence identifier) to the nucleic acid sequence included in each of the fragments of the above reference nucleic acid sequence;

A step of aligning the above subsequent nucleic acid sequence to the above reference nucleic acid sequence;

A step of assigning a PI to a fragment of a subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence to which the PI is assigned, and assigning SI to a nucleic acid sequence included in each of the fragments; the PI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier corresponding to the PI of the reference nucleic acid sequence, and the SI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier representing whether the fragment of the reference nucleic acid sequence corresponding to the PI of the subsequent nucleic acid sequence and the fragment of the subsequent nucleic acid sequence are the same or different, and

characterized by comprising a step of storing PI and SI for fragments of the subsequent nucleic acid sequence in a nucleic acid sequence database.

A method for storing subsequent nucleic acid sequences.

In the first paragraph,

The above reference nucleic acid sequence is,

Characterized by being a nucleic acid sequence of an animal, plant, fungi, protozoa, bacteria or virus

A method for storing subsequent nucleic acid sequences.

In the first paragraph,

The above reference nucleic acid sequence is,

Influenza virus nucleic acid sequence, respiratory syncytial virus (RSV) nucleic acid sequence, adenovirus nucleic acid sequence,

A respiratory virus characterized by being selected from the group consisting of a telovirus nucleic acid sequence, a parainfluenza virus nucleic acid sequence, a metapneumovirus (MPV) nucleic acid sequence, a bocavirus nucleic acid sequence, a rhinovirus nucleic acid sequence, and/or a coronavirus nucleic acid sequence.

A method for storing subsequent nucleic acid sequences.

In the first paragraph,

The predetermined length to which the above reference nucleic acid sequence is fragmented is

Characterized in that the above reference nucleic acid sequence is identical or different depending on the species to which it belongs.

A method for storing subsequent nucleic acid sequences.

In the first paragraph,

The SI assigned to each of the fragments of the above subsequent nucleic acid sequence is

If the nucleic acid sequence of the fragment of the above subsequent nucleic acid sequence is identical to the nucleic acid sequence of the fragment of the above reference nucleic acid sequence, the identifier of the SI previously assigned to the fragment of the above reference nucleic acid sequence is assigned, and if it is different, a new identifier is assigned.

A method for storing subsequent nucleic acid sequences.

In the first paragraph,

The SI assigned to each fragment of the above reference nucleic acid sequence and the SI assigned to each fragment of the above subsequent nucleic acid sequence are characterized in that they are stored in a reference database by matching their nucleic acid sequences.

A method for storing subsequent nucleic acid sequences.

In paragraph 6,

The nucleic acid sequences matching the SI of each of the reference nucleic acid sequence and the subsequent nucleic acid sequence stored in the reference database are characterized in that they are different from each other.

A method for storing subsequent nucleic acid sequences.

In paragraph 6,

The SI assigned the above new identifier is,

Stored in a reference database together with a nucleic acid sequence matching the above SI,

The above reference database is,

Characterized in that it stores data to which identifiers for different nucleic acid sequences are assigned.

A method for storing subsequent nucleic acid sequences.

In the first paragraph,

The above nucleic acid sequence database is,

characterized in that it includes SI matching the PI of each of the above reference nucleic acid sequence and the above subsequent nucleic acid sequence.

A method for storing subsequent nucleic acid sequences.

In the first paragraph,

The step of aligning the above subsequent nucleic acid sequence to the above reference nucleic acid sequence comprises:

It is characterized in that it aligns to correspond to a fragment of the reference nucleic acid sequence by reflecting a region where an insertion mutation or a deletion mutation occurred in the subsequent nucleic acid sequence.

A method for storing subsequent nucleic acid sequences.

In Article 10,

If the above subsequent nucleic acid sequence contains an insertion or deletion mutation,

A fragment of the subsequent nucleic acid sequence corresponding to the PI including the above insertion or deletion mutation is characterized in that its length is different from that of the fragment of the reference nucleic acid sequence.

A method for storing subsequent nucleic acid sequences.

In the first paragraph,

A step of aligning an additional subsequent nucleic acid sequence obtained after the above subsequent nucleic acid sequence to the above reference nucleic acid sequence; and

Further comprising the step of assigning PI to a fragment of an additional subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence to which the PI is assigned, and assigning SI to a nucleic acid sequence included in each of the fragments of the additional subsequent nucleic acid sequence,

The SI assigned to each of the fragments of the above additional subsequent nucleic acid sequence is:

It is characterized in that the identifier is an identifier that indicates whether the additional subsequent nucleic acid sequence and the reference nucleic acid sequence are the same or different by comparing the previously stored subsequent nucleic acid sequence and the reference nucleic acid sequence before the additional subsequent nucleic acid sequence is stored.

A method for storing subsequent nucleic acid sequences.

In paragraph 6,

It is characterized by further comprising a step of obtaining a subsequent nucleic acid sequence from the nucleic acid sequence database and the reference database.

A method for storing subsequent nucleic acid sequences.

In Article 13,

The step of obtaining subsequent nucleic acid sequences from the above nucleic acid sequence database and the above reference database is:

A step of generating a query including at least one of genetic information and nucleic acid sequence section information for the subsequent nucleic acid sequence; and

characterized in that it comprises a step of reconstructing nucleic acid sequence information corresponding to the query using the nucleic acid sequence database and the reference database.

A method for storing subsequent nucleic acid sequences.

In Article 14,

The step of reconstructing the above nucleic acid sequence information into nucleic acid sequence information corresponding to the above query is:

When the query is genetic information of a nucleic acid sequence, a step of designating PIs corresponding to the genetic information of the query among the subsequent nucleic acid sequences;

A step of sequentially connecting the nucleic acid sequence of the SI assigned to the nucleic acid sequence included in the above-mentioned PI; and

It is characterized by including a step of trimming both ends of the continuously connected nucleic acid sequence to a region corresponding to the genetic information of the query.

A method for storing subsequent nucleic acid sequences.

In Article 15,

If the above query is nucleic acid sequence section information, a step of designating PIs corresponding to the section information of the query among the subsequent nucleic acid sequences;

A step of sequentially connecting the SI nucleic acid sequence assigned to the nucleic acid sequence included in the above-mentioned PI; and

It is characterized by including a step of trimming both ends of the continuously connected nucleic acid sequence to an area corresponding to the section information of the query.

A method for storing subsequent nucleic acid sequences.

A device for storing a subsequent nucleic acid sequence using a reference nucleic acid sequence including continuously fragmented fragments of a predetermined length,

Memory that stores at least one instruction;

network; and

Includes a processor,

The processor performs at least one instruction,

An instruction for dividing the above reference nucleic acid sequence into consecutive fragments of a predetermined length, assigning a PI (position identifier) to each of the fragments, and assigning a SI (sequence identifier) to a nucleic acid sequence included in each of the fragments of the above reference nucleic acid sequence;

Instructions for aligning the above subsequent nucleic acid sequence to the above reference nucleic acid sequence;

An instruction for assigning a PI to a fragment of a subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence to which the PI is assigned, and assigning SI to a nucleic acid sequence included in each of the fragments; the PI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier corresponding to the PI of the reference nucleic acid sequence, and the SI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier representing whether the fragment of the reference nucleic acid sequence corresponding to the PI of the subsequent nucleic acid sequence and the fragment of the subsequent nucleic acid sequence are the same or different, and

Characterized in that it includes instructions for storing PI and SI for fragments of the subsequent nucleic acid sequence in a nucleic acid sequence database.

Subsequent nucleic acid sequence storage device.

A computer-readable recording medium that stores a computer program,

The above computer program,

A step of dividing a reference nucleic acid sequence including continuously fragmented fragments of a predetermined length into continuously fragmented fragments of a predetermined length, assigning a PI (position identifier) to each of the fragments, and assigning a SI (sequence identifier) to a nucleic acid sequence included in each of the fragments of the reference nucleic acid sequence;

A step of aligning a subsequent nucleic acid sequence to the reference nucleic acid sequence;

A step of assigning a PI to a fragment of a subsequent nucleic acid sequence corresponding to each of the fragments of the reference nucleic acid sequence to which the PI is assigned, and assigning SI to a nucleic acid sequence included in each of the fragments; the PI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier corresponding to the PI of the reference nucleic acid sequence, and the SI assigned to each of the fragments of the subsequent nucleic acid sequence is an identifier representing whether the fragment of the reference nucleic acid sequence corresponding to the PI of the subsequent nucleic acid sequence and the fragment of the subsequent nucleic acid sequence are identical or different, and

Programmed to perform a step including storing PI and SI for fragments of the subsequent nucleic acid sequence in a nucleic acid sequence database.

A computer-readable recording medium that stores a computer program.

A computer program stored on a computer-readable recording medium,

The above computer program,

A computer program stored on a computer-readable recording medium.