KR20030096658A

KR20030096658A - A method of identifying drug targets using microbial genomic sequence data

Info

Publication number: KR20030096658A
Application number: KR1020020033637A
Authority: KR
Inventors: 채승표
Original assignee: 주식회사 아이디알
Priority date: 2002-06-17
Filing date: 2002-06-17
Publication date: 2003-12-31
Anticipated expiration: 2022-06-17
Also published as: KR100441143B1

Abstract

미생물 유전체 정보로부터 만들어진 단백질 데이터베이스에 대하여 단백질 서열 유사성 비교 프로그램을 사용하여 각 미생물 유전체에 공통으로 존재하는 동족체(homolog)를 찾아내고 이를 이용하여 항생제/항진균제 개발을 위한 신규 표적A protein sequence similarity comparison program is used for protein databases created from microbial genome information to identify homologs common to each microbial genome and use them to target new targets for antibiotic / antifungal development.

단백질(drug targets)을 탐색할 수 있는 방법으로서, 서열의 유사성이 낮은 동족체를 찾아낼 수 있으며 기능이 밝혀지지 않은 유전자의 기능을 예측할 수 있는 근거를 마련할 수 있다.As a method for searching for drug targets, homologs with low sequence similarity can be found and a basis for predicting the function of a gene whose function is not known can be prepared.

Description

A method of identifying drug targets using microbial genomic sequence data}

본 발명은 단백질 서열 정보 탐색방법에 관한 것으로서, 보다 상세하게는 서열의 유사성이 낮은 동족체를 찾아낼 수 있는 탐색방법에 관한 것이다.The present invention relates to a method for searching protein sequence information, and more particularly, to a search method capable of finding homologs having low sequence similarity.

서로 다른 두 가지 이상의 생물체의 유전체를 비교할 때 유전체로부터 만들어지는 단백질 서열을 비교하여 그 유사성 정도로 두 단백질의 기능을 예측하게 된다. 이 때 두 단백질의 유사성이 클 경우 두 단백질은 상동관계(homology)가 되고 두 단백질을 동족체(homolog)라고 부른다.When comparing the genomes of two or more different organisms, the sequence of proteins produced from the genome is compared to predict the function of the two proteins to the extent of similarity. In this case, if the two proteins have similarities, the two proteins become homology and the two proteins are called homologs.

현재 여러 생물체의 제놈(Genome) 프로젝트가 진행됨에 따라 핵산과 단백질의 수가 기하급수적으로 증가하고 있다. 새로 찾아낸 단백질의 기능을 밝히기 위해서는 여러 가지 생물학적 실험을 해야 하지만 실험에 의존하지 않고 기존에 기능이 실험적으로 확인된 단백질과 서열의 유사성을 비교해 봄으로써 새로 찾아낸 단백질의 기능을 예측할 수 있다.The number of nucleic acids and proteins is growing exponentially as the Genome project of various organisms progresses. To identify the function of the newly discovered protein, several biological experiments must be performed, but the function of the newly discovered protein can be predicted by comparing the similarity of the sequence with the previously confirmed protein.

이러한 서열 비교에 쓰이는 많은 프로그램이 존재하는데 BLAST 와 FASTA가 대표적이다. 기능이 알려지지 않은 단백질 서열을 질의(query)로 사용하여 단백질 서열 데이터베이스에 유사성 비교 검색(search)을 수행한다. 만일 유사성이 높은 단백질 서열이 발견되고 이 단백질의 기능이 실험적으로 알려져 있다면 질의로 사용되었던 기능이 알려지지 않은 단백질의 기능 또한 그러하리라고 예측할 수 있다.There are many programs used to compare these sequences, including BLAST and FASTA. Similarity search is performed on a protein sequence database using a protein sequence of unknown function as a query. If a highly similar protein sequence is found and the function of the protein is known experimentally, one can predict that the function used as the query is also unknown.

이 방법은 단백질 서열만으로 실험에 의존하지 않고도 간단하게 새로 찾아낸 단백질의 기능을 예측할 수 있으므로 보편적으로 사용된다. 이 방법의 문제점은 서열의 유사성이 낮을 경우 두 단백질이 같은 기능을 하는 동족체(homolog)라 할 지라도 찾아낼 수 없다는 것이다.This method is commonly used because it allows you to easily predict the function of newly discovered proteins without resorting to experiments with protein sequences alone. The problem with this method is that if the sequence similarity is low, even if the two proteins are homologs of the same function, they cannot be found.

생물체는 오랜 시간의 진화 과정 동안 핵산 서열의 변화가 생기게 되고 이 때문에 단백질 서열에도 변화가 생기게 된다. 진화적으로 가까운 생물체들 간에는 단백질의 서열 또한 비슷하지만 진화적으로 멀리 떨어진 생물체들의 단백질 서열은 그 유사성이 매우 약해진다. 예를 들어 인간은 진화적으로 박테리아보다 원숭이에더 가까우므로 사람의 단백질은 원숭이의 단백질과 더 유사하다. 여러 생물체의 유전체를 분석하여 같은 기능을 하는 동족체(homolog)를 찾아내는 일은 여러 분야에 아주 유용하게 사용될 수 있는데 현재의 BLAST나 FASTA의 방법만으로는 서열의 유사성이 낮은 동족체(homolog)를 찾는데 한계를 보인다.Organisms cause changes in nucleic acid sequences over time, and thus change in protein sequences. Although the sequence of proteins is similar among evolutionarily close organisms, the sequence of proteins of evolutionarily distant organisms is very weak. For example, humans are evolutionarily closer to monkeys than bacteria, so human proteins are more similar to monkey proteins. Finding homologs that perform the same function by analyzing the genomes of several organisms can be very useful in many fields. However, current BLAST or FASTA methods have limitations in finding homologs with low sequence similarity.

본 발명은 서열의 유사성이 낮은 동족체(homolog)를 찾아낼 수 있는 방법을 제공하는 것을 목적으로 한다.An object of the present invention is to provide a method for identifying homologs having low similarity in sequence.

본 발명의 다른 목적은 항생제/항진균제의 신규 표적 단백질을 탐색할 수 있는 방법을 제공하는 것이다.Another object of the present invention is to provide a method that can search for a novel target protein of antibiotic / antifungal agent.

본 발명의 또 다른 목적은 기능이 알려지지 않은 단백질의 기능을 예측할 수 있는 근거를 마련할 수 있는 방법을 제공하는 것이다.It is still another object of the present invention to provide a method that can provide a basis for predicting the function of a protein whose function is unknown.

이러한 목적을 달성하기 위한 본 발명에 따른 표적 단백질 탐색방법은 생물체의 단백질 서열 정보를 가진 다수의 데이터베이스 중에서 단백질 서열들간의 상동성(homology)을 알고 싶은 단백질 서열 데이터베이스를 다수 개 선정하는 과정(S1)과; 단백질 데이터베이스 중 어느 하나의 데이터베이스에 포함된 단백질 서열 중 하나를 서열 유사성 검색 프로그램의 질의(query)로 선택하는 과정(S2)과; 유사성 검색 프로그램을 구동시켜 질의가 포함된 생물체의 단백질 데이터베이스를 제외한 나머지 각각의 데이터베이스에 대하여 유사성을 비교하는 과정(S3)과; 유사성 검색 프로그램의 구동 결과 값 중 소정 조건을 만족하는 단백질 서열들 중에서컷-오프(cut-off)값이 가장 작은 단백질 서열 정보(“besthit“라 함)들을 서열의 유사성 비교 수행 순서에 따라 저장한 컴퓨터 파일을 생성하는 과정(S4)과; 상기 besthit 단백질 서열 정보가 저장된 파일에서 첫 번째 위치하는 단백질 서열을 유사성 검색 프로그램의 질의(query)로 선택하는 과정(S5)과; 상기 S5, S3 그리고 S4 과정을 S5->S3->S4 순서로 다수회 수행하는 과정(S6)과; 상기 S2 또는 S4 또는 S6과정에 의해 생성된 파일들에 저장된 besthit 단백질 서열 정보들과 그 besthit 단백질 서열 정보들을 생성하는데 사용된 질의 단백질 서열 정보들 중 서로 다른 데이터베이스에 속하는 적어도 3 가지 단백질 서열 정보들 사이의 관계가 임의의 한 단백질 서열이 다른 2 가지 단백질 서열들에 대하여 besthit이나 질의(query) 관계를 가지는 단백질 서열 정보들을 찾는 과정(S7)과; 상기 S7 과정에서 찾은 단백질 서열 정보를 이용해 서로 다른 단백질 데이터베이스에 존재하는 동족체(homolog) 정보를 추출하는 과정(S8)을 포함하여 이루어지는 것을 특징으로 한다.The target protein search method according to the present invention for achieving this object is a process of selecting a plurality of protein sequence database to know the homology (homology) between protein sequences from a plurality of databases having protein sequence information of the organism (S1) and; Selecting one of the protein sequences included in any one of the protein databases as a query of a sequence similarity search program (S2); Running a similarity search program to compare the similarities with respect to each of the databases except for the protein database of the organism including the query (S3); Among the protein sequences satisfying a predetermined condition among the driving results of the similarity search program, the protein sequence information (“besthit”) having the smallest cut-off value is stored according to the sequence similarity comparison procedure. Generating a computer file (S4); Selecting a protein sequence located first in a file in which the besthit protein sequence information is stored as a query of a similarity search program (S5); Performing steps S5, S3 and S4 a plurality of times in a sequence S5-> S3-> S4 (S6); Between at least three protein sequence information belonging to a different database among the besthit protein sequence information stored in the files generated by the S2 or S4 or S6 process and the query protein sequence information used to generate the besthit protein sequence information. Searching for protein sequence information having a besthit or query relationship with respect to two protein sequences in which any one protein sequence has a relation of (S7); And (S8) extracting homolog information present in different protein databases using the protein sequence information found in step S7.

도 1은 본 발명에 따른 표적 단백질 탐색과정을 나타낸 흐름도,1 is a flow chart showing a target protein search process according to the present invention,

도 2는 5가지 생물체의 단백질 서열 데이터베이스를 나타낸 예시도,Figure 2 is an illustration showing a protein sequence database of five organisms,

도 3은 Blastp 프로그램을 사용한 경우의 질의(query)와 그의 besthit의 예시도,3 is an illustration of a query and its besthit when using the Blastp program,

도 4는 본 발명에 따른 탐색방법 중 seed 형성과정의 예시도,4 is an exemplary view illustrating a seed forming process in a search method according to the present invention;

도 5는 본 발명에 따른 탐색방법 중 최종 seed 형성과정의 예시도,5 is an exemplary view of a final seed forming process of the search method according to the present invention;

도 6은 본 발명에 따른 탐색 방법 중 다른 경우의 seed 형성과정의 예시도,6 is an exemplary view illustrating a seed forming process in another case of the searching method according to the present invention;

도 7은 본 발명에 따른 탐색 방법 중 또 다른 경우의 seed 형성과정의 예시도7 is an exemplary view illustrating a seed forming process in another case of the searching method according to the present invention.

이하, 첨부된 도면을 참조로 본 발명에 따른 표적 단백질 탐색방법에 대하여 설명하기로 한다. 본 발명에 따른 동족체(homolog) 탐색 방법은 선정된 단백질 데이터베이스 중에서 어느 하나의 데이터베이스 안에 있는 임의의 한 단백질 서열과 동족체(homolog) 관계에 있는 단백질 서열들을 다른 데이터베이스에서 찾는 방법이다.Hereinafter, a method for searching for a target protein according to the present invention will be described with reference to the accompanying drawings. The homolog search method according to the present invention is a method of finding a protein sequence homologous to any one protein sequence in any one of the selected protein databases in another database.

도 1은 본 발명의 일 실시에 따른 표적 단백질 탐색방법의 진행과정을 나타낸 흐름도이다.1 is a flow chart showing the progress of the target protein search method according to an embodiment of the present invention.

본 발명을 구현하기 위해서는 비교하려는 생물체의 전체 단백질 서열 데이터베이스와 단백질 서열 유사성 비교 프로그램이 필요하다. 여러 가지 생물체의 제놈(genome) 프로젝트가 진행되어 많은 경우 무료로 단백질 서열 데이터베이스를 이용할 수 있다. 예를 들어 필요한 단백질 서열 데이터베이스는 인터넷의 NCBI 웹 페이지에 접속하여 다운로드 받을 수 있다. 또한 단백질 서열 유사성 프로그램은 NCBI에서 제공하는 BLAST를 사용한다. 그 중 BLASTP는 단백질 서열을 질의로 사용해 선택한 단백질 데이터베이스에 대해 상동성을 검색한다. 이외에 FASTA 등의 다양한 서열 유사성 비교를 위한 프로그램이 있으나, 본 설명에서 BLAST(특히 BLASTP)를 이용하는 것을 예로 한다.The implementation of the present invention requires a protein sequence similarity comparison program with the entire protein sequence database of the organism to be compared. Genome projects of various organisms are underway, and in many cases protein sequence databases are available free of charge. For example, the required protein sequence database can be downloaded from the NCBI web page on the Internet. The protein sequence similarity program also uses BLAST provided by NCBI. Among them, BLASTP uses protein sequences as queries to search for homology to selected protein databases. In addition, there are programs for comparing various sequence similarities such as FASTA, but in this description, BLAST (particularly BLASTP) is used as an example.

여러 생물체의 단백질 서열 정보를 가진 다수의 데이터베이스 중에서 상동성(homology)을 알고 싶은 단백질 서열 데이터베이스를 도 2에서와 같이 다수개 선정한다. 도 2는 5가지 생물체의 단백질 서열 데이터베이스를 나타낸 도식화한 예시도이다 (S1 과정).Among a plurality of databases having protein sequence information of various organisms, a plurality of protein sequence databases whose homology is desired to be known are selected as shown in FIG. 2. Figure 2 is a schematic diagram showing the protein sequence database of five organisms (S1 process).

먼저 임의로 선정된 데이터베이스 A의 단백질 서열 정보(A1)을 질의로 하기 위한 파일(예를 들어: mge1.doc)을 생성한다. 여기에서 파일(file)은 컴퓨터를 이용하여 판독가능한 형태의 전자적 정보의 처리를 위한 것으로 램(RAM) 등의 기억소자를 이용하여 처리하는 것도 가능하다.(S2 과정)First, a file (eg: mge1.doc) for querying protein sequence information A1 of a randomly selected database A is generated. Here, the file is for processing electronic information in a form that can be read using a computer, and can also be processed using a storage device such as a RAM (S2 process).

BLASTP를 구동시켜 피검색 단백질 데이터베이스 중 어느 하나의 데이터베이스에 포함된 단백질 서열 중 하나를 질의(query)로 선택하여 질의가 포함된 생물체의 단백질 데이터베이스를 제외한 나머지 데이터베이스에 대하여 유사성을 비교한다. 즉, 임의의 데이터베이스 중 임의의 단백질 서열(본 예에서는 A 생물체의 A1을 임의의 값으로 선택하는 것으로 한다)을 질의(query)로 하여 나머지 단백질 데이터베이스들(B,C,D,E)에 대하여 유사성을 비교하는 것이다. 이때 갭이 있는 국부정렬 모드(gapped blastp)를 사용하며 e-value는 < 0.0001이다.(S3 과정)BLASTP is run to select one of the protein sequences included in any one of the searched protein databases as a query to compare the similarities to the rest of the databases except the protein database of the organism containing the query. That is, any protein sequence (in this example, A1 of organism A is selected as an arbitrary value) of any database is used as a query for the remaining protein databases (B, C, D, E). To compare the similarities. In this case, a gapped local alignment mode (gapped blastp) is used and the e-value is <0.0001 (S3 process).

데이터베이스 B의 단백질 서열 B1부터 BN에 대하여 유사성을 비교하여 그 중 e-value 가 10-4보다 작은 값들을 갖는 단백질 서열 중 e-value 값이 가장 작은 값(이하에서 "besthit"라 함)을 나타내는 단백질 서열이 B1이라 가정할 때, 단백질 서열 B1을 결과 파일(예를 들어: mge1.out)의 첫 번째 위치에 저장시킨다.Similarity is compared between protein sequences B1 to BN in database B to indicate that the e-value is the smallest (hereinafter referred to as "besthit") among the protein sequences of which the e-value is less than 10-4. Assuming the protein sequence is B1, store protein sequence B1 in the first location of the result file (eg: mge1.out).

이후 다른 데이터베이스 C의 단백질 서열 C1부터 CN에 대하여도 위와 동일한 방법으로 유사성을 비교한다. 이때, besthit에 해당하는 단백질 서열이 C2라고 하면 이 단백질 서열을 상기 단백질 서열 B1이 위치한 다음에 저장한다. 이후, 데이터베이스 D, E에 대하여도 동일한 과정을 수행한다. 이때, 소정 조건 즉, e-value < 0.0001을 만족하는 단백질 서열이 있는 경우에 한하여 besthit이 결정된다. Besthit data 파일(mge1.out)은 FASTA 포맷(format)으로 저장된다. FASTA 포맷은단백질의 서열과 그 단백질의 특징을 설명하는 헤더로 구성된다.Thereafter, similarity is compared with protein sequences C1 to CN of another database C in the same manner as described above. In this case, if the protein sequence corresponding to besthit is C2, the protein sequence is stored after the protein sequence B1 is located. Thereafter, the same process is performed on the databases D and E. At this time, the besthit is determined only when there is a protein sequence satisfying a predetermined condition, that is, e-value <0.0001. The Besthit data file (mge1.out) is stored in FASTA format. The FASTA format consists of a header describing the sequence of the protein and the characteristics of the protein.

아래의 [표 1]은 질의와 beshit의 관계를 나타낸 것이다. 본 예는 설명을 위하여 형식을 빌어 표현한 것이며 실제의 데이터를 나타내는 것이 아니다.Table 1 below shows the relationship between the query and the beshit. This example is for illustrative purposes only and does not represent actual data.

[표 1]TABLE 1

도 3은 BLAST 검색 결과를 도식화한 예시도이다. 즉, 생물체 A의 단백질 서열 중 하나인 A1을 질의(query)로 사용하여 나머지 단백질 데이터베이스 B, C, D, E에서 대해 서열 유사성 검색을 했을 때 besthit는 B1, C2, D1, E2가 됨을 보여준다.(S4 과정)3 is an exemplary diagram illustrating a BLAST search result. In other words, when a sequence similarity search is performed on the remaining protein databases B, C, D, and E using one of the protein sequences of organism A as a query, the besthit is B1, C2, D1, and E2. (S4 course)

두 번째 BLAST 검색은 besthit가 저장된 파일(mge1.out)의 첫 번째 위치에 저장된 단백질 서열(즉 B1)을 질의(query)로 사용한다.(S5 과정)The second BLAST search uses the protein sequence (ie, B1) stored in the first location of the file where besthit is stored (mge1.out) (step S5).

질의(query)로 사용된 단백질 서열(B1)을 포함한 데이터베이스(B) 외의 모든비교하고자 하는 데이터베이스(A,C,D,E)에 대하여 서열 유사성 비교를 한다. 이때, A2, C2, D2, E2가 besthit가 되었다고 하자. 즉, 데이터베이스 A에서는 단백질 서열 A2가, 데이터베이스 C에서는 단백질 서열 C2가, 데이터베이스 D에서는 단백질 서열 D2가, 데이터베이스 E에서는 단백질 서열 E2가 질의 단백질 서열 B1에 대하여 besthit 조건을 만족한다고 하자.A sequence similarity comparison is made for all databases (A, C, D, E) to be compared, except for database (B), including protein sequence (B1) used as a query. In this case, it is assumed that A2, C2, D2, and E2 become besthits. In other words, assume that the protein sequence A2 in the database A, the protein sequence C2 in the database C, the protein sequence D2 in the database D, and the protein sequence E2 in the database E satisfy the besthit condition for the query protein sequence B1.

마찬가지로 조건을 만족하는 besthit를 파일에 저장한다. 이때, 질의로 사용된 데이터베이스에 인접한 데이터베이스 순으로 저장하되 선행 검색과정에서 이미 질의로 사용된 단백질 서열을 포함하는 데이터베이스(A)로부터 검색된 besthit(A2)은 맨 뒤에 저장된다. 즉, (C2, D2, E2, A2)의 순으로 저장된다.Similarly, save the besthit that satisfies the condition to a file. At this time, the besthit (A2) retrieved from the database (A) containing the protein sequence already used as a query in the database in order of adjacent to the database used as a query in the prior search process is stored at the end. That is, they are stored in the order of (C2, D2, E2, A2).

아래의 [표 2]은 질의와 beshit의 관계를 나타낸 것이다. 위의 [표 1]과 마찬가지로 설명을 위해 형식을 빌어 표현한 것이며 실제 데이터는 아니다.[Table 2] below shows the relationship between query and beshit. As shown in [Table 1] above, it is expressed in a format for explanation and is not actual data.

[표 2]TABLE 2

이후 B1을 질의(query)로 했을 때의 besthit 중 첫 번째 단백질 서열 즉, C2를 질의(query)로 하여 C를 제외한 나머지 데이터베이스(A, B, D, E)의 각 단백질 서열에 대하여 유사성 비교를 한다.The similarity comparison is then performed for each protein sequence in the database (A, B, D, E) except for C, using the first protein sequence of the besthit when B1 is a query, that is, C2 as a query. do.

아래의 [표 3]는 이러한 과정들을 다수회 반복하여 최초 질의로 사용되었던 단백질 서열이 포함된 데이터베이스에 속하는 단백질 서열이 첫 번째 자리에 위치하면 서열 유사성 검색을 종료한다. 지금까지의 설명에 따라 생성된 파일들과 그 내용을 간략히 표현하면 [표 3]과 같다.(S6 과정)Table 3 below repeats these processes many times and ends the sequence similarity search when the protein sequence belonging to the database containing the protein sequence used as the initial query is placed in the first position. The files created according to the description so far and the contents thereof are briefly expressed as shown in [Table 3].

[표 3]TABLE 3

화일명File name data(QUERY)data (QUERY) 화일명File name data(BESTHIT)data (BESTHIT) mge1.docmge1.doc A1A1 mge1.outmge1.out B1, C2, D1, E2B1, C2, D1, E2 mge2.docmge2.doc B1B1 mge2.outmge2.out C2, D2, E2, A2C2, D2, E2, A2 mge3.docmge3.doc C2C2 ⇒⇒ mge3.outmge3.out D1, E2, A1, B1D1, E2, A1, B1 mge4.docmge4.doc D1D1 mge4.outmge4.out E2, A1, B1, C2E2, A1, B1, C2 mge5.docmge5.doc E2E2 mge5.outmge5.out A1, B1, C2, D2A1, B1, C2, D2

다음으로, besthit 단백질 서열 정보를 가진 파일들(mge1.out, mge2.out …mge5.out) 중 임의로 두 파일을 선택하고, 질의(query) 정보를 가진 파일들(mge1.doc, mge2.doc …mge5.doc)을 이용하여 서로 다른 데이터베이스에 속하는 적어도 3가지 단백질 서열 정보들 사이의 관계를 확인한다.Next, two files are randomly selected from the files having the besthit protein sequence information (mge1.out, mge2.out… mge5.out), and the files with the query information (mge1.doc, mge2.doc…). mge5.doc) to identify relationships between at least three pieces of protein sequence information belonging to different databases.

각 정보들 사이의 관계는 다양한 형태로 나타나게 된다. 따라서, 각 정보들 사이의 관계를 확인하기 위해서는 선택된 두 파일 중 하나의 파일 내에 존재하는The relationship between each piece of information appears in various forms. Therefore, in order to confirm the relationship between each piece of information, it exists in one of the two selected files.

임의의 한 besthit 단백질 서열이, 선택된 다른 파일에서의 질의(query)로 사용되는 관계의 성립여부를 판단하는 과정과, 두 파일 내에 공통으로 저장된 besthit 단백질 서열의 존재여부를 확인하는 과정이 필요하다.It is necessary to determine whether any one besthit protein sequence is used as a query in another selected file, and to confirm the existence of a besthit protein sequence stored in both files in common.

먼저, 임의로 mge1.out과 mge2.out이 선택되었다고 가정한다. mge1.out 파일 내에 존재하는 임의의 besthit 단백질 서열이, 선택된 mge2.out의 질의(query)로 사용되는 관계의 성립여부를 판단하는 과정과 두 파일에 공통으로 존재하는 besthit 단백질 서열의 존재 여부를 확인한다. 본 예에서는 B1이 mge2.out의 질의로 사용되었으며 C2, E2가 공통으로 존재하는 besthit들이 된다. 이때, A1, B1, C2는 서로 간에 질의(query)와 besthit의 관계가 된다. 마찬가지로 A1, B1, E2도 서로 간에 질의(query)와 besthit의 관계가 된다. 이렇게 besthit 단백질 서열 정보들과 그 besthit 단백질 서열 정보들을 생성하는데 사용된 질의 단백질 서열 정보들 중 서로 다른 데이터베이스에 속하는 적어도 3 가지 단백질 서열 정보들 사이의 관계가 임의의 한 단백질 서열이 다른 2 가지 단백질 서열들에 대하여 besthit이나 질의(query) 관계를 만족시키는 단백질 서열 정보들을 seed라 한다. 이 seed는 별도의 파일(A.doc)에 저장된다.First, assume that optionally mge1.out and mge2.out are selected. Determining whether or not any besthit protein sequence in the mge1.out file is used as a query for the selected mge2.out, and confirming the existence of the besthit protein sequence in common in both files. do. In this example, B1 is used as a query for mge2.out, and C2 and E2 are the besthits in common. At this time, A1, B1, and C2 become a relationship between a query and besthit. Similarly, A1, B1, and E2 also have a relationship between query and besthit. Thus, two protein sequences in which one protein sequence differs in relation between at least three protein sequence information belonging to different databases among the besthit protein sequence information and the query protein sequence information used to generate the besthit protein sequence information. Seeds are sequence information of proteins that satisfy a besthit or query relationship. This seed is stored in a separate file (A.doc).

도 4는 seed 형성을 도식화한 것이다. 즉, A1을 질의(query)로 하였을 때 B1, C2, D1, E2 가 besthit이고, B1을 질의로 할 때 A2, C2, D2, E2가 besthit이 되는 것을 보여주고 있다. 이때, A1, B1, C2는 위에서 언급한 seed형성 조건을 만족시킨다.마찬가지로 A1, B1, E2 도 seed 형성 조건을 만족시킨다. 도 4는 본 발명에 의해 형성되는 seed를 설명하기 위한 것이며 실제적으로 컴퓨터 내 저장된 프로그램의 연산작용에 의해 seed 조건에 부합되는 것을 추출하여 파일(A.doc)에 저장된다.4 is a schematic of seed formation. That is, B1, C2, D1, and E2 are besthit when A1 is a query, and A2, C2, D2, and E2 are besthit when B1 is a query. At this time, A1, B1, and C2 satisfy the seed formation conditions mentioned above. Similarly, A1, B1, and E2 also satisfy the seed formation conditions. 4 is for explaining the seed formed by the present invention is actually extracted by the operation of the program stored in the computer that meets the seed condition is stored in the file (A.doc).

이후 나머지 파일들에 대하여 동일한 방법으로 seed를 찾아낸다. 즉, 모든 출력파일(mge1.out, mge2.out …mge5.out)의 개수를 "N"이라고 할 때, NC2 횟수만큼 두 파일을 비교하여 seed 형성과정을 수행하게 되는 것이다. 도 5는 이렇게 모든 파일들에 대하여 수행된 seed 형성을 도식화한 것이다.Then find the seed in the same way for the rest of the files. That is, when the number of all output files (mge1.out, mge2.out… mge5.out) is “N”, seed formation is performed by comparing the two files as many times as NC2. Figure 5 illustrates the seed formation performed for all of these files.

본 예에서는 선택된 두 파일 중 한 파일 안에 저장된 임의의 besthit 단백질 서열이 선택된 다른 파일의 질의로 쓰이는 관계가 성립되는 경우를 예로 하고 있으나 그렇지 않는 경우도 있을 수 있다. 한 예로, 선택된 두 파일 중 한 파일 안에 저장된 임의의 besthit 단백질 서열이 선택된 다른 파일의 질의로 쓰이는 관계가 성립되지 않는 경우 이 때에도 두 파일에 공통으로 존재하는 단백질 서열이 있을 수 있다. 예를 들어 다른 실시 예에 따라 아래의 [표 4]와 같은 결과값이 나온 경우이다.In this example, there is an example in which a relation between an arbitrary besthit protein sequence stored in one of two selected files is used as a query of another selected file, but it may not be the case. As an example, there may be a protein sequence that is present in both files even when the relationship between any besthit protein sequence stored in one file of two selected files is not established as a query of another selected file. For example, according to another embodiment, the result shown in [Table 4] below is shown.

[표 4]TABLE 4

화일명File name data(QUERY)data (QUERY) 화일명File name data(BESTHIT)data (BESTHIT) mge1.docmge1.doc A1A1 mge1.outmge1.out B1, C1, E2B1, C1, E2 mge2.docmge2.doc B1B1 mge2.outmge2.out C2, D2C2, D2 mge3.docmge3.doc C2C2 ⇒⇒ mge3.outmge3.out D2, A1, B1D2, A1, B1 mge4.docmge4.doc D2D2 mge4.outmge4.out E2, B2E2, B2 mge5.docmge5.doc E2E2 mge5.outmge5.out A1, D2A1, D2

위에서 설명한 바와 같이 임의의 두 파일을 선택하여 저장된 데이터의 상관관계를 확인한다. 그 중, mge1.out 과 mge4.out을 비교하여 보자. mge1.out에 포함된 besthit 들 중 어느 것도 mge4.out을 생성하기 위한 mge4.doc의 질의(query)에 해당되지 않는 경우이다. 이 때에도 두 파일(mge1.out 과 mge4.out)에 공통으로 존재하는 besthit 단백질 서열 정보가 있을 수 있다. 즉, E2가 공통으로 존재한다. 이러한 경우에는, 선택된 두 파일을 생성하기 위해 사용된 각각의 질의 단백질 서열 정보(A1, D2)와, 이 중 하나의 질의(A1)를 선택하여 S3->S4->S5과정을 1회 이상 수행하여 나머지 다른 하나의 질의(D2)가 선택되기까지 선택된 모든 질의 단백질 서열 정보(B1, C2), 그리고 공통으로 존재하는 E2가 seed 형성 조건을 만족하게 된다. 그리하여 A1, B1, C2, D2, E2가 seed를 이루게 된다(도 6참조).As described above, two arbitrary files are selected to check the correlation of the stored data. Among them, compare mge1.out and mge4.out. None of the besthits in mge1.out correspond to the query in mge4.doc for creating mge4.out. At this time, there may be besthit protein sequence information commonly present in both files (mge1.out and mge4.out). That is, E2 is in common. In this case, one or more S3-> S4-> S5 processes are performed by selecting each query protein sequence information (A1, D2) and one of these queries (A1) used to generate the two selected files. Thus, the selected query protein sequence information (B1, C2) and the common E2 until the other query D2 is selected satisfy the seed formation conditions. Thus, A1, B1, C2, D2, and E2 are seeded (see FIG. 6).

본 발명의 또 다른 실시 예에 따라 아래의 [표 5]와 같은 결과값이 나온 경우가 있을 수 있다. 이 경우는 선택된 두 파일 중 한 파일 안에 저장된 임의의 besthit 단백질 서열이 선택된 다른 파일의 질의로 쓰이는 관계가 성립되지 않으며According to another embodiment of the present invention, there may be a case where a result value as shown in Table 5 below is provided. In this case, no relationship between any besthit protein sequence stored in one of the two selected files is used as a query for the other selected file.

두 파일에 공통으로 존재하는 단백질 서열이 존재하지 않는 경우이다.There is no protein sequence present in both files.

[표 5]TABLE 5

화일명File name data(QUERY)data (QUERY) 화일명File name data(BESTHIT)data (BESTHIT) mge1.docmge1.doc A1A1 mge1.outmge1.out B1, C1, E1B1, C1, E1 mge2.docmge2.doc B1B1 mge2.outmge2.out C2, D1C2, D1 mge3.docmge3.doc C2C2 ⇒⇒ mge3.outmge3.out D2, A2, B2D2, A2, B2 mge4.docmge4.doc D2D2 mge4.outmge4.out E2E2 mge5.docmge5.doc E2E2 mge5.outmge5.out A1A1

상기 상황에 해당하는 파일 mge1.out과 mge5.out를 비교해보자.이 경우 besthit 단백질 서열정보를 저장한 두 파일 사이에 공통되는 단백질이 존재하지 않지만, 하나의 질의 단백질 서열(A1)과 나머지 다른 하나의 질의 단백질(E2)에 의해 생성된 파일안의 첫번째 besthit 단백질 서열(A1)이 같은 경우, 상기 각각의 질의 단백질 서열 정보(A1, E2)와 이 중 하나의 질의(A1)를 선택하여 S3->S4->S5과정을 1회 이상 수행하여 나머지 다른 하나의 질의(E2)가 선택되기까지 선택된 모든 질의 단백질 서열 정보(B1, C2, D2)를 결과값에 포함시키게 된다. 이 경우는 A1, E2, B1, C2, D2가 seed를 형성하게 된다. 이를 도식화하면 도 7와 같이 나타난다 (S7 과정).Compare the files mge1.out and mge5.out corresponding to the above situation, in which case there is no common protein between the two files storing the besthit protein sequence information, but one query protein sequence (A1) and the other If the first besthit protein sequence (A1) in the file generated by the query protein (E2) of the same is the same, each of the query protein sequence information (A1, E2) and one of them (A1) by selecting the S3-> S4-> S5 process is performed one or more times to include all selected query protein sequence information (B1, C2, D2) in the result until the other query (E2) is selected. In this case, A1, E2, B1, C2, and D2 form seeds. This is shown in Figure 7 (S7 process).

상기 seed 형성과정에 의해 찾아낸 seed를 하나의 파일(A.doc)에 저장하면 그 안에 있는 besthit 단백질 서열 정보들을 동족체(homolog)라 할 수 있다. 이때, 저장되는 seed 에는 중복되는 besthit가 있을 수 있으므로 중복되는 것이 없도록 정리한다. 이후, 같은 미생물 유전체에서 유래한 2개 이상의 besthit이 있을 수 있으므로 미생물 이름이 중복되는 경우에는 이를 정리한다. 예를 들어 26가지 미생물 이름이 있는 경우에는 26가지 미생물에서 동족체(homolog)가 발견된다는 뜻이 된다.When the seed found by the seed formation process is stored in one file (A.doc), the besthit protein sequence information therein may be called a homolog. At this time, the stored seed may have duplicate besthits, so that there is no overlap. Thereafter, there may be two or more besthits derived from the same microbial genome, so if the microbial name is duplicated, clean it up. For example, if there are 26 microbial names, it means that homologs are found in 26 microorganisms.

이상에서 설명한 것은 본 발명에 따른 표적 단백질 탐색방법 즉, 선정된 단백질 데이터베이스 중에서 어느 하나의 데이터베이스 안에 있는 임의의 한 단백질 서열과 homolog 관계에 있는 단백질 서열들을 다른 데이터베이스에서 찾는 방법에 관한 것이었다.What has been described above relates to a method of searching for a target protein according to the present invention, that is, to find a protein sequence homologous to any one protein sequence in any one of the selected protein databases in another database.

본 발명에 따른 미생물 유전체 정보를 이용한 표적 단백질 탐색방법을 통해 각 미생물에 공통으로 존재하는 상동성을 가진 유전인자를 찾을 수 있으며, 이러한 유전자가 코딩하는 단백질은 생물체의 생존에 필수적인 기능을 할 것으로 여겨지기때문에 신약 개발을 위한 좋은 표적 단백질이 될 수 있다. 이 방법을 이용하면 항생제/항진균제의 신규 표적 단백질을 발굴하는 작업을 효율적으로 수행할 수 있다. 또한 기능이 알려지지 않은 단백질의 기능을 예측할 수 있는 효과를 갖는다.Through the method of searching for a target protein using microbial genome information according to the present invention, it is possible to find homologous genetic factors common to each microorganism, and the protein encoded by these genes is expected to play an essential function for the survival of the organism. It can be a good target protein for drug development. This method can be used to efficiently find new target proteins for antibiotics / antifungals. It also has the effect of predicting the function of a protein whose function is unknown.

Claims

Selecting a plurality of protein sequence databases (S1) from which a plurality of databases having protein sequence information of an organism want to know homology between protein sequences (S1);

Selecting one of the protein sequences included in any one of the protein databases as a query of a sequence similarity search program (S2);

Running a similarity search program to compare the similarities with respect to each of the databases except for the protein database of the organism including the query (S3);

Protein sequence information (“besthit”) having the smallest cut-off value among the protein sequences satisfying a predetermined condition among the driving results of the similarity search program is stored according to the sequence similarity comparison procedure. Generating a computer file (S4);

Selecting a protein sequence located first in a file in which the besthit protein sequence information is stored as a query of a similarity search program (S5);

Performing steps S5, S3 and S4 a plurality of times in a sequence S5-> S3-> S4 (S6);

Between at least three protein sequence information belonging to a different database among the besthit protein sequence information stored in the files generated by the S2 or S4 or S6 process and the query protein sequence information used to generate the besthit protein sequence information. (S7) finding these protein sequence information when the relation of any one protein sequence satisfies a besthit or query relation for two different protein sequences;

A method of searching for a target protein using a similarity comparison of microbial genome sequence information comprising a step (S8) of extracting homolog information present in different protein databases using the protein sequence information found in the step S7.

The method of claim 1, wherein the S7 process is:

Identifying whether any one besthit protein sequence stored in the selected besthit protein sequence file is used as a query in another selected besthit protein sequence file;

A method of searching for a target protein using a comparison of similarity of microbial genomic sequence information, comprising processing the protein sequence data comprising the step of confirming the existence of a besthit protein sequence stored in common in two files.

The method of claim 2, wherein any one stored in the file selected in step S7

When the besthit protein sequence is used as a query for another selected file, the seed includes the protein sequence information common to both files and each query protein sequence information used to generate the selected two files. A target protein search method using similarity comparison of microbial genome sequence information.

The seed of claim 2, wherein when the besthit protein sequence stored in one file selected in step S7 is not established as a query for another selected file, the seed includes protein sequence information common to both files. Target protein search method using a similarity comparison of microbial genomic sequence information, characterized in that.

5. The method of claim 4, wherein each of the query protein sequence information used to generate the two selected files and one of them is selected as a query, and at least one S3-> S4-> S5 process is performed to perform the other query. A method for searching for a target protein using a similarity comparison of microbial genome sequence information, comprising seeding of all selected query protein sequence information until being selected.

5. The method of claim 4, wherein if one query protein sequence is the same as the first besthit protein sequence in a file generated by the other query protein, each of the query protein sequence information and one of the queries is selected to be S3. A method of searching for a target protein using a similarity comparison of microbial genomic sequence information, characterized in that the seed includes all the selected protein sequence information in the seed until one or more other queries are selected. .

The method (S8) of claim 1, wherein the homolog extraction is performed:

Storing the protein sequence information searched by the S7 process into a group, but overlapping the protein sequence information in one file in a calculation method of a union (U) leaving only one;

A method of searching for a target protein using similarity comparison of microbial genomic sequence information, characterized in that the step of calculating the number of microorganisms that are the source of protein sequence information stored in one file.

The method according to any one of claims 1 to 7,

If the sequence similarity search program is BLAST, e-value <0.01 value target protein search method using a similarity comparison of the microbial genomic sequence information, characterized in that for adopting.

The method according to any one of claims 1 to 7,

Target protein search method using a similarity comparison of the microbial genomic sequence information, characterized in that when using the FASTA as a sequence similarity search program Z-SCORE> 90 is adopted.

Comparing the similarities with respect to each of the databases except the protein database of the organism including the query (S3);

The relationship between the besthit protein sequence information stored in the files generated by the steps S2 to S6 and at least three protein sequence information belonging to different databases among the query protein sequence information used to generate the besthit protein sequence information. Searching for protein sequence information in which any one protein sequence has a besthit or query relationship for two different protein sequences (S7);

A series of instructions for executing an operation consisting of a process of extracting homolog information present in different protein databases (S8) using the protein sequence information found in the S7 process. A computer-readable recording medium storing a program comprising a.

A computer system comprising the recording medium of claim 10.

12. The recording medium of claim 11, wherein the recording medium is:

Computer system, characterized in that selected from the group of recording media consisting of a hard disk (optical memory), optical memory (Random Access Memory), ROM (Read Only Memory) and Flash Memory (Flash Memory) .