KR20070115964A

KR20070115964A - Systems, methods, and computer programs for non-binary sequence comparisons

Info

Publication number: KR20070115964A
Application number: KR1020077021284A
Authority: KR
Inventors: 제프리 엠. 클라크
Original assignee: 바이오인포르마티카 엘엘씨
Priority date: 2005-03-18
Filing date: 2006-03-20
Publication date: 2007-12-06
Also published as: US20080040048A1; EP1859268A2; US20100094889A1; CA2601890A1; US20130297640A1; CN101142479A; US20060223095A1; US20070129900A1; US7734427B2; EP2031533A1; WO2006102128A3; WO2006102128A2; AU2006227410A1; JP2008533619A; US7805254B2; EP1859268A4; US8483971B2; US7263444B2

Abstract

본 발명에 따른 생물학적 서열의 비-2원 비교를 수행하기 위한 시스템 및 방법은 VaSSA-I로 불리는 자립형 (stand alone) 모듈에서 사용되는 비-2원 계수 측도인 새로운 측도 Cω₀을 포함한다. 상기 측도는 통상적인 생물정보학 기술에 의해 모으는 것보다 서열 및 이들 사이의 비교에 대한 실질적으로 더 많은 정보를 얻는다.Systems and methods for performing non-binary comparisons of biological sequences according to the present invention include a new measure, Cω ₀ , which is a non-binary count measure used in a stand alone module called VaSSA-I. The measures yield substantially more information about sequences and comparisons between them than are gathered by conventional bioinformatics techniques.

Description

System, method and computer program for non-binary sequence comparison {SYSTEM, METHOD AND COMPUTER PROGRAM FOR NON-BINARY SEQUENCE COMPARISON}

본원은 2005년 3월 18일에 출원된 미국 가출원 60/662,943을 기초로 한 우선권을 주장한다. 상기 가출원의 전문은 본원에 참고로 포함된다.This application claims priority based on US Provisional Application 60 / 662,943, filed March 18, 2005. The entirety of this provisional application is incorporated herein by reference.

본 발명은 일반적으로 생물정보학, 보다 특히 유전자 서열 사이의 유사성 및 차이의 정도를 측정하기 위한 방법에 관한 것이다.The present invention generally relates to bioinformatics, and more particularly to methods for measuring the degree of similarity and difference between gene sequences.

상이한 종의 전체 게놈의 DNA 서열이 빠른 속도로 결정되고 있다. 이들 게놈의 구조적 변이 및 기능을 이해하는 것이 생물정보학계의 의무이다. 또한, 게놈 데이타의 일부 완료된 버전은 데이타가 획득될 수 없는 공백 (gap)을 포함한다. 이들 다양한 게놈 서열 데이타군은 그의 상대 순서 및 배향이 결정하기 어려운 데이타로 이루어질 수 있다. 상기 불완전한 데이타를 다루는 것은 특히 2개 이상의 게놈이 비교될 때 통합 시스템 툴 (tool)에 새로운 요구를 한다. 생물정보학계는 상기 공백을 보다 효과적으로 취급할 수 있어야 한다.DNA sequences of whole genomes of different species are rapidly determined. It is the duty of the bioinformatics community to understand the structural variations and functions of these genomes. In addition, some completed versions of genomic data contain gaps from which data cannot be obtained. These various genomic sequence data groups may consist of data whose relative order and orientation are difficult to determine. Dealing with such incomplete data places new demands on integrated system tools, especially when two or more genomes are compared. The bioinformatics community should be able to handle the gap more effectively.

통상적인 방법에서는, 게놈 사이의 비교를 취급하는 것이 주요 문제이다. 극히 유사한 서열에 대해, 최적 정렬을 산정하는 소위 "과도한 (greedy)" 정렬 방법이 존재한다. 상기 알고리즘은 정렬에서 공백을 허용하고 매우 효율적이지만, 매우 단순한 정렬 스코어링 (scoring) 계획에 대해서만 잘 작용한다. 보다 풍부한 스코어 (단일 게놈의 큰 스트레치 (stretch) 및 다수 게놈을 비교하는데 관여되는)에 대해, 상기 과도한 방법은 동적 프로그래밍에 대해 우위인 그들의 효율을 잃는다.In conventional methods, handling the comparison between genomes is a major problem. For extremely similar sequences, there is a so-called "greedy" alignment method for calculating the optimal alignment. The algorithm allows for whitespace in the sort and is very efficient, but only works well for very simple sort scoring schemes. For richer scores (which are involved in comparing large stretches of a single genome and multiple genomes), these excessive methods lose their efficiency, which is superior to dynamic programming.

3개 이상의 서열에 대한 통상적인 정렬 방법은 단일 아미노산을 코딩하는 3개의 핵산 염기의 세트인 추정 코돈에 기초한 단백질 서열의 비교에 거의 전적으로 맞게 된다. 이는 몇몇 유사한 종으로부터의 게놈 서열 데이타의 적은 예만이 존재한다는 사실 때문일 수 있다. 또한, 서열 비교 및 상동성 분석은 2원 기초로 이루어진다. 이는 컴퓨터 자원을 보존하지만, 생화학 정보를 무시한다.Conventional alignment methods for three or more sequences are almost entirely suitable for comparison of protein sequences based on putative codons, which is a set of three nucleic acid bases encoding a single amino acid. This may be due to the fact that there are only a few examples of genomic sequence data from several similar species. In addition, sequence comparisons and homology analysis are on a two-way basis. It conserves computer resources but ignores biochemical information.

통상적인 서열 정렬 유사성 및 유전자 서열 비교 툴의 단점을 극복하는 개선된 해결책이 필요하다.There is a need for improved solutions that overcome the disadvantages of conventional sequence alignment similarities and genetic sequence comparison tools.

<발명의 개요><Overview of invention>

서열 분석을 위한 시스템은 제1 뉴클레오티드 서열과 제2 뉴클레오티드 서열 사이의 비-2원 유사성 스코어 (score)를 계산하기 위해 채택된 분석 모듈; 파일 관리 모듈; 및 플롯 (plot) 모듈을 포함한다.The system for sequencing comprises an analysis module adapted to calculate a non-membered similarity score between the first nucleotide sequence and the second nucleotide sequence; File management module; And a plot module.

한 실시태양에서, 시스템은 리포트 (report) 모듈, 사용자 옵션 모듈 및/또는 사용자 지원 모듈을 추가로 포함한다.In one embodiment, the system further includes a report module, a user option module and / or a user support module.

다른 실시태양에서, 파일 관리 모듈은 적어도 하나의 서열 파일을 로딩 ( (load)하기 위해 채택된 서열 로드 모듈; 서열 파일을 메모리 (memory)로부터 플러싱 (flush)하기 위해 채택된 활성 서열 플러시 모듈; 및 로딩된 서열 파일을 메모 리로부터 플러싱하기 위해 채택된 로딩된 서열 플러시 모듈을 포함한다.In another embodiment, the file management module comprises a sequence load module adapted to load at least one sequence file; an active sequence flush module adapted to flush the sequence file from memory; and A loaded sequence flush module adapted for flushing the loaded sequence file from memory.

다른 실시태양에서, 서열 로드 모듈은 서열이 로딩될 때 요약 리포트 노트북 페이지를 생성시키고 디스플레이하기 위해 채택된 로딩된 서열 디스플레이 모듈을 포함하고, 여기서 요약 리포트 노트북 페이지는 서열 파일 명칭 및 서열의 수를 디스플레이하기 위해 채택된다.In another embodiment, the sequence loading module includes a loaded sequence display module adapted to generate and display a summary report notebook page when the sequence is loaded, wherein the summary report notebook page displays the sequence file name and the number of sequences. To be adopted.

다른 실시태양에서, 리포트 모듈은 서열 요약, 각각의 로딩된 서열의 컨텐트 (content)의 목록, 및/또는 각각의 로딩된 서열에 대한 통계적 정보를 생성하고 디스플레이하기 위해 채택된다.In other embodiments, the report module is adapted to generate and display sequence summaries, a list of the content of each loaded sequence, and / or statistical information for each loaded sequence.

다른 실시태양에서, 분석 모듈은 표적 서열을 염기 서열에 정렬시키고 정렬 리포트를 디스플레이하기 위해 채택된 서열 정렬 모듈; 서열에 대한 ω₀ 스코어를 계산하고 ω₀ 스코어를 디스플레이하기 위해 채택된 ω₀ 모듈; 표적 서열의 다중 출현을 염기 서열 내에 위치시키고 다중 출현을 디스플레이하기 위해 채택된 질문 리피트 (query repeat) 모듈; 리피트된 뉴클레오티드가 듀플리케이트 (duplicate)인 때를 결정하기 위해 채택된 질문 오메가 리피트 모듈; 염기 서열 내에서 각각의 뉴클레오티드 위치에 대한 기울기를 계산하고 기울기 리포트를 디스플레이하기 위해 채택된 기울기 계산 모듈; 및 표적 서열을 염기 서열에 비교하고 유사성 리포트를 디스플레이하기 위해 채택된 서열 비교 모듈을 포함한다.In another embodiment, the analysis module comprises a sequence alignment module adapted to align the target sequence to the base sequence and display the alignment report; Calculated for ω ₀ score for a sequence and adapted to display the score ω ₀ ω ₀ module; A query repeat module adapted to locate multiple occurrences of the target sequence within the base sequence and display the multiple occurrences; A question omega repeat module adopted to determine when the repeated nucleotides are duplicates; A gradient calculation module adapted to calculate the slope for each nucleotide position within the base sequence and display the slope report; And a sequence comparison module adapted to compare the target sequence to the base sequence and display the similarity report.

다른 실시태양에서, 플롯 모듈은 정렬 계수를 염기 서열 및 표적 서열에 대해 플롯팅하기 위해 채택된 스펙트럼 (spectral) 어레이 모듈; 단일 가닥을 염기 서열 및 표적 서열에 대해 플롯팅하기 위해 채택된 단일 가닥 모듈; 염기 서열 내에서 각각의 뉴클레오티드 위치에 대한 기울기를 계산하고 기울기의 플롯을 디스플레이하기 위해 채택된 기울기 모듈, 및 염기 서열에 대한 ω_N을 계산하고 ω_N의 플롯을 디스플레이하기 위해 채택된 ω_N 모듈을 포함한다.In another embodiment, the plot module comprises a spectral array module adapted to plot alignment coefficients against base and target sequences; Single stranded modules adapted for plotting single strands against base and target sequences; Calculating a slope for each nucleotide position in the nucleotide sequence, and calculating the ω _N for the tilt module, and a nucleotide sequence adapted to display a plot of the slope and the ω _N module adapted to display a plot of ω _N Include.

본 발명의 다른 측면은 서열 분석을 위한 방법에 관한 것이다. 상기 방법은 서열 파일을 판독하고; 상기 파일로부터 표적 서열 및 염기 서열을 선택하고; 표적과 염기 서열 사이에 비-2원 비교를 수행하고 (여기서 비-2원 비교는 비교값을 생성함); 비교값에 기초하여 표적과 염기 서열 사이의 유사성을 결정하는 단계를 포함한다.Another aspect of the invention relates to a method for sequence analysis. The method reads a sequence file; Selecting a target sequence and base sequence from the file; A non-membered comparison is performed between the target and the base sequence, where the non-membered comparison produces a comparison value; Determining similarity between the target and the base sequence based on the comparison value.

한 실시태양에서, 방법은 정렬된 서열을 서열 파일에 기록하고 정렬 비율을 계산하는 단계를 추가로 포함한다.In one embodiment, the method further comprises recording the aligned sequence in a sequence file and calculating the alignment ratio.

다른 실시태양에서, 방법은 2차원 스펙트럼 어레이 플롯 또는 2차원 단일 가닥 플롯 중 적어도 하나를 생성하는 단계를 추가로 포함한다.In another embodiment, the method further comprises generating at least one of a two-dimensional spectral array plot or a two-dimensional single strand plot.

다른 실시태양에서, 비-2원 비교를 수행하는 단계는 2개의 서열 구성요소 사이의 복수의 가능한 비교에 대한 비-2원 유사성 스코어값을 포함하는 룩업 테이블 (look-up table)을 사용하는 것을 포함한다.In another embodiment, performing the non-binary comparison comprises using a look-up table that includes non-binary similarity score values for a plurality of possible comparisons between two sequence components. Include.

본 발명의 상기 및 다른 특징 및 잇점은 다음에, 보다 특히 첨부 도면에 예시된 바와 같은 본 발명의 바람직한 실시태양의 설명으로부터 자명해질 것이고, 첨부 도면에서 유사한 참조 번호는 일반적으로 동일한 및/또는 기능상 유사한 및/또 는 구조상 유사한 구성요소를 나타낸다.These and other features and advantages of the present invention will next become apparent from the description of preferred embodiments of the present invention, more particularly as illustrated in the accompanying drawings, in which like reference numerals are generally the same and / or functionally similar. And / or structurally similar components.

도 1은 본 발명에 따른 예시적인 방법의 순서도를 도시한 것이다.1 shows a flowchart of an exemplary method according to the present invention.

도 2는 본 발명에 따른 DNA 분석 모듈의 하위모듈의 예시적인 실시태양을 도시한 것이다.2 illustrates an exemplary embodiment of a submodule of a DNA analysis module according to the present invention.

도 3은 Variation Sequence Software Application (이하 "VaSSA")에서 GUI 메인 윈도우의 예시적인 실시태양을 도시한 것이다.3 illustrates an exemplary embodiment of a GUI main window in a Variation Sequence Software Application (hereinafter "VaSSA").

도 4는 VaSSA에서 파일 메뉴 윈도우의 예시적인 실시태양을 도시한 것이다.4 illustrates an example embodiment of a file menu window in VaSSA.

도 5는 VaSSA에서 노트북 뷰어 (VIEWER) 윈도우의 예시적인 실시태양을 도시한 것이다.5 illustrates an example embodiment of a notebook viewer (VIEWER) window in VaSSA.

도 6은 VaSSA에서 서열 요약 리포트 윈도우의 예시적인 실시태양을 도시한 것이다.6 illustrates an exemplary embodiment of a Sequence Summary Report window in VaSSA.

도 7은 VaSSA에서 서열 뷰 (VIEW) 리포트 윈도우의 예시적인 실시태양을 도시한 것이다.7 illustrates an exemplary embodiment of a sequence view report window in VaSSA.

도 8은 VaSSA에서 서열 뷰 STAT 윈도우의 예시적인 실시태양을 도시한 것이다.8 illustrates an exemplary embodiment of a sequence view STAT window in VaSSA.

도 9는 VaSSA에서 서열 정렬 메뉴 윈도우의 예시적인 실시태양을 도시한 것이다.9 illustrates an exemplary embodiment of a sequence alignment menu window in VaSSA.

도 10은 VaSSA에서 정렬된 서열 리포트 윈도우의 예시적인 실시태양을 도시한 것이다.10 illustrates an exemplary embodiment of a sequence report window aligned in VaSSA.

도 11은 VaSSA에서 질문 리피트 윈도우의 예시적인 실시태양을 도시한 것이다. 11 illustrates an example embodiment of a question repeat window in VaSSA.

도 12는 VaSSA에서 질문 리피트 리포트 윈도우의 예시적인 실시태양을 도시한 것이다.12 illustrates an example embodiment of a question repeat report window in VaSSA.

도 13은 VaSSA에서 오메가 서브제로 (SUBZERO) 윈도우의 예시적인 실시태양을 도시한 것이다.FIG. 13 illustrates an exemplary embodiment of an Omega SUBZERO window in VaSSA.

도 14는 VaSSA에서 오메가 서브제로 리포트 윈도우의 예시적인 실시태양을 도시한 것이다.14 illustrates an exemplary embodiment of an omega subzero report window in VaSSA.

도 15는 VaSSA에서 질문 오메가 리피트 메뉴 윈도우의 예시적인 실시태양을 도시한 것이다.15 illustrates an example embodiment of a question omega repeat menu window in VaSSA.

도 16은 VaSSA에서 질문 오메가 리피트 리포트의 예시적인 실시태양을 도시한 것이다.FIG. 16 illustrates an example embodiment of a question omega repeat report in VaSSA.

도 17은 VaSSA에서 기울기 계산 윈도우의 예시적인 실시태양을 도시한 것이다.17 illustrates an example embodiment of a slope calculation window in VaSSA.

도 18은 VaSSA에서 기울기 계산 리포트의 예시적인 실시태양을 도시한 것이다.18 illustrates an example embodiment of a tilt calculation report in VaSSA.

도 19는 VaSSA에서 서열 비교 윈도우의 예시적인 실시태양을 도시한 것이다.19 depicts an exemplary embodiment of a sequence comparison window in VaSSA.

도 20은 VaSSA에서 서열 비교 리포트 윈도우의 예시적인 실시태양을 도시한 것이다.20 illustrates an exemplary embodiment of a sequence comparison report window in VaSSA.

도 21은 VaSSA에서 스펙트럼 어레이 윈도우의 예시적인 실시태양을 도시한 것이다.21 illustrates an exemplary embodiment of a spectral array window in VaSSA.

도 22는 VaSSA에서 스펙트럼 어레이 플롯 윈도우의 예시적인 실시태양을 도시한 것이다.22 illustrates an example embodiment of a spectral array plot window in VaSSA.

도 23은 스펙트럼 어레이 수식 (FORMULA)의 사진이다. 23 is a photograph of a spectral array formula (FORMULA).

도 24는 스펙트럼 어레이 수식 예의 개략도를 도시한 것이다.24 shows a schematic diagram of an example spectrum array equation.

도 25는 스펙트럼 어레이 삼각 구조의 사진이다.25 is a photograph of a spectral array triangular structure.

도 26은 VaSSA에서 단일 가닥 윈도우의 예시적인 실시태양을 도시한 것이다.FIG. 26 illustrates an exemplary embodiment of a single stranded window in VaSSA.

도 27은 단일 염기 해상도로 2개의 360 염기 서열 (상단부) 및 상기 서열의 위치 250 내지 위치 295의 영역 (저변부)의 스펙트럼 어레이 플롯을 비교하는, VaSSA에서 단일 가닥 플롯 리포트 윈도우의 예시적인 실시태양을 도시한 것이다.FIG. 27 is an exemplary embodiment of a single stranded plot report window in VaSSA comparing two 360 base sequences (top) and a spectral array plot of the region (base) of position 250 to position 295 of the sequence at single base resolution It is shown.

도 28은 단일 가닥 서열 사이의 비교를 보여주는, VaSSA에서 추가의 단일 가닥 플롯 리포트 윈도우의 예시적인 실시태양을 도시한 것이다.FIG. 28 depicts an exemplary embodiment of an additional single strand plot report window in VaSSA, showing a comparison between single strand sequences.

도 29는 VaSSA에서 플롯 기울기 윈도우의 예시적인 실시태양을 도시한 것이다.29 illustrates an example embodiment of a plot slope window in VaSSA.

도 30은 단일 서열에 대한 기울기 플롯을 도시한 것이다.30 shows the slope plot for a single sequence.

도 31은 VaSSA에서 오메가 SUBN 윈도우의 예시적인 실시태양을 도시한 것이다.31 illustrates an example embodiment of an omega SUBN window in VaSSA.

도 32는 VaSSA에서 오메가 SUBN 플롯 윈도우의 예시적인 실시태양을 도시한 것이다.32 illustrates an exemplary embodiment of an omega SUBN plot window in VaSSA.

도 33은 핵산 구아닌, 시토신, 아데닌 및 티민의 4개 염기와 RNA에서 티민을 대체하는 우라실의 화학 구조를 도시한 것이다. Figure 33 shows the chemical structure of uracil replacing thymine in RNA with four bases of nucleic acid guanine, cytosine, adenine and thymine.

도 34A는 A＼G 비교에 관여된 상이한 구성요소의 사진을 도시한 것이다.34A shows photographs of the different components involved in A＼G comparison.

도 34B는 G＼A 비교에 관여된 상이한 구성요소의 사진을 도시한 것이다.34B shows photographs of the different components involved in the G＼A comparison.

도 34C는 A＼C 비교에 관여된 상이한 구성요소의 사진을 도시한 것이다.34C shows photographs of the different components involved in the A＼C comparison.

도 35는 본 발명에 따른 DNA 위상 공액 (topological conjugacy) 모듈의 예시적인 실시태양을 도시한 것이다.35 illustrates an exemplary embodiment of a DNA topological conjugacy module according to the present invention.

도 36은 본 발명에 따른 DNA 근사 (approximate) 모듈의 예시적인 실시태양을 도시한 것이다.Figure 36 illustrates an exemplary embodiment of a DNA approximation module in accordance with the present invention.

도 37은 본 발명에 따른 DNA 오빗 (orbit) 모듈의 예시적인 실시태양을 도시한 것이다.Figure 37 illustrates an exemplary embodiment of a DNA orbit module in accordance with the present invention.

도 38은 본 발명에 따른 혼돈 영역 (Chaotic Region) 분류 모듈의 예시적인 실시태양을 도시한 것이다.Figure 38 illustrates an exemplary embodiment of a Chaotic Region Classification Module in accordance with the present invention.

도 39는 본 발명에 따른 DNA 분기 (bifurcation) 모듈의 예시적인 실시태양을 도시한 것이다.Figure 39 illustrates an exemplary embodiment of a DNA bifurcation module in accordance with the present invention.

도 40은 본 발명에 따른 DNA 유도 (derivative) 모듈의 예시적인 실시태양을 도시한 것이다.40 depicts an exemplary embodiment of a DNA derivative module according to the present invention.

도 41은 본 발명에 따른 DNA 분석적 거동 프로파일러 모듈의 예시적인 실시태양을 도시한 것이다.Figure 41 illustrates an exemplary embodiment of a DNA analytical behavior profiler module in accordance with the present invention.

도 42는 본 발명에 따른 구조 안정 영역 모듈의 예시적인 실시태양을 도시한 것이다. 42 illustrates an exemplary embodiment of a structural stable region module in accordance with the present invention.

도 43은 본 발명에 따른 분해불가능 영역 모듈의 예시적인 실시태양을 도시한 것이다.Figure 43 illustrates an exemplary embodiment of a non-degradable region module in accordance with the present invention.

도 44는 본 발명에 따른 DNA 복잡성 (complexity) 염기 모듈의 예시적인 실시태양을 도시한 것이다.Figure 44 illustrates an exemplary embodiment of a DNA complexity base module according to the present invention.

도 45는 본 발명에 따른 DNA 정렬기 모듈의 예시적인 실시태양을 도시한 것이다.45 illustrates an exemplary embodiment of a DNA aligner module according to the present invention.

도 46은 본 발명에 따른 비-2원 서열 비교 시스템의 예시적인 실시태양을 도시한 것이다.46 depicts an exemplary embodiment of a non-binary sequence comparison system in accordance with the present invention.

본 발명의 실시태양은 분리된 위상 공간 위에서 서열의 구조 거동을 분석하고 결정하기 위한 통합 시스템을 제공한다. 기술은 특히, 표준화, 압축 기술, 구조 분류 및 위상 공액 방법을 포함하는 신규한 개선된 측정가능 방법을 제공한다. 상기 분석적 방법의 조합은 게놈 데이타의 수치 지배 특성 및/또는 구조 거동 패턴을 생성하는 계산적 수학 기술, 생물학, 및 화학을 고려한다.Embodiments of the present invention provide an integrated system for analyzing and determining the structural behavior of sequences on separate phase spaces. The technology provides, in particular, new and improved measurable methods, including standardization, compression techniques, structure classification and phase conjugation methods. Combinations of these analytical methods take into account computational mathematical techniques, biology, and chemistry to generate numerical governing properties and / or structural behavior patterns of genomic data.

본 발명은 다양한 생물정보학 용도에서 사용될 수 있다. 본 발명의 통합 시스템 및 방법은 단일 서열 플롯 및 본질적으로 임의의 길이 (예를 들어, 50 내지 2백만개의 염기)의 뉴클레오티드 서열에 대한 다른 데이타를 제공한다. 본 발명의 통합 시스템 및 방법은 효율적인 프로세싱 단계로 인해 다수의 서열에 대한 비교 데이타를 제공할 수 있다. 예를 들어, 시스템은 500개 염기의 500개의 서열에서 매우 빠르게 작동하는 것으로 나타났다. 1000, 10,000, 100,000, 1,000,000개, 또는 그 이상의 서열의 비교는 본 발명의 범위 내에 있다.The present invention can be used in a variety of bioinformatics applications. The integrated systems and methods of the present invention provide a single sequence plot and other data for nucleotide sequences of essentially any length (eg, 50 to 2 million bases). The integrated systems and methods of the present invention can provide comparative data for multiple sequences due to efficient processing steps. For example, the system has been shown to operate very fast on 500 sequences of 500 bases. Comparison of 1000, 10,000, 100,000, 1,000,000, or more sequences is within the scope of the present invention.

본 발명의 시스템은 0% (동일성 없음) 내지 100% (완전한 동일성)의 상동성 범위 내에서 의미있는 비교 정보를 생성하는 비-2원 방법을 이용한다. 본 발명의 비-2원 방법은 전형적인 2원 비교보다 훨씬 더 식별력이 있고, 2원 비교에서 구별할 수 없는 서열 차이의 정도를 분해할 수 있다.The system of the present invention utilizes a non-binary method of generating meaningful comparison information within a homology range of 0% (no identity) to 100% (complete identity). The non-binary methods of the present invention are much more discriminating than typical binary comparisons and can resolve the degree of sequence differences that are indistinguishable from binary comparisons.

본 발명의 시스템 및 방법은 임의의 길이의 삽입 또는 삭제의 존재에도 불구하고 서열을 비교하는데 있어서 효과적이다. 정렬 모듈은 의미있는 비교를 허용하기 위해 전역 (global) 및 국소 (local) 최적화를 모두 제공한다. 단일 가닥 플롯 및 비교가 혼돈 서열 또는 오메가 리피트를 갖는 코딩 (분해가능) 영역 및 비-코딩 (분해불가능) 영역에서 생성될 수 있다.The systems and methods of the present invention are effective in comparing sequences despite the presence of insertions or deletions of any length. The sort module provides both global and local optimizations to allow meaningful comparisons. Single stranded plots and comparisons can be generated in the coding (degradable) region and non-coding (non-degradable) region with chaotic sequence or omega repeats.

DNA 염기 (A, T, G, 및 C)가 이하 설명에서 사용된다. 그러나, 본 발명의 시스템 및 방법은 DNA 뿐만 아니라, RNA (티민 대신 우라실), LNA, PNA, 및 다른 합성 뉴클레오티드 변이체를 포함한 모든 뉴클레오티드에도 적용가능함을 이해해야 한다.DNA bases (A, T, G, and C) are used in the description below. However, it should be understood that the systems and methods of the present invention are applicable not only to DNA but to all nucleotides including RNA (uracil instead of thymine), LNA, PNA, and other synthetic nucleotide variants.

도면에 나타낸 디스플레이는 일반적으로 뉴클레오티드 서열만을 도시한 것이다. 명백해지는 바와 같이, 코딩 영역에 대해, 코돈에 대응하는 아미노산 서열이 또한 당업자에게 잘 알려진 통상적인 기술을 이용하여 디스플레이될 수 있다.The display shown in the figure generally depicts only nucleotide sequences. As will be apparent, for the coding region, amino acid sequences corresponding to the codons can also be displayed using conventional techniques well known to those skilled in the art.

본 발명의 방법은 게놈 정보를 분석하고 검색하고 디스플레이하는 것을 포함한다. 본 발명의 시스템 및 방법은 유전체, 단백질체학, 및 의학 데이타를 수집하고, 저장하고, 분석하고, 검색하고, 데이타 마이닝 (mining) 및 데이타 시각화 및 디스플레이; 서열 정렬 및 패턴 인지; 및 구조 예측을 위한 툴을 제공한다. 예를 들어, 본 발명의 시스템 및 방법은 실리콘 분석, 분산 컴퓨팅 (distributed computing), 진단, 및 치료 계획의 설계에서 예측 생화학 모델을 위해 사용될 수 있다.The method of the present invention includes analyzing, retrieving and displaying genomic information. The systems and methods of the present invention include the collection, storage, analysis, retrieval, data mining and data visualization and display of genomic, proteomic, and medical data; Sequence alignment and pattern recognition; And tools for structural prediction. For example, the systems and methods of the present invention can be used for predictive biochemical models in the design of silicon analysis, distributed computing, diagnostics, and treatment plans.

본 발명의 시스템은 하나 이상의 모듈로 이루어진다. 본 발명의 모듈 및 시스템은 개별 작동하는 자립형 (stand-alone) 컴퓨터에 의해, 또는 몇몇 개체들에 의해 작동된 분산 컴퓨팅 "시스템"의 일부로서 실시될 수 있다. 본 발명은 또한 시스템의 다양한 측면, 예를 들어 하드웨어, 소프트웨어, 서브시스템, 서브시스템의 컴포넌트 (component), 및 시스템을 사용하여 생성되거나, 편집되거나 수집된 데이타의 구조를 포함한다. 또한, 본 발명은 관련 데이타를 모으고, 생성하고 디스플레이하기 위한 방법 및 장치와 연합된 분석적 장비 (instrumentation) 및 상기 장비를 작동하고 사용하는 방법을 포함한다. 본 발명의 시스템 및 방법을 사용하는 영업 방법, 예를 들어 서열 분석 툴을 위한 예약 (subscription)을 판매하는 것이 또한 고려된다.The system of the present invention consists of one or more modules. The modules and systems of the present invention may be implemented by stand-alone computers operating individually, or as part of a distributed computing "system" operated by several entities. The invention also encompasses various aspects of the system, such as hardware, software, subsystems, components of the subsystem, and structures of data created, edited, or collected using the system. The invention also includes analytical instrumentation associated with methods and apparatus for gathering, generating and displaying relevant data and methods for operating and using the instrument. It is also contemplated to sell sales methods using the systems and methods of the present invention, such as subscriptions for sequence analysis tools.

아래 더욱 상세히 설명되는 실시태양의 실시는 달리 나타내지 않으면 당업계의 기술 내에서 생물학, 미생물학, 분자생물학 및 면역학의 통상적인 방법을 사용할 것이다. 상기 기술은 문헌에 충분히 설명되어 있다. 상기 또는 하기 본원에 인용된 모든 간행물, 특허 및 특허 출원은 전문이 본원에 참고로 포함된다. Embodiments of the embodiments described in more detail below will use conventional methods of biology, microbiology, molecular biology and immunology, unless otherwise indicated in the art. Such techniques are explained fully in the literature. All publications, patents, and patent applications cited herein above or below are hereby incorporated by reference in their entirety.

정의Justice

본 발명을 설명하는데 있어서, 다음 용어를 사용할 것이고 아래 나타낸 바와 같이 정의하는 것으로 의도한다.In describing the present invention, the following terms will be used and are intended to be defined as indicated below.

"VaSSA"는 Variation Sequence Software Application을 나타낸다."VaSSA" stands for Variation Sequence Software Application.

"컴퓨터"는 구조화된 입력신호 (input)을 수용하고, 구조화된 입력신호를 소정의 규칙에 따라 처리하고, 처리의 결과를 출력신호 (output)로서 생성시킬 수 있는 임의의 장치를 의미한다. 컴퓨터는 예를 들어 데이타를 수용하고, 데이타를 하나 이상의 저장된 소프트웨어 프로그램에 따라 처리하고, 결과를 생성시키는 임의의 장치를 포함할 수 있고, 일반적으로 입력, 출력, 저장, 연산, 논리 (logic), 및 제어 유닛을 포함한다. 컴퓨터의 예는 컴퓨터; 범용 컴퓨터; 수퍼컴퓨터; 메인프레임; 수퍼 미니-컴퓨터; 미니-컴퓨터; 워크스테이션; 마이크로-컴퓨터; 서버; 양방향 (interactive) 텔레비젼; 웹 장비; 인터넷 접근이 가능한 통신 장치; 컴퓨터와 양방향 텔레비젼의 하이브리드 조합체; 휴대용 컴퓨터; 개인 정보 단말기 (PDA); 휴대 전화; 및 컴퓨터를 모방하는 용도 특이적 하드웨어 및/또는 소프트웨어, 예를 들어 프로그래밍가능 게이트 어레이 (PGA) 또는 프로그래밍된 디지탈 시그날 프로세서 (DSP)를 포함한다. 컴퓨터는 고정식 또는 휴대용일 수 있다. 컴퓨터는 단일 프로세서, 또는 병렬식 및/또는 비병렬식으로 작동할 수 있는 다중 프로세서를 가질 수 있다. 컴퓨터는 또한 컴퓨터 사이에 정보를 발신하거나 수용하기 위해 네트워크를 통해 함께 연결된 2개 이상의 컴퓨터를 나타낸다. 상기 컴퓨터의 예는 네트워크에 의해 연결된 컴퓨터를 통해 정보를 처리하기 위한 분산형 컴퓨터 시스템을 포함한다."Computer" means any device capable of receiving a structured input signal, processing the structured input signal in accordance with a predetermined rule, and generating a result of the processing as an output signal. A computer may include any device that accepts data, processes the data according to one or more stored software programs, and produces a result, for example, and typically includes input, output, storage, arithmetic, logic, And a control unit. Examples of computers include computers; General purpose computer; Supercomputer; Mainframe; Super mini-computer; Mini-computer; Workstation; Micro-computers; server; Interactive television; Web equipment; A communication device with internet access; Hybrid combination of computer and interactive television; Portable computer; Personal digital assistant (PDA); Cell Phone; And use-specific hardware and / or software to mimic a computer, such as a programmable gate array (PGA) or a programmed digital signal processor (DSP). The computer may be stationary or portable. A computer can have a single processor or multiple processors that can operate in parallel and / or non-parallel. A computer also refers to two or more computers connected together over a network to send or receive information between the computers. Examples of such computers include distributed computer systems for processing information through computers connected by a network.

"기계 접근가능 (machine-accessible) 매체"는 컴퓨터에 의해 접근가능한 테이타를 저장하기 위해 사용된 임의의 저장 장치를 의미한다. 기계-판독가능 매체의 예는 자기 하드 디스크; 플로피 디스크; 광학 디스크, 예를 들어 CD-ROM 및 DVD; 자기 테이프; 메모리 칩; 및 기계-판독가능 전자 데이타를 반송하기 위한 반송파 (carrier wave), 예를 들어 이메일 송수신 또는 네트워크 접근시에 사용되는 것을 포함한다."Machine-accessible media" means any storage device used to store data accessible by a computer. Examples of machine-readable media include magnetic hard disks; Floppy disks; Optical discs such as CD-ROMs and DVDs; Magnetic tape; Memory chips; And carrier waves for carrying machine-readable electronic data, such as those used in e-mail transmission and reception or network access.

"소프트웨어"는 컴퓨터를 작동하기 위한 소정의 규칙을 의미한다. 소프트웨어의 예는 소프트웨어; 코드 세그먼트 (code segment); 지시; 소프트웨어 프로그램; 컴퓨터 프로그램; 및 프로그래밍된 논리를 포함한다."Software" means certain rules for operating a computer. Examples of software include software; Code segment; indication; Software programs; Computer program; And programmed logic.

"컴퓨터 시스템"은 컴퓨터를 작동시키는 소프트웨어를 구현하는 기계-판독가능 매체를 포함하는 컴퓨터를 갖는 시스템을 의미한다."Computer system" means a system having a computer that includes a machine-readable medium that implements software for operating the computer.

"정보 저장 장치"는 정보를 저장하기 위해 사용되는 제조품을 의미한다. 정보 저장 장치는 상이한 형태, 예를 들어 종이 형태 및 전자 형태를 갖는다. 종이 형태에서, 정보 저장 장치는 정보와 함께 인쇄된 종이를 포함한다. 전자 형태에서, 정보 저장 장치는 정보를 소프트웨어로서, 예를 들어 데이타로서 저장하는 기계-판독가능 매체를 포함한다."Information storage device" means an article of manufacture used for storing information. Information storage devices come in different forms, for example paper and electronic forms. In paper form, the information storage device comprises paper printed with information. In electronic form, the information storage device comprises a machine-readable medium for storing information as software, for example as data.

다음 용어는 유전학 및 생물정보학의 표준 용어가 아니다.The following terms are not standard terms in genetics and bioinformatics.

"스트링 (string)"은 문자 (character)의 서열이다. 서열은 대상 (문자)의 n-개 한벌로서 공지된 n x 1 매트릭스로서 간주될 수 있다. 뉴클레오티드 서열, 예를 들어 DNA, RNA, 또는 합성 또는 다른 변이체의 경우에, 각각의 뉴클레오티드 성분은 분리된 세트인 스트링 내에서 특유한 위치를 갖는다.A "string" is a sequence of characters. The sequence can be regarded as an n x 1 matrix known as the n-pair of subjects (letters). In the case of nucleotide sequences, such as DNA, RNA, or synthetic or other variants, each nucleotide component has a unique position in the string that is a separate set.

예: AGCAATATAGGA는 길이가 12인 문자의 스트링이다.Example: AGCAATATAGGA is a string of 12 characters in length.

스트링 S의 "하위서열"은 S 내에서 연속적일 필요는 없지만, S 내에 주어진 것과 같은 순서를 유지하는 S의 문자의 서열을 의미한다."Subsequence" of string S means a sequence of letters of S that need not be contiguous within S but maintain the same order as given in S.

예: ACG는 ACTCGT의 하위서열이다.Example: ACG is a subsequence of ACTCGT.

"f(n)=O(g(n))": f(n) 및 g(n)은 함수로 한다. 그러면, 상수 c가 존재하는 경우에만 f(n)=O(g(n))이어서, 모든 n에 대해 충분히 큰 ｜f(n)｜≤ cg(n)이다."f (n) = O (g (n))": f (n) and g (n) are functions. Then f (n) = O (g (n)) only when the constant c is present, so that | f (n) | ≤ cg (n) which is sufficiently large for all n.

"S₄"는 4개의 뉴클레오티드: A, C, G, 및 T 상의 DNA 서열 세트이다.“S ₄ ” is a set of DNA sequences on four nucleotides: A, C, G, and T.

σ_L: S₄ -> S₄, 단 σ_k,L (S₀S₁S₂···S_n···)=s₀ s ₁ s ₂ ··· s _n ··· (여기서, k = 1 (1로 이동하는 것을 나타내는)이고, L은 좌측에서 우측으로 이동하는 것을 나타낸다). 따라서, σ_L은 S₄ 상에 규정된 연속 DNA 값(valued) 함수이다. 맵 (map)을 시각화하는 한가지 방법은 단순히 서열에서 제1 입력을 "잊고", 모든 다른 입력을 우측 (즉, 상기 서열의 밑줄친 부분)으로 집중하는 것이다. 상기 DNA 연속성의 직관적 개념은 S₄ 내의 임의의 위치 DNA 하위서열의 작은 이웃 상의 점근적 (asymptotic) 언어 변이가 상기 위치로부터 단지 근소하게 변할 것임을 진술함으로써 설명될 수 있다. 상기 변이는 이웃의 크기를 감소시키거나 증가시킴으로써 원하는 대로 작거나 크게 만들어질 수 있다. _{_{σ L: S 4 -> S}} 4, only _{_{σ k, L (S 0 S}} 1 S 2 ··· S n ···) = s 0 s 1 s 2 ··· s n ··· ( here, k = 1 (indicating to move to 1), and L to shift from left to right. Thus, σ _L is a continuous DNA valued function defined on S ₄ . One way to visualize a map is simply to “forget” the first input in the sequence and focus all other inputs to the right (ie, the underlined portion of the sequence). The intuitive concept of DNA continuity can be explained by stating that asymptotic language variation on the small neighbor of any positional DNA subsequence in S ₄ will only slightly change from that position. The variation can be made small or large as desired by decreasing or increasing the size of the neighbor.

σ_t,R은 t 단위로 좌측으로 이동하고 우측으로부터 판독하는 상기한 것에 대한 아날로그 맵이다. 이들 맵의 연속성은 맵들이 결합되도록 허용한다.σ _{t, R} is the analog map for the above, which moves to the left in units of t and reads from the right. The continuity of these maps allows the maps to be combined.

하위서열의 전향 (forward) 및 후향 (backward) 오빗: 하위서열 z의 전향 오빗은 점 z,σ_L(z),σ² _L(z),σ³ _L(z),···의 세트이고, O⁺(z)로 표시된다. 하위서열 z의 후향 오빗은 점 z,σ_R(z),σ² _R(z),σ³ _R(z),···의 세트이고, O^-(z)로 표시된다.Forward and backward orbits of subsequences: The forward orbits of subsequence z are sets of points z, σ _L (z), σ ² _L (z), σ ³ _L (z), ... , O ⁺ (z). Orbit backward of the lower sequence z is a point _{z, σ R (z),} σ 2 R (z), σ 3 R (z), and a set of ···, O ^- is denoted by (z).

고정 및 주기적 하위서열: DNA 하위서열 s는 σ_L(s) = s이면 σ_L에 대한 고정 하위서열이다. DNA 하위서열 s는 σⁿ _L(s) = s이면 주기 n의 주기적 하위서열이다. 최소 양의 n은 s의 일차 주기로 불린다. 주기적 점의 모든 반복의 세트는 주기적 오빗을 형성한다.Fixed and periodic sub-sequences: DNA sub-sequences is s is σ _L (s) = s is a fixed sub-sequence for the σ _L. The DNA subsequence s is the periodic subsequence of period n if σ ⁿ _L (s) = s. The minimum amount of n is called the first period of s. The set of all iterations of the periodic point forms a periodic orbit.

궁극적 주기성: s가 주기성이 아니지만, 모든 i≥m에 대해 σⁿ⁺ ⁱ _L(s)= σⁱ(s)이도록 m>0이 존재하면, DNA 하위서열 s는 주기 n의 궁극적 주기성이다. 즉, σⁱ _L(s)는 i≥m에 대해 주기성이다. Ultimate periodicity: If s is not periodic but m> 0 exists such that σ ^{n +} ⁱ _L (s) = σ ⁱ (s) for all ⁱ ≧ m, then DNA subsequence s is the ultimate periodicity of period n. That is, σ ⁱ _L (s) is periodic for i≥m.

전향 점근성: s를 주기 n의 주기성인 DNA 하위서열로 한다.

이면, 하위서열 x는 s에 전향 점근성이다. S^s(s)로 표시되는 s의 안정한 세트는 s에 전향 점근성인 모든 하위서열로 이루어진다.Forward asymptotes: Let s be the DNA subsequence that is the periodicity of period n.

If subsequence x is forward asymptotic on s. The stable set of s, denoted by S ^s (s), consists of all subsequences that are forward asymptotic to s.

"정렬기"는 다중 서열 정렬 분석의 하나의 버전이다."Sorter" is one version of multiple sequence alignment analysis.

"오메가 비교기 (Comparator)"는 ω₀ 측도 (measure) 상의 단일 및 다중 서열 염기 탐색 염기이다."Omega Comparator" is a single and multiple sequence base search base on the ω ₀ measure.

"스펙트럼 어레이"는 최적 언어 거동을 찾을 수 있도록 하는 ω₀ 측도에 관하여 그의 특유한 구조를 생성시키는 다수 스트링 내의 모든 뉴클레오티드를 비교하도록 허용하는 일련의 계산이다.A "spectral array" is a series of calculations that allows comparing all the nucleotides in the multiple strings that produce their unique structure in terms of the ω ₀ measure that allows to find the optimal language behavior.

"DNA ω₀ 유전자 코드 뷰어"는 측도 ω₀를 갖는 유전자 코드의 보다 정교한 분류이다. The "DNA ω ₀ Gene Code Viewer" is a more sophisticated classification of genetic codes with measures ω ₀ .

"안정한 분석적 프로파일러"는 표적 하위서열에 전향 점근성인 모든 하위서열의 세트를 규정하는 기술이다. A "stable analytical profiler" is a technique that defines a set of all subsequences that are forward asymptotic to the target subsequence.

"불안정한 분석적 프로파일러"는 표적 하위서열에 후향 점근성인 모든 하위서열의 세트를 규정하는 기술이다.An "unstable analytical profiler" is a technique that defines a set of all subsequences that are retrospective asymptotic to a target subsequence.

혼돈: (1) σ_L(z)가 표적 하위서열에 관하여 예민한 의존성을 갖고; (2) σ_L(z)가 위상 이행적이고; (3) 주기적 하위서열이 스트링 또는 데이타 세트에 관하여 치밀하면, σ_L(z)는 혼돈인 것으로 말해진다.Chaos: (1) σ _L (z) has a sensitive dependency on target subsequences; (2) σ _L (z) is phase shifted; (3) If the periodic subsequence is dense with respect to the string or data set, then σ _L (z) is said to be chaotic.

"기호 (symbolic) DNA 오빗"은 반복 과정으로 서열 내에서 표적 하위서열의 점근적 기호 거동이다.A "symbolic DNA orbit" is an iterative process that is the asymptotic sign behavior of a target subsequence within a sequence.

"분석적 DNA 오빗"은 서열 내에서 표적 하위서열의 점근적 언어 거동이다.An “analytical DNA orbit” is the asymptotic language behavior of a target subsequence in a sequence.

"DNA 근사 분석"은 낮은 복잡성 하위서열에 정확한 구조적 거동을 제공하는 일련의 기술이다."DNA Approximation Analysis" is a set of techniques that provides accurate structural behavior at low complexity subsequences.

"혼돈 영역 분류"는 하위서열 표적을 (1) 초기 조건에 예민하게 의존성인 표적, (2) 위상 이행적인 표적, 및 (3) 그들의 DNA 서열에서 치밀한 주기적 하위서열의 3개의 카테고리로 특유하게 분할하는 기술이다."Chaotic region classification" uniquely divides subsequence targets into three categories: (1) targets that are sensitive to initial conditions, (2) topologically transitional targets, and (3) dense periodic subsequences in their DNA sequences. It is a technique to do.

"DNA 유도"는 DNA 서열에서 하나의 뉴클레오티드로부터 다음 것으로의 변화를 정성적으로 관찰할 수 있도록 하는 측정이다."DNA induction" is a measure that allows qualitative observation of a change from one nucleotide to the next in a DNA sequence.

"DNA 분기"는 상이한 파라미터 하에 하위서열에서의 변화를 관찰하는 기술이다."DNA branching" is a technique of observing changes in subsequences under different parameters.

"DNA 위상 공액"은 σ_L(z)의 상이한 매핑 (mapping)이 완전히 동등할 때를 보여주는 기술이다. "DNA phase conjugate" is a technique that shows when different mappings of σ _L (z) are completely equivalent.

"신뢰 스코어"는 표적 서열에 가장 가까운 것으로부터 가장 먼 것까지 서열의 패밀리를 분류하는 측도이다. 오메가 유사성 스코어, 또는 ω₀ 측도는

으로 정의되고, 여기서, s_i/t_i는 비-2원 함수이고, 그의 예는 표 1 및 2에 정의되고, N은 비교되는 두 서열 중 더 짧은 서열 내의 뉴클레오티드의 수이다. 오메가 유사성 스코어는 룩업 테이블에 주어진 비교의 값을 갖는, 염기 위치 i에서 임의의 2개의 뉴클레오티드 스트링, s 및 t의 비-2원 비교이다.A "trust score" is a measure of classifying a family of sequences from the closest to the furthest to the target sequence. Omega similarity score, or ω ₀ measure

Wherein s _i / t _i is a non-binary function, examples of which are defined in Tables 1 and 2, and N is the number of nucleotides in the shorter of the two sequences being compared. The omega similarity score is a non-binary comparison of any two nucleotide strings, s and t, at base position i with the value of the comparison given in the lookup table.

본 발명의 실시태양을 아래에 상세히 논의한다. 특정한 예시적인 실시태양이 논의되지만, 이는 단지 예시의 목적으로 이루어짐을 이해해야 한다. 관련 분야의 숙련인은 본 발명의 취지와 범위로부터 벗어나지 않으면서 다른 성분 및 구성이 사용될 수 있음을 알 것이다.Embodiments of the present invention are discussed in detail below. While certain exemplary embodiments are discussed, it should be understood that this is for illustrative purposes only. Those skilled in the art will appreciate that other components and configurations may be used without departing from the spirit and scope of the invention.

도 1은 예시적인 실시태양이다. 본 발명의 방법 (100)은 서열 파일을 판독하고 (101); 표적 서열 및 염기 서열을 파일로부터 선택하고 (103); 비-2원 비교를 이용하여 표적 서열을 염기 서열(들)에 비교하고 (105); 유사성 스코어를 생성하고 (107); 정렬된 서열을 파일에 기록하는 (109) 단계를 포함할 수 있다. 임의로, 방법 (100)은 비교의 시각적 표시를 생성하고 (111), 정렬 비율을 계산하고/하거나; 2차원 단일 가닥 플롯 또는 스펙트럼 어레이 플롯 (113), 다중 가닥 리포트 (115) 또는 다른 플롯 (117)을 생성하는 단계를 추가로 포함할 수 있다.1 is an exemplary embodiment. The method 100 of the present invention reads the sequence file (101); Selecting a target sequence and a nucleotide sequence from the file (103); Comparing the target sequence to base sequence (s) using a non-two-way comparison (105); Generate a similarity score (107); Recording the aligned sequence to a file (109). Optionally, the method 100 generates a visual representation of the comparison (111), calculates an alignment ratio; The method may further comprise generating a two-dimensional single stranded plot or spectral array plot 113, a multi stranded report 115, or another plot 117.

서열 파일은 하나 이상의 유전자 서열을 포함하는 기계-판독가능 파일일 수 있다. DNA 서열에 대한 다양한 허용되는 포맷이 존재한다. EMBL 포맷이 허용가능하다. 상기 포맷에서 서열 파일은 복수의 서열을 포함할 수 있다. 한 서열 입력은 식별자 라인 ("ID")에 이어, 추가의 주석 라인을 사용하여 개시된다. 서열의 개시는 "SQ"로 시작하는 라인에 의해 표지할 수 있고, 서열의 단부는 2개의 사선 ("//")에 의해 표지할 수 있다. FASTA 포맷이 또한 허용가능하다. FASTA 포맷에서 서열은 단일-라인 설명에 이어, 서열 데이타의 라인을 사용하여 시작한다. 설명 라인은 제1 컬럼에서 큰 부등호 (">") 기호로 시작해야 한다. GCG, GenBank, 및 IG와 같은 많은 다른 포맷이 또한 허용될 수 있다.The sequence file may be a machine-readable file comprising one or more gene sequences. There are various acceptable formats for DNA sequences. EMBL format is acceptable. In this format the sequence file may comprise a plurality of sequences. One sequence entry is initiated using an identifier line (“ID”) followed by an additional comment line. The initiation of a sequence can be labeled by a line beginning with "SQ" and the end of the sequence can be labeled by two diagonal lines ("//"). FASTA format is also acceptable. Sequences in the FASTA format begin with a single-line description followed by a line of sequence data. The description line must begin with a large inequality (">") symbol in the first column. Many other formats may also be acceptable, such as GCG, GenBank, and IG.

서열 데이타는 텍스트 형태, 예를 들어 ASCII, 또는 본 발명의 방법을 실행하는 컴퓨터에 의해 판독가능한 몇몇 다른 표시로 존재할 수 있다. 서열 파일을 판독하는 것은 서열을 직접 타이핑하거나, 디스크로부터 판독하거나, Entrez와 같은 잘 공지된 인터페이스를 사용하여 퍼블릭 도메인 (public domain)에 접근하는 것을 포함할 수 있다. 파일은 저장하고 분석되거나, "작동 중에 (on the fly)" 분석될 수 있다. 사용자는 단일 파일 또는 다중 파일, 또는 전체 데이타 베이스, 또는 파일 또는 다중 파일, 또는 전체 데이타 베이스 내의 임의의 길이의 임의의 하위서열을 판독하는 것을 선택할 수 있다.The sequence data may be in text form, for example ASCII, or some other representation readable by a computer executing the method of the present invention. Reading the sequence file may include typing the sequence directly, reading from disk, or accessing the public domain using a well known interface such as Entrez. The file can be saved and analyzed, or analyzed "on the fly." The user can choose to read a single file or multiple files, or an entire database, or a file or multiple files, or any subsequence of any length within the entire database.

표적은 임의의 길이의 하위서열이다. 사용자는 분석을 데이타베이스에 대해, 또는 구조적 거동을 관찰할 수 있도록 파일에 대해 수행하는 것을 선택할 수 있다. 표적은 2개의 단계로 서로 구분된다. 제1 생물학적 연계는 하위서열 표적을 구성하는 알파벳이다. 제2 연계는 오메가 제로 생물학적 연계이다.The target is a subsequence of any length. The user can choose to perform the analysis on a database or on a file so that the structural behavior can be observed. The targets are distinguished from each other in two stages. The first biological link is the alphabet that makes up the subsequence target. The second linkage is an omega zero biological linkage.

한 실시태양에서, 스펙트럼 어레이 플롯을 생성하는 단계는 ω_N을 계산하고; 방사상 (radial) 비교를 수행하고; 정렬 계수를 추출하고; 정렬 계수를 플롯팅하는 단계를 포함한다.In one embodiment, generating the spectral array plot calculates ω _N ; Perform a radial comparison; Extract the alignment coefficients; Plotting the alignment coefficients.

다른 실시태양에서, 스펙트럼 어레이 플롯을 생성하는 단계는 염기 또는 표적 중 하나를 역전시키고; 모드 (mod)를 역전시키는 단계를 추가로 포함한다.In another embodiment, generating the spectral array plot includes reversing either the base or the target; Inverting the mod further.

다른 실시태양에서, 비-2원 비교를 수행하는 단계는 2개의 서열 구성요소 사이의 복수의 가능한 비교에 대한 비-2원 유사성 스코어값을 포함하는 룩업 테이블을 사용하는 단계를 포함한다.In another embodiment, performing the non-binary comparison includes using a lookup table that includes non-binary similarity score values for a plurality of possible comparisons between two sequence components.

또다른 실시태양에서, 본 발명의 방법은 제1 뉴클레오티드의 분자 구조를 제2 뉴클레오티드에 비교하고; 상기 비교에 기초하여 제1 비-2원 유사성 스코어를 결정하고; 룩업 테이블에 각각의 뉴클레오티드에 대한 유사성 스코어를 기재하고; 룩업 테이블을 사용하여 뉴클레오티드의 표적 서열 (t)를 뉴클레오티드의 염기 서열 (s)에 비교하는 제2 비-2원 유사성 스코어를 계산하는 단계를 포함한다.In another embodiment, the methods of the present invention compare the molecular structure of the first nucleotide to the second nucleotide; Determine a first non-binary similarity score based on the comparison; Describe the similarity score for each nucleotide in a lookup table; Calculating a second non-binary similarity score comparing the target sequence (t) of the nucleotides to the base sequence (s) of the nucleotides using a lookup table.

도 46은 본 발명의 비-2원 서열 비교 시스템 (10)의 실시태양을 도시한 것이다. 시스템 (10)은 제1 뉴클레오티드 서열과 제2 뉴클레오티드 서열 사이의 비-2원 유사성 스코어를 계산하기 위해 채택된 분석 모듈 (200), 파일 관리 모듈 (300), 플롯 모듈 (400) 및 임의로, 리포트 모듈 (500), 사용자 옵션 모듈 600, 및/또는 사용자 지원 모듈 (700)을 포함한다.46 illustrates an embodiment of a non-binary sequence comparison system 10 of the present invention. System 10 includes an analysis module 200, a file management module 300, a plot module 400 and optionally, reports adapted to calculate a non-binary similarity score between a first nucleotide sequence and a second nucleotide sequence. Module 500, user option module 600, and / or user support module 700.

비-2원 서열 비교 시스템 (10)의 파일 관리 모듈 (300)은 서열 파일을 관리한다. 한 실시태양에서, 파일 관리 모듈 (300)은 적어도 하나의 서열 파일을 로딩하기 위해 채택된 서열 로드 모듈 (310); 서열 파일을 메모리로부터 플러싱하기 위해 채택된 활성 서열 플러시 모듈 (320); 및 로딩된 서열 파일을 메모리로부터 플러싱하기 위해 채택된 로딩된 서열 플러시 모듈 (330)을 포함한다. 다른 실시태양에서, 서열 로드 모듈 (310)은 서열이 로딩될 때 요약 리포트 노트북 페이지를 생성시키고 디스플레이하기 위해 채택된 로딩된 서열 디스플레이 모듈 (312)를 포함한다. 요약 리포트 노트북 페이지는 서열 파일 명칭 및 서열의 수를 디스플레이하기 위해 채택된다.File management module 300 of non-binary sequence comparison system 10 manages sequence files. In one embodiment, file management module 300 includes sequence loading module 310 adapted for loading at least one sequence file; An active sequence flush module 320 adapted to flush the sequence file from memory; And a loaded sequence flush module 330 adapted to flush the loaded sequence file from memory. In another embodiment, sequence loading module 310 includes a loaded sequence display module 312 adapted to generate and display a summary report notebook page when a sequence is loaded. The summary report notebook page is adapted to display the sequence file name and the number of sequences.

다른 실시태양에서, 비-2원 서열 비교 시스템 (10)의 플롯 모듈 (400)은 정렬 계수를 염기 서열 및 표적 서열에 대해 플롯팅하기 위해 채택된 스펙트럼 어레이 모듈 (410); 단일 가닥을 염기 서열 및 표적 서열에 대해 플롯팅하기 위해 채택된 단일 가닥 모듈 (420); 염기 서열 내에서 각각의 뉴클레오티드 위치에 대한 기울기를 계산하고 기울기의 플롯을 디스플레이하기 위해 채택된 기울기 모듈 (430), 및 염기 서열에 대한 ω_N을 계산하고 ω_N의 플롯을 디스플레이하기 위해 채택된 ω_N 모듈 (440)을 포함한다. 바람직한 실시태양에서, 스펙트럼 어레이 모듈 (410)은 방사상 비교를 위한 ω_N 값을 계산하고, 정렬 계수를 추출하기 위해 추가로 채택된다. 다른 바람직한 실시태양에서, 단일 가닥 모듈 (420)은 염기 서열 및 표적 서열에 대한 ω_N 값을 계산하기 위해 채택된다.In another embodiment, the plot module 400 of the non-binary sequence comparison system 10 includes a spectral array module 410 adapted to plot alignment coefficients against base and target sequences; A single strand module 420 adapted to plot the single strand against the base sequence and the target sequence; Gradient module 430 adopted to calculate the slope for each nucleotide position in the base sequence and display a plot of the slope, and ω adopted to calculate ω _N for the base sequence and display a plot of ω _N _N module 440. In a preferred embodiment, the spectral array module 410 is further employed to calculate ω _N values for radial comparison and to extract alignment coefficients. In another preferred embodiment, the single stranded module 420 is adapted to calculate ω _N values for the base sequence and the target sequence.

다른 실시태양에서, 본 발명의 비-2원 서열 비교 시스템 (10)의 리포트 모듈 (500)은 서열 요약, 각각의 로딩된 서열의 컨텐트의 목록, 및/또는 각각의 로딩된 서열에 대한 통계적 정보를 생성하고 디스플레이하기 위해 채택된다.In another embodiment, the report module 500 of the non-binary sequence comparison system 10 of the present invention provides a sequence summary, a list of the contents of each loaded sequence, and / or statistical information about each loaded sequence. It is adopted to create and display.

또다른 실시태양에서, 비-2원 서열 비교 시스템 (10)의 분석 모듈 (200)은 표적 서열을 염기 서열에 정렬시키고 정렬 리포트를 디스플레이하기 위해 채택된 서열 정렬 모듈 (201); 서열에 대한 ω₀ 스코어를 계산하고 ω₀ 스코어를 디스플레이하기 위해 채택된 ω₀ 모듈 (203); 표적 서열의 다중 출현을 염기 서열 내에 위치시키고 다중 출현을 디스플레이하기 위해 채택된 질문 리피트 모듈 (205); 리피트된 뉴클레오티드가 듀플리케이트인 때를 결정하기 위해 채택된 질문 오메가 리피트 모듈 (207); 염기 서열 내에서 각각의 뉴클레오티드 위치에 대한 기울기를 계산하고 기울기 리포트를 디스플레이하기 위해 채택된 기울기 계산 모듈 (209); 및 표적 서열을 염기 서열에 비교하고 유사성 리포트를 디스플레이하기 위해 채택된 서열 비교 모듈 (211)을 포함한다.In another embodiment, the analysis module 200 of the non-binary sequence comparison system 10 includes a sequence alignment module 201 adapted to align the target sequence to the base sequence and display an alignment report; A ω ₀ module 203 adopted to calculate a ω ₀ score for the sequence and display the ω ₀ score; A query repeat module 205 adopted to locate multiple occurrences of the target sequence within the base sequence and display the multiple occurrences; A query omega repeat module 207 employed to determine when the repeated nucleotides are duplicates; Gradient calculation module 209, adapted to calculate the slope for each nucleotide position within the base sequence and display the slope report; And a sequence comparison module 211 adopted for comparing the target sequence to the base sequence and displaying the similarity report.

바람직한 실시태양에서, 서열 정렬 모듈 (201)은 상기 염기 서열을 역전시키고/시키거나, 모드를 역전시키고/시키거나, 염기 및 표적을 최단 길이에 정렬시키고/시키거나, 정렬 비율을 계산하고/하거나, 오메가 유사성 스코어를 계산하는 작용을 수행하기 위해 추가로 채택된다.In a preferred embodiment, sequence alignment module 201 reverses the base sequence, reverses the mode, and / or aligns the base and target in the shortest length, and / or calculates the alignment ratio. It is further employed to perform the action of calculating the omega similarity score.

다른 바람직한 실시태양에서, 서열 비교 모듈 (211)은 염기 서열을 역전시키고, 표적 서열을 역전시키고, 모드를 역전시키고, 각각의 염기 및 표적 서열에 대한 ω_N 값을 계산하고, 염기 및 표적 서열을 2원으로 전환시키고, 염기 서열과 표적 서열 사이의 거리를 계산하고, 거리가 경계 (bound)를 초과하는지 결정하는 작용을 수행하기 위해 추가로 채택된다.In another preferred embodiment, the sequence comparison module 211 reverses the base sequence, reverses the target sequence, reverses the mode, calculates ω _N values for each base and target sequence, and calculates the base and target sequence. It is further employed to convert to binary, calculate the distance between the base sequence and the target sequence, and perform the action of determining if the distance exceeds the bound.

도 2는 VaSSA 아키텍쳐 (architecture)의 DNA 분석 부분의 바람직한 모듈 해체의 레이아웃 (layout)을 도시한 것이다. 해체에서 모듈은 아래에 보다 상세히 논의된다. 하위모듈은 도 35 내지 45에 순서도 형태로 도시된다.FIG. 2 shows the layout of the desired module disassembly of the DNA analysis portion of the VaSSA architecture. Modules in disassembly are discussed in more detail below. The submodules are shown in flow chart form in FIGS. 35 to 45.

VaSSAVaSSA 아키텍쳐의Architectural 모듈 해체 Module dismantling

DNA 분석 모듈 그룹 200DNA Analysis Module Group 200

SSDA (단일 가닥 DNA 분석) 모듈 그룹 210 SSDA (Single Strand DNA Analysis) Module Group 210

MSDA (다중 가닥 DNA 분석) 모듈 그룹 240 MSDA (Multi-Strand DNA Analysis) Module Group 240

SSDA (단일 가닥 DNA 분석) (도 2)SSDA (Single Strand DNA Analysis) (Figure 2)

DNA 근사 모듈 212 DNA Approximation Module 212

혼돈 영역 분류 모듈 214 Chaotic Area Classification Module 214

DNA 유도 모듈 216 DNA Induction Module 216

DNA 분기 모듈 218 DNA Branch Module 218

DNA 오빗 모듈 220 DNA Orbit Module 220

분석적 거동 프로파일러 모듈 222 Analytical Behavior Profiler Module 222

DNA 위상 공액 모듈 224 DNA Phase Conjugation Module 224

구조적 안정 영역 모듈 226 Structural Stability Area Modules 226

분해불가능 영역 모듈 228 Non-Resolvable Area Modules 228

DNA 복잡성 염기 모듈 230 DNA Complexity Base Module 230

DNA 정렬기 모듈 232 DNA Sorter Module 232

MSDA (다중 가닥 DNA 분석) (도 2) MSDA (Multi Strand DNA Analysis) (FIG. 2)

DNA 근사 모듈 242 DNA Approximation Module 242

혼돈 영역 분류 모듈 244 Chaos Region Classification Module 244

DNA 유도 모듈 246 DNA Induction Module 246

DNA 분기 모듈 248 DNA Branch Module 248

DNA 오빗 모듈 250 DNA Orbit Module 250

분석적 거동 프로파일러 모듈 252 Analytical Behavior Profiler Module 252

DNA 위상 공액 모듈 254 DNA Phase Conjugation Module 254

구조적 안정 영역 모듈 256 Structural Stability Area Module 256

분해불가능 영역 모듈 258 Non-Resolvable Area Module 258

DNA 복잡성 염기 모듈 260 DNA Complexity Base Module 260

DNA 정렬기 모듈 262 DNA Sorter Module 262

DNA 위상 공액 모듈 224 및 254 (도 35) DNA phase conjugated modules 224 and 254 (FIG. 35)

a. 분석적 프로파일러 모듈 3501 a. Analytical Profiler Module 3501

b. 분석적 맵퍼 (Mapper) 모듈 (분석적 매핑의 생성) 3503 b. Analytic Mapper Module (Creating Analytic Mappings) 3503

c. 공액 비교 모듈 3505 c. Conjugation Comparison Module 3505

d. 제1 반복 분석 모듈 3507 d. First Iterative Analysis Module 3507

e. 위상 묘사 (Phase Portrait) 생성기 모듈 3511 e. Phase Portrait Generator Module 3511

DNA 근사 모듈 212 및 242 (도 36) DNA Approximation Modules 212 and 242 (FIG. 36)

a. 홀로모픽 (Holomorphic) 형태 생성기 모듈 3601 a. Holomorphic Shape Generator Module 3601

b. 근사 컨스트럭터 (Constructor) 모듈 3603 b. Approximate Constructor Module 3603

c. P&Q 계수 계산기 모듈 3605 c. P & Q Coefficient Calculator Module 3605

d. JC-DNA 곡선 생성기 모듈 3607 d. JC-DNA Curve Generator Module 3607

e. 낮은 복잡성 생성기 모듈 3609 e. Low Complexity Generator Module 3609

f. 표적 분류기 모듈 3611 f. Target Classifier Module 3611

g. 기호 DNA 오빗 모듈 (또한 SSDA 및 MSDA의 차일드 (child)) 3613 g. Symbolic DNA Orbit Module (also Child of SSDA and MSDA) 3613

h. 분석적 DNA 오빗 모듈 (또한 SSA 및 MSDA의 차일드) 3615 h. Analytical DNA Orbit Module (also Child of SSA and MSDA) 3615

DNA 오빗 220 및 250 (분석적 DNA 오빗 모듈, 도 37)DNA Orbits 220 and 250 (Analytical DNA Orbit Module, FIG. 37)

기호 DNA 오빗 모듈 3701 Symbol DNA Orbit Module 3701

a. 기호 유동 생성기 모듈 3703 a. Symbolic Flow Generator Module 3703

b. 로우 (Row) 차이 생성기 모듈 3705 b. Low Difference Generator Module 3705

c. 오빗 생성기 모듈 3707 c. Orbit Generator Module 3707

분석적 DNA 오빗 모듈 3709 Analytical DNA Orbit Module 3709

a. 분석적 전향 프로파일러 모듈 3711 a. Analytical Forward Profiler Module 3711

b. 분석적 후향 프로파일러 모듈 3713 b. Analytical Backward Profiler Module 3713

c. DNA 견인자 (Attractor) 생성기 모듈 3715 c. DNA Attractor Generator Module 3715

d. DNA 반발자 (Repeller) 생성기 모듈 3717 d. DNA Repeller Generator Module 3717

혼돈 영역 분류 모듈 214 및 244 (도 38) Chaotic Region Classification Modules 214 and 244 (FIG. 38)

혼돈 영역 분류기 3801 Chaos Area Sorter 3801

a. DNA 민감도 (Sensitivity) 생성기 모듈 3803 a. DNA Sensitivity Generator Module 3803

b. DNA 이행 (transitivity) 생성기 모듈 3805 b. DNA Transitivity Generator Module 3805

c. 치밀한 주기적 서열 생성기 모듈 3807 c. Compact Periodic Sequence Generator Module 3807

DNA 분기 모듈 218 및 248 (도 39) DNA branching modules 218 and 248 (FIG. 39)

스플리터 (Splitter) 분류기 3901 Splitter Sorter 3901

a. DNA 이행 스플리터 프로파일러 모듈 3903 a. DNA Transition Splitter Profiler Module 3903

b. DNA 치밀 스플리터 프로파일러 모듈 3905 b. DNA Dense Splitter Profiler Module 3905

DNA 유도 모듈 216 및 246 (도 40) DNA Induction Modules 216 and 246 (FIG. 40)

유도 생성기 모듈 4001 Induction Generator Module 4001

단조 (Monotonic) 생성기 모듈 4003 Monotonic Generator Module 4003

a. 포지티브 측도 모듈 4005 a. Positive Measure Module 4005

b. 네가티브 측도 모듈 4007 b. Negative Measure Module 4007

분석적 거동 프로파일러 모듈 222 및 252 (도 41) Analytical Behavior Profiler Modules 222 and 252 (FIG. 41)

DNA 근사 모듈 4101 DNA Approximation Module 4101

혼돈 영역 분류 모듈 4103 Chaos Zones Classification Module 4103

DNA 유도 모듈 4105 DNA Induction Module 4105

DNA 분기 모듈 4107 DNA Branch Module 4107

DNA 오빗 모듈 4109 DNA Orbit Module 4109

분석적 거동 프로파일러 모듈 4111 Analytical Behavior Profiler Module 4111

DNA 위상 공액 모듈 4113 DNA Phase Conjugation Module 4113

구조적 안정 영역 모듈 4115 Structural Stability Area Module 4115

분해불가능 영역 모듈 4117 Non-Resolvable Area Module 4117

DNA 복잡성 염기 모듈 4119 DNA Complexity Base Module 4119

DNA 정렬기 모듈 4121 DNA Sorter Module 4121

대수 (Algebraic) 구조 생성기 모듈 4123 Algebraic Structure Generator Module 4123

a. 그룹 생성기 모듈 4125 a. Group Generator Module 4125

b. 반 (Semi)-그룹 생성기 모듈 4127 b. Semi-Group Generator Module 4127

c. 고리 생성기 모듈 4129 c. Ring Generator Module 4129

d. 분석적 세트 생성기 모듈 4131 d. Analytic Set Generator Module 4131

준동형사상 (Homomorphism)-생성기 모듈 4133 Homomorphism-Generator Module 4133

동형사상 (Isomorphism)-생성기 모듈 4135Isomorphism-Generator Module 4135

구조적 안정 영역 모듈 226 및 256 (도 42) Structural Stability Area Modules 226 and 256 (Figure 42)

리피트 생성기 모듈 4201 Repeat Generator Module 4201

전향 점근성 모듈 4203 Forward Asymptotic Module 4203

안정성 프로파일러 모듈 4205 Reliability Profiler Module 4205

분해불가능 영역 모듈 228 및 258 (도 43) Non-Resolvable Area Modules 228 and 258 (FIG. 43)

DNA 오빗 분석 모듈 4301 DNA Orbit Analysis Module 4301

비-리피트 생성기 모듈 4303 Non-Repeat Generator Module 4303

분해불가능 프로파일러 모듈 4305 Non-degradable Profiler Module 4305

DNA 복잡성 염기 모듈 230 및 260 (도 44) DNA Complexity Base Modules 230 and 260 (FIG. 44)

리피트 생성기 모듈 4401 Repeat Generator Module 4401

보편적 (Universal) DNA 기초 생성기 모듈 4403 Universal DNA Base Generator Module 4403

밀도 생성기 모듈 4405 Density Generator Module 4405

DNA 정렬기 모듈 232 및 262 (도 45) DNA Sorter Modules 232 and 262 (Figure 45)

기호 정렬기 모듈 4501 Symbol sorter module 4501

a. 단일 가닥 생성기 모듈 4503 a. Single Strand Generator Module 4503

b. 다중-단일 가닥 생성기 모듈 4505 b. Multi-Single Strand Generator Module 4505

오메가 비교 정렬기 모듈 4507 Omega Compare Sorter Module 4507

a. 오메가 단일 가닥 생성기 모듈 4509 a. Omega Single Strand Generator Module 4509

b. 다중-단일 가닥 생성기 모듈 4511 b. Multi-Single Strand Generator Module 4511

VaSSAVaSSA 의 메인 모듈의 설명Description of the main module

DNA 근사 모듈 212 또는 242: 상기 모듈은 VaSSA 내에 있는 다항식형 구성을 감소시킨다. 모든 f의 계수가 계산을 수행하기 위해 요구되는 것이 아님을 보여준다. 또한, 근사값은 낮은 복잡성 하위서열의 언어 구조 거동의 시각화를 위해 사용될 수 있는 데이타를 생성한다. 상기 절차는 임의의 생물학적 정보를 잃지 않으면서 수행된다. 근사값은 보다 빠르고 보다 정확한 분석을 제공하는 보다 적은 차수 (order)로 존재하고, 계산은 원래 함수의 보다 우수한 피팅 (fitting)을 제공한다. DNA Approximation Module 212 or 242 : The module reduces the polynomial configuration in VaSSA. It shows that not all coefficients of f are required to perform the calculation. The approximation also produces data that can be used for visualization of the language structure behavior of low complexity subsequences. The procedure is performed without losing any biological information. Approximations exist in a less order that provides faster and more accurate analysis, and the calculation provides better fitting of the original function.

혼돈 영역 분류 모듈 214 또는 244: 상기 모듈은 3개의 구성요소를 갖는다: 비예측성, 규칙성의 요소, 및 보다 작은 하위서열로 파괴될 수 없는 요소. Chaotic Region Classification Module 214 or 244 : The module has three components: unpredictability, regularity, and elements that cannot be broken down into smaller subsequences.

DNA 유도 모듈 216 또는 246: 상기 모듈은 DNA 스트링이 좌측에서 우측으로 및/또는 우측에서 좌측으로 판독될 때 컨텐트에서 단조 변화가 관찰될 수 있는 환경을 생성시킨다. DNA 유도가 포지티브일 때, 전달되는 정보는 증가한다. DNA 유도가 네가티브일 때, 전달되는 정보는 감소한다. DNA 유도가 제로일 때, 전달되는 정보는 일정하다. DNA Derivation Module 216 or 246 : The module creates an environment in which monotone changes in the content can be observed when the DNA string is read from left to right and / or from right to left. When DNA induction is positive, the information conveyed increases. When DNA induction is negative, the information conveyed decreases. When DNA induction is zero, the information conveyed is constant.

DNA 분기 모듈 218 또는 248: 상기 모듈은 파라미터 변화를 겪을 때 DNA 맵에서의 변화를 분석한다. 상기 변화는 종종 DNA의 주기적 하위서열을 포함하지만, 다른 변화도 또한 포함한다. DNA Branch Module 218 or 248 : The module analyzes changes in the DNA map when undergoing parameter changes. Such changes often include periodic subsequences of DNA, but other changes also.

DNA 오빗 모듈 220 또는 250: DNA 서열의 분석은 특성상 수학적이지만, 상기 모듈은 하위서열이 어디로 가는지 및 그곳에서 무엇을 하는지와 같은 다소 비수학적인 문제에 답을 하는 환경을 생성시킨다. 상기 모듈은 DNA 서열이 별개의 세트인 것을 가정하면 하나의 하위서열을 다른 것에 취하는 기하적 과정을 내포한다. DNA Orbit module 220 or 250 : Analysis of DNA sequences is mathematical in nature, but the module creates an environment that answers rather non-mathematical questions, such as where subsequences go and what they do there. The module incorporates a geometric process of taking one subsequence to another, assuming that the DNA sequences are separate sets.

분석적 거동 프로파일러 모듈 222 또는 252: 상기 모듈은 모든 그의 차일드 모듈을 고려한 후 이들을 생물학의 컨텐트를 잃지 않는 대수 함수 방법을 통해 연결시킨다. 이어서, 이는 차일드 모듈로부터의 동적 정보를 대수적 동등 클래스로 세분함으로써 정보를 더욱 개선시킨다. Analytical Behavior Profiler Module 222 or 252 : The module considers all its child modules and links them through an algebraic function method that does not lose biological content. This further improves the information by subdividing the dynamic information from the child module into algebraic equivalent classes.

DNA 위상 공액 모듈 224 또는 254: 상기 모듈은 데이타 세트를 데이타 세트에, DNA 서열을 DNA 서열에, 및 다중 DNA 서열을 DNA 서열에 관련시킨다. 이는 완전히 동등하고 동등하지 않은 서열을 분류하는 환경을 생성시킨다. DNA Phase Conjugation Module 224 or 254 : The module associates a data set to a data set, a DNA sequence to a DNA sequence, and multiple DNA sequences to a DNA sequence. This creates an environment that classifies sequences that are completely equivalent and unequal.

구조적 안정 영역 모듈 226 또는 256: 상기 모듈은 모든 오빗을 이해하고, 주기적, 궁극적 주기적 점근성 등의 오빗의 세트를 확인하는 것에 관한 것이다. 주어진 데이타 세트를 이해하기 위한 정성적 및/또는 기하적 기술의 실행. Structural Stability Area Module 226 or 256 : The module is concerned with understanding all orbits and identifying sets of orbits such as periodic and ultimate periodic asymptotes. Implementation of qualitative and / or geometric techniques to understand a given data set.

분해불가능 영역 모듈 228 또는 258: 상기 모듈은 모든 비-오빗을 이해하고, 주기적, 궁극적 주기성 또는 점근성이 아닌 비-오빗의 세트를 확인하는 것에 관한 것이다. 주어진 데이타 세트를 이해하기 위한 정성적 및/또는 기하적 기술의 실행. Non-resolvable zone module 228 or 258: the module for all non-related to identifying a set of Orbit-understand Orbit, non non-periodic, eventually periodic points or spirit. Implementation of qualitative and / or geometric techniques to understand a given data set.

DNA 복잡성 염기 모듈 230 또는 260: 상기 모듈은 비-주기적 하위서열이 또다른 서열에 독단적으로 근접하는 방식의 관찰이 이루어질 수 있는 보편적 DNA 세트를 생성시킨다. 상기 모듈은 언어 거동이 매우 많은 장소에서 일치하는 환경을 생성시키고, 이는 언어적으로 치밀한 오빗을 생성시킨다. 상기 오빗은 위상 이행성으로 불린다. DNA Complexity Base Module 230 or 260 : The module produces a universal DNA set that allows observation of the way in which the non-periodic subsequence is arbitrarily close to another sequence. The module creates a matching environment in a place where the language behavior is very high, which creates a verbally dense orbit. The orbit is called phase shiftability.

DNA 정렬기 모듈 232 또는 262: 상기 모듈은 서열 정렬을 분석하는 툴 키트의 시스템의 VaSSA의 버전이다. 또한, 모듈은 추가의 생물학적 정보 모듈, 예를 들어 기호 DNA 오빗 등을 사용하여 향상될 수 있다. DNA Aligner module 232 or 262: the module is a version of the system of VaSSA tool kit for analyzing the sequence alignment. In addition, the module can be enhanced using additional biological information modules such as symbolic DNA orbits and the like.

도 3 내지 도 28은 VaSSA 실행 동안 VaSSA를 갖는 그래픽 사용자 인터페이스 (GUI)의 예시적인 실시태양을 도시한 것이다.3 through 28 illustrate exemplary embodiments of a graphical user interface (GUI) with VaSSA during VaSSA execution.

이어서, 정렬된 서열은 다시 서열 파일, 또는 상이한 파일에 기록될 수 있다. 이어서, 정렬의 비율이 계산될 수 있고, 이는 정렬 내에 존재하는 2개의 서열의 비율을 보여준다.The aligned sequence can then be written back to the sequence file, or to a different file. The ratio of alignments can then be calculated, which shows the ratio of two sequences present in the alignment.

오메가 유사성 스코어 (ω₀)가 또한 계산될 수 있다. ω₀의 대수 구조는

로서 정의된다. 오메가 유사성 스코어, 또는 ω₀ 측도는 임의의 2개의 뉴클레오티드 스트링, s 및 t의 비-2원 비교이다. 이는 상기 등식에서 s_i/t_i를 s_i/s_i ₊₁로 치환함으로써 단일 스트링에 대한 분석을 위해 쉽게 변형될 수 있다.Omega similarity score (ω ₀ ) can also be calculated. The algebraic structure of ω ₀

It is defined as The omega similarity score, or ω ₀ measure, is a non-binary comparison of any two nucleotide strings, s and t. This can be easily modified for analysis on a single string by replacing s _i / t _i with s _i / s _i ₊₁ in the equation.

오메가 유사성 스코어는 몇가지 방식으로 계산될 수 있다. s_i/t_i 비교의 값은 DNA의 뉴클레오티드의 화학 구조에 기초한다. DNA에는 4개의 가능한 염기가 존재한다: 아데닌 (A), 시토신 (C), 구아닌 (G) 및 티민 (T). RNA에서는 티민이 우라실 (U)로 교체된다. 상기 염기의 구조를 도 33에 나타낸다. 퓨린인 아데닌 및 구아닌은 2개의 고리 구조를 갖고, 피리미딘인 시토신, 티민 및 우라실은 단일 고리 구조를 갖는다. 값 s_i/t_i는 다양한 염기 사이의 구조의 차이를 나타낸다. 퓨린 염기 구조 내에는 2개의 고리가 있고, 이는 큰 6-원 고리 및 작은 5-원 고리로 간주될 수 있다. 피리미딘 구조는 하나의 고리만을 갖는다. 측정은 퓨린＼퓨린, 피리미딘＼피리미딘, 퓨린＼피리미딘, 및 피리미딘＼퓨린의 4개의 카테고리로 분류될 수 있다.Omega similarity scores can be calculated in several ways. The value of the s _i / t _i comparison is based on the chemical structure of the nucleotides of the DNA. There are four possible bases in the DNA: adenine (A), cytosine (C), guanine (G) and thymine (T). In RNA, thymine is replaced by uracil (U). The structure of this base is shown in FIG. The purines adenine and guanine have two ring structures, and the pyrimidines cytosine, thymine and uracil have a single ring structure. The values s _i / t _i represent the difference in structure between the various bases. There are two rings within the purine base structure, which can be considered large 6-membered rings and small 5-membered rings. The pyrimidine structure has only one ring. Measurements can be classified into four categories: purine＼purine, pyrimidine＼pyrimidine, purine＼pyrimidine, and pyrimidine＼purine.

DNA 서열을 비교하는 전통적인 방법은 염기 서열을 2원 양식으로 비교함으로써, 즉, 단순히 염기가 동일한지 또는 상이한지를 평가함으로써 작동한다. 한 측면에서, 본 발명은 염기가 상이한 것을 고려할 뿐만 아니라 차이의 크기를 측정하는, DNA 서열을 비교하는 방법에 관한 것이다. 따라서, 본 발명은 DNA 서열을 비교하는 비-2원 방법을 포함한다.Traditional methods of comparing DNA sequences work by comparing base sequences in a binary fashion, ie simply by evaluating whether the bases are identical or different. In one aspect, the present invention relates to a method of comparing DNA sequences that not only considers different bases but also measures the magnitude of the differences. Accordingly, the present invention includes non-binary methods of comparing DNA sequences.

제1 실시태양에서, 입체적인 문제가 주로 고려된다. 본 실시태양에서, 염기가 동일하면 0의 값이 지정되고, 퓨린＼퓨린, 피리미딘＼피리미딘 배열에 대해, 즉, 염기가 상이하지만 고리 크기가 변화되지 않는 경우 1이 지정되고, 퓨린＼피리미딘 및 피리미딘＼퓨린에 대해, 염기의 고리 크기가 변화하는 경우 2가 지정된다. 따라서, ω₀는 염기의 동일성에서의 차이뿐만 아니라, 퓨린과 피리미딘의 화학 구조 사이의 차이의 정도도 또한 반영한다.In the first embodiment, three-dimensional problems are mainly considered. In this embodiment, if the bases are the same, a value of 0 is assigned, and for the purinuppurine, pyrimidine＼pyrimidine configuration, that is, 1 is specified if the bases are different but the ring size does not change, For midines and pyrimidinesuppurines, 2 is specified when the ring size of the base changes. Thus, ω ₀ reflects not only the difference in base identity, but also the degree of difference between the chemical structures of purine and pyrimidine.

제1 실시태양은 표 1에 예시된다:The first embodiment is illustrated in Table 1:

SS TT s/ts / t AA GG CC TT AA 00 1One 22 22 GG 1One 00 22 22 CC 22 22 00 1One TT 22 22 1One 00

본 발명의 제2 실시태양은 분자 구조의 각각의 위치에서 염기 t_i 내에 존재하지 않는 염기 s_i 내의 원소의 수를 추가로 고려한다. 퓨린＼퓨린 측정은 큰 고리 및 작은 고리 모두를 비교한다. 이는 분자 배열이 가장 유사하고 두 퓨린 분자가 그들의 화학 원소의 크기 및 배열에 관하여 유사하게 거동하는 경우이다. 본원에서 ω₀으로 칭하는 측정은 한 실시태양에서 제1 서열 내에 존재하고 제2 서열 내에 존재하지 않는 원자의 수를 계수함으로써 계산된다. 예를 들어, 제1 서열 s가 위치 i에서 구아닌 ("G") 뉴클레오티드를 갖고 제2 서열 t가 상응하는 위치에서 아데닌 ("A") 뉴클레오티드를 가지면, 위치 i에서 ω₀ 측도 (본원에서 s_i/t_i로 칭함)는 t_i 내에 존재하지 않고/않거나 상이한 위치에 존재하는 s_i 내의 원자의 수를 결정함으로써 계산된다. 이제 도 33을 살펴보면, 구아닌 분자 내에 큰 고리에 결합된 산소 원자 (1), 수소 원자 (2) 및 원자의 NH₂기 (3, 4, 5), 및 이중 결합된 탄소 원자 반대편의 작은 고리 내에 수소 (6) 및 탄소 (7) 원자가 아데닌 분자 내에 존재하지 않거나 상이한 위치에 존재한다. 따라서, s_i/t_i = 7이고, 여기서 s_i = G 및 t_i = A이다. 따라서, ω₀는 퓨린의 화학 구조에서의 차이 및 유사성의 정도를 반영한다. 상기 차이 및 유사성은 뉴클레오티드 서열의 코딩 및 비-코딩 영역에서 생물학적 유의성을 갖는 것으로 가정된다. ω₀의 계산은 다른 실시태양에서 각각의 화학 원소에 대한 결합 수준에서 보다 정확한 정보를 사용하여 변형될 수 있다.The second embodiment of the present invention further contemplates the number of elements in base s _i that are not present in base t _i at each position of the molecular structure. Purinovpurin measurements compare both large and small rings. This is the case when the molecular arrangements are most similar and the two purine molecules behave similarly in terms of the size and arrangement of their chemical elements. The measurement, referred to herein as ω _0, is calculated in one embodiment by counting the number of atoms present in the first sequence and not in the second sequence. For example, if the first sequence s has a guanine ("G") nucleotide at position i and the second sequence t has an adenine ("A") nucleotide at the corresponding position, the ω ₀ measure at position i (s here _i / t _i ) is calculated by determining the number of atoms in s _i not present in t _i and / or present in different positions. Referring now to FIG. 33, in the small ring opposite the oxygen atom (1), hydrogen atom (2) and NH ₂ groups (3, 4, 5) of the atom and the double bonded carbon atom in the guanine molecule The hydrogen (6) and carbon (7) atoms are not present in the adenine molecule or are in different positions. Thus, s _i / t _i = 7 where s _i = G and t _i = A. Thus, ω ₀ reflects the degree of difference and similarity in the chemical structure of the purine. Such differences and similarities are assumed to have biological significance in the coding and non-coding regions of nucleotide sequences. The calculation of ω ₀ may be modified using more accurate information at the level of binding to each chemical element in other embodiments.

오메가 측도의 계산에서, 오메가 측도가 동일하게 제로일 때, 화학은 동일하게 동일하다. 오메가 측도가 동일하게 제로가 아닌 경우, 오메가 측도는 상이한 화학 원소의 수를 나타내는 수를 제공한다. 4가지 뉴클레오티드에 대한 완전한 분석을 하기 표 2에 디스플레이한다. 피리미딘＼피리미딘 분석에서 s_i/t_i값은 단일 고리만이 고려되는 것을 제외하고는 퓨린＼퓨린 측도와 유사한 방식으로 수행된다. 퓨린＼피리미딘 또는 피리미딘＼퓨린 측정에서, 퓨린의 큰 고리는 피리미딘의 고리에 비교되지만, 비교는 퓨린의 큰 고리 상에서 시계 반대 방향으로 및 피리미딘 고리 상에서 시계 방향으로 (또는 그 반대로) 수행된다. 분자의 구조를 도 33에 나타낸다. 그러나, 뉴클레오티드 성분 구조의 구조는 2개의 고리 대 1개의 고리에 관하여 변화하므로, 측도값은 변화하지 않는다. In the calculation of omega measures, when the omega measures are equally zero, the chemistry is equally the same. If the omega measure is not equally zero, the omega measure provides a number representing the number of different chemical elements. The complete analysis for the four nucleotides is shown in Table 2 below. In the pyrimidinepyrimidine assays the s _i / t _i values are performed in a similar manner to the Purinjanpurin measure, except that only a single ring is considered. In purin＼pyrimidine or pyrimidine＼purin measurements, the large ring of purine is compared to the ring of pyrimidine, but the comparison is performed counterclockwise on the large ring of purine and clockwise (or vice versa) on the pyrimidine ring. do. The structure of the molecule is shown in FIG. However, since the structure of the nucleotide component structure changes with respect to two rings to one ring, the measure value does not change.

본 발명의 제2 실시태양을 사용하여, 표 2에 보이는 바와 같이 s_i/t_i의 값을 결정하기 위해 매트릭스가 생성될 수 있다: Using a second embodiment of the invention, a matrix can be generated to determine the value of s _i / t _i as shown in Table 2:

SS TT s/ts / t AA GG CC TT AA 00 77 44 99 GG 66 00 77 77 CC 66 1010 00 66 TT 99 88 44 00

도 34A-34C는 오메가 계수의 결과 및 포함된 화학 원소의 일부 예를 디스플레이한다. 도면은 A/G가 A/C 및 A/T보다 더 유사하고, G/A가 G/C 및 G/T보다 더 유사한 등의 이유를 그래프로 증명한다. 오메가 측도는 동일한 G/A 및 G/T에 대한 수를 생성하지만, 포함된 화학 원소는 상이하다. 표의 원소의 과잉성은 포함된 원소를 도시하는 도면에 의해 명백해진다. 상기 유사성 또는 차이의 현실 사회에서의 유의성은 본 발명의 서열 정렬 탐색에서 전통적인 생물학적 타당성의 보전을 잃지 않으면서 서열의 세트가 얼마만큼 유사한지 또는 얼마만큼 상이한지 설명할 수 있기 위한 것이다. 다른 차이 매트릭스는 염기 사이의 다른 화학적 비교에 기초하여 사용될 수 있다.34A-34C display results of omega coefficients and some examples of chemical elements included. The figure demonstrates graphically why A / G is more similar than A / C and A / T, G / A is more similar than G / C and G / T, and so forth. Omega measures produce numbers for the same G / A and G / T, but the chemical elements involved are different. The excess of the elements of the table is evident by the figure showing the contained elements. The significance in the real world of such similarities or differences is to explain how similar or different sets of sequences are in the sequence alignment search of the present invention without losing the integrity of traditional biological validity. Different difference matrices can be used based on different chemical comparisons between bases.

본 명세서에서, 당업계의 숙련인은 RNA 및 단백질에 대해 상응하는 표를 작성할 수 있을 것이다.In this specification, those skilled in the art will be able to prepare corresponding tables for RNA and proteins.

한 실시태양에서, 2개의 대안적인 서열 t 및 r을 천연 서열 s에 비교한다:In one embodiment, two alternative sequences t and r are compared to native sequence s:

t=AAGCC t = AAGCC

r=AAGACr = AAGAC

s=ATAGCs = ATAGC

r 및 t는 3개의 염기가 s와 상이한 것으로 관찰된다. 그러나, r 및 s는 동일하지 않고, 고려되는 의문점은 r 및 t 중 어느 것이 s에 더 유사한가이다.r and t are observed that the three bases differ from s. However, r and s are not the same and the question considered is which of r and t is more similar to s.

전통적인 방법을 사용하여, t 및 r를 각각 s에 비교하기 위해 양 S(s,t) 및 S(s,r)을 정의할 수 있다. S(x_i,y_j) = s(x_i,y_j) = {+1, x_i= y_j; -μ, x_i≠y_j 및

(여기서, μ는 상수임)인 일반적인 BLAST 시스템을 사용하여, s 및 t에 대한 유사성 스코어는 다음과 같다:Using traditional methods, one can define amounts S (s, t) and S (s, r) to compare t and r to s, respectively. S (x _i , y _j ) = s (x _i , y _j ) = {+1, x _i = y _j ; -μ, x _i ≠ y _j and

Using the general BLAST system, where μ is a constant, the similarity scores for s and t are:

S(s,t) = 2-3μS (s, t) = 2-3 μ

S(s,r) = 2-3μS (s, r) = 2-3 μ

어떠한 명백한 차이도 관찰되지 않는다.No obvious difference is observed.

표 1과 관련하여 상기 설명된 바와 같은 본 발명의 제1 실시태양을 사용하여, ω₀ (s,r) 및 ω₀ (s,t)의 값은 다음과 같이 결정된다:Using the first embodiment of the invention as described above in connection with Table 1, the values of ω ₀ (s, r) and ω ₀ (s, t) are determined as follows:

ω₀ (s,r) = (0+2+1+1+0) = 4ω ₀ (s, r) = (0 + 2 + 1 + 1 + 0) = 4

ω₀ (s,t) = (0+2+1+2+0) = 5.ω ₀ (s, t) = (0 + 2 + 1 + 2 + 0) = 5.

따라서, 본 발명자들은 차이가 존재하는 것으로 본다.Thus, the inventors believe that a difference exists.

상기 설명된 바와 같은 본 발명의 제2 실시태양을 사용하여, ω₀ (s,r) 및 ω₀ (s,t)의 값은 하기 식 (1) (여기서, N은 비교되는 2개의 서열 중 더 짧은 서열의 길이를 나타낸다)을 이용하여 다음과 같이 결정된다:Using the second embodiment of the present invention as described above, the values of ω ₀ (s, r) and ω ₀ (s, t) are obtained from the following formula (1), where N is the Shorter sequence length) is determined as follows:

(1)

(One)

ω₀ (s,r) = (0+9+6+7+0)/80 = 22/80 = 0.275ω ₀ (s, r) = (0 + 9 + 6 + 7 + 0) / 80 = 22/80 = 0.275

ω₀ (s,t) = (0+9+6+10+0)/80 = 25/80 = 0.3125ω ₀ (s, t) = (0 + 9 + 6 + 10 + 0) / 80 = 25/80 = 0.3125

세그먼트 r은 t보다 s에 더 유사하다.Segment r is more similar to s than t.

제2 실시태양에서 정수의 과잉성 때문에, 계수에 관여된 화학을 고찰하는 것은 매우 상이하지만 동일한 값을 갖는 서열, 예를 들어 A/G 대 A/C을 따라잡는 것이 가능하다. 이는 분자가 상이하게 소통하여, 동일한 정보를 전달하지 않는 방식의 표시이다.Because of the excess of integers in the second embodiment, it is possible to keep up with the chemistry involved in the coefficients but to catch up with sequences that have the same values, for example A / G vs. A / C. This is an indication of how the molecules communicate differently and do not convey the same information.

전체 게놈의 서열에 대해, 표준화 기술이 사용되고, 하기 등식 (2)에 제시한다. 따라서, DNA 서열에서, 뉴클레오티드의 각각의 위치는 스트링에서 특유한 어드레스를 나타낸다. 짧은 가닥에서, 분모 (denominator)가 차이의 강도를 측정하기 위해 사용된다. 보다 긴 가닥에 대해서, 분모의 지수적 성장을 제거하는 등식 (2)와 연관되어 하기 논의된 표준화 기술이 사용된다. 이는 VaSSA가 각각의 위치를 그의 특유한 어드레스에 관하여 플롯팅하도록 한다. 상기 특유한 위치에 관한 오메가 측도는 각각의 뉴클레오티드에 관한 특유한 구조 거동뿐만 아니라 그가 존재하는 가닥에 관하여 프로파일링되는 방식을 생성시킨다. For the sequence of the whole genome, standardization techniques are used and are shown in equation (2) below. Thus, in the DNA sequence, each position of the nucleotides represents a unique address in the string. In short strands, a denominator is used to measure the strength of the difference. For longer strands, the standardization technique discussed below in connection with equation (2) that eliminates exponential growth of the denominator is used. This allows VaSSA to plot each location with respect to its unique address. The omega measure of this unique position creates a unique structural behavior for each nucleotide as well as the way in which it is profiled with respect to the strands in which it is present.

컴퓨터 프로그램 제품Computer program products

예시적인 실시태양에서, 본 발명의 방법은 기계에 의해 판독될 때 기계, 예를 들어, 컴퓨터가 상기 설명된 방법을 실행하도록 하는 기계-판독가능 매체 상에 구현될 수 있다. 또한, 본 발명의 본 실시태양은 사용자가 유전 물질의 서열을 비교하고, 서열 및 비고 결과를 추가로 분석하도록 허용하는 그래픽 사용자 인터페이스 (GUI)를 제공할 수 있다.In an exemplary embodiment, the method of the present invention may be implemented on a machine-readable medium that, when read by a machine, causes a machine, eg, a computer, to execute the method described above. In addition, this embodiment of the present invention may provide a graphical user interface (GUI) that allows a user to compare sequences of genetic material and further analyze sequences and remarks results.

예를 들어, 도 3에 나타낸 바와 같이, GUI는 파일 관리, 리포팅, 분석, 플롯팅, 사용자 옵션 세팅, 및 사용자 지원을 위한 모듈을 제공할 수 있다.For example, as shown in FIG. 3, the GUI may provide modules for file management, reporting, analysis, plotting, user option setting, and user assistance.

도 4에 나타낸 바와 같이, 파일 관리 모듈 (300)은 하나 이상의 서열 파일을 로딩할 수 있는 서열을 로딩하는 모듈을 추가로 포함할 수 있다. 파일은 단일 서열 또는 다중 서열을 포함할 수 있다. 상기 서열은 디스크, CD 등에서 판독될 수 있다. 상기 서열은 저장되어야 하지 않고, 수용될 때 "작동 중에 (on the fly)" 분석될 수 있다. 서열 파일은 FASTA 포맷팅되거나, 또는 임의의 다른 포맷일 수 있다. 로딩될 때, 각각의 서열은 특유한 참조 번호로 지정될 수 있고, 모든 문자가 유효하도록 보장하기 위해 검토될 수 있다.As shown in FIG. 4, the file management module 300 may further include a module for loading a sequence capable of loading one or more sequence files. The file may comprise a single sequence or multiple sequences. The sequence can be read from discs, CDs and the like. The sequence does not have to be stored and can be analyzed "on the fly" when received. The sequence file may be FASTA formatted or in any other format. When loaded, each sequence can be assigned a unique reference number and reviewed to ensure that all letters are valid.

파일 관리 모듈 (300)은 또한 활성 서열 파일을 메모리로부터 제거하거나 "플러싱할" 수 있는, 활성 서열을 플러싱하는 모듈을 포함할 수 있다. 플러싱될 때, 서열에 대한 참조 번호는 보존된다. 파일 관리 모듈 (300)은 또한 로딩된 서열을 메모리로부터 플러싱하는 모듈을 포함할 수 있다. 활성 서열은 그에 대해 분석이 수행되는 서열인 반면, 로딩된 서열은 또한 메모리 내에 있지만 현재는 그에 대해 분석이 이루어지지 않는 서열이다.File management module 300 may also include a module for flushing the active sequence, which may remove or “flush” the active sequence file from memory. When flushed, reference numbers to sequences are preserved. File management module 300 may also include a module for flushing the loaded sequence from memory. The active sequence is the sequence for which analysis is to be performed, while the loaded sequence is also a sequence in memory but not currently analyzed for it.

서열을 로딩하는 모듈은 서열이 로딩될 때 요약 리포트 노트북 페이지를 생성시키고 디스플레이할 수 있는, 로딩된 서열을 디스플레이하는 모듈을 포함할 수 있다. 도 5에 나타낸 바와 같이, 요약 리포트 노트북 페이지는 서열 파일 명칭 및 서열의 수를 디스플레이할 수 있다.The module for loading the sequence can include a module for displaying the loaded sequence, which can generate and display a summary report notebook page when the sequence is loaded. As shown in FIG. 5, the summary report notebook page may display the sequence file name and number of sequences.

리포트 모듈 (500)은 특유한 참조 번호, 서열 헤더 (header), 및 서열 길이를 포함하는 모든 로딩된 서열의 서열 요약 (도 6); 특유한 참조 번호 및 FASTA 포맷으로 서열 컨텐트를 포함하는 각각의 로딩된 서열의 컨텐트의 목록 (도 7); 및/또는 특유한 참조 번호, 서열 헤더, 및 각각의 표준 서열 문자의 계수를 포함하는 각각의 로딩된 서열에 관한 통계적 정보 (도 8)를 생성하고 디스플레이할 수 있다. 서열 문자가 인식되지 않으면, 리포팅 모듈은 각각의 로딩된 서열에 관한 통계 정보에서 "오류" 컬럼에 나열되는 오류 시그날을 생성한다 (도 8).Report module 500 includes a sequence summary of all loaded sequences including unique reference numbers, sequence headers, and sequence lengths (FIG. 6); A list of the contents of each loaded sequence, including sequence content in a unique reference number and FASTA format (FIG. 7); And / or statistical information (FIG. 8) about each loaded sequence, including unique reference numbers, sequence headers, and coefficients of each standard sequence letter can be generated and displayed. If the sequence letter is not recognized, the reporting module generates an error signal listed in the "Error" column in the statistical information about each loaded sequence (Figure 8).

분석 모듈 (200)은 많은 하위모듈을 포함할 수 있다. 예를 들어, 서열 정렬 하위모듈은 표적 서열을 염기 서열에 정렬시키고 정렬 리포트를 디스플레이할 수 있다 (도 9). 서열 정렬 모듈은 또한 염기 서열을 역전시키거나, 모드를 역전시키거나, 염기 및 표적을 최단 길이에 정렬시키거나, 정렬 비율을 계산하거나, 오메가 유사성 스코어를 계산할 수 있다 (도 10). 오메가 유사성 스코어는 표적이 염기에 유사한지 여부와 어느 정도 유사한지를 결정하기 위해 사용될 수 있다. 오메가 유사성 스코어값이 1/2ⁿ (여기서, n은 2개의 서열 s 및 t 중 최대 길이임)보다 작으면, 2개의 서열은 유사한 것으로 말할 수 있다. 오메가 유사성 스코어값이 1/2ⁿ을 초과하면, 서열은 유사하지 않은 것으로 말해진다.Analysis module 200 may include many submodules. For example, the sequence alignment submodule may align the target sequence to the base sequence and display the alignment report (FIG. 9). The sequence alignment module may also reverse the base sequence, reverse the mode, align the base and target to the shortest length, calculate the alignment ratio, or calculate the omega similarity score (FIG. 10). Omega similarity scores can be used to determine whether and how similar the target is to base. If the omega similarity score value is less than 1/2 ⁿ (where n is the maximum length of two sequences s and t), the two sequences can be said to be similar. If the omega similarity score exceeds 1/2 ⁿ , the sequences are said to be dissimilar.

VaSSA 분석 메뉴에서 메뉴 옵션의 태스크 (task)는 다음의 것을 포함하고 이로 제한되지 않는다: Tasks of the menu options in the VaSSA analysis menu include, but are not limited to:

1. 염기를 역전시킴1. Reverse the base

VaSSA의 분석 메뉴 하에 역전 염기 옵션이 있다. 역전 염기의 하나의 기능은 사용자가 서열을 전환시킬 수 있도록 하는 것이다. 예를 들어, 서열이 5'에서 3' 방향이면, 역전 염기 기능은 3'에서 5' 방향으로 판독한다 (그러나, 상보 가닥 방향은 아님).There is an inverted base option under VaSSA's analysis menu. One function of the inverting base is to allow the user to switch sequences. For example, if the sequence is in the 5 'to 3' direction, the reverse base function is read in the 3 'to 5' direction (but not in the complementary strand direction).

2. 모드를 역전시킴2. Invert mode

모드 역전 옵션의 기능은 모드 계산을 역전시킬 수 있도록 하는 것이다. "모드 계산을 역전시키는 것"은 s_i/t_i를 t_i/s_i로 변화시키는 것을 의미한다. 정의상 ω₀는 대칭 조작이 아니므로 이는 중요하다.The function of the mode inversion option is to invert the mode calculation. "Inverting the mode calculation" means changing s _i / t _i to t _i / s _i . This is important because by definition ω ₀ is not a symmetric operation.

3. 염기 및 표적 서열을 최단 길이에 정렬시킴 3. Align the base and target sequences to the shortest length

염기 및 표적은 상이한 길이 또는 동일한 길이의 2개의 서열 스트링이다. 스트링이 상이한 길이이면, 분석의 제1 부분은 정렬되고 최단 서열의 단부에서 중단되는 것이다. 이들이 동일한 길이이면, 서열 분석은 각각의 스트링의 단부에 수행된다.Base and target are two sequence strings of different lengths or of the same length. If the strings are of different lengths, the first part of the analysis is aligned and stopped at the end of the shortest sequence. If they are the same length, sequencing is performed at the end of each string.

4. 알파 수 정렬 비율 및 오메가 유사성 스코어를 계산함4. Calculate alpha number sort ratio and omega similarity score

알파 수 정렬은 뉴클레오티드의 총 수 위에 정렬된 뉴클레오티드의 총 수인 비율을 제공하는 정렬이다. 도 13에 나타낸 바와 같이, 오메가 서브-제로 (ω₀) 모듈은 서열에 대한 ω₀ 스코어를 계산하고 ω₀ 스코어를 디스플레이할 수 있다. 하나의 염기, 또는 모든 로딩된 서열이 선택될 수 있다. 리포트는 참조 번호, 길이, 또는 오메가 스코어에 의해 분류될 수 있다 (도 14). 염기 서열 및 모드는 각각 역전될 수 있다. Alpha number alignment is an alignment that provides a ratio that is the total number of nucleotides aligned over the total number of nucleotides. 13, Omega sub-zero (ω ₀₎ module may calculate the score for the sequence ω ₀ ω ₀ and displaying the score. One base or all loaded sequences can be selected. Reports can be sorted by reference number, length, or omega score (FIG. 14). The base sequence and mode can each be reversed.

ω₀ 값은 또한 염기 서열 및 표적 서열에 대해 단일 가닥 모듈에 의해 계산될 수 있다. 등식 6의 단순화된 버전인 다음 단일 가닥 등식을 고려한다 (등식의 다중 가닥 형태는 아래에서 논의할 것이다):The ω ₀ value can also be calculated by the single stranded module for the base sequence and the target sequence. Consider the following single-stranded equation, a simplified version of Equation 6 (multiple strand forms of the equation will be discussed below):

(2)

식에서, z_l은 단일 가닥을 나타낸다. 즉, z_l = s₀s₁···s_k··· (여기서, 각각의 s_k는 A, G, C 또는 T임).In the formula, z ₁ represents a single strand. That is, z _l = s ₀ s ₁ ..s _k ... where each s _k is A, G, C or T.

z_l ^λi은 λ_i번째 위치 및 λ_i+1번째 위치 내의 뉴클레오티드에 대응한다 (여기서, i는 인덱스 세트 (index set) l=1,2,3,... 내의 수임).z _l ^λi corresponds to a nucleotide in the second position λ _i and λ _{i + 1-th} position (where, i is the number in the index set (index set) l = 1,2,3, ...).

계수 c _λi = s _i / s _i ₊₁ λi번째 위치 및 i+1번째 위치 (여기서, i는 인덱스 세트 l=1,2,3,... 내의 수임).Coefficient c _{λ i} = s _i / s _i ₊₁ λ i th position and i + 1 th position, where i is a number in index set l = 1,2,3, ...

따라서, 예시적인 4개의 뉴클레오티드 가닥 z_l = ACGT에 대해, C_l(z_l)은 계수 [c₀, C₁, C₂]의 어레이이고, 여기서 각각의 계수는 가닥 내의 위치 i에 대한 z_l ^λi/z_l ^λi+1을 결정함으로써 계산되고 (마지막 위치에 대한 것을 제외함), 본 경우에 이는 [A/C, C/G, G/T] = [6,7,8]과 동일하다. 상기 계수는 가닥 z_l에 대한 단일 가닥 플롯을 형성하기 위해 사용될 수 있고, 여기에서 가닥 내의 위치 (즉, l의 값)은 x축 상에 표시되고, 상응하는 계수의 값은 y축 상에 표시된다 (2개의 가닥에 대한 단일 가닥 플롯의 예를 도 27에 도시한다).Thus, for the exemplary four nucleotide strand z _l = ACGT, C _l (z _l ) is an array of coefficients [c ₀ , C ₁ , C ₂ ], where each coefficient is z _l for position i in the strand calculated by determining ^λi / z _l ^{λi + 1} (except for the last position), which in this case is equal to [A / C, C / G, G / T] = [6,7,8] . The coefficients can be used to form a single strand plot for strand z _l , where the position within the strand (ie the value of l) is plotted on the x-axis and the value of the corresponding coefficient is plotted on the y-axis (An example of a single strand plot for two strands is shown in FIG. 27).

질문 리피트 모듈은 사용자-특정된 표적 서열의 다중 출현을 염기 서열 내에 위치시키고 상기 다중 출현을 디스플레이할 수 있다. 표적 서열의 다중 출현은 본원에서 리피트로서 칭한다. VaSSA는 2가지 종류의 리피트: 리피트 및 오메가 리피트를 갖는다. 리피트는 기호에 대해 쉬프트 (shift) 함수만을 사용하고, 오메가 리피트는 오메가 유사성의 측정에 대해 쉬프트 함수를 사용한다. 도 11에 나타낸 바와 같이, 사용자는 조사하기 위해 염기 서열, 및 탐색하기 위해 표적 서열을 선택할 수 있다. 사용자는 조사를 완화하거나 강화하기 위해 역치를 지정할 수 있다. 염기 또는 표적 서열은 또한 역전될 수 있다. 이어서, 질문 리피트 모듈은 사용자가 역치를 지정할 때 하위-표적을 생성하고, 표적 또는 하위-표적이 나타나는 염기 내의 위치를 확인할 수 있다. 한 실시태양에서, 표적이 AGCT이면, 질문 리피트 모듈은 AGC 및 GCT의 하위-표적을 생성할 수 있다. 도 12에 나타낸 바와 같이, 리피트 표적 및 하위표적은 GUI 윈도우 페이지의 상단에서 리피트 표적 및 하위표적이 검출되는 횟수와 함께 확인된다. 표적 서열의 출현은 모자 (hat) 기호 1201로 확인되고, 하위-표적 서열의 출현은 별표 기호 1202로 확인된다.The question repeat module can locate multiple occurrences of a user-specific target sequence within the base sequence and display the multiple occurrences. Multiple occurrences of the target sequence are referred to herein as repeats. VaSSA has two kinds of repeats: repeat and omega repeat. Repeat uses only the shift function for symbols, and Omega Repeat uses the shift function for measurements of omega similarity. As shown in FIG. 11, a user can select a base sequence for searching and a target sequence for searching. The user can specify thresholds to mitigate or enhance the investigation. Base or target sequences may also be reversed. The question repeat module can then generate a sub-target when the user specifies a threshold and identify the location in the base where the target or sub-target appears. In one embodiment, if the target is AGCT, the question repeat module may generate sub-targets of AGC and GCT. As shown in FIG. 12, the repeat target and subtarget are identified along with the number of times the repeat target and subtarget are detected at the top of the GUI window page. The appearance of the target sequence is identified by hat symbol 1201 and the appearance of the sub-target sequence is identified by asterisk symbol 1202.

도 15 및 16에 나타낸 바와 같이, 질문 오메가 리피트 모듈은 질문 리피트 모듈에 관하여 상기 언급된 모든 것을 얻는다. 그러나, 또한, 상기 모듈은 스트링의 세그먼트 내의 리피트된 뉴클레오티드가 스트링의 또다른 세그먼트에서 상이하게 (적어도 오메가 측도에 관하여) 소통할 수 있는 방식을 발견한다. 따라서, 질문 오메가 리피트는 리피트가 듀플리케이트인 때 및 그렇지 않은 때를 발견할 수 있다.As shown in Figures 15 and 16, the question omega repeat module obtains everything mentioned above with respect to the question repeat module. However, the module also discovers how repeated nucleotides in segments of the string can communicate differently (at least in terms of omega measures) in another segment of the string. Thus, the question omega repeat can find out when the repeat is a duplicate and when it is not.

도 17 및 18에 나타낸 바와 같이, 기울기 계산 모듈은 염기 서열 내에서 각각의 뉴클레오티드 위치에 대한 기울기를 계산하고 기울기 리포트를 디스플레이할 수 있다. 예시적인 실시태양에서, 기울기는 다음 식을 이용하여 계산할 수 있다:As shown in FIGS. 17 and 18, the slope calculation module can calculate the slope for each nucleotide position in the base sequence and display the slope report. In an exemplary embodiment, the slope can be calculated using the following formula:

Ω_k = s_k/s_k ₊₁ - s_k _-1/s_k (3)Ω _k = s _k / s _k ₊₁ -s _k _-1 / s _k (3)

식에서, k는 DNA 서열에서 뉴클레오티드의 특유한 위치를 나타낸다. ω_k = s_k/s_k+1 (ω_k는 ω₀ 시리즈에서 k번째 용어임). 상기 등식은 2-D 프로파일에서 곡률 상의 정보를 생성하기 위해 사용될 수 있다. Ω_k가 포지티브일 때, 전달되는 정보는 증가하고, 이중 가닥을 연결하는 결합은 더 길다 (따라서, 더 짧은 것보다 더 약해지는 경향이 있다). Ω_k가 네가티브일 때, 전달되는 정보는 감소하고, 이중 나선을 연결하는 결합은 더 짧다 (더 강해지는 경향이 있다). 따라서, 포지티브 및 네가티브의 플롯 내에, 서열에서 하나의 위치에서 다음 위치로 정보 유동의 프로파일이 존재한다. 기울기 그래프는 변화 정보 유동의 플롯이다. 이는 정보 변화가 서열 내에서 동일한 곳 (사인 (sign) 차트에서 제로임) 및 상이한 곳을 보여준다. 이는 또한 정보는 정확하게 동일하지만 반대 방향인 경우를 보여준다. 그래프를 생성하기 위해, (그 예를 도 30에 도시함) 뉴클레오티드의 위치를 기울기의 값에 대해 플로팅한다. 따라서, 등식 3은 사인 차트 및 VaSSA에서 기울기 플롯을 생성하는 것이다. 두 경우에, 가닥에서 뉴클레오티드 특유한 위치는 x축에 대응하고 Ω_k의 값은 y축에 대응한다.In the formula, k represents the unique position of the nucleotide in the DNA sequence. ω _k = s _k / s _{k + 1} (ω _k is the kth term in the ω ₀ series). The equation can be used to generate information on curvature in a 2-D profile. When Ω _k is positive, the information conveyed increases, and the bond connecting the double strands is longer (and therefore tends to be weaker than the shorter one). When Ω _k is negative, the information conveyed decreases, and the coupling connecting the double helix is shorter (they tend to be stronger). Thus, within the plot of positive and negative, there is a profile of information flow from one position to the next in the sequence. The slope graph is a plot of the change information flow. This shows where the change in information is the same (zero in the sign chart) and different in the sequence. It also shows the case where the information is exactly the same but in the opposite direction. To generate the graph, the position of the nucleotides (shown in FIG. 30 for example) is plotted against the value of the slope. Thus, Equation 3 is to generate a slope plot in the sine chart and VaSSA. In both cases, the nucleotide specific position in the strand corresponds to the x axis and the value of Ω _k corresponds to the y axis.

한 실시태양에서, 서열 AGC에서, A에서 G로의 변화는 다음과 같이 계산될 것이다: A는 위치 k-1에 존재하고, G는 k에 존재하고, C는 k+1에 존재한다. 따라서, 표 2의 값에 기초한 오메가(k)는 G/C - A/G = 10-6 = 4이다. 따라서, A에서 G로의 변화는 포지티브이고, 기울기 리포트에서 "+"로 표시될 수 있다.In one embodiment, in the sequence AGC, the change from A to G will be calculated as follows: A is at position k-1, G is at k, and C is at k + 1. Thus, the omega (k) based on the values in Table 2 is G / C-A / G = 10-6 = 4. Thus, the change from A to G is positive and can be indicated with "+" in the slope report.

도 19 및 20에 나타낸 바와 같이, 서열 비교 하위모듈은 표적 서열을 염기 서열에 비교하고 유사성 리포트를 디스플레이할 수 있다. 서열 비교 하위모듈은 또한 염기 서열을 역전시키고, 표적 서열을 역전시키고, 모드를 역전시키고, 각각의 염기 및 표적 서열에 대한 ω_n 값을 계산하고, 염기 및 표적 서열을 2원으로 전환시키고, 염기 서열과 표적 서열 사이의 거리를 계산하고, 거리가 경계를 초과하는지 결정할 수 있다,As shown in Figures 19 and 20, the sequence comparison submodule can compare target sequences to base sequences and display similarity reports. The sequence comparison submodule also reverses the base sequence, reverses the target sequence, reverses the mode, calculates the ω _n values for each base and target sequence, converts the base and target sequence to binary, The distance between the sequence and the target sequence can be calculated and determine if the distance exceeds the boundary,

도 21-25에 나타낸 바와 같이, 플롯 모듈은 많은 플롯팅 하위모듈을 포함할 수 있다. 예를 들어, 스펙트럼 어레이 하위모듈은 정렬 계수를 염기 서열 및 표적 서열에 대해 플롯팅할 수 있다. 스펙트럼 어레이 하위모듈은 또한 방사상 비교를 위한 ω_n 값을 계산하고, 정렬 계수를 추출할 수 있다. 방사상 비교에서, 스펙트럼 어레이 하위모듈은 수식 (4)를 사용할 수 있다:As shown in FIGS. 21-25, the plot module may include many plotting submodules. For example, the spectral array submodule can plot alignment coefficients against base sequences and target sequences. The spectral array submodule can also calculate ω _n values for radial comparison and extract the alignment coefficients. In the radial comparison, the spectral array submodule can use equation (4):

(4)

식에서, In the formula,

(5)

상기 수식은 다중 서열을 위한 것이다. 이는 특유한 스펙트럼 분석의 생성을 허용하고, l에 관하여 다중 합을 위해 사용되는 표기 (notation)이다. 이들은 그들의 위치에 대해 ω₀에 관하여 각각의 서열 내에 생성된 계수이다. 각각의 서열 위치에서 뉴클레오티드는 Z₁ ^λl Z₂ ^λ2l···Z_n ^λ ^nl로 표시된다.The formula is for multiple sequences. This allows the generation of unique spectral analysis and is a notation used for multiple sums with respect to l. These are the coefficients generated within each sequence with respect to ω ₀ for their position. The nucleotide at each position of the sequence is represented by Z ₁ ^λl Z ₂ ··· Z _n ^λ ^λ2l ^nl.

등식 4 및 5의 형성은 VaSSA에서 플롯의 생성을 허용한다. 수식의 계수 구조는 도 25에 제시되는 삼각 구조로 포획될 수 있다. 스펙트럼 구조는 삼각이고, DNA의 가닥 내에 공간을 삽입하거나 삭제하지 않으면서 최적화를 관찰하는 것을 허용한다. 도 24는 2개의 가닥을 사용하여 수식이 사용될 때 계수가 생성되는 방식을 증명한다. 단일 가닥 플롯은 동일한 구조를 갖지만 상이한 값을 갖는다. 비-2원 측도 때문에, 플롯이 동등한 곳 및 상이한 곳이 정확하게 관찰될 수 있다. 주기성이 있는 곳이 또한 관찰될 수 있다. 기능은 분석적이므로, 이는 뉴클레오티드 위치의 특유성에 영향을 미치지 않으면서 쉬프트로 공식화될 수 있다. 한 실시태양을 도 27에 나타낸다. VaSSA에서 스펙트럼 어레이 플롯은 도 25 상의 삼각 구조의 중심의 바로 아래의 계수를 사용한다. 상기 플롯의 예는 도 22이다. 이는 그래프가 그곳에서 제로이기 때문에 그들이 어디에서 직접 정렬을 갖는지 정보를 갖는다. 특정 높이를 갖는 스파이크 (spike)가 또한 존재한다. 유사한 정보는 단일 가닥 플롯으로서 관찰될 수 있다. 그러나, 차이의 크기는 여기서 스파이크의 높이에 관하여 시각화될 수 있다. 또한, 삼각에서 포인터 (pointer)를 사용하여, 본 발명자들은 최적화를 완료하는 것과 상이한 방식인 위상 묘사를 완료할 수 있다.The formation of equations 4 and 5 allows the generation of plots in VaSSA. The coefficient structure of the equation may be captured by the triangular structure shown in FIG. 25. The spectral structure is triangular and allows observing optimization without inserting or deleting spaces in the strands of DNA. 24 demonstrates how coefficients are generated when a formula is used using two strands. Single stranded plots have the same structure but different values. Because of the non-binary measures, where the plots are equal and where the different ones can be observed accurately. Where periodicity can also be observed. Since the function is analytical, it can be formulated as a shift without affecting the specificity of the nucleotide position. One embodiment is shown in FIG. 27. The spectral array plot in VaSSA uses the coefficient just below the center of the triangular structure on FIG. 25. An example of the plot is FIG. 22. It has information where they have a direct alignment because the graph is zero there. There is also a spike with a certain height. Similar information can be observed as a single stranded plot. However, the magnitude of the difference can be visualized here with respect to the height of the spikes. In addition, using a pointer in the triangle, we can complete the phase description, which is a different way than completing the optimization.

도 26-28에 나타낸 바와 같이, 단일 가닥 하위모듈은 단일 가닥을 염기 서열 및 표적 서열에 대해 플롯팅할 수 있다. 단일 가닥 하위모듈은 또한 염기 서열 및 표적 서열에 대한 ω_n 값을 계산할 수 있다. 단일 가닥 하위모듈은 등식 (4)를 이용하여 플롯팅할 수 있다.As shown in FIGS. 26-28, the single stranded submodule can plot the single strand against the base sequence and the target sequence. Single stranded submodules may also calculate ω _n values for base and target sequences. Single stranded submodules can be plotted using equation (4).

(6)

여기서, 등식 (6)은 등식 (5)의 단순화된 버전이다. 그러나, 상기 등식은 단일 가닥을 프로파일하도록 허용한다.Here, equation (6) is a simplified version of equation (5). However, the equation allows to profile a single strand.

도 29-30에 나타낸 바와 같이, 기울기 모듈은 염기 서열 내에서 각각의 뉴클레오티드 위치에 대한 기울기를 계산하고 기울기의 플롯을 디스플레이할 수 있다. ω_n 모듈은 염기 서열에 대한 ω_n을 계산하고 ω_n의 플롯을 디스플레이할 수 있다. ω_n 모듈은 등식 (6)을 사용할 수 있다.As shown in FIGS. 29-30, the slope module can calculate the slope for each nucleotide position within the base sequence and display a plot of the slope. The ω _n module can calculate ω _n for the base sequence and display a plot of ω _n . The ω _n module can use equation (6).

기울기의 플롯을 생성하는 것은 도 30 상에 플롯을 생성시킬 것이다. 기울기 플롯은 정보 유동의 단조성의 그래프이다. 상기 플롯은 사용자가 단일 가닥 플롯 상에 국소 및 전역 최대, 및 최소 위치를 결정하도록 허용한다. 이는 또한 사용자가 단일 가닥 플롯의 국소 영역 및 전역 영역에서 오목부 (concavity)를 결정하도록 허용한다.Generating a plot of slope will produce a plot on FIG. 30. The slope plot is a graph of the monotony of the information flow. The plot allows the user to determine local and global maximum and minimum positions on a single stranded plot. This also allows the user to determine the concavity in the local and global regions of the single stranded plot.

본 발명의 다양한 실시태양이 상기 설명되었지만, 이들은 제한이 아니라 단지 예로서 제시되었음을 이해해야 한다. 따라서, 본 발명의 넓이와 범위는 임의의 상기 설명된 예시적인 실시태양에 의해 제한되지 않아야 하고, 대신 하기 청구의 범위와 그의 동등물에 따라 규정되어야 한다. While various embodiments of the invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Accordingly, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should instead be defined in accordance with the following claims and their equivalents.

SEQUENCE LISTING <110> CLARK, JEFFREY M. <120> SYSTEM, METHOD AND COMPUTER PROGRAM FOR NON-BINARY SEQUENCE COMPARISON <130> 5442-001 <140> 11/378,284 <141> 2006-03-20 <150> 60/662,943 <151> 2005-03-18 <160> 8 <170> PatentIn Ver. 3.3 <210> 1 <211> 12 <212> DNA <213> Artificial Sequence <220> <223> Description of Artificial Sequence: Synthetic oligonucleotide <400> 1 agcaatatag ga 12 <210> 2 <211> 288 <212> DNA <213> Homo sapiens <400> 2 ggccgggcgc ggtggctcac gcctgtaatc ccagcacttt gggaggccga ggcgggcgga 60 tcacgaggtc aggagatcga gaccatcctg gctaacacgg tgaaaccccg tctctactaa 120 aaatacaaaa attagccggg cgtggtggcg ggcgcctgta gtcccagcta ctcgggaggc 180 tgaggcagga gaatggcgtg aacccgggag gcggagcttg cagtgagccg agatcgcgcc 240 actgcactcc agcctgggcg acagagcgag actccgtctc aaaaaaaa 288 <210> 3 <211> 290 <212> DNA <213> Homo sapiens <400> 3 ggccgggcgc ggtggctcac gcctgtaatc ccagcacttt gggaggccga ggcgggcgga 60 tcacctgagg tcaggagttc gagaccagcc tggccaacat ggtgaaaccc cgtctctact 120 aaaaatacaa aaattagccg ggcgtggtgg cgcgcgcctg taatcccagc tactcgggag 180 gctgaggcag gagaatcgct tgaacccggg aggcggaggt tgcagtgagc cgagatcgcg 240 ccactgcact ccagcctggg cgacagagcg agactccgtc tcaaaaaaaa 290 <210> 4 <211> 291 <212> DNA <213> Homo sapiens <400> 4 ggccgggcgc ggtggctcac gcctgtaatc ccagcacttt gggaggccga ggcgggcgga 60 tcacctgagg tcgggagttc gagaccagcc tgaccaacat ggagaaaccc cgtctctact 120 aaaaatacaa aaattagccg ggcgtggtgg cgcatgcctg taatcccagc tactcgggag 180 gctgaggcag gagaatcgct tgaacccggg aggcggaggt tgcggtgagc cgagatcgcg 240 ccattgcact ccagcctggg caacaagagc gaaactccgt ctcaaaaaaa a 291 <210> 5 <211> 288 <212> DNA <213> Homo sapiens <400> 5 ggccgggcgc ggtggctcac gcctgtaatc ccagcacttt gggaggccga ggcgggcgga 60 tcacgaggtc aggagatcga gaccatcccg gctaaaacgg tgaaaccccg tctctactaa 120 aaatacaaaa attagccggg cgtagtggcg ggcgcctgta gtcccagcta cttgggaggc 180 tgaggcagga gaatggcgtg aacccgggag gcggagcttg cagtgagccg agatcccgcc 240 actgcactcc agcctgggcg acagagcgag actccgtctc aaaaaaaa 288 <210> 6 <211> 288 <212> DNA <213> Homo sapiens <400> 6 ggccgggcac ggtcgtcacg cccgtcatcc cagcacttcg agaggccgag gtgggcgcat 60 cacaggaggt cgggagttgg agaccagcct gagcaacatg gagagacacg gcgtgcctac 120 taaaaacaca aacatcagcc aagccaggcg tgggggtgcc tccctgtcaa tccccgctaa 180 tcgggaggct ggaggcagga gaagcgctcg aacccgggag gcggaaggtg cggtgagcca 240 agatcgcgcc attgcactct agccgtggaa acaagagtga aactctgt 288 <210> 7 <211> 360 <212> DNA <213> Homo sapiens <400> 7 ttctttcatg gggaagcaga tttgggtacc acccaagtat tgactcaccc atcaacaacc 60 gctatgtatt tcgtacatta ctgccagcca ccatgaatat tgtacagtac cataaatact 120 tgaccacctg tagtacataa aaacccaatc cacatcaaaa ccctcccctc atgcttacaa 180 gcaagtacag caatcaacct tcaactatca cacatcaact gcaactccaa agccacctct 240 cccccactag gataccaaca aagctaccca tccttaacag tacatagcac ataaagccat 300 ttaccgtaca tagcacatta cagtcaaatc ccttcctgtc cccatggatg acccccctca 360 <210> 8 <211> 360 <212> DNA <213> Homo sapiens <400> 8 ttctttcatg gggaagcaga tttgggtacc acccaagtat tgactcaccc atcaacaacc 60 gctatgtatt tcgtacatta ctgccagcca ccatgaatat tgtacggtac cataaatact 120 tgaccacctg tagtacataa aaacccaacc cacatcaaac cccccccccc atgcttacaa 180 gcaagtacag caatcaacct tcaactatca cacatcaact gcaactccaa agccacccct 240 cacccactag gataccaaca aacctaccca cccttaacag tacatagtac ataaagccat 300 ttaccgtaca tagcacatta cagtcaaatc ccttctcgtc cccatggatg acccccctca 360 SEQUENCE LISTING <110> CLARK, JEFFREY M. <120> SYSTEM, METHOD AND COMPUTER PROGRAM FOR NON-BINARY SEQUENCE COMPARISON <130> 5442-001 <140> 11 / 378,284 <141> 2006-03-20 <150> 60 / 662,943 <151> 2005-03-18 <160> 8 <170> Patent In Ver. 3.3 <210> 1 <211> 12 <212> DNA <213> Artificial Sequence <220> <223> Description of Artificial Sequence: Synthetic oligonucleotide <400> 1 agcaatatag ga 12 <210> 2 <211> 288 <212> DNA <213> Homo sapiens <400> 2 ggccgggcgc ggtggctcac gcctgtaatc ccagcacttt gggaggccga ggcgggcgga 60 tcacgaggtc aggagatcga gaccatcctg gctaacacgg tgaaaccccg tctctactaa 120 aaatacaaaa attagccggg cgtggtggcg ggcgcctgta gtcccagcta ctcgggaggc 180 tgaggcagga gaatggcgtg aacccgggag gcggagcttg cagtgagccg agatcgcgcc 240 actgcactcc agcctgggcg acagagcgag actccgtctc aaaaaaaa 288 <210> 3 <211> 290 <212> DNA <213> Homo sapiens <400> 3 ggccgggcgc ggtggctcac gcctgtaatc ccagcacttt gggaggccga ggcgggcgga 60 tcacctgagg tcaggagttc gagaccagcc tggccaacat ggtgaaaccc cgtctctact 120 aaaaatacaa aaattagccg ggcgtggtgg cgcgcgcctg taatcccagc tactcgggag 180 gctgaggcag gagaatcgct tgaacccggg aggcggaggt tgcagtgagc cgagatcgcg 240 ccactgcact ccagcctggg cgacagagcg agactccgtc tcaaaaaaaa 290 <210> 4 <211> 291 <212> DNA <213> Homo sapiens <400> 4 ggccgggcgc ggtggctcac gcctgtaatc ccagcacttt gggaggccga ggcgggcgga 60 tcacctgagg tcgggagttc gagaccagcc tgaccaacat ggagaaaccc cgtctctact 120 aaaaatacaa aaattagccg ggcgtggtgg cgcatgcctg taatcccagc tactcgggag 180 gctgaggcag gagaatcgct tgaacccggg aggcggaggt tgcggtgagc cgagatcgcg 240 ccattgcact ccagcctggg caacaagagc gaaactccgt ctcaaaaaaa a 291 <210> 5 <211> 288 <212> DNA <213> Homo sapiens <400> 5 ggccgggcgc ggtggctcac gcctgtaatc ccagcacttt gggaggccga ggcgggcgga 60 tcacgaggtc aggagatcga gaccatcccg gctaaaacgg tgaaaccccg tctctactaa 120 aaatacaaaa attagccggg cgtagtggcg ggcgcctgta gtcccagcta cttgggaggc 180 tgaggcagga gaatggcgtg aacccgggag gcggagcttg cagtgagccg agatcccgcc 240 actgcactcc agcctgggcg acagagcgag actccgtctc aaaaaaaa 288 <210> 6 <211> 288 <212> DNA <213> Homo sapiens <400> 6 ggccgggcac ggtcgtcacg cccgtcatcc cagcacttcg agaggccgag gtgggcgcat 60 cacaggaggt cgggagttgg agaccagcct gagcaacatg gagagacacg gcgtgcctac 120 taaaaacaca aacatcagcc aagccaggcg tgggggtgcc tccctgtcaa tccccgctaa 180 tcgggaggct ggaggcagga gaagcgctcg aacccgggag gcggaaggtg cggtgagcca 240 agatcgcgcc attgcactct agccgtggaa acaagagtga aactctgt 288 <210> 7 <211> 360 <212> DNA <213> Homo sapiens <400> 7 ttctttcatg gggaagcaga tttgggtacc acccaagtat tgactcaccc atcaacaacc 60 gctatgtatt tcgtacatta ctgccagcca ccatgaatat tgtacagtac cataaatact 120 tgaccacctg tagtacataa aaacccaatc cacatcaaaa ccctcccctc atgcttacaa 180 gcaagtacag caatcaacct tcaactatca cacatcaact gcaactccaa agccacctct 240 cccccactag gataccaaca aagctaccca tccttaacag tacatagcac ataaagccat 300 ttaccgtaca tagcacatta cagtcaaatc ccttcctgtc cccatggatg acccccctca 360 <210> 8 <211> 360 <212> DNA <213> Homo sapiens <400> 8 ttctttcatg gggaagcaga tttgggtacc acccaagtat tgactcaccc atcaacaacc 60 gctatgtatt tcgtacatta ctgccagcca ccatgaatat tgtacggtac cataaatact 120 tgaccacctg tagtacataa aaacccaacc cacatcaaac cccccccccc atgcttacaa 180 gcaagtacag caatcaacct tcaactatca cacatcaact gcaactccaa agccacccct 240 cacccactag gataccaaca aacctaccca cccttaacag tacatagtac ataaagccat 300 ttaccgtaca tagcacatta cagtcaaatc ccttctcgtc cccatggatg acccccctca 360

Claims

An analysis module adapted to calculate a non-binary similarity score between the first nucleotide sequence and the second nucleotide sequence; And

Outputs in communication with the analysis module for outputting similarity scores

Including a system for sequence analysis.

The system of claim 1, wherein the similarity score is based on a combination of similarity scores for each base pair.

The system of claim 2, wherein the similarity score for the base pairs is dependent on the similarity of the chemical structure of the base pairs.

4. The method of claim 3, wherein the similarity score for the base pairs is a first value if the nucleotides of the base pairs match, a second value if the nucleotides of the base pairs do not match but have the same structure, and the first, second and third values Is different.

The system of claim 3, wherein the similarity score for the base pairs is determined based on the relative position of the base pairs.

The system of claim 3, wherein the similarity score for the base pairs is based on the number of elements in the nucleotides of the first sequence that are not present in the nucleotides of the second sequence.

The system of claim 1 further comprising a report module, a file management module, and a plot module.

8. The system of claim 7, further comprising a user option module or a user help module, or both.

The method of claim 1, wherein the file management module

A sequence load module adapted to load one or more sequence files;

An active sequence flush module adapted to flush the sequence file from memory; And

Loaded sequence flush module adopted to flush a loaded sequence file from the memory

System comprising a.

10. The system of claim 9, wherein the sequence load module comprises a loaded sequence display module adapted to generate and display a summary report notebook page when a sequence is loaded, wherein the summary report notebook page is a sequence file name and The system being adapted to display the number of sequences.

The system of claim 1, wherein the report module is adapted to generate and display one or more of a sequence summary, a list of the content of each loaded sequence, or statistical information for each loaded sequence. .

The method of claim 1, wherein the analysis module

A sequence alignment module adapted to align the target sequence to the base sequence and display an alignment report;

A ω ₀ module adapted to calculate a ω ₀ score for the sequence and display the ω ₀ score;

A query repeat module adapted to locate multiple occurrences of the target sequence within the base sequence and display the multiple occurrences;

A question omega repeat module adopted to determine when the repeated nucleotide is a duplicate;

A gradient calculation module adapted to calculate a slope for each nucleotide position within the base sequence and display a slope report; And

A sequence comparison module adapted to compare the target sequence with the base sequence and display a similarity report.

The method of claim 12, wherein the sequence alignment module reverses the base sequence, reverses the mod, aligns the base and the target to the shortest length, calculates an alignment ratio, or scores an omega similarity The system is further adopted to perform one or more of the calculations.

The method of claim 12, wherein the sequence comparison module reverses the base sequence; Inverting the target sequence; Reverse mode; The system is further adapted to perform one or more of calculating a ω ₀ value for each said base and said target sequence.

The method of claim 1, wherein the plot module

A spectral array module adapted to plot alignment coefficients against base and target sequences;

A single stranded module adapted to plot a single strand against the base sequence and the target sequence;

A tilt module adapted to calculate a slope for each nucleotide position within the base sequence and display a plot of the slope, and

The system calculates the ω _N for the nucleotide sequence and comprises a ω _N module adapted to display a plot of the ω _N.

The system of claim 15, wherein the spectral array module is further employed to calculate ω _N values for radial comparison and extract alignment coefficients.

The system of claim 15, wherein the single stranded module is further adapted to calculate ω _N values for the base sequence and the target sequence.

The system of claim 1, wherein the analysis module comprises a single stranded DNA analysis module and a multiple stranded DNA analysis module.

19. The method of claim 18, wherein each of the single stranded DNA analysis module and the multi-stranded DNA analysis module are DNA approximation module, chaotic region classification module, DNA induction module, DNA branching module, DNA orbit module, analytical behavior profiler. And at least one module selected from the group consisting of a module, a DNA topological conjugacy module, a structural stable region module, a non-degradable region module, a DNA complexity base module, and a DNA aligner module.

20. The DNA approximation module of claim 19, wherein the DNA approximation module is a holomorphic shape generator module, an approximation constructor module, a P & Q coefficient calculator module, a JC-DNA curve generator module, a low complexity generator module, a target classifier module, a symbolic DNA. And at least one module selected from the group consisting of an orbit module and an analytical DNA orbit module.

The system of claim 19, wherein the chaotic region classification module further comprises one or more modules selected from the group consisting of a DNA sensitivity generator module, a DNA transition generator module, and a dense periodic sequence generator module.

20. The module of claim 19, wherein the DNA induction module further comprises one or more modules selected from the group consisting of an induction generator module and a monotonic generator module, wherein the forge generator module is a positive measure module and a negative. And a measure module.

20. The system of claim 19, wherein said DNA branch module further comprises one or more modules selected from the group consisting of a DNA transitivity splitter profiler module and a DNA dense splitter profiler module.

The system of claim 19, wherein the DNA orbit module further comprises one or more modules selected from the group consisting of symbolic DNA orbit modules and analytical DNA orbit modules.

25. The apparatus of claim 24, wherein the symbolic DNA orbit module comprises a symbolic flow generator module, a row difference generator module, and an orbit generator module, wherein the analytical DNA orbit module is an analytical forward profiler module, an analytical backward profiler module. A DNA retractor generator module, and a DNA retractor generator module.

20. The method of claim 19, wherein the analytical behavior profiler module further comprises at least one module selected from the group consisting of an algebraic structure generator module, a homomorphism-generator module, and an isomorphism-generator module. It is to include a system.

20. The method of claim 19, wherein the DNA phase conjugate module is selected from the group consisting of an analytical profiler module, an analytic mapper module, a conjugate comparison module, a first iterative analysis module, and a phase portrait generator module. The system further comprises the above module.

20. The system of claim 19, wherein the structural stable region module further comprises one or more modules selected from the group consisting of a repeat generator module, an forward asymptotic module, and a stability profiler module.

The system of claim 19, wherein the non-degradable region module further comprises one or more modules selected from the group consisting of a DNA orbit analysis module, a non-repeat generator module, and a non-degradable profiler module.

20. The system of claim 19, wherein said DNA complexity base module further comprises one or more modules selected from the group consisting of a repeat generator module, a universal DNA based generator module, and a density generator module.

The system of claim 19, wherein the DNA aligner module further comprises one or more modules selected from the group consisting of a symbol aligner module and an omega compare aligner module.

Reading the sequence file;

Selecting a target sequence and a base sequence from the file;

Performing a non-membered comparison between the target and each base pair of base sequences to generate a comparison for each base pair; And

Determining similarity between the target and the base sequence based on the comparison value

Sequence analysis method comprising a.

33. The method of claim 32, further comprising recording the aligned sequence in the file, and calculating the alignment ratio.

33. The method of claim 32, further comprising generating one or more of a two-dimensional spectral array plot or a two-dimensional single strand plot.

35. The method of claim 34, wherein generating the spectral array plot includes calculating ω _N , performing a radial comparison, extracting alignment coefficients, and plotting the alignment coefficients. Way.

The method of claim 35, further comprising reversing one of the base or the target, and reversing the calculation.

33. The method of claim 32, wherein performing the non-binary comparison uses a look-up table that includes non-binary similarity score values for a plurality of possible comparisons between two sequence components. And a step.

33. The method of claim 32, wherein similarity is

Determined by.