TWI326431B

TWI326431B - Method and system of analyzing gene sequence

Info

Publication number: TWI326431B
Application number: TW96115338A
Authority: TW
Inventors: Hahn Ming Lee; Pai Ling Lo; Che Fu Yeh; Chiahsin Huang
Original assignee: Univ Nat Taiwan Science Tech
Priority date: 2007-04-30
Filing date: 2007-04-30
Publication date: 2010-06-21
Also published as: TW200842735A

Description

1326431 九、發明說明：【發明所屬之技術領域】本發明係有關於基因序列分析方法，特別是有關於一種藉由分析基因序列之特徵，來尋找該基因序列上被基因 • 調控因子所調控的結合區域。【先前技術】基因調控因子與基因之間的交互作用，控制著發育及 # 環境變化時等等重要的生理作用。因此，尋找在基因上被調控因子作用的片段’不論就生物學理上和醫療上而言，都是至關重要的一項研究主題。傳統上’對於基因上被調控因子作用的片段或基因上調控因子之調控區域(binding site)的尋找，係採用生物學的方法，例如基因序列足跡法（DNA footprint)。足跡法是利用化學藥劑，將作為研究對象之一段基因序列切割成片段，再針對這些切割出的片段，分析其是否為調控區域。 ^ 此方法相當費時和費力。近年來，除了傳統的生物學方法，也開始運用電腦資訊技術來輔助基因序列的分析。例如，利用位置權重矩陣法(position weight matrix)的基因序列分析方法。位置權重矩陣法是針對已發現的調控區域做分析，統計每個位置出現某個核甘酸（nucleotide)的次數來得知每個位置的權重值並將之建立成一矩陣。依據位置權重矩陣法，所有的調控區域長度都假設是固定的，且每個權重片段位置對於基因調控區域的結合吸引力（binding affinity)影響是各自1326431 IX. Description of the invention: [Technical field to which the invention pertains] The present invention relates to a method for analyzing a gene sequence, and more particularly to a method for analyzing a gene sequence to be regulated by a gene regulatory factor by analyzing the characteristics of the gene sequence. Combine the area. [Prior Art] The interaction between gene regulatory factors and genes controls important physiological roles such as development and #environmental changes. Therefore, finding a fragment that is genetically regulated by a regulatory factor is a critical research topic both biologically and medically. Traditionally, the search for a gene-regulated fragment or a regulatory site for a gene regulatory factor has been carried out using biological methods such as the DNA footprint. The footprint method is to use a chemical agent to cut a segment of a gene sequence into a fragment, and then analyze whether the fragment is a regulatory region. ^ This method is quite time consuming and laborious. In recent years, in addition to traditional biological methods, computer information technology has also been used to assist in the analysis of gene sequences. For example, a gene sequence analysis method using a position weight matrix method. The position weight matrix method analyzes the discovered regulatory regions, counts the number of nucleotides appearing at each location to know the weight value of each location and establishes it into a matrix. According to the position weight matrix method, the lengths of all regulatory regions are assumed to be fixed, and the influence of each weight segment position on the binding affinity of the gene regulatory region is

Client's Docket No. 0950067tw TT’s Docket N〇:09I2-0912-A5087 丨 TWfl：送件版本)/a 丨 icewu 5 1326431 獨立的。然而’位置權重矩陣法的上述假設，就基因序列 ^互相影響的觀點而言，並不正確。基因序列常ί 子結構結合的特徵序列通常為小片段 ^ 都出現於基因上游序列，且可能是位於基=二周歹j中任何部份的小片段，故其長度也可能不同。序分析=:需要一種快速，且符合生物學原理的基因序列【發明内容】 ^發明之目的為提供H由分析基因心找該基因序列上被基因調控因子所調控的結合區域。“ 為達成上述目的，本發明提供一種基因序 7b. ΐ用以找尋一分析對象片段中的調控因子結合區域〜dlngslte)。-子片段產生器，其接收該分析對象片 &，並切㈣分析對象片段以產生複數分析對 =寺徵向量產生H’其依據該分析對象子又 ==量。：聚類器，其將該分析對象特妝^ M线應之—分析對象特徵向量分佈 I群組’以及每-向量群組在該分析對象片段中1 里。一對映器，其將該分析對象特徵向量分 ^一預設的-背景序列特徵向量做對映，將 ^徵: 量分佈狀態圖中該向量群組之該表現量與。徵向量之差異大於-預定值者部份標示為過度表現向旦 TT's Docke. No:0912-0912-A50871TWfl；^tS^)/a|icewu 1326431 及排序，並據以篩選出包含調控因子結合區域的該向量群組。本發明另提供一種基因序列分析方法，其用以找尋一分析對象片段中的調控因子結合區域（binding site )。接 • 收該分析對象片段，並切割該分析對象片段以產生複數分 _ 析對象子片段。依據該分析對象子片段產生對應之分析對象特徵向量。將該分析對象特徵向量予以聚類、分群，以產生對應之一分析對象特徵向量分佈狀態圖，用以表現該 φ 分析對象特徵向量劃分而成之複數向量群組，以及每一向 ' 量群組在該分析對象片段中的表現量。將該分析對象特徵 - 向量分佈狀態圖與一預設的一背景序列特徵向量做對映，將該分析對象特徵向量分佈狀態圖中該向量群組之該表現量與該背景序列特徵向量之差異大於一預定值者部份標示為過度表現向量群組。針對該過度表現向量群組加以評分及排序，並據以篩選出包含調控因子結合區域的該向量群組。為讓本發明之上述和其他目的、特徵、和優點能更明 ® 顯易懂，下文特舉出較佳實施例，並配合所附圖式，作詳細說明如下：【實施方式】為了讓本發明之目的、特徵、及優點能更明顯易懂，下文特舉較佳實施例，並配合所附圖示第1圖到第4圖，做詳細之說明。本發明說明書提供不同的實施例來說明本發明不同實施方式的技術特徵。其中，實施例中的各元件之配置係Client's Docket No. 0950067tw TT’s Docket N〇: 09I2-0912-A5087 丨 TWfl: delivery version) /a 丨 icewu 5 1326431 Independent. However, the above assumptions of the position weight matrix method are not correct in terms of the mutual influence of the gene sequences. The signature sequence of the gene sequence often has a small fragment ^ which appears in the upstream sequence of the gene, and may be a small fragment located in any part of the base = two weeks ,j, so its length may also be different. Sequence Analysis =: A fast, biologically-compliant gene sequence is required. [Invention] The purpose of the invention is to provide H for the binding region of the gene sequence to be regulated by a gene regulatory factor. "To achieve the above object, the present invention provides a gene sequence 7b. ΐ for finding a regulatory factor binding region in the analysis target fragment ~ dlngslte). - a sub-segment generator that receives the analysis object slice & and cuts (4) The analysis object segment is generated to generate a complex analysis pair = temple sign vector to generate H' according to the analysis object sub-== quantity.: clusterer, which analyzes the object of the analysis object M line-analyze object feature vector distribution I The group 'and each-vector group are in the analysis object segment. The pair of mappers, which map the analysis object feature vector to a preset-background sequence feature vector, The distribution of the vector group in the distribution state diagram and the difference between the eigenvectors greater than the predetermined value are marked as excessive performance to TT's Docke. No:0912-0912-A50871TWfl;^tS^)/a|icewu 1326431 and sorting, and screening the vector group including the regulatory factor binding region. The present invention further provides a gene sequence analysis method for finding a regulatory factor binding site in an analysis target fragment.Receiving the analysis target segment, and cutting the analysis target segment to generate a complex analysis target sub-segment. The corresponding analysis object feature vector is generated according to the analysis object sub-segment. The analysis object feature vector is clustered and grouped to generate Corresponding to one of the analysis object feature vector distribution state maps, which is used to represent the complex vector group divided by the φ analysis object feature vector, and the performance amount of each of the 'quantity group' in the analysis target segment. The feature-vector distribution state map is mapped with a preset background sequence feature vector, and the difference between the representation amount of the vector group and the background sequence feature vector in the analysis object feature vector distribution state diagram is greater than a predetermined value Partially labeled as an over-expressed vector group, the over-expressed vector group is scored and ranked, and the vector group containing the regulatory factor binding region is selected accordingly. To make the above and other objects and features of the present invention And advantages can be more clearly understood, and the preferred embodiments are described below, and in conjunction with the drawings, The detailed description is as follows: [Embodiment] In order to make the objects, features, and advantages of the present invention more comprehensible, the preferred embodiments of the present invention will be described in detail with reference to Figures 1 through 4 of the accompanying drawings. The description of the present invention provides various embodiments to illustrate the technical features of various embodiments of the present invention, wherein the configuration of each component in the embodiment is

Client's Docket No. 0950067tw TTs Docket No:0912-0912-A50871TWf(送件版本)/alicewu 7 丄」仂431 :=用以限制本發明。且實施例中圖式標號的關聯係、為了簡化說明’並非意指不同實施例之間 -立第1圖顯示依據本發明實施例之基因序列分析圖：基因序列分析系統1〇係用以找尋一分析對、象二中的調控因子結合區域（binding site) , jl包括.、 ”面U、背景序列產生器12、子片段產生哭、13蛀輪入量遙哇哭u取丄门仅座生為13、特徵向出介“。器15、對映器16、比較排序器17、輸輸入介面11接收對應於該分析對象片段資料110。該分析對象片段資料110“ ^ =對應於其所欲分析處理之基因序列片段的基因序』月厅、序列產生器12依據一數學模—北列片段資料⑽。該數學模型可以為厅、序使用者輸入數學模型。在此，該預設數學 =，，其產生與該分析對以= ⑶。由於基因的上游序列中，通序列片段資料因子結合區域，其餘縣在分析均可^段是調控故經由數學模型隨機產生的f景 )為雜帅〇叫，突出的序列片段。亦即，背景序列片‘交出特別除雜訊資料，以利找出分析對象片段資料L係用以消的部份。、料110中過度表現Client's Docket No. 0950067tw TTs Docket No: 0912-0912-A50871TWf (delivery version)/alicewu 7 丄"仂431:= is used to limit the present invention. In the embodiment, the schematic reference numerals are used to simplify the description. 'It is not intended to mean that between different embodiments. FIG. 1 shows a gene sequence analysis diagram according to an embodiment of the present invention: a gene sequence analysis system 1 is used to find An analysis pair, the binding factor binding site in the second, jl includes., "face U, background sequence generator 12, sub-segment crying, 13 蛀 round-in volume, wow, cry, u take the door only Born to 13, the feature is introduced. The processor 15, the mapper 16, the comparison sequencer 17, and the input interface 11 receive the corresponding analysis object segment data 110. The analysis target fragment data 110 "^ = the gene sequence corresponding to the gene sequence fragment to be analyzed and processed", the sequence generator 12 according to a mathematical model - the northern column fragment data (10). The mathematical model can be the hall, the order The user inputs a mathematical model. Here, the preset math =, which is generated with the analysis pair = (3). Because of the upstream sequence of the gene, the sequence of the fragment data is combined with the region, and the remaining counties are analyzed. The f-view that is randomly generated by the mathematical model is a screaming, prominent sequence fragment. That is, the background sequence piece 'excludes special noise-removing data to facilitate the analysis of the segment of the analysis object L is used to eliminate Part of the material

ClienVs Dockei No. Q950067tw TT;s Docket No:09l2-09}2-A50871TW^^^)/aljcev .子片段產生器13其切割該分析對象片段資料U0以ClienVs Dockei No. Q950067tw TT;s Docket No: 09l2-09}2-A50871TW^^^)/aljcev. The sub-segment generator 13 cuts the analysis target fragment data U0 to

Clients Dockei No. Q950067tw 丄326431 產生複數分析對象子片段資料131，並且割該背景序歹^ μ 段資料120以產生複數背景序列子片段資料133。子片# 產生斋13可以使用滑動視窗（sliding window)來執行上述切割的程序’其中該滑動視窗的大小係為使用者指定。特徵向量產生器14依據該分析對象子片段資料131 • 產生對應之分析對象特徵向量141，並且處理該背景序列子片段資料133，以產生對應之該背景序列特徵向量 143 °其中該特徵向量產生器14計算分段模數(K-mer；^ • 項目頻率（term frequency)，並依據該分段模數的項目頻率界定該分析對象特徵向量141及該背景序列特徵向量 143。聚類器15將該分析對象特徵向量141予以聚類、分群’以產生對應之一分析對象特徵向量分佈狀態圖151 , 用以表現該分析對象特徵向量141劃分而成之複數向量群組，以及每一向量群組在該分析對象片段中的表現量。该聚類器14使用類神經網路(Neural Networks)之自我組織圖分群（Self-Organizing Map Clustering)方法將該分析 9 對象特徵向量予以聚類、分群，其中該分析對象特徵向量分佈狀態圖151可以為二維方式表現。對映器16將該分析對象特徵向量分佈狀態圖151與該背景序列特徵向量143做對映，將該分析對象特徵向量分佈狀態圖151中該向量群組之該表現量與該背景序列特徵向量143之差異大於一預定值者部份標示為過度表現向量群組，並將過度表現向量群組161的資料傳送給比較排序器17。Clients Dockei No. Q950067tw 丄 326431 generates a plurality of analysis object sub-segment data 131, and cuts the background sequence μ^ segment data 120 to generate a plurality of background sequence sub-segment data 133. The sub-slices # generate a fast 13 can use a sliding window to perform the above-described cutting procedure 'where the size of the sliding window is specified by the user. The feature vector generator 14 generates a corresponding analysis object feature vector 141 according to the analysis object sub-segment data 131, and processes the background sequence sub-segment data 133 to generate a corresponding background sequence feature vector 143 °, wherein the feature vector generator 14 calculating a segmentation modulus (K-mer; ^ • term frequency, and defining the analysis object feature vector 141 and the background sequence feature vector 143 according to the item frequency of the segmentation modulus. The clusterer 15 The analysis object feature vector 141 is clustered and grouped to generate a corresponding one of the analysis object feature vector distribution state maps 151 for representing the complex vector group divided by the analysis object feature vector 141, and each vector group. The amount of expression in the segment of the analysis object. The clusterer 14 clusters and groups the analysis object eigenvectors using a Neural Networks-based Self-Organizing Map Clustering method. The analysis object feature vector distribution state diagram 151 can be expressed in a two-dimensional manner. The imager 16 analyzes the object of the analysis object. The quantity distribution state diagram 151 is mapped to the background sequence feature vector 143, and the difference between the representation amount of the vector group and the background sequence feature vector 143 in the analysis object feature vector distribution state diagram 151 is greater than a predetermined value. The shares are marked as an over-expression vector group, and the data of the over-expression vector group 161 is transmitted to the comparison sequencer 17.

Client’s Docket No. 0950067tw TT’s Docket No:〇912-0912-A50871 TWfl；送件版本)/alicewu 分及排1排17針對該過度表現向量群組161加以評群組171 :並==出包含調控因子結合區域的該向量二ί:=排序？ 17計算該過度表現向量群組二表；向：群徵向|143的相對表現量’並計算該過度里群、，且和该分析對象特徵向量之相對表現晋，if e 據上述2结果對該過度表現向量群組心：；。亚依流程圖。圖顯示依據本發明實施例之基因序列分析方法的 #資=ΓG1 ’純對應於分析對象諸之—分析對象片應於其所欲分❹者提供對步驟S209 ^ 列片段的基因序列資料。料！20。該數學Hf；模型’ Μ 一背景序列片段資入數學模型。在此，該預t免數風n學拉型或一使用者輸且有象片段資料相同數量長度，且 =序=Γ、，背景序列片峨12〇。由於基因的游序列中，通常只有一小片段在分析上都可視為雜訊㈣，二= 即，^二序列資料來比較出特別突出的序列片段。亦出分斤^12〇係用以消除雜訊資^ 刀析對象片&貢料1U)中過度表現的部份。八;^^ S211’切吾“亥分析縣片段資料110以產生複數刀析對象子片段資料131。座生獲数步驟奶2’切割該背景序列片段資料12〇以產生複數 __ 1〇 1326431 月景序列子片段資料]33。可以使用滑動視窗（siidi 的程序’其令該滑動視窗的大小係為使仏定蝴割第3圖顯示滑動視窗的切割方窗寬度為L，滑動距離為d的下來 ::切二？所示，依據滑動視窗首==二 Ί圍為點A和點c之間，其中點A到點段2二滑動視窗向下游處滑動距離d，產生子片Client's Docket No. 0950067tw TT's Docket No: 〇912-0912-A50871 TWfl; delivery version) /alicewu points and row 1 row 17 for the over-expression vector group 161 to be evaluated group 171: and == out contain control factors Combine the region's vector two ί:= sort? 17 calculating the over-expression vector group two table; to: the relative sign of the group sign to | 143 and calculating the over-middle group, and the relative performance of the eigenvector of the analysis object, if e according to the above 2 results The over-expression vector group heart:;. Yayi flow chart. The figure shows that the gene sequence analysis method according to the embodiment of the present invention corresponds to the analysis object - the analysis target piece should provide the gene sequence data of the step S209 ^ column in its desired branch. material! 20. The mathematical Hf; model' Μ a background sequence fragment is assigned to the mathematical model. Here, the pre-t-free wind n-type pull type or a user input has the same number of lengths as the clip data, and = order = Γ, and the background sequence is 〇 12〇. Because of the gene's swimming sequence, usually only a small fragment can be visually analyzed as noise (4), and two = ie, ^ two sequence data to compare the particularly prominent sequence fragments. Also, it is used to eliminate the over-expressed part of the miscellaneous information and the 1U).八;^^ S211 '切吾" Hai analysis county fragment data 110 to generate a plurality of knife analysis object sub-segment data 131. Block acquisition step milk 2' cut the background sequence fragment data 12 〇 to generate a complex number __ 1〇1326431 The moon sequence sub-segment data]33. You can use the sliding window (the siidi program's size is such that the size of the sliding window is such that the cutting window is 3, the width of the cutting window is L, and the sliding distance is d. Down:: cut two?, according to the sliding window first == two circumferences between point A and point c, wherein point A to point 2 two sliding window sliding distance d downstream, resulting in sub-pieces

ί為L Λ!，B和點D之間，其中點B到，點D的距為L ’而點衫至丨丨^JL P«5^ J ] 的距離為4。同樣地，滑動禎窑五 3處:動距離d’產生子片段33,其範圍;點= 距㈣"^到點£的距離為[，而點C_D的㈣S231，依據該分析對象子片段資料⑶ 之分析對象特徵向量141。座生對應步驟S232,處理該背景序列子片段資料133，以產生對應之该背景序列特徵向量143。在此，上述特徵向量之產生，係可以計算分段模數 (:)的項目頻率（term叫咖⑼，並依據 it定該分析對象特一及該背』以第4圖為例’說明上述特徵向量產生方法中分段模 =的項目頻率的計算。以第4圖所示之片段『aaattcg t例。當分段模數⑴為1時（1儒），計算得出，在 AAATTCG』這個片段裡，^贿的項目1，出現3次，ί is L Λ!, between B and point D, where point B to point D is at a distance L ́ and the distance from the shirt to 丨丨^JL P«5^ J ] is 4. Similarly, the sliding kiln five places: the moving distance d' produces a sub-segment 33, the range; point = distance (four) " ^ to point £ distance [, and point C_D (four) S231, according to the analysis object sub-fragment data (3) The analysis object feature vector 141. The background sequence sub-segment data 133 is processed to generate the corresponding background sequence feature vector 143. Here, the generation of the feature vector described above can calculate the item frequency of the segmentation modulus (:) (term is called coffee (9), and according to the specific object and the back of the analysis object], taking FIG. 4 as an example to illustrate the above The calculation of the item frequency of the piecewise mode = in the eigenvector generation method. The fragment shown in Fig. 4 is aaattcg t. When the piecewise modulus (1) is 1 (1 Confucian), it is calculated in AAATTCG. In the snippet, the item 1 of bribes appeared 3 times.

Client's Docket Mo. 〇950067tw TTsD〇ctoMO:09l2-09 丨 2-A5087!TWfl[送件版本)/aHcewuClient's Docket Mo. 〇950067tw TTsD〇ctoMO: 09l2-09 丨 2-A5087!TWfl[送件版)/aHcewu

Client's Docket Mo. 〇950067tw 11 1326431 項目‘丁，出現2次，項目s 其對應之項目頻率為：項項目‘G ’各出現1次。因此， ‘C，及項目‘G|A1典、、八為3,項目‘T，為2,項目、各為虽分段模數 2-mer的項目，AA，出a數⑴為2~(2-mer)， ‘⑶，各出現！次。因此，其=靡項目，AT，、‘TT，、‘TC，及為2，項目，AT’、‘ττ,、％ J應之項目頻率為··項目’AA’ 項目的項目頻率則均二及^各為其他的2-贿 (一），3.二=目〇2分=數⑴為3時Client's Docket Mo. 〇950067tw 11 1326431 Item ‘Ding, appears 2 times, item s The corresponding project frequency is: item ‘G ’ appears once each. Therefore, 'C, and the project 'G|A1 code, eight is 3, the item 'T is 2, the item, each is a segmental modulus 2-mer item, AA, the number of a (1) is 2~( 2-mer), '(3), each appears! Times. Therefore, its =靡 project, AT,, 'TT, 'TC, and 2, project, AT', 'ττ, and % J should be the project frequency of the project. The project frequency of the project 'AA' is equal to And ^ each for the other 2 bribes (1), 3. 2 = witness 2 points = number (1) is 3

各出現 i 二欠。因此，二L T、ATT，、‘TTC，及‘丁CG， ‘ATT，、‘TTC，及‘TCG:各:應之項目頻率為：項目‘AAT、頻率則均為0。為1，其他的her項目的項目步驟S24 ’將該分析對象特徵向量141予以聚類、分田，以產生對應之一分析對象特徵向量分佈狀態圖151， 2表現齡析對象特徵向4 141劃分而成之複數向量、且以及母向1群組在該分析對象片段中的表現量。上述程序係可以使用類神經網路（Neural Networks)之自我組織圖分群（Self-〇rganizing Map Clustering)方法實現之，其中該分析對象特徵向量分佈狀態圖151可以二方式表現。匕步驟S25 ’將該分析對象特徵向量分佈狀態圖ι51與該背景序列特徵向量143做對映，將該分析對象特徵向量分佈狀態圖151中該向量群組之該表現量與該背景序列特徵向量143之差異大於一預定值者部份標示為過度表現向量群組。步驟S26 ’針對該過度表現向量群組ι61加以評分及Each appears i owes. Therefore, two L T, ATT, ‘TTC, and ‘Ding CG, ‘ATT, ‘TTC, and ‘TCG: each: The frequency of the project should be: the item ‘AAT, the frequency is 0. For example, the other project item S24 of the her project is to cluster and divide the analysis object feature vector 141 to generate a corresponding one of the analysis object feature vector distribution state maps 151, and 2 to represent the age-resolved object features to 4 141. The complex vector and the amount of representation of the parent group 1 in the analysis target segment. The above program can be implemented using a Neural Networks-based Self-〇rganizing Map Clustering method, wherein the analysis object feature vector distribution state map 151 can be expressed in two ways. Step S25 ' analyze the analysis object feature vector distribution state map ι51 and the background sequence feature vector 143, and the expression quantity of the vector group in the analysis object feature vector distribution state diagram 151 and the background sequence feature vector The portion where the difference of 143 is greater than a predetermined value is indicated as an over-expression vector group. Step S26' scores the over-expression vector group ι61 and

Client^ Docket No. 0950067tw TT's Docket N〇：0912-0912-A5087,TWf(iSiTO)/alicewu 1326431 排序。上述比較排序程序，係可以藉由計算該過度表現向量群組161和該背景序列特徵向量143的相對表現量，並計算該過度表現向量群組和該分析對象特徵向量之相對表現量，並依據上述結果對該過度表現向量群組161評分及排序。在步驟S27中，依據上述評分排序的結果，篩選出包含調控因子結合區域的該向量群組171。雖然本發明已以較佳實施例揭露如上，然其並非用以 φ 限定本發明，任何熟悉此項技藝者，在不脫離本發明之精神和範圍内，當可做些許更動與潤飾，因此本發明之保護 • 範圍當視後附之申請專利範圍所界定者為準。Client^ Docket No. 0950067tw TT's Docket N〇: 0912-0912-A5087, TWf(iSiTO)/alicewu 1326431 Sort by. The comparison sorting program may calculate the relative performance of the over-expression vector group 161 and the background sequence feature vector 143, and calculate the relative performance of the over-expression vector group and the analysis object feature vector, and The above results score and rank the over-expression vector group 161. In step S27, the vector group 171 containing the regulatory factor binding region is selected based on the result of the ranking of the above scores. While the present invention has been described above in terms of the preferred embodiments thereof, it is not intended to limit the invention, and any one skilled in the art can make some modifications and refinements without departing from the spirit and scope of the invention. Protection of the invention • The scope is subject to the definition of the scope of the patent application.

Client's Docket No. 0950067tw 13 TT’s Docket No:0912-0912-A50871TWf(送件版本)/alicewu 1326431 【圖式簡單說明】第1圖顯示依據本發明實施例之基因序列分析系統之示意圖。第2圖顯示依據本發明實施例之基因序列分析方法的流程圖。第3圖顯示滑動視窗的切割方式的示意圖。第4圖顯示依據本發明實施例特徵向量產生方法之示意圖。【主要元件符號說明】輸入介面11 ; 子片段產生器13 ; 聚類器15 ; 比較排序器17 ; 基因序列分析系統10 ; 背景序列產生器12 ; 特徵向量產生器14 ; 對映器16 ; 輸出介面18。Client's Docket No. 0950067tw 13 TT's Docket No: 0912-0912-A50871TWf (delivery version) / alicewu 1326431 [Simplified Schematic] FIG. 1 is a schematic diagram showing a gene sequence analysis system according to an embodiment of the present invention. Fig. 2 is a flow chart showing a method of analyzing a gene sequence according to an embodiment of the present invention. Figure 3 shows a schematic diagram of the way the sliding window is cut. Figure 4 is a diagram showing the eigenvector generation method according to an embodiment of the present invention. [Major component symbol description] Input interface 11; sub-segment generator 13; clusterer 15; comparison sorter 17; gene sequence analysis system 10; background sequence generator 12; feature vector generator 14; Interface 18.

Client's Docket No. 0950067tw TT’s Docket No:0912-0912-A50871TWf(送件版本)/alicewu 14Client's Docket No. 0950067tw TT’s Docket No:0912-0912-A50871TWf (Send version)/alicewu 14

Claims

X. The scope of application for patents: The regulation and control factors in the paragraph: ‘It is used to find and analyze the object of analysis.二2. Receiving the number-analysis object corresponding to the analysis target segment=segment data knife; the sub-analysis object fragment data is used to generate the resurrection job data, and the analysis object sub-segment data group is represented by (4) Feature vector t state diagram, used for each and every one, ★ J multi-vector group formed by d-knife, and the performance of the group of 6 in the analysis object segment; preset::; : the eigenvector distribution state diagram and the amount of the group of the quantity group and the part of the background sequence value are marked as the over-expression vector. 2 rows ΐ 11' are scored for the over-expressed vector group and the f-order is used. The author selects the gene sequence analysis system described in item 1 of the vector group including the regulatory factor binding region, ^3 month hall, sequence generator, which generates a background f column fragment data according to a mathematical model. The sub-segment generator cuts the background sequence fragment to generate a plurality of background sequence sub-segment data, wherein the feature vector generator processes the background sequence sub-segment data to generate a corresponding sequence sequence Sign vector. Client's Docket No. 0950067tw TT's Docket No:0912-09I2-A50871TWfl:^/K$)/a, icewu 15 Where, due to the sequence analysis system, model or 1 user input math: the mathematical model is - preset mathematics The gene sequencing system described in the three paragraphs, and the analysis pair:: ΐ ί ; Lkov model (Mark. v modei), the sequence of the sequence of the sequence of the sequence. There is a tunnel 5. For example, please refer to the basis of the patent scope 丨其中该该该该该该该该该 , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , The data of the segment is used to generate a plurality of the sub-slices of the analysis object, and the gene sequence analysis system described in the second item is 使用者. The size of the 动 '月动视® is specified by the user. In the gene sequence analysis system, the plant (4) eigenvector generator calculates the term frequency of the segmental modulus (K_mer), and analyzes the object feature vector according to the segmental modulus of the project. ^ The gene sequence analysis system according to claim 1, wherein the clusterer uses a Neural Networks-based Self-〇rganizing Map clustedng method to analyze the object eigenvector 9. The gene sequence analysis system according to claim 1, wherein the comparison sequencer calculates a relative expression amount of the excessive representation vector group and the background sequence feature vector, and calculates the The relative performance of the eigenvectors of the degree vector group and the S-analysis object, and the scoring and sorting of the over-performing vector group according to the above results. Client's Docket No. 0950067tw TT's Docket N〇:0912-0912-A50871TWf A version of Valicewu 1 ( 1326431 & 1 基因. Gene sequence analysis method for finding a binding factor binding site in an analysis object slice 1 , the method comprising: receiving a segment corresponding to the analysis object An analysis object fragment data 'cuts the analysis object fragment data to generate a plurality of analysis object sub-segment data; ^ generates a corresponding analysis object feature vector according to the analysis object sub-segment data; clusters and groups the knife analysis object feature vectors , to generate a corresponding knife ^ object feature vector distribution state map, used to represent the complex vector group divided by the analysis object feature vector, and the amount of expression in the analysis object segment; worship, in the will The object feature vector distribution state diagram is mapped with a preset back-column eigenvector, and the score is The object feature vector distribution state m is the difference between the representation amount of the vector group and the background sequence feature vector - the portion greater than a predetermined value is marked as an over-expression vector group; and the over-expressed vector group is scored and Sorting and sorting out the vector group containing the regulatory factor binding region. ^ The gene sequence analysis method described in item 1 () of the patent scope is based on a mathematical model to generate - Background sequence fragment data, ^^ 旦序序片 1 data to generate a complex f sequence sequence sub-sequence sequence 2 “ "sequence sub-segment data to generate a sequence analysis of the gene described in item (U) of the (10) method Fangί mathematical mathematical model is - pre-remaining - user input patent paradigm (four) 12 gene sequence analysis method S Docket N〇: 09I2-09, 2-A50871TWf (^^) / alicewu, 7 1326431 method, which The preset mathematical model is a Markov model, which generates the same number and length of the analysis target segment data, and has a random sequence of the background sequence segment tribute. 14. The gene sequence analysis method according to claim 10, wherein the analysis object slice data is cut using a sliding window to generate a plurality of analysis object sub-segment data. 15. The method of gene sequence analysis according to claim 14, wherein the size of the sliding window is specified by a user. • 16. The gene sequence analysis method described in claim 10, which calculates the project frequency (term • frequency) of the piecewise modulus (K-mer) and the project frequency according to the segmental modulus The analysis object feature vector is defined. 17. The method for analyzing a gene sequence according to claim 10, wherein the analysis object feature vector is clustered using a Neural Networks-based Self-Organizing Map Clustering method. Grouping. 18. The gene sequence analysis method according to claim 10, which calculates a relative expression amount of the over-expression vector group and the background sequence feature vector, and calculates the excessive performance vector group and the analysis object. The relative amount of performance of the feature vector, and the score and ranking of the over-expressed vector group according to the above results. Client’s Docket No. 0950067tw TT’s Docket Mo: 0912-0912-A50871 TWf (send version)/alicewu 18