[go: up one dir, main page]

TWI326431B - Method and system of analyzing gene sequence - Google Patents

Method and system of analyzing gene sequence Download PDF

Info

Publication number
TWI326431B
TWI326431B TW96115338A TW96115338A TWI326431B TW I326431 B TWI326431 B TW I326431B TW 96115338 A TW96115338 A TW 96115338A TW 96115338 A TW96115338 A TW 96115338A TW I326431 B TWI326431 B TW I326431B
Authority
TW
Taiwan
Prior art keywords
sequence
analysis
feature vector
analysis object
segment
Prior art date
Application number
TW96115338A
Other languages
Chinese (zh)
Other versions
TW200842735A (en
Inventor
Hahn Ming Lee
Pai Ling Lo
Che Fu Yeh
Chiahsin Huang
Original Assignee
Univ Nat Taiwan Science Tech
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Taiwan Science Tech filed Critical Univ Nat Taiwan Science Tech
Priority to TW96115338A priority Critical patent/TWI326431B/en
Publication of TW200842735A publication Critical patent/TW200842735A/en
Application granted granted Critical
Publication of TWI326431B publication Critical patent/TWI326431B/en

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

1326431 九、發明說明: 【發明所屬之技術領域】 本發明係有關於基因序列分析方法,特別是有關於一 種藉由分析基因序列之特徵,來尋找該基因序列上被基因 • 調控因子所調控的結合區域。 【先前技術】 基因調控因子與基因之間的交互作用,控制著發育及 # 環境變化時等等重要的生理作用。因此,尋找在基因上被 調控因子作用的片段’不論就生物學理上和醫療上而言, 都是至關重要的一項研究主題。 傳統上’對於基因上被調控因子作用的片段或基因上 調控因子之調控區域(binding site)的尋找,係採用生物學 的方法,例如基因序列足跡法(DNA footprint)。足跡法是 利用化學藥劑,將作為研究對象之一段基因序列切割成片 段,再針對這些切割出的片段,分析其是否為調控區域。 ^ 此方法相當費時和費力。 近年來,除了傳統的生物學方法,也開始運用電腦資 訊技術來輔助基因序列的分析。例如,利用位置權重矩陣 法(position weight matrix)的基因序列分析方法。位置權 重矩陣法是針對已發現的調控區域做分析,統計每個位置 出現某個核甘酸(nucleotide)的次數來得知每個位置的權 重值並將之建立成一矩陣。依據位置權重矩陣法,所有的 調控區域長度都假設是固定的,且每個權重片段位置對於 基因調控區域的結合吸引力(binding affinity)影響是各自1326431 IX. Description of the invention: [Technical field to which the invention pertains] The present invention relates to a method for analyzing a gene sequence, and more particularly to a method for analyzing a gene sequence to be regulated by a gene regulatory factor by analyzing the characteristics of the gene sequence. Combine the area. [Prior Art] The interaction between gene regulatory factors and genes controls important physiological roles such as development and #environmental changes. Therefore, finding a fragment that is genetically regulated by a regulatory factor is a critical research topic both biologically and medically. Traditionally, the search for a gene-regulated fragment or a regulatory site for a gene regulatory factor has been carried out using biological methods such as the DNA footprint. The footprint method is to use a chemical agent to cut a segment of a gene sequence into a fragment, and then analyze whether the fragment is a regulatory region. ^ This method is quite time consuming and laborious. In recent years, in addition to traditional biological methods, computer information technology has also been used to assist in the analysis of gene sequences. For example, a gene sequence analysis method using a position weight matrix method. The position weight matrix method analyzes the discovered regulatory regions, counts the number of nucleotides appearing at each location to know the weight value of each location and establishes it into a matrix. According to the position weight matrix method, the lengths of all regulatory regions are assumed to be fixed, and the influence of each weight segment position on the binding affinity of the gene regulatory region is

Client's Docket No. 0950067tw TT’s Docket N〇:09I2-0912-A5087 丨 TWfl:送件版本)/a 丨 icewu 5 1326431 獨立的。 然而’位置權重矩陣法的上述假設,就基因序列 ^互相影響的觀點而言,並不正確。基因序列常ί 子結構結合的特徵序列通常為小片段 ^ 都出現於基因上游序列,且可能是位於基=二周 歹j中任何部份的小片段,故其長度也可能不同。 序 分析=:需要一種快速,且符合生物學原理的基因序列 【發明内容】 ^發明之目的為提供H由分析基因心 找該基因序列上被基因調控因子所調控的結合區域。“ 為達成上述目的,本發明提供一種基因序 7b. ΐ用以找尋一分析對象片段中的調控因子結合區域 〜dlngslte)。-子片段產生器,其接收該分析對象片 &,並切㈣分析對象片段以產生複數分析對 =寺徵向量產生H’其依據該分析對象子又 ==量。:聚類器,其將該分析對象特 妝^ M线應之—分析對象特徵向量分佈 I群組’以及每-向量群組在該分析對象片段中1 里。一對映器,其將該分析對象特徵向量分 ^一 預設的-背景序列特徵向量做對映,將 ^徵: 量分佈狀態圖中該向量群組之該表現量與。 徵向量之差異大於-預定值者部份標示為過度表現向旦 TT's Docke. No:0912-0912-A50871TWfl;^tS^)/a|icewu 1326431 及排序,並據以篩選出包含調控因子結合區域的該向量群 組。 本發明另提供一種基因序列分析方法,其用以找尋一 分析對象片段中的調控因子結合區域(binding site )。接 • 收該分析對象片段,並切割該分析對象片段以產生複數分 _ 析對象子片段。依據該分析對象子片段產生對應之分析對 象特徵向量。將該分析對象特徵向量予以聚類、分群,以 產生對應之一分析對象特徵向量分佈狀態圖,用以表現該 φ 分析對象特徵向量劃分而成之複數向量群組,以及每一向 ' 量群組在該分析對象片段中的表現量。將該分析對象特徵 - 向量分佈狀態圖與一預設的一背景序列特徵向量做對 映,將該分析對象特徵向量分佈狀態圖中該向量群組之該 表現量與該背景序列特徵向量之差異大於一預定值者部 份標示為過度表現向量群組。針對該過度表現向量群組加 以評分及排序,並據以篩選出包含調控因子結合區域的該 向量群組。 為讓本發明之上述和其他目的、特徵、和優點能更明 ® 顯易懂,下文特舉出較佳實施例,並配合所附圖式,作詳 細說明如下: 【實施方式】 為了讓本發明之目的、特徵、及優點能更明顯易懂, 下文特舉較佳實施例,並配合所附圖示第1圖到第4圖, 做詳細之說明。 本發明說明書提供不同的實施例來說明本發明不同 實施方式的技術特徵。其中,實施例中的各元件之配置係Client's Docket No. 0950067tw TT’s Docket N〇: 09I2-0912-A5087 丨 TWfl: delivery version) /a 丨 icewu 5 1326431 Independent. However, the above assumptions of the position weight matrix method are not correct in terms of the mutual influence of the gene sequences. The signature sequence of the gene sequence often has a small fragment ^ which appears in the upstream sequence of the gene, and may be a small fragment located in any part of the base = two weeks ,j, so its length may also be different. Sequence Analysis =: A fast, biologically-compliant gene sequence is required. [Invention] The purpose of the invention is to provide H for the binding region of the gene sequence to be regulated by a gene regulatory factor. "To achieve the above object, the present invention provides a gene sequence 7b. ΐ for finding a regulatory factor binding region in the analysis target fragment ~ dlngslte). - a sub-segment generator that receives the analysis object slice & and cuts (4) The analysis object segment is generated to generate a complex analysis pair = temple sign vector to generate H' according to the analysis object sub-== quantity.: clusterer, which analyzes the object of the analysis object M line-analyze object feature vector distribution I The group 'and each-vector group are in the analysis object segment. The pair of mappers, which map the analysis object feature vector to a preset-background sequence feature vector, The distribution of the vector group in the distribution state diagram and the difference between the eigenvectors greater than the predetermined value are marked as excessive performance to TT's Docke. No:0912-0912-A50871TWfl;^tS^)/a|icewu 1326431 and sorting, and screening the vector group including the regulatory factor binding region. The present invention further provides a gene sequence analysis method for finding a regulatory factor binding site in an analysis target fragment.Receiving the analysis target segment, and cutting the analysis target segment to generate a complex analysis target sub-segment. The corresponding analysis object feature vector is generated according to the analysis object sub-segment. The analysis object feature vector is clustered and grouped to generate Corresponding to one of the analysis object feature vector distribution state maps, which is used to represent the complex vector group divided by the φ analysis object feature vector, and the performance amount of each of the 'quantity group' in the analysis target segment. The feature-vector distribution state map is mapped with a preset background sequence feature vector, and the difference between the representation amount of the vector group and the background sequence feature vector in the analysis object feature vector distribution state diagram is greater than a predetermined value Partially labeled as an over-expressed vector group, the over-expressed vector group is scored and ranked, and the vector group containing the regulatory factor binding region is selected accordingly. To make the above and other objects and features of the present invention And advantages can be more clearly understood, and the preferred embodiments are described below, and in conjunction with the drawings, The detailed description is as follows: [Embodiment] In order to make the objects, features, and advantages of the present invention more comprehensible, the preferred embodiments of the present invention will be described in detail with reference to Figures 1 through 4 of the accompanying drawings. The description of the present invention provides various embodiments to illustrate the technical features of various embodiments of the present invention, wherein the configuration of each component in the embodiment is

Client's Docket No. 0950067tw TTs Docket No:0912-0912-A50871TWf(送件版本)/alicewu 7 丄」仂431 :=用以限制本發明。且實施例中圖式標號 的關聯係、為了簡化說明’並非意指不同實施例之間 -立第1圖顯示依據本發明實施例之基因序列分析 圖:基因序列分析系統1〇係用以找尋一分析對、象 二中的調控因子結合區域(binding site) , jl包括.、 ”面U、背景序列產生器12、子片段產生哭、13蛀輪入 量遙哇哭u取丄 门仅座生為13、特徵向 出介“。器15、對映器16、比較排序器17、輸 輸入介面11接收對應於該分析對 象片段資料110。該分析對象片段資料110“ ^ =對應於其所欲分析處理之基因序列片段的基因序』 月厅、序列產生器12依據一數學模—北 列片段資料⑽。該數學模型可以為 厅、序 使用者輸入數學模型。在此,該預設數學 =,,其產生與該分析對以= ⑶。由於基因的上游序列中,通序列片段資料 因子結合區域,其餘縣在分析均可^段是調控 故經由數學模型隨機產生的f景 )為雜帅〇叫, 突出的序列片段。亦即,背景序列片‘交出特別 除雜訊資料,以利找出分析對象片段資料L係用以消 的部份。 、料110中過度表現Client's Docket No. 0950067tw TTs Docket No: 0912-0912-A50871TWf (delivery version)/alicewu 7 丄"仂431:= is used to limit the present invention. In the embodiment, the schematic reference numerals are used to simplify the description. 'It is not intended to mean that between different embodiments. FIG. 1 shows a gene sequence analysis diagram according to an embodiment of the present invention: a gene sequence analysis system 1 is used to find An analysis pair, the binding factor binding site in the second, jl includes., "face U, background sequence generator 12, sub-segment crying, 13 蛀 round-in volume, wow, cry, u take the door only Born to 13, the feature is introduced. The processor 15, the mapper 16, the comparison sequencer 17, and the input interface 11 receive the corresponding analysis object segment data 110. The analysis target fragment data 110 "^ = the gene sequence corresponding to the gene sequence fragment to be analyzed and processed", the sequence generator 12 according to a mathematical model - the northern column fragment data (10). The mathematical model can be the hall, the order The user inputs a mathematical model. Here, the preset math =, which is generated with the analysis pair = (3). Because of the upstream sequence of the gene, the sequence of the fragment data is combined with the region, and the remaining counties are analyzed. The f-view that is randomly generated by the mathematical model is a screaming, prominent sequence fragment. That is, the background sequence piece 'excludes special noise-removing data to facilitate the analysis of the segment of the analysis object L is used to eliminate Part of the material

ClienVs Dockei No. Q950067tw TT;s Docket No:09l2-09}2-A50871TW^^^)/aljcev .子片段產生器13其切割該分析對象片段資料U0以ClienVs Dockei No. Q950067tw TT;s Docket No: 09l2-09}2-A50871TW^^^)/aljcev. The sub-segment generator 13 cuts the analysis target fragment data U0 to

Clients Dockei No. Q950067tw 丄326431 產生複數分析對象子片段資料131,並且割該背景序歹^ μ 段資料120以產生複數背景序列子片段資料133。子片# 產生斋13可以使用滑動視窗(sliding window)來執行上 述切割的程序’其中該滑動視窗的大小係為使用者指定。 特徵向量產生器14依據該分析對象子片段資料131 • 產生對應之分析對象特徵向量141,並且處理該背景序列 子片段資料133,以產生對應之該背景序列特徵向量 143 °其中該特徵向量產生器14計算分段模數(K-mer;^ • 項目頻率(term frequency),並依據該分段模數的項目頻率 界定該分析對象特徵向量141及該背景序列特徵向量 143。 聚類器15將該分析對象特徵向量141予以聚類、分 群’以產生對應之一分析對象特徵向量分佈狀態圖151 , 用以表現該分析對象特徵向量141劃分而成之複數向量 群組,以及每一向量群組在該分析對象片段中的表現量。 该聚類器14使用類神經網路(Neural Networks)之自我組 織圖分群(Self-Organizing Map Clustering)方法將該分析 9 對象特徵向量予以聚類、分群,其中該分析對象特徵向量 分佈狀態圖151可以為二維方式表現。 對映器16將該分析對象特徵向量分佈狀態圖151與 該背景序列特徵向量143做對映,將該分析對象特徵向量 分佈狀態圖151中該向量群組之該表現量與該背景序列 特徵向量143之差異大於一預定值者部份標示為過度表 現向量群組,並將過度表現向量群組161的資料傳送給比 較排序器17。Clients Dockei No. Q950067tw 丄 326431 generates a plurality of analysis object sub-segment data 131, and cuts the background sequence μ^ segment data 120 to generate a plurality of background sequence sub-segment data 133. The sub-slices # generate a fast 13 can use a sliding window to perform the above-described cutting procedure 'where the size of the sliding window is specified by the user. The feature vector generator 14 generates a corresponding analysis object feature vector 141 according to the analysis object sub-segment data 131, and processes the background sequence sub-segment data 133 to generate a corresponding background sequence feature vector 143 °, wherein the feature vector generator 14 calculating a segmentation modulus (K-mer; ^ • term frequency, and defining the analysis object feature vector 141 and the background sequence feature vector 143 according to the item frequency of the segmentation modulus. The clusterer 15 The analysis object feature vector 141 is clustered and grouped to generate a corresponding one of the analysis object feature vector distribution state maps 151 for representing the complex vector group divided by the analysis object feature vector 141, and each vector group. The amount of expression in the segment of the analysis object. The clusterer 14 clusters and groups the analysis object eigenvectors using a Neural Networks-based Self-Organizing Map Clustering method. The analysis object feature vector distribution state diagram 151 can be expressed in a two-dimensional manner. The imager 16 analyzes the object of the analysis object. The quantity distribution state diagram 151 is mapped to the background sequence feature vector 143, and the difference between the representation amount of the vector group and the background sequence feature vector 143 in the analysis object feature vector distribution state diagram 151 is greater than a predetermined value. The shares are marked as an over-expression vector group, and the data of the over-expression vector group 161 is transmitted to the comparison sequencer 17.

Client’s Docket No. 0950067tw TT’s Docket No:〇912-0912-A50871 TWfl;送件版本)/alicewu 分及排1排17針對該過度表現向量群組161加以評 群組171 :並==出包含調控因子結合區域的該向量 二ί:=排序? 17計算該過度表現向量群組二 表;向:群徵向|143的相對表現量’並計算該過度 里群、,且和该分析對象特徵向量之相對表現晋,if e 據上述2结果對該過度表現向量群組心:;。亚依 流程圖。圖顯示依據本發明實施例之基因序列分析方法的 #資=ΓG1 ’純對應於分析對象諸之—分析對象片 應於其所欲分❹者提供對 步驟S209 ^ 列片段的基因序列資料。 料!20。該數學Hf;模型’ Μ 一背景序列片段資 入數學模型。在此,該預t免數風n學拉型或一使用者輸 且有象片段資料相同數量長度,且 =序=Γ、,背景序列片峨12〇。由於基因的 游序列中,通常只有一小片段 在分析上都可視為雜訊㈣,二= 即,^二序列資料來比較出特別突出的序列片段。亦 出分斤^12〇係用以消除雜訊資^ 刀析對象片&貢料1U)中過度表現的部份。 八;^^ S211’切吾“亥分析縣片段資料110以產生複數 刀析對象子片段資料131。 座生獲数 步驟奶2’切割該背景序列片段資料12〇以產生複數 __ 1〇 1326431 月景序列子片段資料]33。 可以使用滑動視窗(siidi 的程序’其令該滑動視窗的大小係為使仏定蝴割 第3圖顯示滑動視窗的切割方 窗寬度為L,滑動距離為d的 下來 ::切二?所示,依據滑動視窗首==二 Ί圍為點A和點c之間,其中點A到點 段2二滑動視窗向下游處滑動距離d,產生子片Client's Docket No. 0950067tw TT's Docket No: 〇912-0912-A50871 TWfl; delivery version) /alicewu points and row 1 row 17 for the over-expression vector group 161 to be evaluated group 171: and == out contain control factors Combine the region's vector two ί:= sort? 17 calculating the over-expression vector group two table; to: the relative sign of the group sign to | 143 and calculating the over-middle group, and the relative performance of the eigenvector of the analysis object, if e according to the above 2 results The over-expression vector group heart:;. Yayi flow chart. The figure shows that the gene sequence analysis method according to the embodiment of the present invention corresponds to the analysis object - the analysis target piece should provide the gene sequence data of the step S209 ^ column in its desired branch. material! 20. The mathematical Hf; model' Μ a background sequence fragment is assigned to the mathematical model. Here, the pre-t-free wind n-type pull type or a user input has the same number of lengths as the clip data, and = order = Γ, and the background sequence is 〇 12〇. Because of the gene's swimming sequence, usually only a small fragment can be visually analyzed as noise (4), and two = ie, ^ two sequence data to compare the particularly prominent sequence fragments. Also, it is used to eliminate the over-expressed part of the miscellaneous information and the 1U).八;^^ S211 '切吾" Hai analysis county fragment data 110 to generate a plurality of knife analysis object sub-segment data 131. Block acquisition step milk 2' cut the background sequence fragment data 12 〇 to generate a complex number __ 1〇1326431 The moon sequence sub-segment data]33. You can use the sliding window (the siidi program's size is such that the size of the sliding window is such that the cutting window is 3, the width of the cutting window is L, and the sliding distance is d. Down:: cut two?, according to the sliding window first == two circumferences between point A and point c, wherein point A to point 2 two sliding window sliding distance d downstream, resulting in sub-pieces

ί為L Λ!,B和點D之間,其中點B到,點D的距 為L ’而點衫至丨丨^JL P«5^ J ] 的距離為4。同樣地,滑動禎窑五 3處:動距離d’產生子片段33,其範圍;點= 距㈣"^到點£的距離為[,而點C_D的 ㈣S231,依據該分析對象子片段資料⑶ 之分析對象特徵向量141。 座生對應 步驟S232,處理該背景序列子片段資料133,以產生 對應之该背景序列特徵向量143。 在此,上述特徵向量之產生,係可以計算分段模數 (:)的項目頻率(term叫咖⑼,並依據 it定該分析對象特一及該背』 以第4圖為例’說明上述特徵向量產生方法中分段模 =的項目頻率的計算。以第4圖所示之片段『aaattcg t例。當分段模數⑴為1時(1儒),計算得出,在 AAATTCG』這個片段裡,^贿的項目1,出現3次,ί is L Λ!, between B and point D, where point B to point D is at a distance L ́ and the distance from the shirt to 丨丨^JL P«5^ J ] is 4. Similarly, the sliding kiln five places: the moving distance d' produces a sub-segment 33, the range; point = distance (four) " ^ to point £ distance [, and point C_D (four) S231, according to the analysis object sub-fragment data (3) The analysis object feature vector 141. The background sequence sub-segment data 133 is processed to generate the corresponding background sequence feature vector 143. Here, the generation of the feature vector described above can calculate the item frequency of the segmentation modulus (:) (term is called coffee (9), and according to the specific object and the back of the analysis object], taking FIG. 4 as an example to illustrate the above The calculation of the item frequency of the piecewise mode = in the eigenvector generation method. The fragment shown in Fig. 4 is aaattcg t. When the piecewise modulus (1) is 1 (1 Confucian), it is calculated in AAATTCG. In the snippet, the item 1 of bribes appeared 3 times.

Client's Docket Mo. 〇950067tw TTsD〇ctoMO:09l2-09 丨 2-A5087!TWfl[送件版本)/aHcewuClient's Docket Mo. 〇950067tw TTsD〇ctoMO: 09l2-09 丨 2-A5087!TWfl[送件版)/aHcewu

Client's Docket Mo. 〇950067tw 11 1326431 項目‘丁,出現2次,項目s 其對應之項目頻率為:項項目‘G ’各出現1次。因此, ‘C,及項目‘G|A1典、、八為3,項目‘T,為2,項目 、 各為虽分段模數 2-mer的項目,AA,出a數⑴為2~(2-mer), ‘⑶,各出現!次。因此,其=靡項目,AT,、‘TT,、‘TC,及 為2,項目,AT’、‘ττ,、% J應之項目頻率為··項目’AA’ 項目的項目頻率則均二及^各為其他的2-贿 (一),3.二=目〇2分=數⑴為3時Client's Docket Mo. 〇950067tw 11 1326431 Item ‘Ding, appears 2 times, item s The corresponding project frequency is: item ‘G ’ appears once each. Therefore, 'C, and the project 'G|A1 code, eight is 3, the item 'T is 2, the item, each is a segmental modulus 2-mer item, AA, the number of a (1) is 2~( 2-mer), '(3), each appears! Times. Therefore, its =靡 project, AT,, 'TT, 'TC, and 2, project, AT', 'ττ, and % J should be the project frequency of the project. The project frequency of the project 'AA' is equal to And ^ each for the other 2 bribes (1), 3. 2 = witness 2 points = number (1) is 3

各出現 i 二欠。因此,二L T、ATT,、‘TTC,及‘丁CG, ‘ATT,、‘TTC,及‘TCG:各:應之項目頻率為:項目‘AAT、 頻率則均為0。 為1,其他的her項目的項目 步驟S24 ’將該分析對象特徵向量141予以聚類、分 田,以產生對應之一分析對象特徵向量分佈狀態圖151, 2表現齡析對象特徵向4 141劃分而成之複數向量 、且以及母向1群組在該分析對象片段中的表現量。 上述程序係可以使用類神經網路(Neural Networks)之自我 組織圖分群(Self-〇rganizing Map Clustering)方法實現 之,其中該分析對象特徵向量分佈狀態圖151可以二 方式表現。 匕步驟S25 ’將該分析對象特徵向量分佈狀態圖ι51與 該背景序列特徵向量143做對映,將該分析對象特徵向量 分佈狀態圖151中該向量群組之該表現量與該背景序列 特徵向量143之差異大於一預定值者部份標示為過度表 現向量群組。 步驟S26 ’針對該過度表現向量群組ι61加以評分及Each appears i owes. Therefore, two L T, ATT, ‘TTC, and ‘Ding CG, ‘ATT, ‘TTC, and ‘TCG: each: The frequency of the project should be: the item ‘AAT, the frequency is 0. For example, the other project item S24 of the her project is to cluster and divide the analysis object feature vector 141 to generate a corresponding one of the analysis object feature vector distribution state maps 151, and 2 to represent the age-resolved object features to 4 141. The complex vector and the amount of representation of the parent group 1 in the analysis target segment. The above program can be implemented using a Neural Networks-based Self-〇rganizing Map Clustering method, wherein the analysis object feature vector distribution state map 151 can be expressed in two ways. Step S25 ' analyze the analysis object feature vector distribution state map ι51 and the background sequence feature vector 143, and the expression quantity of the vector group in the analysis object feature vector distribution state diagram 151 and the background sequence feature vector The portion where the difference of 143 is greater than a predetermined value is indicated as an over-expression vector group. Step S26' scores the over-expression vector group ι61 and

Client^ Docket No. 0950067tw TT's Docket N〇:0912-0912-A5087,TWf(iSiTO)/alicewu 1326431 排序。上述比較排序程序,係可以藉由計算該過度表現向 量群組161和該背景序列特徵向量143的相對表現量,並 計算該過度表現向量群組和該分析對象特徵向量之相對 表現量,並依據上述結果對該過度表現向量群組161評分 及排序。 在步驟S27中,依據上述評分排序的結果,篩選出包 含調控因子結合區域的該向量群組171。 雖然本發明已以較佳實施例揭露如上,然其並非用以 φ 限定本發明,任何熟悉此項技藝者,在不脫離本發明之精 神和範圍内,當可做些許更動與潤飾,因此本發明之保護 • 範圍當視後附之申請專利範圍所界定者為準。Client^ Docket No. 0950067tw TT's Docket N〇: 0912-0912-A5087, TWf(iSiTO)/alicewu 1326431 Sort by. The comparison sorting program may calculate the relative performance of the over-expression vector group 161 and the background sequence feature vector 143, and calculate the relative performance of the over-expression vector group and the analysis object feature vector, and The above results score and rank the over-expression vector group 161. In step S27, the vector group 171 containing the regulatory factor binding region is selected based on the result of the ranking of the above scores. While the present invention has been described above in terms of the preferred embodiments thereof, it is not intended to limit the invention, and any one skilled in the art can make some modifications and refinements without departing from the spirit and scope of the invention. Protection of the invention • The scope is subject to the definition of the scope of the patent application.

Client's Docket No. 0950067tw 13 TT’s Docket No:0912-0912-A50871TWf(送件版本)/alicewu 1326431 【圖式簡單說明】 第1圖顯示依據本發明實施例之基因序列分析系統之 示意圖。 第2圖顯示依據本發明實施例之基因序列分析方法的 流程圖。 第3圖顯示滑動視窗的切割方式的示意圖。 第4圖顯示依據本發明實施例特徵向量產生方法之示 意圖。 【主要元件符號說明】 輸入介面11 ; 子片段產生器13 ; 聚類器15 ; 比較排序器17 ; 基因序列分析系統10 ; 背景序列產生器12 ; 特徵向量產生器14 ; 對映器16 ; 輸出介面18。Client's Docket No. 0950067tw 13 TT's Docket No: 0912-0912-A50871TWf (delivery version) / alicewu 1326431 [Simplified Schematic] FIG. 1 is a schematic diagram showing a gene sequence analysis system according to an embodiment of the present invention. Fig. 2 is a flow chart showing a method of analyzing a gene sequence according to an embodiment of the present invention. Figure 3 shows a schematic diagram of the way the sliding window is cut. Figure 4 is a diagram showing the eigenvector generation method according to an embodiment of the present invention. [Major component symbol description] Input interface 11; sub-segment generator 13; clusterer 15; comparison sorter 17; gene sequence analysis system 10; background sequence generator 12; feature vector generator 14; Interface 18.

Client's Docket No. 0950067tw TT’s Docket No:0912-0912-A50871TWf(送件版本)/alicewu 14Client's Docket No. 0950067tw TT’s Docket No:0912-0912-A50871TWf (Send version)/alicewu 14

Claims (1)

十、申請專利範圍: 段中的調控因: ’其用以找尋-分析對象片 分析對象。ί料二、接收對應於該分析對象片段之-數分析對象=段資料刀;副該分析對象片段資料以產生複 生職據該分析對象子片段資料產 群,以㈣類、分 表現該分析對象特徵向量t狀態圖,用以 每一而旦我,★ J里d刀而成之複數向量群組,以及 6里群組在該分析對象片段中的表現量; 預設的::;序t:象特徵向量分佈狀態圖與- 量群組之該表現量與該背景序列特 值者部份標示為過度表現向量 排序了2排ΐ11’針對該過度表現向量群組加以評分及 f序亚據以師選出包含調控因子結合區域的該向量群 二利範圍第1項所述之基因序列分析系統, ^3月厅、序列產生器,其依據一數學模型,產生一背 景f列片段資料’·其中該子片段產生器切割該背景序列片 段貧料以產生複數背景序列子片段資料,其中該特徵向量 產生器處理該背景序列子片段資料,以產生對應之景 序列特徵向量。 Client's Docket No. 0950067tw TT's Docket No:0912-09I2-A50871TWfl:^/K$)/a,icewu 15 其中所,因序列分析系統, 模型或1用者輸入數學:該數學模型為-預設數學 其中3項所述之基因序列分析系統, 生與該分析對::ΐ ί ;L可夫模型(Mark。v modei),其產 序列的垓背景序列片段資料。 、有隧機 5.如ΐ請專利範圍第丨項所述之基 其中該子片段產生器 j刀析糸統, 切宝古玄公姑料备u 使用/月動視由(shd!ng window) 段資料資料以產生複數之該分析對象子片 其二„5項所述之基因序列分析系統, 〇乂 ’月動視®的大小係為使用者指定。 盆中請專利範圍第1項所述之基因序列分析系統, 厂中㈣徵向量產生器計算分段模數(K_mer)的項目頻率 term frequency),並依據該分段模數的項目 析對象特徵向量。 、午"疋該刀 ^如申請專利範圍第1項所述之基因序列分析系統, 其中該聚類器使用類神經網路(Neural Networks)之自我組 織圖分群(Self-〇rganizing Map clustedng)方法將該分析 對象特徵向量予以聚類、分群。 9·如申請專利範圍第1項所述之基因序列分析系統, 其中該比較排序器計算該過度表現向量群組和該背景序 列特徵向量的相對表現量,並計算該過度表現向量群組和 S亥分析對象特徵向量之相對表現量,並依據上述結果對該 過度表現向量群組評分及排序。 Client’s Docket No. 0950067tw TT’s Docket N〇:0912-0912-A50871TWf(送件版本Valicewu 1( 1326431 &1 〇.種基因序列分析方法,其用以找尋一分析對象 片1又中的调控因子結合區域(binding site),該方法包括: ;接收對應於該分析對象片段之一分析對象片段資 料’並切割該分析對象片段資料以產生複數分析對象子片 段資料; ^依據該分析對象子片段資料產生對應之分析對象特 徵向量; 將該刀析對象特徵向量予以聚類、分群,以產生對應 之刀^對象特徵向量分佈狀態圖,用以表現該分析對象 特徵向量劃分而成之複數向量群組,以及 該分析對象片段中的表現量; 拜、、在 旦將°亥77析對象特徵向量分佈狀態圖與一預設的一背 3列特徵向量做對映,將該分析對象特徵向量分佈狀態 m該向量群組之該表現量與該背景序列特徵向量之差 -大於一預定值者部份標示為過度表現向量群組;以及 十^該過度表現向量群組加以評分及排序,並據以筛 k出包含調控因子結合區域的該向量群組。 ^如中請專利範圍第1()項所述之基因序列分析方 並切ί二含匕依據一數學模型,產生-背景序列片段資料, 資^ ==旦列序片1資料以產生複數f景序列子片段 序列二ί “序列子片段資料,以產生對⑽^ 法」專利範圍第U項所述之基因序列分析方 ί數學數學模型為—預餘――使用者輸 專利範㈣12項所述之基因序列分析方 S Docket N〇:09I2-09,2-A50871TWf(^^)/alicewu ,7 1326431 法,其中該預設數學模型為馬可夫模型(Markov model), 其產生與該分析對象片段資料相同數量及長度,且具有一 隨機序列的該背景序列片段貢料。 14. 如申請專利範圍第10項所述之基因序列分析方 . 法,其使用滑動視窗(sliding window)切割該分析對象片 _ 段資料以產生複數之該分析對象子片段資料。 15. 如申請專利範圍第14項所述之基因序列分析方 法,其中該滑動視窗的大小係為使用者指定。 • 16.如申請專利範圍第10項所述之基因序列分析方 ' 法,其計算分段模數(K-mer)的項目頻率(term • frequency),並依據該分段模數的項目頻率界定該分析對 象特徵向量。 17. 如申請專利範圍第10項所述之基因序列分析方 法,其使用類神經網路(Neural Networks)之自我組織圖分 群(Self-Organizing Map Clustering)方法將該分析對象特 徵向量予以聚類、分群。 18. 如申請專利範圍第10項所述之基因序列分析方 _ 法,其計算該過度表現向量群組和該背景序列特徵向量的 相對表現量,並計算該過度表現向量群組和該分析對象特 徵向量之相對表現量,並依據上述結果對該過度表現向量 群組評分及排序。 Client’s Docket No. 0950067tw TT’s Docket Mo:0912-0912-A50871 TWf(送件版本)/alicewu 18X. The scope of application for patents: The regulation and control factors in the paragraph: ‘It is used to find and analyze the object of analysis.二2. Receiving the number-analysis object corresponding to the analysis target segment=segment data knife; the sub-analysis object fragment data is used to generate the resurrection job data, and the analysis object sub-segment data group is represented by (4) Feature vector t state diagram, used for each and every one, ★ J multi-vector group formed by d-knife, and the performance of the group of 6 in the analysis object segment; preset::; : the eigenvector distribution state diagram and the amount of the group of the quantity group and the part of the background sequence value are marked as the over-expression vector. 2 rows ΐ 11' are scored for the over-expressed vector group and the f-order is used. The author selects the gene sequence analysis system described in item 1 of the vector group including the regulatory factor binding region, ^3 month hall, sequence generator, which generates a background f column fragment data according to a mathematical model. The sub-segment generator cuts the background sequence fragment to generate a plurality of background sequence sub-segment data, wherein the feature vector generator processes the background sequence sub-segment data to generate a corresponding sequence sequence Sign vector. Client's Docket No. 0950067tw TT's Docket No:0912-09I2-A50871TWfl:^/K$)/a, icewu 15 Where, due to the sequence analysis system, model or 1 user input math: the mathematical model is - preset mathematics The gene sequencing system described in the three paragraphs, and the analysis pair:: ΐ ί ; Lkov model (Mark. v modei), the sequence of the sequence of the sequence of the sequence. There is a tunnel 5. For example, please refer to the basis of the patent scope 丨 其中 该 该 该 该 该 该 该 该 该 , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , The data of the segment is used to generate a plurality of the sub-slices of the analysis object, and the gene sequence analysis system described in the second item is 使用者. The size of the 动 '月动视® is specified by the user. In the gene sequence analysis system, the plant (4) eigenvector generator calculates the term frequency of the segmental modulus (K_mer), and analyzes the object feature vector according to the segmental modulus of the project. ^ The gene sequence analysis system according to claim 1, wherein the clusterer uses a Neural Networks-based Self-〇rganizing Map clustedng method to analyze the object eigenvector 9. The gene sequence analysis system according to claim 1, wherein the comparison sequencer calculates a relative expression amount of the excessive representation vector group and the background sequence feature vector, and calculates the The relative performance of the eigenvectors of the degree vector group and the S-analysis object, and the scoring and sorting of the over-performing vector group according to the above results. Client's Docket No. 0950067tw TT's Docket N〇:0912-0912-A50871TWf A version of Valicewu 1 ( 1326431 & 1 基因. Gene sequence analysis method for finding a binding factor binding site in an analysis object slice 1 , the method comprising: receiving a segment corresponding to the analysis object An analysis object fragment data 'cuts the analysis object fragment data to generate a plurality of analysis object sub-segment data; ^ generates a corresponding analysis object feature vector according to the analysis object sub-segment data; clusters and groups the knife analysis object feature vectors , to generate a corresponding knife ^ object feature vector distribution state map, used to represent the complex vector group divided by the analysis object feature vector, and the amount of expression in the analysis object segment; worship, in the will The object feature vector distribution state diagram is mapped with a preset back-column eigenvector, and the score is The object feature vector distribution state m is the difference between the representation amount of the vector group and the background sequence feature vector - the portion greater than a predetermined value is marked as an over-expression vector group; and the over-expressed vector group is scored and Sorting and sorting out the vector group containing the regulatory factor binding region. ^ The gene sequence analysis method described in item 1 () of the patent scope is based on a mathematical model to generate - Background sequence fragment data, ^^ 旦 序 序 片 1 data to generate a complex f sequence sequence sub-sequence sequence 2 “ "sequence sub-segment data to generate a sequence analysis of the gene described in item (U) of the (10) method Fangί mathematical mathematical model is - pre-remaining - user input patent paradigm (four) 12 gene sequence analysis method S Docket N〇: 09I2-09, 2-A50871TWf (^^) / alicewu, 7 1326431 method, which The preset mathematical model is a Markov model, which generates the same number and length of the analysis target segment data, and has a random sequence of the background sequence segment tribute. 14. The gene sequence analysis method according to claim 10, wherein the analysis object slice data is cut using a sliding window to generate a plurality of analysis object sub-segment data. 15. The method of gene sequence analysis according to claim 14, wherein the size of the sliding window is specified by a user. • 16. The gene sequence analysis method described in claim 10, which calculates the project frequency (term • frequency) of the piecewise modulus (K-mer) and the project frequency according to the segmental modulus The analysis object feature vector is defined. 17. The method for analyzing a gene sequence according to claim 10, wherein the analysis object feature vector is clustered using a Neural Networks-based Self-Organizing Map Clustering method. Grouping. 18. The gene sequence analysis method according to claim 10, which calculates a relative expression amount of the over-expression vector group and the background sequence feature vector, and calculates the excessive performance vector group and the analysis object. The relative amount of performance of the feature vector, and the score and ranking of the over-expressed vector group according to the above results. Client’s Docket No. 0950067tw TT’s Docket Mo: 0912-0912-A50871 TWf (send version)/alicewu 18
TW96115338A 2007-04-30 2007-04-30 Method and system of analyzing gene sequence TWI326431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW96115338A TWI326431B (en) 2007-04-30 2007-04-30 Method and system of analyzing gene sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW96115338A TWI326431B (en) 2007-04-30 2007-04-30 Method and system of analyzing gene sequence

Publications (2)

Publication Number Publication Date
TW200842735A TW200842735A (en) 2008-11-01
TWI326431B true TWI326431B (en) 2010-06-21

Family

ID=44822099

Family Applications (1)

Application Number Title Priority Date Filing Date
TW96115338A TWI326431B (en) 2007-04-30 2007-04-30 Method and system of analyzing gene sequence

Country Status (1)

Country Link
TW (1) TWI326431B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI420007B (en) * 2011-03-04 2013-12-21 Hsueh Ting Chu System and method of assembling dna reads
TWI582631B (en) * 2015-11-20 2017-05-11 財團法人資訊工業策進會 Dna sequence analyzing system for analyzing bacterial species and method thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI420007B (en) * 2011-03-04 2013-12-21 Hsueh Ting Chu System and method of assembling dna reads
TWI582631B (en) * 2015-11-20 2017-05-11 財團法人資訊工業策進會 Dna sequence analyzing system for analyzing bacterial species and method thereof

Also Published As

Publication number Publication date
TW200842735A (en) 2008-11-01

Similar Documents

Publication Publication Date Title
Roshan et al. Probalign: multiple sequence alignment using partition function posterior probabilities
Shahmuradov et al. bTSSfinder: a novel tool for the prediction of promoters in cyanobacteria and Escherichia coli
Medvedev et al. Error correction of high-throughput sequencing datasets with non-uniform coverage
AU2014340461B2 (en) Systems and methods for using paired-end data in directed acyclic structure
Degroeve et al. SpliceMachine: predicting splice sites from high-dimensional local context representations
Katoh et al. Recent developments in the MAFFT multiple sequence alignment program
Ramírez-Sánchez et al. Plant proteins are smaller because they are encoded by fewer exons than animal proteins
US9779205B2 (en) Systems and methods for rational selection of context sequences and sequence templates
Delport et al. Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology
CN104584022B (en) A kind of system and method generating biomarker signature
US20090076735A1 (en) Method, system and software arrangement for comparative analysis and phylogeny with whole-genome optical maps
AU2014340461A1 (en) Systems and methods for using paired-end data in directed acyclic structure
WO2017120128A1 (en) Systems and methods for adaptive local alignment for graph genomes
Morgado et al. Computational tools for plant small RNA detection and categorization
Ma et al. DNA sequence classification via an expectation maximization algorithm and neural networks: a case study
Oh et al. Landscape of gene transposition–duplication within the Brassicaceae family
Ritchie et al. Mireval: a web tool for simple microRNA prediction in genome sequences
Bruneau et al. A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Model
Oğul et al. A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets
TWI326431B (en) Method and system of analyzing gene sequence
Opperdoes Phylogenetic analysis using
Middleton et al. NoFold: RNA structure clustering without folding or alignment
KR20200102182A (en) Method and apparatus of the Classification of Species using Sequencing Clustering
De Clercq et al. Deep learning for classification of DNA functional sequences
US20210158896A1 (en) Information processing system, mutation detection system, storage medium, and information processing method

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees