[go: up one dir, main page]

CN116705159A - Screening method for methylation markers, method and device for identifying methylation features - Google Patents

Screening method for methylation markers, method and device for identifying methylation features Download PDF

Info

Publication number
CN116705159A
CN116705159A CN202310713016.6A CN202310713016A CN116705159A CN 116705159 A CN116705159 A CN 116705159A CN 202310713016 A CN202310713016 A CN 202310713016A CN 116705159 A CN116705159 A CN 116705159A
Authority
CN
China
Prior art keywords
methylation
model
differentially methylated
reads
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310713016.6A
Other languages
Chinese (zh)
Inventor
相学平
郭丽华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Buping Medical Laboratory Co ltd
Original Assignee
Hangzhou Buping Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Buping Medical Laboratory Co ltd filed Critical Hangzhou Buping Medical Laboratory Co ltd
Priority to CN202310713016.6A priority Critical patent/CN116705159A/en
Publication of CN116705159A publication Critical patent/CN116705159A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Analytical Chemistry (AREA)
  • Public Health (AREA)
  • Chemical & Material Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application provides a screening method of methylation markers, a method and a device for identifying methylation characteristics, and relates to the technical field of biomarkers. The screening method includes (a) obtaining sequencing reads for each candidate differential methylation region of a feature sample and a control sample; calculating the alpha value of each read; the alpha value is the ratio of the number of methylated cytosines in each read to the number of all methylation sites; (b) And acquiring an alpha distribution model of the candidate differential methylation region, and taking the candidate differential methylation region and/or at least one methylation site contained in the candidate differential methylation region as a methylation marker if the alpha distribution model accords with a set principle. The application relieves the problem that the specificity and the sensitivity of the methylation marker in the prior art are to be improved.

Description

甲基化标志物的筛选方法、识别甲基化特征的方法和装置Screening method for methylation markers, method and device for identifying methylation features

技术领域technical field

本发明涉及生物标志物技术领域,尤其是涉及一种甲基化标志物的筛选方法、识别甲基化特征的方法和装置。The present invention relates to the technical field of biomarkers, in particular to a screening method for methylation markers, a method and a device for identifying methylation features.

背景技术Background technique

近几年大量研究标明,甲基化作为基因表观遗传学的一个重要方面,与肿瘤的发生,发展有密切的关系,可以作为肿瘤早期诊断的标志物.传统上,肿瘤和正常组织的甲基化特异性研究基于差异位点(DMP)和差异区域(DMR)进行筛选,这些方式是根据甲基化的beta值展开研究:一个CpG位点上,beta=甲基化修饰的胞嘧啶的数量/(甲基化修饰胞嘧啶的数量+未经甲基化修饰胞嘧啶的数量),研究表明在CpG岛上,肿瘤组织的beta值会高于正常组织的beta值,如申请号CN202010326209.2,发明名称为“基于TCGA数据库的乳腺癌甲基化生物标志物及其筛选方法”的中国专利申请等,通过beta值的差异可以得到肿瘤的甲基化标物。但是当肿瘤细胞含量比较低,例如样本来自血液或者尿液,上述方法对肿瘤信号并不敏感进而不能很好地筛查出肿瘤,因此,如何提高甲基化标物在肿瘤中的检出能力是目前有待改进的问题。A large number of studies in recent years have shown that methylation, as an important aspect of gene epigenetics, is closely related to the occurrence and development of tumors, and can be used as a marker for early diagnosis of tumors. Traditionally, methylation in tumors and normal tissues Methylation-specific research is based on differential sites (DMPs) and differential regions (DMRs). These methods are based on the beta value of methylation: at a CpG site, beta = methylation-modified cytosine Quantity/(number of methylated cytosine + number of unmethylated cytosine), studies have shown that on CpG islands, the beta value of tumor tissue will be higher than that of normal tissue, such as application number CN202010326209. 2. Chinese patent application titled "Breast Cancer Methylation Biomarkers and Screening Method Based on TCGA Database", etc., the methylation markers of tumors can be obtained through the difference of beta value. However, when the content of tumor cells is relatively low, such as samples from blood or urine, the above methods are not sensitive to tumor signals and cannot screen out tumors well. Therefore, how to improve the detection ability of methylated markers in tumors It is a problem that needs to be improved at present.

有鉴于此,特提出本发明。In view of this, the present invention is proposed.

发明内容Contents of the invention

本发明的第一目的在于提供一种甲基化标志物的筛选方法,该筛选方法筛选得到的甲基化标志物具有更高的灵敏度和特异性。基于该筛选方法,本发明还提供了膀胱癌甲基化标志物、识别甲基化特征的模型建立方法和识别甲基化特征的方法及它们的应用,以提高甲基化区域和/或位点对肿瘤信号的识别和预测能力。The first object of the present invention is to provide a method for screening methylation markers, and the methylation markers screened by the screening method have higher sensitivity and specificity. Based on the screening method, the present invention also provides bladder cancer methylation markers, a method for establishing a model for identifying methylation features, a method for identifying methylation features, and their applications, so as to increase the number of methylated regions and/or positions. Point-to-tumor signature recognition and prediction capabilities.

为解决上述技术问题,本发明特采用如下技术方案:In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

根据本发明的一个方面,提供了一种甲基化标志物的筛选方法,包括:According to one aspect of the present invention, a method for screening methylation markers is provided, comprising:

(a)获取特征样本和对照样本的各候选差异甲基化区域的测序reads;计算每条read的alpha值;所述alpha值为每条read中甲基化胞嘧啶数量与所有甲基化位点数量的比例;(a) Obtain the sequencing reads of each candidate differentially methylated region of the characteristic sample and the control sample; calculate the alpha value of each read; the alpha value is the number of methylated cytosines in each read and all methylated positions The ratio of the number of points;

(b)获取候选差异甲基化区域的alpha分布模型,若alpha分布模型符合以下原则,则所述候选差异甲基化区域,和/或,所述候选差异甲基化区域中含有的至少一个甲基化位点作为甲基化标志物:(b) Obtain the alpha distribution model of the candidate differentially methylated region, if the alpha distribution model meets the following principles, the candidate differentially methylated region, and/or, at least one of the candidate differentially methylated regions Methylation sites as methylation markers:

对照样本至少80%reads的alpha值≤低甲基化阈值;The alpha value of at least 80% of the reads in the control sample ≤ the hypomethylation threshold;

特征样本至少30%reads的alpha值≥高甲基化阈值;The alpha value of at least 30% of the reads of the feature sample ≥ the high methylation threshold;

高甲基化阈值-低甲基化阈值≥N,N至少为0.5;High methylation threshold - low methylation threshold ≥ N, N is at least 0.5;

各候选差异甲基化区域的alpha分布模型符合的所述原则中的高甲基化阈值和/或低甲基化阈值相同或不同。The high methylation thresholds and/or low methylation thresholds in the principles that the alpha distribution models of the candidate differentially methylated regions conform to are the same or different.

根据本发明的一个方面,提供了一种膀胱癌甲基化标志物,包括如下表中的1~13中的至少一种;其中,以下表格中的每一项所对应的标志物包括对应的甲基化位点和/或含有对应甲基化位点的区域:According to one aspect of the present invention, a bladder cancer methylation marker is provided, including at least one of 1-13 in the following table; wherein, the marker corresponding to each item in the following table includes the corresponding Methylated sites and/or regions containing corresponding methylated sites:

以hg19参考基因组序列为基准。Based on the hg19 reference genome sequence.

根据本发明的一个方面,还提供了一种识别甲基化特征的模型建立方法,包括:将采用上述筛选方法的步骤(b)中得到的差异甲基化区域作为输入数据,采用一维卷积神经网络建立模型,使用深度学习框架进行模型训练和预测。According to one aspect of the present invention, there is also provided a method for establishing a model for identifying methylation features, comprising: using the differentially methylated region obtained in step (b) of the above screening method as input data, using a one-dimensional volume Build a model using a product neural network, and use a deep learning framework for model training and prediction.

根据本发明的一个方面,还提供了一种识别甲基化特征的方法,包括使用上述建立方法建立的模型识别待测样本测序reads的甲基化特征。According to one aspect of the present invention, a method for identifying methylation features is also provided, including using the model established by the above establishment method to identify the methylation features of the sequencing reads of the sample to be tested.

根据本发明的一个方面,还提供了甲基化特征识别装置,该装置含有预测模块,所述预测模块用于将差异甲基化区域的reads的甲基化模式输入上述建立方法建立的模型中,获得待测样本的甲基化特征。According to one aspect of the present invention, a methylation feature recognition device is also provided, the device contains a prediction module, and the prediction module is used to input the methylation pattern of the reads of the differentially methylated region into the model established by the above establishment method , to obtain the methylation feature of the sample to be tested.

根据本发明的一个方面,还提供了计算机可读介质,其存储有计算机程序所述计算机程序被处理执行时实现上述模型建立方法建立的模型。According to one aspect of the present invention, there is also provided a computer-readable medium, which stores a computer program, and when the computer program is processed and executed, realizes the model established by the above model establishment method.

根据本发明的一个方面,还提了用于预测、诊断或辅助诊断癌症的系统,包括如下至少任意一种:According to one aspect of the present invention, a system for predicting, diagnosing or assisting in diagnosing cancer is also provided, including at least any one of the following:

(a)用于检测上述筛选方法筛选得到的甲基化标志物的试剂和/或设备;(a) Reagents and/or equipment for detecting the methylation markers screened by the above screening method;

(b)用于检测上述膀胱癌甲基化标志物的试剂和/或设备;(b) reagents and/or equipment for detecting the above-mentioned bladder cancer methylation markers;

(c)上述甲基化特征识别装置;(c) the above-mentioned methylation feature recognition device;

(d)上述计算机可读介质。(d) the above computer readable medium.

与现有技术相比,本发明具有如下有益效果:Compared with the prior art, the present invention has the following beneficial effects:

本发明提供的甲基化标志物的筛选方法进行了两次差异甲基化区域的筛选,第一次是基于本领域常规的方法的筛选,例如基于芯片数据的筛选;第二次是基于甲基化高通量靶向测序的深度reads上,并且使用了alpha密度分布模型,可以对不同的差异甲基化区域,建立不同参数模型,降低了误判率和提高了具有特定生物状态样本的检出。The method for screening methylation markers provided by the present invention performs two screenings of differentially methylated regions, the first screening based on conventional methods in the art, such as screening based on chip data; the second screening based on methylation Based on the deep reads of high-throughput targeted sequencing, and using the alpha density distribution model, different parameter models can be established for different differentially methylated regions, reducing the misjudgment rate and improving the accuracy of samples with specific biological states Check out.

本发明利用高通量测序的便利,在差异性位点确定的基础之上对候选差异甲基化区域进行二代测序,在单分子层面计算每条read的alpha值,即每条read的所有CpG位点的甲基化C比例。alpha值比beta值更能敏锐地捕捉到肿瘤信号,同时消除噪音干扰。通过癌症组织和正常组织的单分子甲基化信号进行数学建模,可以更高灵敏度,更高特异性地识别出甲基化标志物。与传统的甲基化标志的识别方法相比,基于单条read的癌症预测具有更高的分辨率,可以检测到血液和尿液中更加微小的肿瘤信号。The present invention utilizes the convenience of high-throughput sequencing to perform second-generation sequencing on candidate differentially methylated regions on the basis of determining differential sites, and calculates the alpha value of each read at the single-molecule level, that is, all The proportion of methylated C at CpG sites. The alpha value can more sensitively capture the tumor signal than the beta value, while eliminating noise interference. Through mathematical modeling of single-molecule methylation signals in cancer tissues and normal tissues, methylation markers can be identified with higher sensitivity and specificity. Compared with the traditional identification method of methylation marks, cancer prediction based on a single read has higher resolution and can detect smaller tumor signals in blood and urine.

附图说明Description of drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the specific implementation or description of the prior art. Obviously, the accompanying drawings in the following description The drawings show some implementations of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative effort.

图1为本发明实施例1方案的流程图;Fig. 1 is the flowchart of the scheme of embodiment 1 of the present invention;

图2为样本RD20230117-Twist15M-1-54525A1在OTX1_200_chr2:63282292-63283121区域的每条reads的甲基化分布情况,图2中A表示的测序深度,最高深度220×;图2中B的X轴表示chr2的坐标,Y轴是reads个数,每条read在甲基化位点的状态不同显示不同的颜色:黄色表示没有甲基化,黑色表示甲基化,图2中C的Y轴和B是重合的,表示一条read,X轴则是表示该条read的alpha值;Figure 2 shows the methylation distribution of each reads in the OTX1_200_chr2:63282292-63283121 region of sample RD20230117-Twist15M-1-54525A1, the sequencing depth represented by A in Figure 2, the highest depth is 220×; the X axis of B in Figure 2 Represents the coordinates of chr2, the Y axis is the number of reads, and each read shows different colors at different states of the methylation site: yellow indicates no methylation, black indicates methylation, and the Y axis of C in Figure 2 and B is coincident, indicating a read, and the X axis indicates the alpha value of the read;

图3为正常样本和癌症样本在不同的高甲基化区域里面的alpha分布情况,左侧为正常样本,右侧为癌症样本,X轴(横轴)表示reads个数,Y轴(纵轴)表示每条read的alpha值;Figure 3 shows the alpha distribution of normal samples and cancer samples in different hypermethylated regions. The left side is normal samples, and the right side is cancer samples. The X-axis (horizontal axis) represents the number of reads, and the Y-axis (vertical axis) represents The alpha value of each read;

图4为本发明实施例建立的卷积网络模型的训练评估图,横轴表示Epoch(指神经网络遍历整个训练数据集的次数),纵轴表示Accuracy。横轴刻度为0-38,表示网络训练了38个Epoch;而纵轴刻度为0.65-1.0,表示Accuracy从0.65开始,逐渐提升到0.98。FIG. 4 is a training evaluation diagram of the convolutional network model established by the embodiment of the present invention. The horizontal axis represents Epoch (referring to the number of times the neural network traverses the entire training data set), and the vertical axis represents Accuracy. The scale of the horizontal axis is 0-38, indicating that the network has trained 38 Epochs; while the scale of the vertical axis is 0.65-1.0, indicating that Accuracy starts from 0.65 and gradually increases to 0.98.

具体实施方式Detailed ways

下面将结合实施例对本发明的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below in conjunction with the embodiments. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the terms "comprising" and "comprises" indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or Presence or addition of multiple other features, integers, steps, operations, elements, components and/or collections thereof.

还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的而并非意在限制本发明。如在本发明说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terminology used in the description of the present invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the present invention. As used in this specification and the appended claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise.

还应当进一步理解,在本发明说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be further understood that the term "and/or" used in the description of the present invention and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

本文中“reads”和“read”可以互相替换使用,“reads”可以表示多条reads,一类reads,也可以表示一条reads。In this article, "reads" and "read" can be used interchangeably, and "reads" can represent multiple reads, one type of reads, or one read.

本文中,术语“癌症”、“癌”、和肿瘤”可互换使用,指其中细胞表现出或表现了相对异常的、不受控制的和/或自主生长的疾病、病症或状况,因此它们展现出或展现了异常升高的增殖率和/或异常生长表型。As used herein, the terms "cancer," "cancer," and tumor" are used interchangeably to refer to a disease, disorder, or condition in which cells exhibit or exhibit relatively abnormal, uncontrolled, and/or autonomous growth such that they Exhibit or exhibit an abnormally elevated rate of proliferation and/or an abnormal growth phenotype.

根据本发明的一个方面,提供了一种甲基化标志物的筛选方法,本发明提供的筛选方法筛选的甲基化标志物指的是该标志物的甲基化水平与被检个体的特定生物状态(例如疾病)有关系,本发明不限制具体的特定生物状态,在可选的实施方式中,所述特定的生物状态包括癌症。According to one aspect of the present invention, a screening method for methylation markers is provided. The methylation markers screened by the screening method provided in the present invention refer to the methylation level of the markers and the specific characteristics of the tested individual. Biological state (such as disease), the present invention is not limited to a specific specific biological state, in an optional embodiment, the specific biological state includes cancer.

所述筛选方法包括:The screening methods include:

(a)获取特征样本和对照样本的各候选差异甲基化区域的测序reads;计算每条read的alpha值;所述alpha值为每条read中甲基化胞嘧啶数量与所有甲基化位点数量的比例。(a) Obtain the sequencing reads of each candidate differentially methylated region of the characteristic sample and the control sample; calculate the alpha value of each read; the alpha value is the number of methylated cytosines in each read and all methylated positions The ratio of the number of points.

各候选差异甲基化区域中存在至少一个已知的甲基化位点,各候选差异甲基化区域可根据本领域常规方法获得,例如参考本领域已公开的教材、参考文献、工艺手册、商品说明、数据库和标准文件中记载的差异甲基化区域,或记载的差异甲基化区域的筛选方法的差异化区域,例如根据beta值筛选出的差异化区域,本发明对此不做限制。There is at least one known methylation site in each candidate differentially methylated region, and each candidate differentially methylated region can be obtained according to conventional methods in the art, for example, by referring to published textbooks, references, process manuals, The differentially methylated regions recorded in product descriptions, databases, and standard documents, or the differentially methylated regions recorded in the screening method for differentially methylated regions, such as the differentiated regions screened out based on beta values, are not limited by the present invention .

特征样本指的是具有上述特定生物状态的样本,对照样本指的是不具有上述特定生物状态的样本,或者上述特定生物状态低于阈值的样本。样本可以为本领域常规的生物样本,包括但不限于组织、细胞、体液和核苷酸等。A characteristic sample refers to a sample with the above-mentioned specific biological state, and a control sample refers to a sample without the above-mentioned specific biological state, or a sample with the above-mentioned specific biological state below a threshold. Samples can be conventional biological samples in the field, including but not limited to tissues, cells, body fluids, and nucleotides.

(b)获取候选差异甲基化区域的alpha分布模型,若alpha分布模型符合以下原则,则所述候选差异甲基化区域,和/或,所述候选差异甲基化区域中含有的至少一个甲基化位点作为甲基化标志物:(b) Obtain the alpha distribution model of the candidate differentially methylated region, if the alpha distribution model meets the following principles, the candidate differentially methylated region, and/or, at least one of the candidate differentially methylated regions Methylation sites as methylation markers:

(Ⅰ)对照样本至少80%reads的alpha值≤低甲基化阈值,例如可以为但不限于至少80%、82%、85%、87%、90%、92%或95%reads的alpha值≤低甲基化阈值;对照样本中越多的reads的alpha值低于低甲基化阈值,对应的差异甲基化区域作为标志物的特异性越高。(I) The alpha value of at least 80% of the reads in the control sample ≤ the hypomethylation threshold, such as but not limited to the alpha value of at least 80%, 82%, 85%, 87%, 90%, 92% or 95% of the reads ≤ low methylation threshold; the more the alpha value of the reads in the control sample is lower than the low methylation threshold, the higher the specificity of the corresponding differentially methylated region as a marker.

(Ⅱ)特征样本至少30%reads的alpha值≥高甲基化阈值,例如可以为但不限于至少30%、35%、40%、45%或50%的alpha值≥高甲基化阈值,特征样本中越多的reads的alpha值高于高甲基化阈值,对应的差异甲基化区域作为标志物的敏感性越高。(II) The alpha value of at least 30% of the reads in the feature sample ≥ the high methylation threshold, for example, it can be but not limited to at least 30%, 35%, 40%, 45% or 50% of the alpha value ≥ the high methylation threshold, the more in the feature sample The alpha value of the reads is higher than the high methylation threshold, and the corresponding differentially methylated region is more sensitive as a marker.

(Ⅲ)高甲基化阈值-低甲基化阈值≥N,N至少为0.5,例如可以为但不限于为0.5、0.6、0.7或0.8。(III) High methylation threshold-low methylation threshold ≥ N, N is at least 0.5, for example, but not limited to 0.5, 0.6, 0.7 or 0.8.

并且,各候选差异甲基化区域的alpha分布模型符合的所述原则中的高甲基化阈值和/或低甲基化阈值相同或不同。例如某一候选差异甲基化区域特征样本30%的reads的alpha值都是大于0.9的,那么高甲基化阈值就是0.9,同时该区域对照样本80%的reads的alpha值都是小于0.3的,那么对于该甲基化区域,它的一套模型参数是0.3/0.9。而对于另一候选差异甲基化区域特征样本30%的reads的alpha值都是大于0.85,同时该区域对照样本80%的reads的alpha值都是小于0.2的,则该区域的一套模型参数是0.2/0.85;不同候选差异甲基化区域根据各自区域的甲基化特征,alpha分布模型的参数之间可以相同也可以不同,也可以部分相同,例如至少两个候选差异甲基化区域的低甲基化阈值相同但高甲基化阈值不同。不同的候选差异甲基化区域可以有一套不同的alpha分布模型,以用来评估该区域是否可以作为甲基化标志物。根据各自区域的甲基化特征设定参数能够降低误判率和提高了具有特定生物状态样本的检出。Moreover, the high methylation threshold and/or the low methylation threshold in the principle that the alpha distribution models of each candidate differentially methylated region conform to are the same or different. For example, the alpha value of 30% of the reads in a characteristic sample of a candidate differentially methylated region is greater than 0.9, then the high methylation threshold is 0.9, and at the same time, the alpha value of 80% of the reads in the control sample in this region is less than 0.3, then For this methylated region, its set of model parameters is 0.3/0.9. For another candidate differentially methylated region characteristic sample, the alpha value of 30% of the reads is greater than 0.85, and at the same time, the alpha value of 80% of the reads of the control sample in this region is less than 0.2, then a set of model parameters for this region It is 0.2/0.85; different candidate differentially methylated regions according to the methylation characteristics of their respective regions, the parameters of the alpha distribution model can be the same or different, or partly the same, for example, the parameters of at least two candidate differentially methylated regions The hypomethylation threshold is the same but the hypermethylation threshold is different. Different candidate differentially methylated regions can have a different set of alpha distribution models to evaluate whether the region can be used as a methylation marker. Setting parameters according to the methylation characteristics of respective regions can reduce the misjudgment rate and improve the detection of samples with specific biological states.

低甲基化阈值和高甲基化阈值可根据本领域一般规则进行划分,低甲基化阈值和高甲基化阈值的取值范围为0~1。The low methylation threshold and the high methylation threshold can be divided according to the general rules in this field, and the value range of the low methylation threshold and the high methylation threshold is 0-1.

可选的实施方式中,步骤(b)中,所述差异甲基化区域存在于至少三对特征样本和对照样本中,则判定该差异甲基化区域和/或其中的甲基化位点可以作为甲基化标志物。In an optional embodiment, in step (b), if the differentially methylated region exists in at least three pairs of characteristic samples and control samples, then the differentially methylated region and/or the methylation sites therein are determined Can be used as a methylation marker.

可选的实施方式中,所述甲基化标志物包括癌症的甲基化标志物,所述特征样本为癌症组织,所述对照样本为癌旁正常组织。所述癌症可选地包括膀胱癌。In an optional embodiment, the methylation markers include cancer methylation markers, the characteristic sample is cancer tissue, and the control sample is para-cancerous normal tissue. The cancer optionally includes bladder cancer.

可选的实施方式中,候选差异甲基化区域采用如下方式获得:在TCGA数据库中获取肿瘤样本数据和正常样本数据,根据beta值进行差异分析获得候选差异甲基化区域,根据差异甲基化区域设计探针,捕获癌症组织和癌旁正常组织核酸样本,测序后获得候选差异甲基化区域的测序reads。In an optional embodiment, the candidate differentially methylated region is obtained in the following manner: obtain tumor sample data and normal sample data in the TCGA database, perform differential analysis according to the beta value to obtain the candidate differentially methylated region, and obtain the candidate differentially methylated region according to the differential methylation Probes are designed for regions to capture nucleic acid samples of cancer tissues and normal adjacent tissues, and sequence reads of candidate differentially methylated regions are obtained after sequencing.

根据本发明的一个方面,提供了一种膀胱癌甲基化标志物,本发明提供的膀胱癌甲基化标志物是采用上述甲基化标志物的筛选方法筛选得到的,包括如下表中的1~13中的至少一种;其中,以下表格中的每一项所对应的标志物包括对应的甲基化位点和/或含有对应甲基化位点的区域:According to one aspect of the present invention, a bladder cancer methylation marker is provided. The bladder cancer methylation marker provided in the present invention is obtained by screening the above methylation marker screening method, including the following table At least one of 1-13; wherein, the markers corresponding to each item in the following table include the corresponding methylation site and/or the region containing the corresponding methylation site:

标志物序号marker serial number 染色体chromosome 起始位置starting point 终止位置end position 11 chr1chr1 5088672450886724 5088684350886843 22 chr1chr1 6378548663785486 6378560563785605 33 chr2chr2 6328295563282955 6328307463283074 44 chr2chr2 6666737566667375 6666749466667494 55 chr6chr6 108485949108485949 108486068108486068 66 chr7chr7 2720505627205056 2720528427205284 77 chr7chr7 2720532327205323 2720544227205442 88 chr7chr7 2720560027205600 2720571927205719 99 chr7chr7 1915794319157943 1915806219158062 1010 chr10chr10 2251813722518137 2251859422518594 1111 chr11chr11 4360280443602804 4360292343602923 1212 chr17Chr17 1768534917685349 1768546817685468 1313 chr18chr18 2292934922929349 2292946822929468

以hg19参考基因组序列为基准。Based on the hg19 reference genome sequence.

根据本发明的一个方面,还提供了一种识别甲基化特征的模型建立方法,该方法基于上述筛选方法步骤(b)中筛选得到的候选差异甲基化区域作基础,包括将差异甲基化区域物作为输入数据,采用一维卷积神经网络建立模型,使用深度学习框架进行模型训练和预测。According to one aspect of the present invention, a method for establishing a model for identifying methylation features is also provided. The method is based on the candidate differentially methylated regions screened in step (b) of the above screening method, including differentially methylated Using regionalized objects as input data, a one-dimensional convolutional neural network is used to build a model, and a deep learning framework is used for model training and prediction.

可选的实施方式中,根据步骤(b)选取的差异甲基化区域,对reads的甲基化模式进行编码并输入卷积神经网络建立模型,模型基于Python语言构建,使用Tensorflow深度学习框架进行模型训练和预测。In an optional embodiment, according to the differentially methylated region selected in step (b), the methylation pattern of the reads is encoded and input into a convolutional neural network to establish a model, the model is constructed based on the Python language, and the Tensorflow deep learning framework is used for Model training and prediction.

可选的实施方式中,采用一位有效编码(one-hot编码)方式对reads的甲基化模式进行编码,然后创建序贯模型(Sequential模型),使用Conv1D(一维卷积神经网络)层来构建模型,对输入数据进行卷积处理,提取数据的特征信息,然后添加其他类型的层来增加模型的复杂度和准确性,最后进行全连接层的处理,将提取到的特征信息映射至输出结果,并对建立的卷积神经网络模型选择损失函数和优化算法进行编译。In an optional embodiment, the methylation pattern of reads is encoded by one-hot encoding, and then a sequential model (Sequential model) is created, using a Conv1D (one-dimensional convolutional neural network) layer To build a model, perform convolution processing on the input data, extract the feature information of the data, and then add other types of layers to increase the complexity and accuracy of the model, and finally perform the processing of the fully connected layer, and map the extracted feature information to Output the result, and compile the loss function and optimization algorithm for the established convolutional neural network model.

可选的实施方式中,Conv1D层参数为:In an optional implementation, the parameters of the Conv1D layer are:

filters=32,kernel_size=3,strides=1,padding="same",activation="relu";filters=32, kernel_size=3, strides=1, padding="same", activation="relu";

可选的实施方式中,所述其他类型层包括池化层和/或Dropout层;In an optional implementation manner, the other types of layers include a pooling layer and/or a Dropout layer;

可选的实施方式中,池化层参数为:pool_size=2,strides=2;In an optional embodiment, the parameters of the pooling layer are: pool_size=2, strides=2;

可选的实施方式中,Dropout层参数为:0.5。In an optional embodiment, the Dropout layer parameter is: 0.5.

可选的实施方式中,所述损失函数选自交叉熵损失函数。In an optional implementation manner, the loss function is selected from a cross-entropy loss function.

可选的实施方式中,所述优化算法选自随机梯度下降法;随机梯度下降法的参数优选为lr=0.01,momentum=0.9,weight_decay=0.0001。In an optional embodiment, the optimization algorithm is selected from the stochastic gradient descent method; the parameters of the stochastic gradient descent method are preferably lr=0.01, momentum=0.9, and weight_decay=0.0001.

可选的实施方式中,然后使用至少4对特征样本和对照样本对模型进行训练。In an optional embodiment, at least 4 pairs of feature samples and control samples are then used to train the model.

可选的实施方式中,所述模型用于识别癌症组织的甲基化特征,其中癌症可选地包括膀胱癌。In an alternative embodiment, the model is used to identify methylation signatures of cancerous tissues, where the cancer optionally includes bladder cancer.

根据本发明的一个方面,还提供了一种识别甲基化特征的方法,包括使用上述建立方法建立的模型识别待测样本测序reads的甲基化特征。According to one aspect of the present invention, a method for identifying methylation features is also provided, including using the model established by the above establishment method to identify the methylation features of the sequencing reads of the sample to be tested.

可选的实施方式中,向模型中输入待测样本的差异甲基化区域的reads的甲基化模式,所述差异甲基化区域为用于建立所述模型的差异甲基化区域,即上述筛选方法的步骤(b)中得到的差异甲基化区域,所述模型输出待测样本的reads的甲基化特征。In an optional embodiment, the methylation pattern of the reads of the differentially methylated region of the sample to be tested is input into the model, and the differentially methylated region is the differentially methylated region used to establish the model, namely For the differentially methylated regions obtained in step (b) of the above screening method, the model outputs the methylation features of the reads of the sample to be tested.

根据本发明的一个方面,还提供了甲基化特征识别装置,该装置含有预测模块,所述预测模块用于将获得的标志物的reads的甲基化模式输入上述模型建立方法建立的模型中,获得待测样本的甲基化特征。According to one aspect of the present invention, a methylation feature recognition device is also provided, the device includes a prediction module, and the prediction module is used to input the obtained methylation pattern of the marker reads into the model established by the above model establishment method , to obtain the methylation feature of the sample to be tested.

所述预测模块可以软件或固件(Firmware)的形式存储于存储器中或固化于甲基化特征识别装置的操作系统(Operating System,OS)中。上述甲基化特征识别装置还可选地预置有存储模块,其中存储有用于执行上述模块所需的数据、程序的代码等。The prediction module can be stored in the memory in the form of software or firmware (Firmware), or be fixed in the operating system (Operating System, OS) of the methylation feature recognition device. The above-mentioned methylation feature recognition device may also optionally be pre-installed with a storage module, in which data and program codes required for executing the above-mentioned modules are stored.

所述甲基化特征识别装置可以包括存储器、处理器、总线和通信接口,该存储器、处理器和通信接口相互之间直接或间接地电性连接,以实现数据的传输或交互。例如,这些元件相互之间可通过一条或多条总线或信号线实现电性连接。处理器可以处理与目标识别有关的信息和/或数据,以执行本申请中描述的一个或多个功能。The methylation feature recognition device may include a memory, a processor, a bus and a communication interface, and the memory, the processor and the communication interface are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, these components may be electrically connected to each other through one or more buses or signal lines. A processor may process information and/or data related to object recognition to perform one or more functions described herein.

在实际应用中,该甲基化特征识别装置可以是服务器、云平台、手机、平板电脑、笔记本电脑、超级移动个人计算机(ultra mobile personal computer,UMPC)、手持计算机、上网本、个人数字助理(personaldigitalassistant,PDA)、可穿戴电子设备、虚拟现实设备等设备,因此本申请实施例对上述装置的种类不做限制。In practical applications, the methylation feature recognition device can be a server, cloud platform, mobile phone, tablet computer, notebook computer, ultra mobile personal computer (ultra mobile personal computer, UMPC), handheld computer, netbook, personal digital assistant (personal digital assistant) , PDA), wearable electronic equipment, virtual reality equipment and other equipment, so the embodiment of the present application does not limit the types of the above-mentioned devices.

根据本发明的一个方面,还提供了计算机可读介质,其存储有计算机程序所述计算机程序被处理执行时实现上述模型建立方法建立的模型。计算机可读介质包括但不限于U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。According to one aspect of the present invention, there is also provided a computer-readable medium, which stores a computer program, and when the computer program is processed and executed, realizes the model established by the above model establishment method. Computer-readable media include, but are not limited to, various media that can store program codes, such as USB flash drives, removable hard drives, read-only memories, random access memories, magnetic disks, or optical disks.

根据本发明的一个方面,还提了用于预测、诊断或辅助诊断癌症的系统,包括如下至少任意一种:According to one aspect of the present invention, a system for predicting, diagnosing or assisting in diagnosing cancer is also provided, including at least any one of the following:

(a)用于检测上述筛选方法筛选得到的甲基化标志物的试剂和/或设备;(a) Reagents and/or equipment for detecting the methylation markers screened by the above screening method;

(b)用于检测上述膀胱癌甲基化标志物的试剂和/或设备;(b) reagents and/or equipment for detecting the above-mentioned bladder cancer methylation markers;

(c)上述甲基化特征识别装置;(c) the above-mentioned methylation feature recognition device;

(d)上述计算机可读介质。(d) the above computer readable medium.

下面通过具体的实施例进一步说明本发明,但是,应当理解为,这些实施例仅仅是用于更详细地说明之用,而不应理解为用于以任何形式限制本发明。The present invention will be further described below through specific examples, however, it should be understood that these examples are only used for more detailed description, and should not be construed as limiting the present invention in any form.

实施例1Example 1

本实施例提供一种识别膀胱癌甲基化特征的方法,其方案构思如图1所示,包括如下步骤:This embodiment provides a method for identifying the methylation characteristics of bladder cancer, the concept of which is shown in Figure 1, including the following steps:

1.根据beta值选取差异甲基化位点,设计探针:1. Select differentially methylated sites according to the beta value and design probes:

肿瘤基因组图谱数据库(TCGA)中膀胱癌(Bladder Urothelial Carcinoma,BLCA)共有433个样本,其中412个肿瘤样本(Primary Tumor)数据,21个正常样本(Solid TissueNormal)数据,450k下载的数据中一共有485577个cg(甲基化)位点,使用Champ包进行差异分析,logFC_t>0.4;P.Value_t<0.001,得到217个差异cg位点,设计对应位置的探针。There are 433 samples of bladder cancer (Bladder Urothelial Carcinoma, BLCA) in the Tumor Genome Atlas Database (TCGA), including 412 tumor samples (Primary Tumor) data, 21 normal samples (Solid Tissue Normal) data, and a total of 450k downloaded data. 485,577 cg (methylation) sites were differentially analyzed using the Champ package, logFC_t>0.4; P.Value_t<0.001, 217 differential cg sites were obtained, and probes for corresponding positions were designed.

2.甲基化建库和探针捕获:2. Methylation library construction and probe capture:

提取1对膀胱癌和癌旁正常组织DNA,进行酶切,末端修复及dA尾添加;之后使用艾吉泰康双链建库试剂盒中甲基化接头进行链接,磁珠纯化后使用zymo试剂盒进行甲转,甲转后扩增,最后根据第一步设计好的探针进行捕获。Extract 1 pair of bladder cancer and adjacent normal tissue DNA, carry out enzyme digestion, end repair and dA tail addition; then use the methylation linker in the Aijitaikang double-strand library construction kit to link, and use the zymo kit after purification of magnetic beads Carry out A conversion, amplify after A conversion, and finally capture according to the probe designed in the first step.

3.质控和比对:3. Quality control and comparison:

二代测序下机后的数据先使用trim-glora进行质控,参数:--illumina--paired--fastqc--length20-q20--clip_r110--clip_r210--three_prime_clip_r110--three_prime_clip_r210;比对软件使用bismark,参考基因组hg19,其他参数为默认值,deduplicate_bismark去重后使用bismark_methylation_extractor提取甲基化位置和信息,下一步使用脚本把OT/OB的CpG信息合并。The data after next-generation sequencing off-machine is first used for quality control with trim-glora, parameters: --illumina--paired--fastqc--length20-q20--clip_r110--clip_r210--three_prime_clip_r110--three_prime_clip_r210; using comparison software bismark, reference genome hg19, other parameters are default values, use bismark_methylation_extractor to extract methylation position and information after deduplicate_bismark deduplication, next step use script to merge OT/OB CpG information.

4.识别reads的甲基化模式:4. Identify the methylation pattern of reads:

每个捕获的区域识别出来CpG的甲基化模式,1表示甲基化,0表示未甲基化,并计算出CpG的个数以及alpha值,如表1所示,其中甲基化位点记为“1”,未甲基化的位点记为“0”。以样本RD20230117-Twist15M-1-54525A1在OTX1_200_chr2:63282292-63283121区域为例,每条reads的甲基化分布情况如图2所示。Each captured region identifies the methylation pattern of CpG, 1 indicates methylation, 0 indicates unmethylation, and calculates the number of CpG and alpha value, as shown in Table 1, where the methylation site Recorded as "1" and unmethylated sites as "0". Taking sample RD20230117-Twist15M-1-54525A1 in the OTX1_200_chr2:63282292-63283121 region as an example, the methylation distribution of each reads is shown in Figure 2.

表1Table 1

5.肿瘤和正常样本的alpha分布模型5. Alpha distribution model of tumor and normal samples

随着肿瘤细胞的增多,高aphla值的reads也会随着增多,提取alpha的分布在两者之间有显著差异的甲基化位置作为候选区域,在差异甲基化区域里面alpha模型的分布按照以下原则进行:With the increase of tumor cells, the reads with high aphla value will also increase. Extract the methylation position with significant difference in the distribution of alpha as the candidate region, and the distribution of alpha model in the differential methylation region Follow the principles below:

(1)提高模型的特异性,正常样本的80%reads的alpha值≤低甲基化阈值;(1) To improve the specificity of the model, the alpha value of 80% reads of normal samples ≤ the hypomethylation threshold;

(2)提高模型的敏感性,肿瘤样本的30%reads的alpha值≥高甲基化阈值。(2) To improve the sensitivity of the model, the alpha value of 30% reads of the tumor sample ≥ the high methylation threshold.

(3)候选差异甲基化区域里面,高甲基化阈值-低甲基化阈值≥0.5,两者相差越大,模型的鲁棒性越强。(3) In the candidate differential methylation region, the high methylation threshold - low methylation threshold ≥ 0.5, the greater the difference between the two, the stronger the model's robustness.

基于以上alpha分布模型,对甲基化区域进行进一步筛选,结果如图3所示。Based on the alpha distribution model above, the methylated regions were further screened, and the results are shown in Figure 3.

6.特异性差异甲基化区域:6. Specific differentially methylated regions:

使用4对膀胱癌症和癌旁组织,使用上述方法找到对应的差异区域。Using 4 pairs of bladder cancer and paracancerous tissues, the corresponding differential regions were found using the method described above.

Normal(癌旁组织):Normal (adjacent tissue):

50765B1_BLCA-N/58590C1_BLCA-N/54525D1_BLCA-N/56577C1_BL CA-N50765B1_BLCA-N/58590C1_BLCA-N/54525D1_BLCA-N/56577C1_BLCA-N

Tumor(膀胱癌症组织):Tumor (bladder cancer tissue):

50765C1_BLCA-T/58590A1_BLCA-T/54525A2_BLCA-T/56577A1_BL CA-T50765C1_BLCA-T/58590A1_BLCA-T/54525A2_BLCA-T/56577A1_BLCA-T

之后筛选出至少在3对样本中存在的差异甲基化区域作为接下来数据训练的差异甲基化区域,结果如下:Afterwards, the differentially methylated regions that exist in at least 3 pairs of samples are screened out as the differentially methylated regions for subsequent data training. The results are as follows:

表2Table 2

上表中特征描述列中:A|B|C|D分别表示,A:甲基化位点定位基因;B:甲基化区域长度;C:距离转录起始位点距离(Transcription Start Site,TSS)距离;D表示甲基化区域所在基因中的区域。例如上表中“HOXA9|120|-451|Promoter”表示基因名称HOXA9,区域长度120,距离TSS距离-451,promoter启动子区域。例如“EBLN1|458|-19225|Exon”表示基因名称EBLN1,区域长度458,距离TSS距离-19225,外显子区域。In the feature description column in the above table: A|B|C|D represent respectively, A: methylation site localization gene; B: length of methylation region; C: distance from transcription start site (Transcription Start Site, TSS) distance; D indicates the region in the gene where the methylated region is located. For example, "HOXA9|120|-451|Promoter" in the above table indicates the gene name HOXA9, the region length is 120, the distance from TSS is -451, and the promoter region of the promoter. For example, "EBLN1|458|-19225|Exon" indicates the gene name EBLN1, the region length is 458, the distance from TSS is -19225, and the exon region.

7.深度学习模型:7. Deep learning model:

根据选取的基于alpha的差异甲基化区域,对reads的甲基化模式进行编码,ATGC分别编码为0、1、2、3甲基化的C编码为4,编码好的矩阵作为输入,之后采用深度学习中的卷积神经网络(Conv1D)建立,模型基于Python语言构建,使用Tensorflow深度学习框架进行模型训练和预测。包括如下步骤:According to the selected alpha-based differential methylation region, the methylation pattern of the reads is encoded, ATGC is encoded as 0, 1, 2, and 3 methylated C is encoded as 4, and the encoded matrix is used as input, and then The convolutional neural network (Conv1D) in deep learning is used to build the model based on the Python language, and the Tensorflow deep learning framework is used for model training and prediction. Including the following steps:

(1)提取每个样本的差异甲基化区域的reads,对应的A/T/unmethlC/G/methylC编码,以便后续模型能够理解和处理二代测序得到的reads数据。(1) Extract the reads of the differentially methylated regions of each sample, and the corresponding A/T/unmethlC/G/methylC codes, so that subsequent models can understand and process the reads data obtained by next-generation sequencing.

(2)采用one-hot编码方式,reads对应有5个变量(A/T/unmethlC/G/methylC),one-hot编码将其转换为一个5维的one-hot向量,其中只有一个元素为1,其余元素为0。然后,将这个向量作为输入张量的一部分送入Conv1D中进行训练。(2) Using the one-hot encoding method, the reads correspond to 5 variables (A/T/unmethlC/G/methylC), and the one-hot encoding converts it into a 5-dimensional one-hot vector, of which only one element is 1, and the remaining elements are 0. Then, this vector is fed into Conv1D as part of the input tensor for training.

(3)创建一个Sequential模型,使用Conv1D层来构建模型,对上述数据进行卷积处理,提取数据的特征信息。参数:filters=32,kernel_size=3,strides=1,padding="same",activation="relu"之后,添加其他类型的层(池化层/Dropout层)来增加模型的复杂度和准确性。池化层参数:pool_size=2,strides=2;Dropout层参数:0.5。(3) Create a Sequential model, use the Conv1D layer to build the model, perform convolution processing on the above data, and extract the feature information of the data. Parameters: filters=32, kernel_size=3, strides=1, padding="same", activation="relu", add other types of layers (pooling layer/Dropout layer) to increase the complexity and accuracy of the model. Pooling layer parameters: pool_size=2, strides=2; Dropout layer parameters: 0.5.

Dense:activation='relu',kernel_regularizer=None,bias_regularizer=NoneDense: activation='relu', kernel_regularizer=None, bias_regularizer=None

最后,使用Dense(relu)进行全连接层的处理,将提取到的特征信息映射至输出结果。Finally, Dense (relu) is used to process the fully connected layer, and the extracted feature information is mapped to the output result.

(4)对于上面建立的Conv1D模型,选择损失函数(交叉熵Cross-entropy)和优化算法(随机梯度下降法SGD)进行编译,通过多次实验进行比较和选择SGD优化算法的参数:lr=0.01,momentum=0.9,weight_decay=0.0001。(4) For the Conv1D model established above, select the loss function (Cross-entropy) and optimization algorithm (SGD) to compile, compare and select the parameters of the SGD optimization algorithm through multiple experiments: lr=0.01 , momentum=0.9, weight_decay=0.0001.

(5)训练模型:使用4对膀胱癌症和癌旁组织,来训练模型(5) Training model: use 4 pairs of bladder cancer and paracancerous tissues to train the model

Normal:50765B1_BLCA-N/58590C1_BLCA-N/54525D1_BLCA-N/56577C1_BLCA-N;Normal: 50765B1_BLCA-N/58590C1_BLCA-N/54525D1_BLCA-N/56577C1_BLCA-N;

Tumor:50765C1_BLCA-T/58590A1_BLCA-T/54525A2_BLCA-T/56577A1_BLCA-T。Tumor: 50765C1_BLCA-T/58590A1_BLCA-T/54525A2_BLCA-T/56577A1_BLCA-T.

卷积网络模型训练评估图如图4所示。The training evaluation diagram of the convolutional network model is shown in Figure 4.

8.模型预测和性能评估:8. Model prediction and performance evaluation:

使用15个膀胱样本,其中10个膀胱癌样本,5个癌旁正常样本,进行模型验证本模型经过对训练集(60%)和测试集(20%)的数据进行训练和测试,得到了较好的分类预测结果。经过比对实际标签与预测标签的结果,发现本模型在测试集上的准确率高达97.8%,能够较好地满足分类预测的需求。Use 15 bladder samples, including 10 bladder cancer samples and 5 adjacent normal samples, for model verification. After training and testing the data of the training set (60%) and the test set (20%), the model has obtained a comparative Good classification prediction results. After comparing the results of actual labels and predicted labels, it is found that the accuracy rate of this model on the test set is as high as 97.8%, which can better meet the needs of classification prediction.

本实施例建立的模型具有较好的分类预测能力,并且具备一定的学习和自我调整能力。进而对16个血液样本进行预测,预测结果如下,只有一个样本(result_RD20230421-CFD-nova-1598)里面有7%reads是含有Tumor成分的,结果如下表所示:The model established in this embodiment has better classification and prediction ability, and has certain learning and self-adjustment ability. Then predict the 16 blood samples, the prediction results are as follows, only one sample (result_RD20230421-CFD-nova-1598) has 7% reads containing Tumor components, the results are shown in the following table:

表3table 3

最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.

Claims (10)

1.甲基化标志物的筛选方法,其特征在于,包括:1. A method for screening methylation markers, comprising: (a)获取特征样本和对照样本的各候选差异甲基化区域的测序reads;计算每条read的alpha值;所述alpha值为每条read中甲基化胞嘧啶数量与所有甲基化位点数量的比例;(a) Obtain the sequencing reads of each candidate differentially methylated region of the characteristic sample and the control sample; calculate the alpha value of each read; the alpha value is the number of methylated cytosines in each read and all methylated positions The ratio of the number of points; (b)获取候选差异甲基化区域的alpha分布模型,若alpha分布模型符合以下原则,则所述候选差异甲基化区域,和/或,所述候选差异甲基化区域中含有的至少一个甲基化位点作为甲基化标志物:(b) Obtain the alpha distribution model of the candidate differentially methylated region, if the alpha distribution model meets the following principles, the candidate differentially methylated region, and/or, at least one of the candidate differentially methylated regions Methylation sites as methylation markers: 对照样本至少80%reads的alpha值≤低甲基化阈值;The alpha value of at least 80% of the reads in the control sample ≤ the hypomethylation threshold; 特征样本至少30%reads的alpha值≥高甲基化阈值;The alpha value of at least 30% of the reads of the feature sample ≥ the high methylation threshold; 高甲基化阈值-低甲基化阈值≥N,N至少为0.5;High methylation threshold - low methylation threshold ≥ N, N is at least 0.5; 各候选差异甲基化区域的alpha分布模型符合的所述原则中的高甲基化阈值和/或低甲基化阈值相同或不同。The high methylation thresholds and/or low methylation thresholds in the principles that the alpha distribution models of the candidate differentially methylated regions conform to are the same or different. 2.根据权利要求1所述的筛选方法,其特征在于,步骤(b)中,所述差异甲基化区域存在于至少三对特征样本和对照样本中;2. The screening method according to claim 1, wherein in step (b), the differentially methylated region exists in at least three pairs of characteristic samples and control samples; 优选地,所述甲基化标志物包括癌症的甲基化标志物,所述特征样本为癌症组织,所述对照样本为癌旁正常组织;Preferably, the methylation markers include cancer methylation markers, the characteristic sample is cancer tissue, and the control sample is paracancerous normal tissue; 优选地,所述癌症包括膀胱癌;Preferably, the cancer comprises bladder cancer; 优选地,在TCGA数据库中获取肿瘤样本数据和正常样本数据,根据beta值进行差异分析获得候选差异甲基化区域,根据差异甲基化区域设计探针,捕获癌症组织和癌旁正常组织核酸样本,测序后获得候选差异甲基化区域的测序reads。Preferably, tumor sample data and normal sample data are obtained in the TCGA database, differential analysis is performed according to beta values to obtain candidate differentially methylated regions, probes are designed according to differentially methylated regions, and nucleic acid samples of cancer tissues and adjacent normal tissues are captured After sequencing, the sequencing reads of the candidate differentially methylated regions are obtained. 3.膀胱癌甲基化标志物,其特征在于,包括如下表中的1~13中的至少一种;其中,以下表格中的每一项所对应的标志物包括对应的甲基化位点和/或含有对应甲基化位点的区域:3. Bladder cancer methylation marker, characterized in that it includes at least one of 1-13 in the following table; wherein, the marker corresponding to each item in the following table includes the corresponding methylation site and/or regions containing corresponding methylated sites: 标志物序号marker serial number 染色体chromosome 起始位置starting point 终止位置end position 11 chr1chr1 5088672450886724 5088684350886843 22 chr1chr1 6378548663785486 6378560563785605 33 chr2chr2 6328295563282955 6328307463283074 44 chr2chr2 6666737566667375 6666749466667494 55 chr6chr6 108485949108485949 108486068108486068 66 chr7chr7 2720505627205056 2720528427205284 77 chr7chr7 2720532327205323 2720544227205442 88 chr7chr7 2720560027205600 2720571927205719 99 chr7chr7 1915794319157943 1915806219158062 1010 chr10chr10 2251813722518137 2251859422518594 1111 chr11chr11 4360280443602804 4360292343602923 1212 chr17Chr17 1768534917685349 1768546817685468 1313 chr18chr18 2292934922929349 2292946822929468
以hg19参考基因组序列为基准。Based on the hg19 reference genome sequence.
4.识别甲基化特征的模型建立方法,其特征在于,包括:将采用权利要求1或2所述的筛选方法的步骤(b)中得到的差异甲基化区域作为输入数据,采用一维卷积神经网络建立模型,使用深度学习框架进行模型训练和预测。4. A method for establishing a model for identifying methylation features, comprising: using the differentially methylated region obtained in the step (b) of the screening method according to claim 1 or 2 as input data, using a one-dimensional Convolutional neural networks build models, and use deep learning frameworks for model training and prediction. 5.根据权利要求4所述的模型建立方法,其特征在于,根据步骤(b)选取的差异甲基化区域,对reads的甲基化模式进行编码并输入卷积神经网络建立模型,模型基于Python语言构建,使用Tensorflow深度学习框架进行模型训练和预测;5. The model building method according to claim 4, characterized in that, according to the differential methylation region selected in step (b), the methylation pattern of reads is encoded and input into a convolutional neural network to establish a model, the model is based on Python language construction, using Tensorflow deep learning framework for model training and prediction; 优选地,采用一位有效编码的编码方式对reads的甲基化模式进行编码,然后创建序贯模型,使用一维卷积神经网络层来构建模型,对输入数据进行卷积处理,提取数据的特征信息,然后添加其他类型的层来增加模型的复杂度和准确性,最后进行全连接层的处理,将提取到的特征信息映射至输出结果,并对建立的卷积神经网络模型选择损失函数和优化算法进行编译;Preferably, the methylation pattern of the reads is encoded by using a one-bit efficient encoding method, and then a sequential model is created, and a one-dimensional convolutional neural network layer is used to construct the model, and the input data is convolutionally processed to extract the data. Feature information, and then add other types of layers to increase the complexity and accuracy of the model, and finally process the fully connected layer, map the extracted feature information to the output result, and select the loss function for the established convolutional neural network model Compile with optimized algorithm; 优选地,使用至少4对特征样本和对照样本对模型进行训练;Preferably, at least 4 pairs of feature samples and control samples are used to train the model; 优选地,所述其他类型层包括池化层和/或Dropout层;Preferably, the other types of layers include pooling layers and/or Dropout layers; 优选地,所述损失函数选自交叉熵损失函数;Preferably, the loss function is selected from a cross-entropy loss function; 优选地,所述优化算法选自随机梯度下降法;随机梯度下降法的参数优选为lr=0.01,momentum=0.9,weight_decay=0.0001。Preferably, the optimization algorithm is selected from the stochastic gradient descent method; the parameters of the stochastic gradient descent method are preferably lr=0.01, momentum=0.9, and weight_decay=0.0001. 6.根据权利要求4或5所述的模型建立方法,其特征在于,所述模型用于识别癌症组织的甲基化特征;6. The method for establishing a model according to claim 4 or 5, wherein the model is used to identify the methylation characteristics of cancer tissue; 优选地,所述癌症包括膀胱癌。Preferably, the cancer comprises bladder cancer. 7.识别甲基化特征的方法,其特征在于,使用权利要求4~6任一项所述的建立方法建立的模型识别待测样本测序reads的甲基化特征;7. The method for identifying methylation features, characterized in that, using the model established by the establishment method according to any one of claims 4 to 6 to identify the methylation features of the sequencing reads of the sample to be tested; 优选地,向模型中输入待测样本的差异甲基化区域的reads的甲基化模式,所述差异甲基化区域为用于建立所述模型的差异甲基化区域,所述模型输出待测样本的reads的甲基化特征。Preferably, the methylation pattern of the reads of the differentially methylated region of the sample to be tested is input into the model, the differentially methylated region is the differentially methylated region used to establish the model, and the model outputs the Measure the methylation characteristics of the reads of the sample. 8.甲基化特征识别装置,其特征在于,含有预测模块,所述预测模块用于将差异甲基化区域的reads的甲基化模式输入权利要求4~6任一项所述的建立方法建立的模型中,获得待测样本的甲基化特征。8. The methylation feature recognition device, characterized in that it contains a prediction module, and the prediction module is used to input the methylation pattern of the reads in the differentially methylated region into the establishment method described in any one of claims 4 to 6 In the established model, the methylation characteristics of the sample to be tested are obtained. 9.计算机可读介质,其特征在于,存储有计算机程序,所述计算机程序被处理执行时实现权利要求4~6任一项中所述的模型。9. A computer-readable medium, characterized in that a computer program is stored, and when the computer program is processed and executed, the model described in any one of claims 4-6 is realized. 10.用于预测、诊断或辅助诊断癌症的系统,其特征在于,包括如下至少任意一种:10. A system for predicting, diagnosing or assisting in diagnosing cancer, characterized in that it includes at least any one of the following: (a)用于检测权利要求1或2所述的筛选方法筛选得到的甲基化标志物的试剂和/或设备;(a) reagents and/or equipment for detecting the methylation markers screened by the screening method according to claim 1 or 2; (b)用于检测权利要求3所述的膀胱癌甲基化标志物的试剂和/或设备;(b) reagents and/or equipment for detecting the bladder cancer methylation marker according to claim 3; (c)权利要求8所述的甲基化特征识别装置;(c) the methylation feature recognition device according to claim 8; (d)权利要求9所述的计算机可读介质。(d) The computer readable medium of claim 9.
CN202310713016.6A 2023-06-15 2023-06-15 Screening method for methylation markers, method and device for identifying methylation features Pending CN116705159A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310713016.6A CN116705159A (en) 2023-06-15 2023-06-15 Screening method for methylation markers, method and device for identifying methylation features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310713016.6A CN116705159A (en) 2023-06-15 2023-06-15 Screening method for methylation markers, method and device for identifying methylation features

Publications (1)

Publication Number Publication Date
CN116705159A true CN116705159A (en) 2023-09-05

Family

ID=87823546

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310713016.6A Pending CN116705159A (en) 2023-06-15 2023-06-15 Screening method for methylation markers, method and device for identifying methylation features

Country Status (1)

Country Link
CN (1) CN116705159A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016041010A1 (en) * 2014-09-15 2016-03-24 Garvan Institute Of Medical Research Methods for diagnosis, prognosis and monitoring of breast cancer and reagents therefor
WO2018158589A1 (en) * 2017-03-02 2018-09-07 Ucl Business Plc Diagnostic and prognostic methods
CN109686414A (en) * 2018-12-28 2019-04-26 陈洪亮 It is only used for the choosing method of the special DNA methylation assay Sites Combination of Hepatocarcinoma screening
US20210156863A1 (en) * 2017-11-03 2021-05-27 University Health Network Cancer detection, classification, prognostication, therapy prediction and therapy monitoring using methylome analysis
CN112941180A (en) * 2021-02-25 2021-06-11 浙江大学医学院附属妇产科医院 Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016041010A1 (en) * 2014-09-15 2016-03-24 Garvan Institute Of Medical Research Methods for diagnosis, prognosis and monitoring of breast cancer and reagents therefor
WO2018158589A1 (en) * 2017-03-02 2018-09-07 Ucl Business Plc Diagnostic and prognostic methods
US20210156863A1 (en) * 2017-11-03 2021-05-27 University Health Network Cancer detection, classification, prognostication, therapy prediction and therapy monitoring using methylome analysis
CN109686414A (en) * 2018-12-28 2019-04-26 陈洪亮 It is only used for the choosing method of the special DNA methylation assay Sites Combination of Hepatocarcinoma screening
CN112941180A (en) * 2021-02-25 2021-06-11 浙江大学医学院附属妇产科医院 Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIAQI LI 等: "DISMIR: Deep learning-based noninvasive cancer detection by integrating DNA sequence and methylation information of individual cell-free DNA reads", BRIEF BIOINFORM, vol. 22, no. 6, 5 November 2021 (2021-11-05) *
LEIHONG DENG 等: "A novel and sensitive DNA methylation marker for the urine-based detection of bladder cancer", BMC CANCER, vol. 22, no. 1, 6 May 2022 (2022-05-06) *

Similar Documents

Publication Publication Date Title
CN112951418B (en) Method and device, terminal device and storage medium for assessment of linked region methylation based on liquid biopsy
Pylro et al. Data analysis for 16S microbial profiling from different benchtop sequencing platforms
US20190316209A1 (en) Multi-Assay Prediction Model for Cancer Detection
CN103201744B (en) For estimating the method that full-length genome copies number variation
CA3159287A1 (en) Cancer classification using patch convolutional neural networks
CN108038352B (en) Method for mining whole genome key genes by combining differential analysis and association rules
KR102812123B1 (en) Method and apparatus for classifying variation candidates within whole genome sequence
CN113574602A (en) Sensitive detection of Copy Number Variation (CNV) from circulating cell-free nucleic acids
Zhou et al. Integrative deep learning analysis improves colon adenocarcinoma patient stratification at risk for mortality
Abel et al. AI powered quantification of nuclear morphology in cancers enables prediction of genome instability and prognosis
Ahmad et al. Integrating heterogeneous omics data via statistical inference and learning techniques
KR20250078852A (en) Method and apparatus for identifying genetic variation based on machine learning
CN118969078B (en) A spatial omics tumor evolution prediction method and system based on graph neural network
KR20220086603A (en) Cancer classification using tissue-of-origin thresholding
CN112233726A (en) Analysis method and analysis device for bacterial strains and storage medium
CN115976209B (en) A training method for lung cancer prediction model, prediction device and application
Janeiro et al. Spatially resolved tissue imaging to analyze the tumor immune microenvironment: beyond cell-type densities
CN110462056B (en) Sample source detection method, device and storage medium based on DNA sequencing data
CN116705159A (en) Screening method for methylation markers, method and device for identifying methylation features
CN119475257A (en) IDH wild-type glioblastoma classification method and system based on multimodal data fusion
CN115881218B (en) Automated gene selection methods for genome-wide association analysis
CN116543907A (en) Body mass index prediction method, model training method and equipment
CN111739581B (en) Comprehensive screening method for genome variables
Fan et al. Rapid preliminary purity evaluation of tumor biopsies using deep learning approach
Mohanty et al. Cancer tumor detection using genetic mutated data and machine learning models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination