[go: up one dir, main page]

WO2025194306A1 - Procédé et plateforme de traitement et d'analyse de données de méthylation, support de stockage et produit-programme - Google Patents

Procédé et plateforme de traitement et d'analyse de données de méthylation, support de stockage et produit-programme

Info

Publication number
WO2025194306A1
WO2025194306A1 PCT/CN2024/082205 CN2024082205W WO2025194306A1 WO 2025194306 A1 WO2025194306 A1 WO 2025194306A1 CN 2024082205 W CN2024082205 W CN 2024082205W WO 2025194306 A1 WO2025194306 A1 WO 2025194306A1
Authority
WO
WIPO (PCT)
Prior art keywords
methylation
analysis
data
sample
statistics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/082205
Other languages
English (en)
Chinese (zh)
Inventor
宋阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Beijing BOE Technology Development Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Beijing BOE Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd, Beijing BOE Technology Development Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to PCT/CN2024/082205 priority Critical patent/WO2025194306A1/fr
Publication of WO2025194306A1 publication Critical patent/WO2025194306A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Definitions

  • the embodiments of the present disclosure relate to, but are not limited to, the field of biotechnology, and in particular to a methylation data processing and analysis method and platform, a storage medium, and a program product.
  • DNA methylation is a type of genomic epigenetic modification. There is increasing evidence that methylation is involved in the development and progression of diseases and the regulation of related biological pathways. Methylation sequencing is a type of high-throughput sequencing and is the original source of information explaining the involvement of DNA methylation in biological functions. In order to effectively and easily perform in-depth data analysis and display results on methylation, the role of a one-stop platform that integrates methylation data processing, analysis, and visualization is particularly important.
  • the present disclosure provides a method for processing and analyzing methylation data, including:
  • the multiple sample information includes: sample queue information, queue comparison information, and sample methylation sequencing data, wherein the sample queue information includes samples and queues to which the samples belong, and the queue comparison information includes one or more queues to be compared, wherein one queue includes two queues;
  • the data processing includes sequencing data quality control, sequencing data alignment, and alignment result statistics.
  • the data analysis includes sample methylation analysis, inter-group methylation analysis, and differential methylation analysis.
  • the sample methylation analysis is used to analyze the methylation level of each sample.
  • the inter-group methylation analysis is used to compare the methylation levels between samples of each different cohort.
  • the differential methylation analysis is used to identify differentially methylated regions between each different cohort and perform multi-dimensional analysis on them.
  • An embodiment of the present disclosure also provides a methylation data processing and analysis platform, comprising a memory; and a processor connected to the memory, wherein the memory is used to store instructions, and the processor is configured to execute the steps of the methylation data processing and analysis method described in any embodiment of the present disclosure based on the instructions stored in the memory.
  • the embodiments of the present disclosure further provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the methylation data processing and analysis method described in any embodiment of the present disclosure.
  • An embodiment of the present disclosure further provides a program product, comprising instructions.
  • the instructions execute the methylation data processing and analysis method as described in any embodiment of the present disclosure.
  • the present disclosure also provides a methylation data processing and analysis platform, including: a task generation module, a data processing and analysis module, a visualization module, and a report generation module, wherein:
  • the task generation module is configured to receive a plurality of sample information and generate a task script based on the received sample information, wherein the plurality of sample information includes: sample queue information, queue comparison information, and sample methylation sequencing data, wherein the sample queue information includes samples and queues to which the samples belong, and the queue comparison information includes one or more queues to be compared, wherein a queue includes two queues;
  • the data processing and analysis module is configured to perform data processing and data analysis according to the generated task script.
  • the data processing includes sequencing data quality control, sequencing data alignment, and alignment result statistics performed in sequence.
  • the data analysis includes sample methylation analysis, inter-group methylation analysis, and differential methylation analysis.
  • the sample methylation analysis is used to analyze the methylation level of each sample.
  • the inter-group methylation analysis is used to compare the methylation levels between samples of each different cohort.
  • the differential methylation analysis is used to identify differentially methylated regions between each different cohort and perform multi-dimensional analysis on them.
  • the visualization module is configured to generate graphics and/or tables based on the results of data processing and data analysis by the data analysis module;
  • the report generation module is configured to output an interactive report based on the graphs and/or tables generated by the visualization module.
  • FIG1 is a schematic diagram of a process of a methylation data processing and analysis method provided by an exemplary embodiment of the present disclosure
  • FIG2 is a schematic flow chart of another methylation data processing and analysis method provided by an exemplary embodiment of the present disclosure
  • FIG3 is a schematic structural diagram of a methylation data processing and analysis platform provided by an exemplary embodiment of the present disclosure
  • FIG4 is a schematic diagram of an input of a task generation module provided by an exemplary embodiment of the present disclosure
  • FIG5 is a schematic diagram of a processing flow included in a data processing and analysis module provided by an exemplary embodiment of the present disclosure
  • FIG6 is a schematic diagram of a processing flow included in a visualization module provided by an exemplary embodiment of the present disclosure
  • FIG7 is a schematic structural diagram of a report generation module provided by an exemplary embodiment of the present disclosure.
  • FIG8 is a schematic structural diagram of another methylation data processing and analysis platform provided by an exemplary embodiment of the present disclosure.
  • the technical or scientific terms used in the embodiments of the present disclosure should have the ordinary meaning understood by people with ordinary skills in the field to which the present disclosure belongs.
  • the words “first”, “second” and similar words used in the embodiments of the present disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components.
  • the words “include” or “comprising” and similar words mean that the elements or objects preceding the word include the elements or objects listed after the word and their equivalents, without excluding other elements or objects.
  • Step 101 Receive multiple sample information and generate a task script based on the received sample information.
  • the multiple sample information includes: sample queue information, queue comparison information and sample methylation sequencing data.
  • the sample queue information includes the sample and the queue to which the sample belongs.
  • the queue comparison information includes one or more queues to be compared.
  • a queue includes two queue;
  • Step 102 Perform data processing and data analysis according to the generated task script.
  • Data processing includes sequencing data quality control, sequencing data alignment, and alignment result statistics.
  • Data analysis includes sample methylation analysis, inter-group methylation analysis, and differential methylation analysis. Sample methylation analysis is used to analyze the methylation level of each sample. Inter-group methylation analysis is used to compare the methylation levels between samples in different cohorts. Differential methylation analysis is used to identify differentially methylated regions between different cohorts and perform multi-dimensional analysis on them.
  • Step 103 Generate graphs and/or tables based on the results of data processing and data analysis
  • Step 104 Output an interactive report based on the generated graph and/or table.
  • the methylation data processing and analysis method provided by the present disclosure integrates methylation data processing, analysis, and visualization (i.e., graphical and/or tabular display). By simply inputting multiple sample information, a complete interactive report for interpreting the methylation sequencing data analysis results can be produced in one stop. It has a high degree of automation, is convenient for non-professionals to operate, and can produce an in-depth analysis report for sample methylation sequencing data in a short period of time. It supports multi-group difference analysis based on the provided group information.
  • the interactive report output by the embodiment of the present disclosure has a strong sample size capacity, including graphical and tabular interactions, which is easy for readers to view and understand, and supports inserting and removing content for expansion and deletion.
  • sample cohort information is used to determine the cohort to which a sample belongs, and cohort comparison information is used to provide any one or more cohorts for which a difference comparison is required.
  • Each cohort contains two cohorts, and each cohort contains multiple samples.
  • sample methylation sequencing data which is used as a source for downstream data processing and data analysis.
  • the sample number, cohort number, and comparison scheme number i.e., the two cohort numbers being compared
  • the sample number, cohort number, and comparison scheme number i.e., the two cohort numbers being compared
  • the one or more groups of queues to be compared include at least one of the following:
  • Disease remission cohort and disease non-remission cohort among which the disease remission cohort includes complete remission cohort and partial remission cohort, and the disease non-remission cohort includes stable disease cohort and disease progression cohort;
  • This disclosure takes the methylation data processing and analysis of the prognosis of clinical tumor patients as an example.
  • the patients can be divided into four types: CR (complete remission), PR (partial remission), SD (stable), and PD (progression).
  • CR complete remission
  • PR partial remission
  • SD stable
  • PD progression
  • each patient can be numbered as "type-01" and "type-02".
  • patient No. 1 of the CR type is "CR-1”.
  • the numbering starts at 001 or 0001.
  • the type of the patient is the cohort information of the patient.
  • the cohort information of patient "CR-01" and patient "CR-02" are both "CR".
  • Each patient has corresponding sample methylation sequencing data. When receiving the sample methylation sequencing data, the user enters the sample methylation sequencing data path.
  • the present disclosure can customize the mining of potential biomarkers between cohorts with obvious or vague clinical characteristics through the cohort comparison setting of inter-group differences.
  • the user can set the corresponding cohort comparison information according to the differences between the two cohorts that he wants to observe, for example, set two comparison scheme numbers: "CR_vs_PR” and “SD_vs_PD", and the setting method of the comparison scheme number is "cohort number 1_vs_cohort number 2". More comparison schemes can also be set according to the specific situation.
  • comparison scheme is usually set for the differences in clinical phenotypes, traits, etc., but when the biological information is vague (for example, it is unclear whether a certain feature has an impact on the development of the disease), a comparison scheme can also be set to determine whether there are differences between the two cohorts to mine new potential biomarkers.
  • the one or more groups of queues to be compared may further include:
  • a first patient cohort and a second patient cohort wherein all samples in the first patient cohort have the first feature, and all samples in the second patient cohort do not have the first feature.
  • the first feature can be set as needed. For example, if the user wants to study the effect of smoking on the efficacy of anti-tumor treatment, then the first feature can be set to smoking, that is, the samples in the first patient cohort are all smokers, and the samples in the second patient cohort are all non-smokers.
  • the method further includes: receiving background information, wherein the background information includes at least one of the following: file dependency information, software dependency information, custom script information, and parameter information, the file dependency information includes information about dependent files required for data processing and data analysis, the software dependency information includes information about software required for data processing and data analysis, the custom script information includes intermediate data connection scripts and data statistics scripts required for data processing and data analysis, and the parameter information includes parameters required for data processing and data analysis.
  • the background information includes at least one of the following: file dependency information, software dependency information, custom script information, and parameter information
  • the file dependency information includes information about dependent files required for data processing and data analysis
  • the software dependency information includes information about software required for data processing and data analysis
  • the custom script information includes intermediate data connection scripts and data statistics scripts required for data processing and data analysis
  • the parameter information includes parameters required for data processing and data analysis.
  • background information includes file dependency information, software dependency information, custom script information and parameter information
  • file dependency information includes but is not limited to the reference genome fasta file, the genome chromosome length file, the genome annotation file gff3, the genome gene interval bed file, the methylation CpG island file, the gene transcription start site file, and the result output path
  • the software dependency information mainly includes but is not limited to sequencing data quality control software, methylation sequencing data comparison software, differential methylation region identification software, Python and R programming software
  • the custom script information includes but is not limited to intermediate data connection scripts and data statistics scripts
  • parameter information includes various parameters in the customizable data analysis process, including but not limited to the number of process running cores, the difference identification threshold, etc.
  • the hg19 or hg38 version of the fasta file is used as the reference genome.
  • the reference genome file can be downloaded from websites such as UCSC, NCBI, and genecode.
  • the fasta file is indexed using samtools, and the reference genome chromosome length file can be generated based on the constructed fasta.fai file.
  • the human genome annotation file gff3, the human genome gene interval bed file, the human methylation CpG island file, and the human gene transcription start site file can also be downloaded from one or more of the above websites.
  • Software dependency information can be dynamically adjusted according to different application scenarios.
  • quality detection can be performed using fastqc software
  • data cleaning can be performed using trim_galore software
  • alignment software can use bismark software
  • alignment and multiple index statistics can be performed using software such as picard, samtools, seqkit and bedtools
  • difference identification software can use metilene software.
  • Custom script information is mainly used for the format connection of input and output of software analysis results, as well as data statistics of various indicators. Custom scripts can be dynamically added according to specific needs.
  • Parameter information is used to dynamically adjust relevant parameters during the analysis process, without requiring modification from the original command end.
  • the parameter information may include: the number of server cores, the differential DMR identification threshold, the minimum number of CpG sites constituting a DMR region, and the minimum average methylation difference.
  • the number of server cores can be defined as 5, the differential DMR identification threshold as 0.05, the minimum number of CpG sites constituting a DMR region as 4, and the minimum average methylation difference as 10.
  • the technical solution disclosed in the present invention is applicable to the data processing and analysis of the vast majority of species for which reference genomes exist.
  • the present invention processes and analyzes methylation data for different species, it is only necessary to provide multiple sample information and file dependency information in the background information (including reference genome, genome annotation file gff3, genome gene interval bed file, methylation CpG island file, gene transcription start site file, etc.), and one-click data processing, analysis and visualization can be performed.
  • the file dependency information content in the background information used is the same, the file dependency information of a specific species only needs to be introduced once.
  • the data processing and analysis process includes: sequencing data quality control process, sequencing data alignment process, alignment result statistics process, sample methylation analysis process, intergroup methylation analysis process, and differential methylation analysis process.
  • sequencing data quality control sequencing data alignment
  • alignment result statistics sample methylation analysis process
  • sample methylation analysis process intergroup methylation analysis
  • differential methylation analysis process differential methylation analysis
  • the original sample methylation sequencing data is input into the sequencing data quality control process to obtain the cleaned sample methylation sequencing data
  • the cleaned sample methylation sequencing data is input into the sequencing data comparison process to obtain the comparison results
  • the comparison results are statistically analyzed to obtain the methylation level of the sample.
  • sequencing data quality control includes: performing quality detection and data cleaning on sample methylation sequencing data, and performing quality detection again on the cleaned data; wherein, quality detection includes: sequence base quality detection, sequence repetition level detection, and sequence GC content detection, and data cleaning includes quality trimming, 3' end trimming, adapter sequence removal, polymorphic nucleotide removal, and short sequence filtering.
  • sequencing data quality control includes quality detection and data cleaning of sample methylation sequencing data.
  • the content of quality detection includes sequence base quality, sequence repetition level, sequence GC content, total number of reads, total number of bases, Q20, Q30 content, etc.
  • the content of data cleaning includes quality trimming (trimming off bases with lower quality), 3' end trimming (removing low-quality regions at the 3' end of sequencing reads), adapter sequence removal (removing adapter sequences in the data), short sequence filtering (filtering out shorter sequencing sequences), etc.
  • the content of quality testing is the same as above. When the results of the re-quality test do not meet the requirements in multiple dimensions, it is recommended to re-experiment or re-sequence the sample to ensure the reliability and stability of the sequence entering the subsequent data analysis.
  • sequencing data alignment utilizes cleaned sample methylation sequencing data aligned to the human reference genome, and sequence deduplication is performed by detecting fragment start and length information. Sequence duplication is caused by the characteristics of PCR amplification or sequencing technology. Removing sequence duplication improves the accuracy of methylation level identification. Sequencing data alignment and deduplication in this embodiment were both performed using Bismark software.
  • the comparison result statistical process includes comparison data statistics, methylation statistics, data deduplication statistics, methylation statistics after deduplication, M-deviation statistics, depth and uniformity statistics, etc.
  • the comparison data statistics include: the total number of sequence pairs analyzed, the number of paired-end alignments with unique optimal alignments, the number of paired sequences that were not aligned under any conditions, and the number of sequences for which genomic sequence information could not be extracted;
  • the methylation statistics include: the total number of cytosine Cs analyzed, the number of methylated Cs in CpG, the number of methylated Cs in CHG, the number of methylated Cs in CHH, the number of methylated Cs that could not be determined, the number of unmethylated Cs in CpG, the number of unmethylated Cs in CHG, the number of unmethylated Cs in CHH, the number of unmethylated Cs that could not be determined, the percentage of methylated Cs in CpG, the percentage of methylated Cs in CHG, the percentage of methylated Cs in CHH, and the percentage of methylated Cs that could not be determined;
  • the data deduplication statistics include: the number of sequences considered for alignment during the analysis, the number of duplicate aligned sequences removed, the number of sequences with duplicate positions (coordinates) in the alignment results, the number of unique aligned sequences in the alignment results, the proportion of sequences remaining after deduplication, and the proportion of sequences deleted;
  • the M-deviation statistic is used to display the deviation of methylation levels of R1 and R2 by sequence position in the sequencing results;
  • Depth and uniformity statistics include: 0.2x uniformity, the total number of sequencing data bases that fall within the target region (the target region of probe design), the percentage of bases that fall within the target region, the average number of times each base is read during sequencing, the number of all reads (sequencing fragments), the number of times the sequencing data coverage is less than 80% of the expected coverage, and the percentage of bases in the target region with at least 1x coverage.
  • the comparison result statistics include comparison data statistics, methylation statistics, data deduplication statistics, methylation statistics (after deduplication), M-deviation statistics, depth and uniformity statistics, etc., wherein the comparison data statistics, methylation statistics, data deduplication statistics, and methylation statistics after deduplication are in a serial relationship, and M-deviation statistics, depth and uniformity statistics can be in a serial or parallel relationship.
  • the data deduplication statistics part can be personalized according to whether the sequencing data contains UMI, thereby further improving the accuracy of methylation level detection.
  • the platform can provide an option for the user to select whether the sequencing data contains UMI.
  • the data deduplication statistics process When the user selects that the sequencing data does not contain UMI, the data deduplication statistics process performs deduplication by detecting the fragment start and length information; when the user selects that the sequencing data contains UMI, the data deduplication statistics process performs deduplication by detecting the fragment start, length information, and UMI sequence.
  • the deduplication conditions are more stringent to avoid data loss and reduced analysis accuracy caused by excessive deduplication.
  • M-deviation statistics, depth and uniformity statistics further improve the multi-dimensional control of the comparison result file, which can assist the process to automatically determine whether the sample needs to be included in the subsequent differential methylation analysis process, and provide more stable analysis results.
  • the M-deviation, depth and uniformity reach the preset threshold to determine whether the subsequent differential methylation analysis process is carried out. For example, it can be set that the 0.2x uniformity must be greater than 95%, the average number of times each base is read in sequencing must be greater than 20X, the multiple of the sequencing data coverage lower than the expected coverage of 80% must be less than 2.0, and the percentage of bases with at least 1x coverage in the target region must be greater than 95%. Only when these conditions are met can the subsequent differential methylation analysis process be carried out.
  • sample methylation analysis, inter-group methylation analysis, and differential methylation analysis are performed in parallel to speed up methylation data processing and analysis.
  • the sample methylation analysis includes at least one of the following: genomic chromosome methylation level statistics, genomic functional region methylation level statistics, and gene body methylation level statistics, wherein the genomic chromosome methylation level statistics use preset windows to divide the genome and count the genomic GC content, gene content, and methylation level under each window; the genomic functional region methylation level statistics are performed for different genomic elements according to CpGi, enhancers, exons, introns, promoters, 3'UTRs, and 5'UTRs; the gene body methylation level statistics are performed for the preset regions upstream of the transcription start sites of all genes, exons, and introns, and each region of each gene is divided into equal parts according to length, and the CpG methylation level of each equally divided region is calculated.
  • the sample methylation analysis process includes genome chromosome methylation level statistics, genome functional region methylation level statistics, and gene body methylation level statistics.
  • genome chromosome methylation level statistics are represented by circos diagram, using a 500kb window to divide the genome, and the genome GC content, gene content, and methylation level (CpG) under each window are counted.
  • the Circos diagram is sequentially represented by chromosome number and length, GC content histogram, gene content histogram, and methylation level heat map from outside to inside, wherein the methylation level heat map color indicates that the methylation level is from low to high from light to dark.
  • Genome functional region methylation level statistics are represented by violin diagram, and for different genome elements: CpG methylation level statistics are performed according to CpGi, enhancer, exon, intron, promoter, 3'UTR, and 5'UTR respectively. Gene body methylation levels are presented using a scatter plot. CpG methylation levels are calculated for all genes within 2 kbp upstream of the transcription start site, in exons, and introns. Each region of each gene is divided into 20 bins by length, and the CpG methylation level is calculated for each bin of each region of each gene.
  • the inter-group methylation analysis includes at least one of the following: inter-group genomic functional region methylation level comparison, genomic functional region methylation level correlation analysis, and genomic functional region methylation principal component cluster analysis, wherein the inter-group genomic functional region methylation level comparison performs CpG methylation level statistics on all enhancers, upstream preset regions of gene transcription start sites, promoters, 5'UTRs, exons, introns, 3'UTRs, and CpGi regions of different groups, and divides each region of each gene into equal parts according to length, and calculates the CpG methylation level of each equally divided region; the genomic functional region methylation level correlation analysis performs CpG methylation level statistics on functional elements of different samples. The correlation heat map was drawn; principal component cluster analysis of methylation in genomic functional regions was performed on the functional elements of different samples.
  • the inter-group methylation analysis process includes inter-group genomic functional region methylation level comparison, genomic functional region methylation level correlation analysis, and genomic functional region methylation principal component cluster analysis.
  • the inter-group genomic functional region methylation level comparison is represented by a scatter plot, and CpG methylation level statistics are performed on all enhancers (Enhancer), 2Kbp (up2k) upstream of the gene transcription start site, promoters (Promoter), 5'UTR, exons (Exon), introns (Intron), 3'UTR, and CpGi regions of different groups, and each region of each gene is divided into 20 bins according to length, and the CpG methylation level of each bin in each region of each gene is calculated.
  • the genomic functional region methylation level correlation analysis is represented by a cluster heat map, and a correlation heat map is drawn for the functional elements of different samples, wherein the darker the block color, the higher the correlation.
  • Principal component cluster analysis of methylation in functional genomic regions is represented by a two-dimensional PCA scatter plot.
  • the first principal component (PC1) is the direction with the largest variance
  • the second principal component (PC2) is perpendicular to PC1 and has the second largest variance. Points closer to PC1 and PC2 represent samples with higher correlation.
  • the differential methylation analysis includes at least one of the following: identification of differentially methylated regions (DMRs), significant distribution of identification of hypermethylated and hypomethylated regions, distribution of DMR lengths and levels, clustering of DMR regional methylation levels, DMR annotation, distribution of DMR coverage regions, and enrichment analysis of DMR-associated genes.
  • DMRs differentially methylated regions
  • significant distribution of identification of hypermethylated and hypomethylated regions distribution of DMR lengths and levels
  • clustering of DMR regional methylation levels DMR annotation
  • distribution of DMR coverage regions and enrichment analysis of DMR-associated genes.
  • the differential methylation analysis process of the embodiment of the present disclosure mainly performs a comprehensive analysis of DMRs for the set differential comparison scheme. After the analysis process is completed, a table file of DMRs and their enrichment analysis can be provided for download. Among them, the differential methylation region (DMR) identification statistics the number of DMRs under each cohort comparison scheme, the average length of DMRs, the average number of CpG sites covered by DMRs, the number of high-methylated DMRs in the latter relative to the former in the comparison scheme number, the number of low-methylated DMRs in the latter relative to the former in the comparison scheme number, and the chromosome number of the DMR under each comparison scheme, DMR start site, DMR end site, P value, differential methylation level, number of CpG sites included, average methylation rate of the latter in the comparison scheme number, and average methylation rate of the former in the comparison scheme number.
  • DMR differential methylation region
  • the genome was divided into 500kb windows, and the Hyper DMR significance, genomic GC content, gene content, and Hypo DMR significance within each window were statistically analyzed to create a circos diagram.
  • the circles in the circos diagram represent, from the outside to the inside, chromosome number and length, Hyper DMR significance (the more outward, the stronger the significance), genomic GC content heat map, gene content heat map, and Hypo DMR significance (the more inward, the stronger the significance).
  • Heat map colors range from light to dark, indicating correlation values from low to high.
  • the DMR length distribution was displayed using a bar chart, with the horizontal axis representing DMR length and the vertical axis representing the number of DMRs at a specific length.
  • the methylation levels of DMR regions between different groups were displayed using violin plots, which were divided into hypermethylated DMRs and hypomethylated DMRs.
  • Clustering of DMR region methylation levels was performed by creating a cluster heat map based on the average methylation level between groups, demonstrating differential methylation between groups.
  • DMR annotation software was used to obtain information such as the location of the gene closest to the DMR, the positive and negative strands on which it is located, the gene length, gene ID, the distance of the DMR from the transcription start site (TSS), and the gene name.
  • TSS transcription start site
  • a circular plot was used to calculate the percentage of DMRs covering different genomic elements.
  • DMR-associated gene enrichment analysis included GO and KEGG enrichment analysis, and separate GO and KEGG analyses were performed for promoter regions with more significant biological significance for methylation. Results included enriched pathway identifiers, pathway descriptions, the proportion of genes with pathway function in the entire background gene set, P-values, and the number of DMR-associated genes enriched in that pathway.
  • the disclosed embodiments use a differential methylation analysis process to identify differentially methylated regions between each group of different cohorts.
  • the identification method is cohort-to-cohort and many-to-many. There is no need to repeatedly set up differential comparison schemes. Differential analysis can be specified between different cohorts with less obvious clinical significance, and customized mining of potential biomarkers can be carried out.
  • visualization includes a graph generation process and a table generation process to draw graphs and generate tables for the data generated in the data processing and data analysis process, wherein the graph drawing includes but is not limited to box plots, line graphs, curve graphs, pie charts, ring graphs, bar graphs, circos graphs, violin graphs, cluster heat maps, Scatter plots, etc.
  • the table is generated based on the provided sample cohort information and cohort comparison information, and is output in sequence with sample name or cohort name, which facilitates comparison and understanding of samples and cohorts.
  • outputting an interactive report based on the generated graph and/or table includes:
  • the interactive report of the embodiment of the present disclosure includes a multi-level browsing directory index.
  • the user clicks on the multi-level title he can quickly jump to the corresponding content; the text content can be dynamically updated based on the analysis results, and can also be updated in real time based on user input; the user can perform interactive operations such as dragging, zooming, selecting, searching, sorting, and downloading graphics and tables.
  • the method further comprises:
  • At least one of the following operations is performed on the graph and/or table: dragging, zooming, searching, sorting, selecting, unselecting, and downloading.
  • the method further comprises:
  • the classification model is constructed and multiple sample information is split into training sets and test sets;
  • the sample data of two different cohorts can be split into training sets and test sets using a hierarchical splitting method to ensure that the ratio of label columns in the data sets before and after the split is consistent.
  • the training set is used to traverse most classification algorithms to automatically fit the model, and the model with the highest accuracy on the test set after modeling is selected and saved.
  • the classification algorithm can include logistic regression (Logistic Regression), multi-layer perceptron (MLP Classifier), K nearest neighbors (K neighbors Classifier), support vector machine (SVC), Gaussian process (Gaussian Process Classifier), decision tree (Decision Tree Classifier), Gaussian naive Bayes (Gaussian NB), random forest (Random Forest Classifier), discriminant analysis (Discriminant Analysis), multi-layer perceptron (MLP Classifier), etc.
  • logistic regression Logistic Regression
  • MLP Classifier multi-layer perceptron
  • K neighbors Classifier K nearest neighbors
  • SVC support vector machine
  • Gaussian process Gaussian Process Classifier
  • decision tree Decision Tree Classifier
  • Gaussian naive Bayes Gaussian NB
  • random forest Random Forest Classifier
  • discriminant analysis Discriminant Analysis
  • MLP Classifier multi-layer perceptron
  • the number of samples in each cohort is not recommended to be less than 20.
  • the provided comparison scheme number when inputting cohort comparison information, depending on the two cohorts for which classification models need to be constructed, can be appended with "_ML” after the string of the "XX_vs_XX” numbering method, i.e., "XX_vs_XX_ML”, for example, “CR_vs_PR_ML” and “SD_vs_PD_ML”, and classification models will be constructed between CR and PR, and SD and PD, respectively.
  • the embodiment of the present disclosure further provides a methylation data processing and analysis platform, including a task generation module.
  • Block 301 data processing and analysis module 302, visualization module 303 and report generation module 304, wherein:
  • the task generation module 301 is configured to receive multiple sample information and generate a task script based on the received sample information, wherein the multiple sample information includes: sample queue information, queue comparison information, and sample methylation sequencing data, wherein the sample queue information includes the sample and the queue to which the sample belongs, and the queue comparison information includes one or more queues to be compared, where a queue includes two queues;
  • the data processing and analysis module 302 is configured to perform data processing and data analysis according to the generated task script, wherein the data processing includes sequencing data quality control, sequencing data alignment, and alignment result statistics, and the data analysis includes sample methylation analysis, inter-group methylation analysis, and differential methylation analysis.
  • the sample methylation analysis is used to analyze the methylation level of each sample
  • the inter-group methylation analysis is used to compare the methylation levels between samples of different cohorts
  • the differential methylation analysis is used to identify differentially methylated regions between different cohorts and perform multi-dimensional analysis on them.
  • a visualization module 303 is configured to generate graphs and/or tables based on the results of data processing and data analysis
  • the report generation module 304 is configured to output an interactive report based on the generated graphs and/or tables.
  • the methylation data processing and analysis platform of the embodiment of the present disclosure provides a comprehensive platform for integrated methylation data processing, analysis and visualization by highly integrating modules such as task generation, data processing and analysis, visualization, and report generation. It only requires inputting multiple sample information to output a complete interactive report on the interpretation of the methylation sequencing data analysis results in one stop. It has a high degree of automation, is convenient for non-professionals to operate, and can produce in-depth analysis reports on sample methylation sequencing data in a short time, and supports multi-group difference analysis based on the provided group information.
  • the present disclosure provides sample methylation analysis, inter-group methylation analysis, differential methylation analysis and other content, which is more comprehensive in function.
  • the interactive report output by the embodiment of the present disclosure has a strong sample quantity accommodating capacity, includes graphics and table interactions, is easy for readers to view and understand, and supports insertion and removal of content expansion and deletion.
  • sample cohort information is used to determine the cohort to which a sample belongs, and cohort comparison information is used to provide any one or more cohorts for which a difference comparison is required.
  • Each cohort includes two cohorts, and each cohort includes multiple samples.
  • sample methylation sequencing data which is used for downstream data processing and data analysis.
  • the task generation module 301 is further configured to: receive background information, wherein the background information includes at least one of the following: file dependency information, software dependency information, custom script information and parameter information, the file dependency information includes information about dependent files required for data processing and data analysis, the software dependency information includes information about software required for data processing and data analysis, the custom script information includes intermediate data connection scripts and data statistics scripts required for data processing and data analysis, and the parameter information includes parameters required for data processing and data analysis.
  • the background information includes at least one of the following: file dependency information, software dependency information, custom script information and parameter information
  • the file dependency information includes information about dependent files required for data processing and data analysis
  • the software dependency information includes information about software required for data processing and data analysis
  • the custom script information includes intermediate data connection scripts and data statistics scripts required for data processing and data analysis
  • the parameter information includes parameters required for data processing and data analysis.
  • the background information includes file dependency information, software dependency information, custom script information and parameter information.
  • the file dependency information includes but is not limited to the reference genome fasta file, the genome chromosome length file, the genome annotation file gff3, the genome gene interval bed file, the methylation CpG island file, the gene transcription start site file, and the result output path;
  • the software dependency information mainly includes but is not limited to sequencing data quality control software, methylation sequencing data comparison software, differential methylation region identification software, Python and R programming software;
  • the custom script information includes but is not limited to intermediate data connection scripts and data statistics scripts;
  • the parameter information includes various parameters in the customizable data analysis process, including but not limited to the number of process running cores, the difference identification threshold, etc.
  • the data analysis module includes the following processing processes: sequencing data quality control process, sequencing data comparison process, comparison result statistics process, sample methylation analysis process, inter-group methylation analysis process, and differential methylation analysis process.
  • sequencing data quality control, sequencing data comparison, and comparison result statistics logical relationship are sequentially progressive, and sample methylation analysis, inter-group methylation analysis, and differential methylation analysis can be performed in series or in parallel.
  • the sequencing data quality control process, sequencing data comparison process, and comparison result statistical process of a single sample are connected in series, and the above processes of multiple samples are parallelized. Subsequently, tasks such as sample methylation analysis, inter-group methylation analysis, and differential methylation analysis are performed in parallel, which can speed up the processing and analysis of methylation data.
  • the raw sample methylation sequencing data is input into the sequencing data quality control process to obtain cleaned sample methylation sequencing data.
  • the cleaned sample methylation sequencing data is then input into the sequencing data alignment process to obtain alignment results.
  • the alignment results are statistically analyzed to determine the methylation level of the sample.
  • Sequencing data quality control is performed on both the raw sample methylation sequencing data and the filtered sample methylation sequencing data.
  • Quality control includes quality testing of the sample methylation sequencing data, including but not limited to sequence base quality, sequence duplication level, and sequence GC content. Both the raw sample methylation sequencing data and the filtered sample methylation sequencing data are subject to these tests.
  • Filtering of the raw sample methylation sequencing data includes quality trimming, 3'-end trimming, adapter sequence removal, polymorphic nucleotide (poly-N) removal, and short sequence filtering.
  • the alignment result statistical process includes alignment data statistics, methylation statistics, data deduplication, methylation data deduplication statistics, M-deviation statistics, depth and uniformity statistics, etc.
  • the sample methylation analysis process includes statistics on methylation levels of genomic chromosomes, methylation levels of functional genomic regions, and methylation levels of gene bodies.
  • the intergroup methylation analysis process includes intergroup comparison of methylation levels in functional genomic regions, correlation analysis of methylation levels in functional genomic regions, and principal component cluster analysis of methylation in functional genomic regions.
  • the differential methylation analysis process includes, but is not limited to, identification of differentially methylated regions (DMRs), significant distribution of hypermethylated and hypomethylated regions, DMR length and level distribution, clustering of methylation levels in DMR regions, DMR annotation, distribution of DMR coverage regions, and enrichment analysis of DMR-associated genes.
  • DMRs differentially methylated regions
  • significant distribution of hypermethylated and hypomethylated regions DMR length and level distribution
  • clustering of methylation levels in DMR regions DMR annotation
  • distribution of DMR coverage regions and enrichment analysis of DMR-associated genes.
  • visualization module 303 generates graphs and tables for the data generated during data analysis. These graphs include, but are not limited to, boxplots, line graphs, curve graphs, pie charts, donut charts, bar charts, circos charts, violin plots, cluster heat maps, and scatter plots. Tables are output sequentially numbered by sample or cohort name based on the provided sample cohort information and cohort comparison information, facilitating comparison and understanding of samples and cohorts.
  • the report generation module 304 includes a directory module, a text content module, a graphic interaction module, and a table interaction module.
  • the directory module allows users to quickly jump to the corresponding content by clicking on multi-level titles; the text content module can be dynamically updated based on analysis results; the graphic interaction module allows for interactive operations such as dragging, zooming, and selecting; and the table interaction module allows for basic functions such as searching and sorting directly on the interactive report, and supports one-click downloading for tables with large rows and columns.
  • the methylation data processing and analysis platform of the disclosed embodiment includes: a task generation module, a data processing and analysis module, a visualization module, and a report generation module, wherein: the task generation module receives multiple sample information and generates a task script based on the received sample information; the data processing and analysis module performs data processing and analysis based on the generated task script; the visualization module generates graphs and/or tables based on the results of the data processing and analysis; and the report generation module outputs an interactive report based on the generated graphs and/or tables.
  • the plurality of sample information includes sample cohort information, cohort comparison information, and sample methylation sequencing data.
  • Sample cohort information According to the changes in the size of the target lesions after treatment, patients can be divided into four types: CR (complete remission), PR (partial remission), SD (stable), and PD (progression).
  • the number of patients of each type is 2, and each patient is numbered as "Type-01" and "Type-02". For example, patient No. 1 of the CR type is "CR-1". If there are more patients, then the same applies. For example, if there are more than 100 patients or 1000 patients, the numbering starts at 001 or 0001.
  • the type of patient is the cohort information of the patient. For example, the cohort numbers of "CR-01" and "CR-02" patients are both "CR".
  • the sample methylation sequencing data path of each sample is attached.
  • Cohort comparison information manually set the cohort comparison scheme that the user wants to observe, such as setting two comparisons Scheme IDs: "CR_vs_PR” and “SD_vs_PD.”
  • the comparison scheme ID is set as "Cohort Number 1_vs_Cohort Number 2.”
  • Comparison schemes are typically specified based on clinical phenotypes, traits, and other differences. However, when biological information is ambiguous, you can also set a comparison scheme to determine whether there are differences between any two cohorts. This can be used to identify differences between two cohorts with unclear clinical characteristics, enabling in-depth biomarker discovery.
  • task generation module 301 when processing and analyzing methylation data from different species, receives information about multiple samples and also receives corresponding background information.
  • the background information includes at least one of the following: file dependency information, software dependency information, custom script information, and parameter information.
  • Reference genome files can be downloaded from websites such as UCSC, NCBI, and GeneCode. Use samtools to index the fasta file, and the reference genome chromosome length file can be generated based on the constructed fasta.fai file.
  • the human genome annotation file gff3, human genome gene interval bed file, human methylation CpG island file, and human gene transcription start site file can also be downloaded from one or more of the above websites.
  • fastqc is used for quality detection
  • trim_galore is used for data cleaning
  • bismark is used for alignment software
  • picard is used for alignment software
  • samtools is used for alignment and multiple indicator statistics
  • metilene software is used for difference identification.
  • Custom script information is used for the format connection of input and output of software analysis results, as well as content statistics of various indicators, graph drawing, table generation, etc. Custom scripts can be dynamically added according to specific needs.
  • the parameter information is used to dynamically adjust relevant parameters during the analysis process without requiring modification from the original command side.
  • the number of CPU cores is defined as 5
  • the differential DMR identification threshold is set to 0.05
  • the minimum number of CpG sites constituting a DMR region is set to 4
  • the minimum average methylation difference is set to 10.
  • the disclosed embodiments are applicable to data analysis of most species for which reference genomes exist. For different species, it is only necessary to modify the multiple sample information and corresponding background information received by the task generation module 301. Once the above content has been modified, one-click data processing, analysis, and visualization can be performed. When analyzing the same species, since the file dependency information in the background information used is the same, the file dependency information for a specific species only needs to be introduced once.
  • data processing includes sequencing data quality control, sequencing data alignment, and alignment result statistics; data analysis includes sample methylation analysis, intergroup methylation analysis, and differential methylation analysis. Sequencing data quality control, sequencing data alignment, and alignment result statistics are logically linked in a sequential and progressive manner, and sample methylation analysis, intergroup methylation analysis, and differential methylation analysis can be performed in tandem or in parallel.
  • Sequencing data quality control includes quality testing and data cleaning of sample methylation sequencing data.
  • Data quality testing covers sequence base quality, sequence duplication level, sequence GC content, total number of reads, total base number, Q20 and Q30 content, etc.
  • Data cleaning includes trimming low-quality bases, removing low-quality regions at the 3' end of sequencing reads, removing adapter sequences from the data, and filtering out short sequencing reads. After data cleaning, the cleaned data is re-tested with the same quality testing content as above to ensure the reliability and stability of the sequences used in subsequent data analysis.
  • Sequencing data alignment utilizes cleaned sample methylation sequencing data to align to the human reference genome, and sequence duplication is removed by detecting fragment start and length information. Sequence duplication is caused by the characteristics of PCR amplification or sequencing technology. Removing sequence duplication improves the accuracy of methylation level identification. Sequencing data alignment and deduplication in this example were both performed by Bismark.
  • the comparison result statistics include comparison data statistics, methylation statistics, data deduplication statistics, methylation statistics (after deduplication), M-deviation statistics, depth and uniformity statistics (among which, comparison data statistics, methylation statistics, data deduplication statistics, and methylation statistics after deduplication are in series, and M-deviation statistics, depth and uniformity statistics can be in series.
  • Alignment statistics include: the total number of sequence pairs analyzed, the number of paired-end alignments with unique optimal alignments, the number of paired sequences that were not aligned under any conditions, and the number of sequences for which genomic sequence information could not be extracted.
  • Methylation statistics include: the total number of cytosines (C) analyzed, the number of methylated Cs in CpGs, the number of methylated Cs in CHGs, the number of methylated Cs in CHHs, the number of undetermined methylated Cs, the number of unmethylated Cs in CpGs, the number of unmethylated Cs in CHGs, the number of unmethylated Cs in CHHs, the number of undetermined unmethylated Cs, the percentage of methylated Cs in CpGs, the percentage of methylated Cs in CHGs, the percentage of methylated Cs in CHHs, and the percentage of undetermined methylated Cs.
  • C cytosines
  • Data deduplication statistics include: the number of aligned sequences considered during the analysis, the number of duplicated sequences removed, the number of sequences with duplicate positions (coordinates) in the alignment results, the number of uniquely aligned sequences in the alignment results, the proportion of sequences remaining after deduplication, and the proportion of sequences deleted.
  • the methylation statistics (after deduplication) are the same as those in the methylation statistics section.
  • the M-bias statistic displays the deviation in methylation levels by sequence position between R1 and R2 in the sequencing results.
  • Depth and uniformity statistics include: 0.2x uniformity, the total number of sequencing bases falling within the target region (the probe design target), the percentage of bases falling within the target region, the average number of reads per base during sequencing, the number of total reads (sequencing fragments), the number of bases with sequencing data coverage less than 80% of the expected coverage, and the percentage of bases in the target region with at least 1x coverage.
  • Sample methylation analysis includes statistics on chromosomal methylation, functional genomic regions, and gene body methylation. Chromosome methylation statistics are presented using circos plots, which divide the genome into 500kb windows and calculate the GC content, gene content, and CpG methylation level within each window. The circles in the circos plot, from the outermost to the innermost circles, represent chromosome number and length, GC content histogram, gene content histogram, and methylation heatmap. The methylation heatmap colors range from light to dark, indicating low to high methylation levels.
  • CpG methylation statistics are presented using violin plots, with CpG methylation statistics calculated for different genomic elements: CpGi, enhancer, exon, intron, promoter, 3'UTR, and 5'UTR.
  • Gene body methylation levels are presented using a scatter plot.
  • CpG methylation levels are calculated for all genes within the 2K bp upstream of the transcription start site, in exons, and introns. Each region of each gene is divided into 20 bins by length, and the CpG methylation level is calculated for each bin of each region of each gene.
  • Intergroup methylation analysis included comparison of methylation levels in functional genomic regions between groups, correlation analysis of methylation levels in functional genomic regions, and principal component cluster analysis of methylation in functional genomic regions. The comparison of methylation levels in functional genomic regions between groups was presented using a scatter plot. CpG methylation levels were statistically analyzed for all enhancers, 2K bp upstream of the gene transcription start site (up2k), promoters, 5'UTRs, exons, introns, 3'UTRs, and CpGi regions in different groups. Each region of each gene was divided into 20 bins according to its length, and the CpG methylation levels of each bin in each region of each gene were calculated.
  • PC1 Principal component cluster analysis of methylation in functional genomic regions
  • PC2 Principal component cluster analysis of methylation in functional genomic regions
  • Points closer to PC1 and PC2 represent samples with higher correlation.
  • Differential methylation analysis includes differential methylation region (DMR) identification, significant distribution of high and low methylation region identification, DMR length and level distribution, DMR region methylation level clustering, DMR annotation, DMR coverage region distribution, and DMR-related gene enrichment analysis.
  • DMR differential methylation region
  • the differential methylation region (DMR) identification statistics the number of DMRs under each cohort comparison scheme, the average length of DMRs, the average number of CpG sites covered by DMRs, the number of high-methylation DMRs in the latter relative to the former in the comparison scheme number, the number of low-methylation DMRs in the latter relative to the former in the comparison scheme number, and the chromosome number of DMRs under each comparison scheme, DMR start site, DMR end site, P value, differential methylation level, number of CpG sites included, and the average number of CpG sites covered by CpG sites in the latter relative to the former in the comparison scheme number. Methylation rate, the average methylation rate of the former in the comparison scheme number.
  • the significance distribution of hypermethylated and hypomethylated regions was identified using 500 kb windows. Within each window, the Hyper DMR significance, genomic GC content, gene content, and Hypo DMR significance were statistically analyzed. Circos diagrams were created. The circles in the circos diagram represent, from the outermost to the innermost, chromosome number and length, Hyper DMR significance (the more outward, the stronger the significance), genomic GC content heatmap, gene content heatmap, and Hypo DMR significance (the more inward, the stronger the significance). Heatmap colors range from light to dark, indicating the correlation value from low to high. DMR length distribution was presented using a bar chart, with the horizontal axis representing DMR length and the vertical axis representing the number of DMRs of a specific length.
  • Methylation levels of DMR regions across different groups were visualized using violin plots, categorized as hypermethylated and hypomethylated DMRs. Clustering of DMR region methylation levels was performed using heatmaps based on the average methylation levels between groups, demonstrating differential methylation between groups. DMR annotation software was used to obtain information such as the location of the gene closest to the DMR, its positive and negative strands, gene length, gene ID, distance from the DMR to the transcription start site (TSS), and gene name. A circular plot was used to calculate the percentage of DMRs covering different genomic elements. DMR-associated gene enrichment analysis included GO and KEGG enrichment analysis, with separate GO and KEGG analyses performed for promoter regions with clear biological significance for methylation. Results included enriched pathway identifiers, pathway descriptions, the proportion of genes with pathway function within the entire background gene set, P-values, and the number of DMR-associated genes enriched in that pathway.
  • the visualization module generates graphs and/or tables based on the results of data processing and data analysis.
  • Visualization includes graphics generation process and table generation process. Since the report generation module cannot directly display the graphics and tables generated by the data processing and analysis module, this module connects the graphics and tables generated by the data processing and analysis module with the report generation module to ensure that the report generation module can correctly read a variety of complex pictures and tables.
  • the picture and table generation process includes organizing the picture and table paths under each section to the corresponding specific samples and comparison schemes, and marking the picture and table results, realizing carousel viewing of the picture results of multiple samples or comparison schemes, and realizing scrolling and page turning viewing of the table results of multiple samples or comparison schemes.
  • the report generation module includes: directory module, text content module, graphic interaction module, and table interaction module.
  • Directory module In order to make the report more interactive, the identification directory index is categorized according to the results generated by data analysis. Click the multi-level title to quickly jump to the corresponding part of the content for viewing.
  • Text content module In order to save the time of manually changing the text description of the content of each part of the result each time, the text content is passed as a parameter to dynamically modify the text content of each report.
  • Graphic interaction module In order to avoid the inability to effectively understand the results due to excessive content displayed in some graphics, the zoom function, frame selection and magnification function, selection and undo function, etc. are implemented.
  • the table interaction module can interactively adjust the number of rows displayed per page according to the number of rows in the table, and set the search window to quickly locate the results. When there are many rows and columns of results, you can click the button to download the table in csv or xlsx format for external viewing.
  • the report generation module of the embodiment of the present disclosure sets a directory module, a text content module, a graphic interaction module, and a table interaction module. Compared with static PDF or Word reports, it can use an interactive visual interface to perform customized exploration and analysis of methylation data by setting specific conditions.
  • the disclosed embodiments provide a comprehensive platform for integrated methylation data processing, analysis, and visualization, comprising the following modules: a task generation module, a data analysis module, a visualization module, and a reporting module.
  • the task generation module generates task scripts based on the provided sample and background information.
  • the data analysis module performs processes based on the generated task scripts, including but not limited to sequencing data quality control, sequencing data alignment, alignment result statistics, sample methylation analysis, intergroup methylation analysis, and differential methylation analysis.
  • the visualization module generates graphs and tables based on the results generated by the data analysis module.
  • the reporting module outputs the graphs, tables, and text content generated by the visualization module.
  • the methylation data processing and analysis platform of the embodiment of the present disclosure builds the underlying framework for each process in a modular manner, it supports the insertion and deletion of modules, thereby conveniently supporting customized expansion.
  • This embodiment still takes the aforementioned clinical tumor sample classification machine learning model construction content as an example, and adds machine learning to the methylation data processing and analysis platform.
  • the machine learning modeling module is used to expand the scope of intergroup comparative analysis. It should be noted that because sample size determines the lower limit of machine learning model performance, when using the machine learning modeling module, the number of samples in the two cohorts to be compared should be as large as possible, and the recommended sample size for each cohort is no less than 20.
  • the method for enabling the machine learning modeling module may be: when inputting queue comparison information, if the provided comparison scheme number includes a preset machine learning keyword (for example, adding "_ML” after the character string of the "XX_vs_XX” numbering method, that is, "XX_vs_XX_ML”), then while processing and analyzing the aforementioned data, the machine learning modeling module is called to perform model construction and display.
  • a preset machine learning keyword for example, adding "_ML” after the character string of the "XX_vs_XX” numbering method, that is, "XX_vs_XX_ML”
  • the machine learning modeling module will respectively construct classification models between the two groups of queues, CR and PR, and SD and PD.
  • the machine learning modeling module constructs a classification model based on the generated task script, and splits multiple sample information into training sets and test sets; the constructed classification model is trained using the training set, and multiple classification algorithms are traversed during the training process and the accuracy of different classification algorithms is tested on the test set, and the classification model corresponding to the classification algorithm with the highest accuracy on the test set is selected and saved.
  • the machine learning modeling module uses a hierarchical splitting method to split the sample data of two different queues into training sets and test sets, ensuring that the ratio of label columns in the data sets before and after the split is consistent, and automatically fits the model using the training set to traverse most classification algorithms, and selects the model with the highest accuracy on the test set after modeling and saves it.
  • the classification algorithm can include logistic regression, multi-layer perceptron (MLP) classifier, K nearest neighbor (K Neighbors Classifier), support vector machine (SVC), Gaussian process (Gaussian Process Classifier), decision tree (Decision Tree Classifier), Gaussian naive Bayes (Gaussian NB), random forest (Random Forest Classifier), discriminant analysis (Discriminant Analysis), multi-layer perceptron (MLP Classifier), etc.
  • MLP multi-layer perceptron
  • K Neighbors Classifier K nearest neighbor
  • SVC support vector machine
  • Gaussian process Gaussian Process Classifier
  • decision tree Decision Tree Classifier
  • Gaussian naive Bayes Gaussian NB
  • random forest Random Forest Classifier
  • discriminant analysis Discriminant Analysis
  • MLP Classifier multi-layer perceptron
  • the report generation module can display the ROC curve and PR curve of the constructed machine learning model, as well as the specificity, sensitivity, accuracy, and AUC result tables of the model on the training set and test set.
  • An embodiment of the present disclosure also provides a methylation data processing and analysis platform, comprising a memory; and a processor connected to the memory, wherein the memory is used to store instructions, and the processor is configured to execute the steps of the methylation data processing and analysis method as described in any embodiment of the present disclosure based on the instructions stored in the memory.
  • a methylation data processing and analysis platform may include: a processor 810, a memory 820, a bus system 830, and a transceiver 840.
  • the processor 810, the memory 820, and the transceiver 840 are connected via the bus system 830.
  • the memory 820 is configured to store instructions
  • the processor 810 is configured to execute the instructions stored in the memory 820 to control the transceiver 840 to transmit and receive signals.
  • the transceiver 840 may receive multiple sample information.
  • the processor 810 generates a task script based on the received sample information and performs data processing and data analysis based on the generated task script.
  • the data processing includes sequentially performing sequencing data quality control, sequencing data alignment, and alignment result statistics.
  • the data analysis includes at least one of the following: sample methylation analysis, inter-group methylation analysis, and differential methylation analysis.
  • Graphs and/or tables are generated based on the results of the data processing and analysis.
  • An interactive report is output based on the generated graphs and/or tables.
  • processor 810 may be a central processing unit (CPU), or may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSPs digital signal processors
  • ASICs application-specific integrated circuits
  • FPGAs field-programmable gate arrays
  • a general-purpose processor may be a microprocessor or any conventional processor.
  • the memory 820 may include a read-only memory and a random access memory, and provides instructions and data to the processor 810. A portion of the memory 820 may also include a non-volatile random access memory. For example, the memory 820 may also store information about the device type.
  • bus system 830 may also include a power bus, a control bus, a status signal bus, etc.
  • bus system 830 may also include a power bus, a control bus, a status signal bus, etc.
  • various buses are labeled as the bus system 830 in FIG.
  • the processing performed by the processing device can be completed by hardware integrated logic circuits in the processor 810 or by instructions in the form of software. That is, the method steps of the embodiment of the present disclosure can be embodied as being executed by a hardware processor, or by a combination of hardware and software modules in the processor.
  • the software module can be located in a storage medium such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, etc.
  • the storage medium is located in the memory 820, and the processor 810 reads the information in the memory 820 and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.
  • the present disclosure also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the methylation data processing and analysis method described in any of the embodiments of the present disclosure.
  • the methylation data processing and analysis method driven by the execution of executable instructions is substantially identical to the methylation data processing and analysis method described in the aforementioned embodiments of the present disclosure and is not further described here.
  • various aspects of the methylation data processing and analysis method provided by the present disclosure may also be implemented in the form of a program product, which includes program code.
  • the program product When the program product is run on a computer device, the program code is used to enable the computer device to execute the steps of the methylation data processing and analysis method according to various exemplary embodiments of the present disclosure described above in this specification.
  • the computer device may execute the methylation data processing and analysis method described in the embodiments of the present disclosure.
  • the program product may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or component, or any combination thereof. More specific examples (a non-exhaustive list) of readable storage media include: an electrical connection having one or more wires, a portable disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof.
  • Such software may be distributed on a computer-readable medium, which may include a computer storage medium (or non-transitory medium) and a communication medium (or temporary medium).
  • a computer storage medium includes volatile and non-volatile, removable, and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data).
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and can be accessed by a computer.
  • communication media generally embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne un procédé et une plateforme de traitement et d'analyse de données de méthylation, un support de stockage et un produit-programme. Le procédé consiste à : recevoir une pluralité d'éléments d'informations d'échantillon, et générer un script de tâche sur la base des informations d'échantillon reçues, la pluralité d'éléments d'informations d'échantillon comprenant des informations de file d'attente d'échantillon, des informations de comparaison de file d'attente et des données de séquençage de méthylation d'échantillon, les informations de file d'attente d'échantillon comprenant un échantillon et une file d'attente à laquelle appartient l'échantillon, les informations de comparaison de file d'attente comprenant un ou plusieurs groupes de files d'attente à comparer, et un groupe de files d'attente comprenant deux files d'attente ; sur la base du script de tâche généré, effectuer un traitement de données et une analyse de données, le traitement de données comprenant un contrôle de qualité de données de séquençage, une comparaison de données de séquençage ainsi que des statistiques de résultat de comparaison, et l'analyse de données comprenant une analyse de méthylation d'échantillon, une analyse de méthylation inter-groupe ainsi qu'une analyse de méthylation différentielle ; générer un graphe et/ou un tableau sur la base des résultats du traitement de données et de l'analyse de données ; et produire un rapport interactif sur la base du graphe et/ou du tableau générés.
PCT/CN2024/082205 2024-03-18 2024-03-18 Procédé et plateforme de traitement et d'analyse de données de méthylation, support de stockage et produit-programme Pending WO2025194306A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2024/082205 WO2025194306A1 (fr) 2024-03-18 2024-03-18 Procédé et plateforme de traitement et d'analyse de données de méthylation, support de stockage et produit-programme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2024/082205 WO2025194306A1 (fr) 2024-03-18 2024-03-18 Procédé et plateforme de traitement et d'analyse de données de méthylation, support de stockage et produit-programme

Publications (1)

Publication Number Publication Date
WO2025194306A1 true WO2025194306A1 (fr) 2025-09-25

Family

ID=97138289

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/082205 Pending WO2025194306A1 (fr) 2024-03-18 2024-03-18 Procédé et plateforme de traitement et d'analyse de données de méthylation, support de stockage et produit-programme

Country Status (1)

Country Link
WO (1) WO2025194306A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273663A (zh) * 2017-05-22 2017-10-20 人和未来生物科技(长沙)有限公司 一种dna甲基化测序数据计算解读方法
CN107563152A (zh) * 2017-08-03 2018-01-09 北京百迈客生物科技有限公司 基于生物云平台的甲基化数据分析应用系统
US20180051343A1 (en) * 2014-08-08 2018-02-22 Ait Austrian Institute Of Technology Gmbh Thyroid cancer diagnosis by dna methylation analysis
CN111261229A (zh) * 2020-01-17 2020-06-09 广州基迪奥生物科技有限公司 一种MeRIP-seq高通量测序数据的生物分析流程
CN112201302A (zh) * 2019-07-08 2021-01-08 广州基迪奥科技服务有限公司 一种转录组和dna甲基化数据关联分析方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180051343A1 (en) * 2014-08-08 2018-02-22 Ait Austrian Institute Of Technology Gmbh Thyroid cancer diagnosis by dna methylation analysis
CN107273663A (zh) * 2017-05-22 2017-10-20 人和未来生物科技(长沙)有限公司 一种dna甲基化测序数据计算解读方法
CN107563152A (zh) * 2017-08-03 2018-01-09 北京百迈客生物科技有限公司 基于生物云平台的甲基化数据分析应用系统
CN112201302A (zh) * 2019-07-08 2021-01-08 广州基迪奥科技服务有限公司 一种转录组和dna甲基化数据关联分析方法及系统
CN111261229A (zh) * 2020-01-17 2020-06-09 广州基迪奥生物科技有限公司 一种MeRIP-seq高通量测序数据的生物分析流程

Similar Documents

Publication Publication Date Title
Yan et al. From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis
Deshpande et al. Exploring the landscape of focal amplifications in cancer using AmpliconArchitect
US10878938B2 (en) Systems and methods for analyzing sequence data
CN106909806B (zh) 定点检测变异的方法和装置
CN109243530B (zh) 遗传变异判定方法、系统以及存储介质
US20220130488A1 (en) Methods for detecting copy-number variations in next-generation sequencing
Castellano-Escuder et al. POMAShiny: A user-friendly web-based workflow for metabolomics and proteomics data analysis
Gong et al. lncRNA-screen: an interactive platform for computationally screening long non-coding RNAs in large genomics datasets
KR102404947B1 (ko) 기계학습 기반의 유전체 구조 변이 식별 방법 및 장치
Smart et al. A novel phylogenetic approach for de novo discovery of putative nuclear mitochondrial (pNumt) haplotypes
KR20220076444A (ko) 유전체 서열 내의 변이 후보를 분류하는 방법 및 장치
CN114822700B (zh) 用于呈现重排或融合结构亚型的方法、设备和介质
CN118675749A (zh) 一种乳腺癌预后风险评估方法及其系统
CN111508563B (zh) 一种长非编码rna的癌症相关可变剪接数据库系统
Bairakdar et al. Learning the cellular origins across cancers using single-cell chromatin landscapes
Yu et al. Chromothripsis detection with multiple myeloma patients based on deep graph learning
WO2023184976A1 (fr) Procédé et système de gestion de données médicales, dispositif, support et produit-programme d'ordinateur
WO2025194306A1 (fr) Procédé et plateforme de traitement et d'analyse de données de méthylation, support de stockage et produit-programme
CN113278706A (zh) 一种用于区分体细胞突变和种系突变的方法
CN118173172A (zh) 基于二代测序的基因融合分析方法、产品及应用
CN121195306A (zh) 甲基化数据处理分析方法及平台、存储介质与程序产品
Wang et al. Is an SV caller compatible with sequencing data? An online recommendation tool to automatically recommend the optimal caller based on data features
CN120435254A (zh) 变异的处理方法、系统、设备及存储介质
Reimand et al. Pathway enrichment analysis of-omics data
KR102598073B1 (ko) 유전 정보 분석 결과 제공 방법 및 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24929996

Country of ref document: EP

Kind code of ref document: A1