WO2025133869A2 - Method for obtaining data for predicting the risk of metabolic disease - Google Patents
Method for obtaining data for predicting the risk of metabolic disease Download PDFInfo
- Publication number
- WO2025133869A2 WO2025133869A2 PCT/IB2024/062681 IB2024062681W WO2025133869A2 WO 2025133869 A2 WO2025133869 A2 WO 2025133869A2 IB 2024062681 W IB2024062681 W IB 2024062681W WO 2025133869 A2 WO2025133869 A2 WO 2025133869A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- sub
- sequences
- microbiota
- dna data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/10—Ontologies; Annotations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H20/00—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
- G16H20/60—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to nutrition control, e.g. diets
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
Definitions
- This disclosure relates to systems, methods, and computer-readable media with instructions for analysis, risk prediction, decision-making, and/or disease monitoring, and for identifying markers, patterns, and relationships among relevant biological data.
- it relates to computer-implemented methods and systems for processing data using machine learning and artificial intelligence to predict disease risk data.
- Prediction and early detection of disease allow for early intervention. For many diseases, early detection increases the likelihood of successful treatment and provides patients with the best range of options for making quality-of-life decisions. It also allows for increased preventive care and timely diagnoses.
- Predictive analytics is an approach to disease prediction and early detection that uses data and algorithms to identify the likelihood of future outcomes and provide disease risk data.
- Artificial intelligence AI
- AI techniques such as case-based reasoning and data-driven machine learning (ML) algorithms have been used to support decision-making processes in complex tasks. This is used, for example, to assist medical professionals in making clinical decisions by providing disease risk data and predicted prognoses from machine learning (ML) models.
- US10347368B2 discloses a method for characterizing a microbiome-related condition and determining therapeutic measures, as well as for evaluating, diagnosing, and treating at least one cardiovascular disease in at least one individual.
- the method comprises receiving a collection of biological samples from a population of individuals; generating at least one set of information on microbiome composition and another on microbiome functional diversity for said population; developing a description of the cardiovascular disease condition based on features obtained from the data on microbiome composition and functional diversity; based on this description, creating a therapy model designed to address the cardiovascular disease condition.
- the set of Data on microbiome characteristics encompasses at least one component selected from among the taxonomic characteristics, compositional diversity, functional diversity, and functional characteristics of the microbiome.
- the document discloses an output device linked to an individual to promote therapy based on a description and therapy model.
- document US10347368B2 discloses that it can transform the complementary data set and the features extracted from the microbiome composition data set and the microbiome functional diversity data set into a cardiovascular disease condition characterization model, for which it uses computational methods (for example, statistical methods, machine learning methods, artificial intelligence methods, bioinformatics methods, etc.) to characterize a subject that presents characteristics of a group of subjects with cardiovascular disease.
- computational methods for example, statistical methods, machine learning methods, artificial intelligence methods, bioinformatics methods, etc.
- document CN116525105B discloses a system for anticipating and predicting the prognosis of cardiogenic shock, along with a device and a storable tool.
- the device includes several units, such as an acquisition unit, a feature extraction unit, a diagnosis unit, a second acquisition unit, and a decision-making unit.
- the acquisition unit collects genetic information from a sample of peripheral blood mononuclear cells and a label indicating whether or not a treatment is applied.
- the feature extraction unit extracts genetic information to obtain genetic characteristics, while the diagnosis unit determines the need for therapy based on a genetic signature.
- the second acquisition unit obtains the diagnosis result of the cardiogenic shock stage, and the decision-making unit selects a treatment plan based on the diagnosis result in stages.
- the system developed by the application predicts the effect of the treatment based on the patient's genetic information.
- Document CN116525105B also discloses a specific method for detecting genetic features that includes differential expression analysis, loop feature detection, and correlation analysis of important genes. The method selects 21 genes and applies a machine learning algorithm to identify them and obtain the optimal biomarker.
- CN116525105B discloses that the system uses a machine learning model with several classification algorithms, such as KNN, decision tree, random forest, SVM, logistic regression, GBDT, XGBoost, and incorporates a random forest-based negative feedback mechanism.
- KNN decision tree
- random forest random forest
- SVM logistic regression
- GBDT logistic regression
- XGBoost random forest-based negative feedback mechanism
- Document US20190108912A1 presents a method for a machine learning system that analyzes clinical data to identify latent patterns that predict disease.
- the data sets used as training data for the learning system include information about test results, phenotypes, environment, demographics, geography, genetics, clinical data, insurance claims, and treatments.
- the machine learning system can discover sequences and combinations of events that would not be apparent to a human reviewer but are nevertheless reliable predictors of medically important outcomes.
- US20190108912A1 explains that the machine learning system generates reports for healthcare professionals, for example, reports on a particular patient's risk of fibromyalgia. This predictive report enables healthcare professionals to perform additional testing and begin treatment interventions much earlier than would otherwise be possible.
- the machine learning algorithm incorporates a neural network. However, it notes that it may employ any appropriate machine learning system, such as one or more of a random forest, a grid search, or a support vector machine.
- the aforementioned documents do not disclose the combination of input data sources, such as genomic DNA data, microbiota data, and clinical data, as inputs for the learning methods. This significantly affects the classification performance of the learning method or model.
- the present disclosure describes a computer-implemented method for obtaining disease risk prediction data comprising the steps: a) obtaining clinical data, microbiota DNA data, and genomic DNA data from a database; b) generating preprocessed clinical data, preprocessed genomic DNA data, preprocessed microbiota DNA data from the clinical data, microbiota DNA data, and genomic DNA data obtained in step a) by data preprocessing; c) applying a plurality of supervised learning methods to the preprocessed data in step b); d) obtaining disease risk prediction data by each of the supervised learning methods that make up the plurality of learning methods in step c); e) selecting the supervised learning method with the highest metric from the plurality of supervised learning methods in step d); f) selecting the disease risk prediction data from the supervised learning method selected in step e) and storing it in a database.
- the preprocessed genomic DNA data is generated by a method comprising the sub-steps: a2) evaluating the quality of the sequencing data of the genomic DNA data from step a); b2) processing by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the genomic DNA data evaluated in sub-step a); c2) performing the alignment of the sequences of the processed DNA data in step b); sub-stage b; d2) detect genetic variants and recalibrate the quality of sequencing bases and filter variants) from the aligned DNA data in sub-stage b); e2) perform the annotation of the function in its correlation with a disease of the genetic variants of the DNA data sequences obtained in sub-stage d) and store the annotation in a database f2) filter variants of genes associated with metabolic disorders to the annotated variants, in sub-stage e); g2) encode mutations using the One Hot Encoder technique for the variants of the DNA data sequences filtered in sub-stage f2).
- an additional step of dimension reduction is performed on the pre-processed genomic DNA data subsequent to step g2) to encode mutations using the One Hot Encoder technique for the sequence variants of the DNA data filtered in sub-step f2).
- the pre-processed microbiota DNA data is generated by a method comprising the sub-steps: a3) assessing the quality of the microbiota DNA data from step a); b3) processing by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the microbiota DNA data assessed in sub-step a3); c3) identifying and removing cross-contamination sequences originating from human from the microbiota DNA data processed in sub-step b3); d3) identifying gene functions and categorizing the genomic sequences, separating the genomic sequences into sets, finding the taxonomic composition of the microbial community, from the microbiota DNA data after removing cross-contamination sequences in sub-step c3); e3) trimming adapters and unwanted sequences from sequencing reads the microbiota DNA data after applying sub-step d3); f3) assign potential biological functions in the microbiota DNA data sequences after the application of sub-step e); g
- an additional step of dimension reduction is performed on the pre-processed microbiota DNA data subsequent to step j 3).
- FIG. 1 illustrates a block diagram of a general description of the computer-implemented method of obtaining disease risk prediction data of the present disclosure.
- FIG. 2 illustrates a flowchart of an overview of an exemplary embodiment of the computer-implemented method of obtaining disease risk prediction data of the present disclosure.
- the present disclosure describes a system, a computer-readable storage medium, and a computer-implemented method of obtaining disease risk prediction data.
- the computer-implemented method of obtaining a disease risk prediction data of the present disclosure comprises the steps of: a) obtaining clinical data, microbiota DNA data, and genomic DNA data from a database; b) generating preprocessed clinical data, preprocessed genomic DNA data, preprocessed microbiota DNA data, from the clinical data, microbiota DNA data, and genomic DNA data obtained in step a) by data preprocessing; c) applying a plurality of supervised learning methods to the preprocessed data in step b); d) obtaining a disease risk prediction data by each of the supervised learning methods that make up the plurality of learning methods in step c); e) selecting the supervised learning method with the highest metric from the plurality of supervised learning methods in step d).
- step f) select the disease risk prediction data from the supervised learning method selected in step e) and store it in a database.
- the computer-implemented method of obtaining disease risk prediction data of the present disclosure allows filtering out low-quality data and data that may be subject to technical noise due to their low representativeness.
- technical noise is understood to be data that has a low count and therefore it cannot be established whether the trend or not in the data universe is due to the sensitivity of the technique or has a biological significance.
- the computer-implemented method for obtaining disease risk prediction data of the present disclosure allows obtaining disease risk prediction data, which can be modified in an additional stage of intervention in a user through changes, for example, in lifestyle, diet, exercise or through the intake of a consumable that includes ingredients that, for example, are selected from the group of antioxidants, probiotics, prebiotics, postbiotics, vitamin supplements, minerals, fibers, natural extracts, immunonutrients, amino acids, plant proteins, animal proteins, which alone or in combined formulas, can modulate metabolic pathways, molecular pathways, cellular pathways, immunological pathways, vascular pathways, pathways involved in sleep management, mood, memory, oxidative stress, aging, caloric expenditure, nutrient absorption, gastrointestinal status, associated with the function and/or modulation of the microbiome or clinical markers involved in some pathology.
- Other interventions may include fermented food matrices, functional foods, vegan, vegetarian, carnivorous, ketogenic, paleo, Mediterranean, and low-calorie diets, which can be evaluated over different time frames
- preprocessed clinical data, preprocessed genomic DNA data, preprocessed microbiota DNA data are generated by serial preprocessing, independently or in parallel from the clinical data, microbiota DNA data and genomic DNA data obtained in step a).
- step c) of the computer-implemented method of obtaining disease risk prediction data of the present disclosure the plurality of supervised learning methods are applied to the pre-processed data in step b) serially, independently, or in parallel.
- the pre-processed genomic DNA data is generated by a method comprising the sub-steps: a2) evaluating the sequencing data quality of the genomic DNA data of step a); b2) processing by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the genomic DNA data evaluated in sub-step a2); c2) performing sequence alignment of the processed genomic DNA data in sub-step b2; d2) detecting genetic variants and recalibrating the sequencing base quality and filtering variants) of the aligned genomic DNA data in sub-step c2); e2) perform the annotation of the function in its correlation with a disease of the genetic variants of the sequences of the genomic DNA data obtained in sub-stage d2) and store the annotation in a database; f2) filter variants of genes associated with metabolic disorders to the variants annotated in sub-stage e2);
- GC content per base with defined normal distribution behavior GC content per sequence (Normal Behavior of the %GC value, ): Adapter content (for example, no adapters), Over-represented sequences (for example, no over-represented sequences at the ends of the sequences), Sequence length distribution.
- Adapter content for example, no adapters
- Over-represented sequences for example, no over-represented sequences at the ends of the sequences
- Sequence length distribution After checking compliance with these parameters, it is subjected, for example, using a cleaning tool (such as Trimmomatic), which normalizes the values if possible, and otherwise, the sequence reads are removed from the study. If the quality criterion is not met, for example, in at least 80% of the sequences, the entire set is rejected.
- Base quality is measured by the PHRED score, which is a logarithmic value indicating the probability of a base being incorrect.
- a PHRED score of 30 corresponds to a probability of error of 1 in 1,000. Thus, for example, reads with high PHRED scores indicate higher quality.
- Genomic DNA sequencing data quality assessment can be performed using tools such as FASTQC and MultiQC. Aspects such as base quality, the presence of adapters, and length distribution are analyzed. Detailed reports are optionally generated for each sample.
- the quality cutoff is PHRED 30, with a length distribution of min 80% of the read size, and a GC content per base with Normal distribution behavior defined, GC content per sequence with Normal behavior of the %GC value: completely removing the adapter content and the sequences overrepresented at the ends.
- sequence length distribution is checked for uniformity; After checking compliance with these parameters, the sequence reads are submitted, for example, through a cleaning tool (such as Trimmomatic), which normalizes the values if possible, and if not, the sequence reads are removed from the study. If the quality criterion is not met, for example, in at least 80% of the sequences, the entire set is rejected.
- a cleaning tool such as Trimmomatic
- the method that generates the pre-processed genomic DNA data in sub-step b2) processes by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the DNA data evaluated in sub-step a2) by the following sub-steps i. reading the genomic DNA data evaluated in sub-step a2); ii. removing low-quality bases at the beginning and end of each read. For example, with PHRED 30 and PHRED 25 as the minimum value; iii. performing sliding window trimming to remove low-quality regions. For example, every 3 bases have on average a value of PHRED 30 or at least PHRED 25; iv.
- low-quality bases refers to nucleotides whose reads have a low probability of being correct.
- Each base in a sequenced DNA sequence is represented by a letter (A, C, G, or T), and each base is assigned a quality score.
- the quality of a base is usually expressed as a PHRED value, which is negative.
- PHRED score 20 means there's a 1 in 100 chance that the database is incorrect.
- bases whose PHRED score falls below a specified threshold are considered low-quality. These bases are removed to improve the overall quality of the sequence and to prevent the inclusion of erroneous information in subsequent analysis.
- Trimming, cleaning, filtering, and removing unwanted sequences from genomic DNA sequencing data can be performed using tools such as Trimmomatic, a tool that performs multiple preprocessing operations on sequencing data to improve the quality and usefulness of reads.
- a FASTQ file typically uses four lines per sequence.
- the I line begins with a character and is followed by a sequence identifier and an optional description (such as a F ASTA title line).
- ® Line 2 are nucleotides that were sequenced.
- Line 4 encodes the quality values for the sequence in line 2 and must contain the same number of symbols as letters in the sequence that correspond to the PHRED value in ASCII code (American Standard Code for Information Interchange).
- a FASTQ file has the following structure:
- the method that generates the pre-processed genomic DNA data in sub-step c2) performs the alignment of the sequences of the DNA data processed in sub-step b2) is performed by the sub-steps of i. building an index from a reference sequence ii. aligning the sequences of the DNA data processed in sub-step b2) with the reference index; iii. converting the files from formats, for example, from SM to BM format; iv. sorting the file, for example, BAM.
- the reference index refers to an efficient representation of the genome sequence. This index is used to quickly and efficiently search for matches between DNA sequences and the reference genome sequence.
- the reference sequence is the known and annotated DNA sequence of a specific organism. For example, for the human genome, the reference sequence would be the version of the human genome that has been assembled and comprehensively annotated.
- Building a reference index is a process in which data structures are created from the reference sequence. These structures allow for searches during the alignment process.
- the alignment process seeks to identify where in the reference genome the sequences align with the genomic DNA data processed in sub-stage b2).
- the method that generates the pre-processed genomic DNA data in sub-step d2) detecting genetic variants and recalibrating the quality of sequencing bases and filtering variants) of the aligned genomic DNA data in sub-step c2) is performed by the sub-steps of: i. sorts files, e.g., SAM/BAM files, by ascending coordinates, facilitating efficient access to specific genome regions; ii. identifies and flags duplicates in the file, e.g., BAM files, to avoid biased results in variant analysis; iii. provides detailed statistics for the file, e.g., BAM files, including the number of mapped and unmapped reads, the number of duplicates, and so on.
- SAMtools Flagstat tool An example of substage statistics is provided using the SAMtools Flagstat tool:
- Picard Tool Options can be used to establish a criterion for the quality of mapping reads to the reference: v. Identify sequencing errors and recalibrate base quality scores to improve variant calling accuracy; vi. Apply base recalibration, e.g., of BAM files, by adjusting base quality scores; vii. Analyze covariates used in the base recalibration process to assess quality; viii. Detect somatic variants in cancer sequencing data compared to normal sequences; ix. Calculate stack coverage statistics for variants in the context of somatic variant analysis; x. Calculate sample contamination by the normal sample in somatic variant analysis; xi. Filter variants based on orientation biases observed during sequencing; xii. Apply additional filters to somatic variant calls.
- results are obtained for genome interpretation.
- genomic DNA data processed in substep (b2) are organized, followed by duplicate removal to minimize potential bias in subsequent analyses.
- the statistics obtained in step (ix) provide an overview of the dataset, including information on mapped, unmapped, and duplicate reads. Base recalibration and the application of these adjustments improve the accuracy of variant calling by addressing potential sequencing errors. Covariate analysis contributes to assessing the quality of this recalibration process.
- the detection of somatic variants reveals cancer-specific mutations compared to normal sequences.
- the generation of coverage statistics and the assessment of contamination provide essential insights for the accurate interpretation of somatic variants.
- a file for example, BAM
- HG38 is taken as the reference genome and default values are used.
- the method that generates the preprocessed genomic DNA data in sub-step e2), performing the annotation of the function in its correlation with a disease of the genetic variants of the genomic DNA data sequences obtained in sub-step d2) and storing the annotation in a database is carried out by the sub-steps of: i. reading the genomic DNA data sequences obtained in sub-step d2); ii. calling variants, using for example tools such as SNPeff; iii. configuring an annotation database; in one example, ClinVar db is configured as a database for the annotation of the variants, ClinVar db is a public database that stores information on genetic variants and their relationship with human diseases. The ClinVar db database is considered a valuable source for the annotation of genetic variants.
- the result of applying these sub-steps will be the genomic DNA data annotated, for example, in an annotated VCF file, or in a database.
- coding consequence In the case of coding consequences (those that affect the amino acid sequence in a protein), then the best transcription support level (TSL) or canonical transcript should be placed first. iv. locating the variant in the genome at the genomic coordinates of the feature v. comparing the feature IDs alphabetically, even if the ID is a number.
- sub-step e2) To execute sub-step e2), perform the annotation of the function in its correlation with a disease of the genetic variants of the genomic DNA data sequences obtained in sub-step d2), tools such as SnpEff can be used and the annotation can be stored in a database such as dbNSFP v4, which is a complete database of transcript-specific functional annotations for human single nucleotide variants (SNVs).
- dbNSFP v4 is a complete database of transcript-specific functional annotations for human single nucleotide variants (SNVs).
- SnpEff is a tool used in bioinformatics and genomics for the analysis of genetic variants, e.g. single nucleotide polymorphisms (SNPs) and insertion/deletion variants (indels), in genomic sequences. It is used, for example, to predict the functional effects of genetic variants, i.e., how the variants may affect proteins and genes, and to annotate the biological consequences of these variants, among other functions.
- the method that generates the preprocessed genomic DNA data in sub-step e2), filtering variants of genes associated with metabolic disorders to the variants annotated in sub-step e2) is performed by the sub-steps of: i.
- filtering the variants to eliminate those with low quality (for example, quality less than PEERED 25); ii. identifying genes associated with metabolic disorders; iii. filtering the variants to retain only those that are within the metabolic genes identified in the previous step; iv. filtering according to criteria such as allele frequency and prediction of functional effect.
- OM IM Online Mendelian Inheritance in Man
- VCFtools is a set of command-line tools designed to work with VCF (Variant Call Format) files
- GATK Variant Call Format
- ANNOVAR is a widely used bioinformatics tool for the annotation of genetic variants in VCF files.
- Variant annotation involves adding functional and contextual information to genetic variants identified from genomic data.
- the method that generates the pre-processed genomic DNA data in sub-step g2, encode mutations using the One Hot Encoder technique for sequence variants from genomic DNA data filtered in sub-step f2) is performed by the sub-steps of: i. reading the sequence variants from genomic DNA data filtered in sub-step f2), for example, using a library like pandas to read a VCF file and load the genetic variants; ii. performing relevant variable extraction, for example, extracting the relevant columns containing the genetic variant information, chromosome, position, reference and alternative?; iii. encoding categorical variables (variants) without notion of closeness with One Hot Encoder,'
- stage iii the One Hot Encoder technique is applied to categorical variables in which their categories have no notion of closeness. This technique consists of converting each category of the variable into a separate variable, with values of 1 and 0 that indicate the presence or absence of the category in the respective sample.
- the Ordinal Encoder technique was used, which consists of assigning each unique category of the categorical variable a unique integer based on its natural order or hierarchy.
- the method that generates the preprocessed genomic DNA data after sub-step g2) an additional step of reducing the dimension of the data is performed.
- the step of reducing the dimension of the data is performed using Pearson correlations, where those variables that were directly proportional (correlation equal to 1) were selected, and that variable that, due to its importance or relationship with the study, associates related genes in the case of the preprocessed genomic DNA data after stage g2), and species or taxa related to microbiota, were associated with decreased metabolic risk.
- Relevant variables are defined through two considerations. The first is that the variables correspond to each of the variants in the case of genomic DNA data and each of the taxa or OTUs (Operational Taxonomic Units); and the second consideration is that the analysis of the relevance of the variables is carried out by calculating the importance and relative importance. In an example, this can be done with a Python function such as "feature importances," which allows calculating the decrease in data impurity (i.e., the probability of misclassifying a randomly chosen element in a set), thus selecting the variables that contribute most to the discrimination of the groups without the presence of misclassified values.
- the pre-processed microbiota DNA data is generated by a method comprising the sub-steps: a3) assessing the quality of the microbiota DNA data from step a); b3) processing by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the microbiota DNA data assessed in sub-step a3); c3) identifying and removing cross-contamination sequences originating from human from the microbiota DNA data processed in sub-step b3); d3) identifying gene functions and categorizing the genomic sequences, separating the genomic sequences into sets, finding the taxonomic composition of the microbial community, from the microbiota DNA data after removing cross-contamination sequences in sub-step c3); e3) trim adapters and unwanted sequences from sequencing reads the microbiota DNA data after the application of sub-steps:
- step 13 store the operational taxonomic units obtained in step k3) in a database; m3) filter using operational taxonomic unit criteria previously stored in a database and related to the prediction of a disease respectively in the operational taxonomic units stored in step j 3); n3) repeat steps a3) to j 3) after a time T2 has elapsed and store the operational taxonomic units obtained in step n in a database the results obtained in step n after a time T2 has elapsed; o3) filter using operational taxonomic unit criteria previously stored in a database with the prediction of a disease respectively the operational taxonomic units stored in stage k3; p3) compare the data stored in stage k3) with the data stored in stage n3) after a time T2.
- the method that generates the pre-processed microbiota DNA data in sub-step a3) evaluates the quality of the sequencing data of the microbiota DNA data from step a); evaluating the accuracy and reliability of the information obtained through DNA sequencing techniques, the quality is evaluated by various parameters and metrics that provide information on the reliability of the generated sequence reads.
- the quality cut-off threshold is PHRED 30, with a length distribution of min 80% of the read size, with a GC content per base with defined normal distribution behavior, GC content per sequence with Normal behavior of the %GC value: completely removing the adapter content and the over-represented sequences at the ends.
- sequence reads are submitted, for example, to a cleaning tool (such as Trimmomatic), which normalizes the values if possible. If not, the sequence reads are removed from the study. If the quality criterion is not met, for example, in at least 80% of the sequences, the entire set is rejected.
- a cleaning tool such as Trimmomatic
- the method that generates the pre-processed microbiota DNA data between sub-step m3) and sub-step n3) is made an additional step of providing a consumable to a subject, where the consumable for example is selected from the group of antioxidants, probiotics, prebiotics, postbiotics, vitamin supplements, minerals, fibers, natural extracts, immunonutrients, amino acids, vegetable proteins, animal proteins, which alone or in combined formulas between them and other compounds, can modulate metabolic pathways associated with the function of the microbiome or clinical markers involved in some pathology or disease, which can be evaluated by means of a disease risk prediction data obtained after time T2 has elapsed.
- the consumable for example is selected from the group of antioxidants, probiotics, prebiotics, postbiotics, vitamin supplements, minerals, fibers, natural extracts, immunonutrients, amino acids, vegetable proteins, animal proteins, which alone or in combined formulas between them and other compounds, can modulate metabolic pathways associated with the function of the microbiome or clinical markers involved in some pathology or disease,
- the method that generates the pre-processed microbiota DNA data in sub-step n3) and clinical data derived from blood biochemistry analysis can generate information to define metabolic profiles and predictive disease risk profiles, which can be used to define recommendations that improve health such as: nutritional recommendations, diet, exercise, supplementation, portion management and distribution of macronutrients on the plate, hydration, lifestyle changes, food identification according to the season and geographic location, follow-up through monitoring of glycemia and ketone bodies or any other variable that can modulate the risk of disease, health status and nutritional status.
- an intervention can be carried out that modifies habits, lifestyle or diet, which can generate measurable changes through microbiota or clinical data, to a subject, with the possibility of generating recommendations that improve their health and reduce the risk of disease.
- the method generating the preprocessed microbiota DNA data in sub-step a3) assessing the sequencing data quality of the microbiota DNA data from step a) is performed by the following sub-steps: vi. optionally, general statistics on the reads can be generated after applying the different preprocessing steps, providing an overview of the quality and quantity of the remaining data, removing low-quality bases at the beginning and end of each read. PHRED 30 and PHRED 25 as a minimum value. vii. perform a sliding window cropping to eliminate low-quality regions, for example, every 3 bases have an average PHRED value of 30 or a minimum of PHRED 25. viii.
- Microbiota DNA data quality assessment can be performed using tools such as F ASTQC, MetaWRAP and MultiQC. Aspects such as base quality, presence of adapters and length distribution are analysed, and detailed reports are optionally generated for each sample.
- the method that generates pre-processed microbiota DNA data in sub-step b3) is processed by the operations of trimming, cleaning, filtering, and removing unwanted sequences from the sequencing data of the microbiota DNA data evaluated in sub-step a3), by the following sub-steps: i. reading the microbiota DNA data - evaluated in sub-step a2); ii. removing low-quality bases at the beginning and end of each read. PHRED less than 30 iii. performing sliding window trimming to remove low-quality regions, performing quality filtering operations, removing specific adapters, etc., according to your project requirements.
- low quality bases refers to nucleotides whose reads have a low probability of being correct.
- Each base in a sequenced DNA sequence is represented by a letter (A, C, G, or T), and each base is assigned a quality score.
- the quality of a base is usually expressed as a PHRED value, which is negative.
- PHRED score 20 means there's a 1 in 100 chance that the database is incorrect.
- Sequence quality is determined according to the parameters described here. These bases are removed to improve the overall quality of the sequence and to prevent the inclusion of erroneous information in subsequent analysis.
- the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the microbiota DNA data evaluated in sub-step a3) can be performed using tools such as Trimmomatic, which is a tool for performing multiple preprocessing operations on sequencing data to improve the quality and usefulness of the reads.
- Trimmomatic is a tool for performing multiple preprocessing operations on sequencing data to improve the quality and usefulness of the reads.
- the method that generates pre-processed microbiota DNA data in sub-step c3), identifying and removing cross-contamination sequences originating from humans from the microbiota DNA data processed in sub-step b3) is carried out by means of the following sub-steps: i. having a reference database to identify and filter contaminating sequences; ii. indexing the reference database; iii. searching for matches between the sequences of the microbiota DNA data processed in sub-step b3) and the indexed reference database; iv. identifying those sequences that correspond to the human genome, considering them as cross-contamination; v. Sequences considered cross-contaminated are filtered from the microbiota dataset and saved in an output file; vi. Generate an output file containing only the sequences that were not identified as originating from the human genome, according to the reference database.
- the result of this sub-stage is a filtered dataset free of unwanted sequences from cross-contamination.
- identifying and removing cross-contamination sequences originating from human sources from the microbiota DNA data processed in sub-step b3) can be done using tools such as BMT agger “Biome-specific Metagenome Tagger” which is a tool designed to address the problem of cross-contamination in metagenome sequencing data, understood as data sets originating from microbial communities.
- Cross-contamination can occur when sequences from unwanted organisms are introduced during the sequencing process.
- the method that generates pre-processed microbiota DNA data in sub-step d3), identifying gene functions and categorizing the genomic sequences, separating the genomic sequences into sets, finding the taxonomic composition of the microbial community, from the microbiota DNA data after the elimination of cross-contamination sequences in sub-step c3) is performed by the following sub-steps: i. performing assembly of the microbial genomes from the microbiota DNA data after the elimination of cross-contamination sequences in sub-step c3) using a genome sequence assembly tool such as Megabit, which is a genome assembly tool for rapid and efficient genome annotation from next-generation sequencing (NGS) data; ii.
- NGS next-generation sequencing
- BIOM Biological Observation Matrix
- steps can be carried out such as, for example, the identification of metabolic pathways, the comparison of samples, and the determination of OTUs that are found differentially, that is, those that present statistically significant differences.
- identifying and removing cross-contamination sequences originating from human sources from the microbiota DNA data processed in sub-step b3) can be done using tools such as Meta WRAP that facilitate the processing, pre-processing and analysis of metagenome sequencing data.
- the method that generates pre-processed microbiota DNA data in sub-step e3), trim adapters and non- desired sequencing reads from the microbiota DNA data after applying sub-step d3) by performing the following sub-steps: i. trimming adapters and unwanted sequences from sequencing reads from the microbiota DNA data after applying sub-step d3) with the sequences of the reference human genome HG38 and the adapters that were used during sequencing; ii. optionally verifying that the adapters have been correctly trimmed and assessing the quality of the reads after trimming, for example, using tools such as FastQC;
- trimming adapters and unwanted sequences from sequencing reads of microbiota DNA data after applying sub-step d3) can be done using tools such as Cutadapt, which allows trimming adapters and unwanted sequences from DNA or RNA sequencing reads.
- the method generating pre-processed microbiota DNA data in sub-step f3) for assigning potential biological functions to the microbiota DNA data sequences subsequent to applying sub-step e3) is performed by the following sub-steps: i. having a specific database to carry out the function assignment according to the KEGG database using the protein sequences inferred from the structural annotation of the assembly; ii. comparing the microbiota DNA data sequences subsequent to applying sub-step e3) with the sequences in the databases; iii. Assign potential biological functions according to the KEGG database based on the comparison from the previous substage. iv. Save the assignment results from substage iii;
- the results of the stage II comparison may include information on the genetic functions present, the active metabolic pathways, and the relative abundance of different organisms in the microbial community.
- potential biological functions refer to the activities or roles that the identified genes and sequences can perform in the microbial community. These functions can encompass a variety of biological activities, including metabolic processes, specific cellular functions, and the production of certain chemicals.
- HUMAnN3 HMP Unified Metabolic Analysis Network 3
- the method that generates pre-processed microbiota DNA data in sub-step g3), classifying the microbiota DNA data sequences subsequent to the application of sub-step e3) is carried out by means of the following sub-steps: i. having a database that contains information on the reference genomic sequences of microorganisms; ii. dividing the microbiota DNA data sequences subsequent to the application of sub-step e3) into small k-mers fragments; iii. comparing the k-mers from sub-step ii with the k-mers present in the database of sub-step i; iv. assign each sequence a taxon to which it resembles based on the sub-stage k-mers comparison;
- classifying the microbiota DNA data sequences after the application of sub-stage e3) can be done using tools such as Kraken 2, which allows the classification of DNA sequences, metagenome sequences, i.e., in genomic data sets of microbial communities based on their taxonomic origin.
- the classified taxonomic data of the microbiota DNA sequence data after the application of sub-step e3) can be visualized and graphed by using tools such as KRONA, allowing the visualization to be explored and manipulated to obtain details about specific taxonomic groups.
- the method that generates pre-processed microbiota DNA data in sub-step h3) to identify associations between microbiota variables and metadata in the microbiota DNA data sequences after applying sub-step g3) is performed by the following sub-steps: i. reading microbiota DNA data pre-processed in sub-step h3) specifying the tables of taxon abundances and metadata; ii. defining the linear model to be fitted and specifying the independent variables (microbiota taxa) and the dependent variable (metadata); iii. performing statistical tests to evaluate the association between microbiota taxa and metadata variables; iv.
- the linear model defined in sub-stage ii can be selected from the group comprising Simple Linear Model: to explore the association between a microbiota variable and a metadata at a time; Multiple Linear Model: if you have multiple metadata that could influence microbiota abundances; Interaction Model: optionally, interactions between variables can be incorporated into the linear model to consider the effect of one variable depending on the value of another.
- the linear model is adjusted considering the correction for multiple tests, by repeating the process from stages i to iV (for example, applying Bonferroni adjustment) to control type I errors.
- identifying associations between microbiota variables and metadata in the microbiota DNA data sequences after the application of sub-stage g3) can be done using tools such as, for example, MaAsLin2 (Multivariate Association with Linear Models 2), which allows performing multivariate association analysis.
- MaAsLin2 Multivariate Association with Linear Models 2
- the method that generates pre-processed microbiota DNA data in sub-step i3), filtering by taxonomic and functional statistical criteria and the microbiota DNA data sequences after applying sub-step h3) is performed by the following sub-steps: i. performing statistical filtering; o Filtering by abundance: Eliminates taxa or functions with a low abundance. o Filtering by statistical significance: Eliminates results that are not statistically significant. ii. performing taxonomic filtering; o Identifying and eliminating taxa that could be contaminants. iii. performing filtering by taxonomic levels; or define the taxonomic levels of interest and filter out sequences that do not meet those criteria iv. perform functional filtering; or eliminate irrelevant functions v. perform filtering by functional levels; vi. perform literature-based filtering or eliminate taxa or functions that are not supported by the relevant literature
- steps can be added to apply cross-validation techniques to ensure that the filtering criteria are unbiased and applicable to different data sets.
- exploratory data review steps can be added before and after filtering to assess the impact of filtering decisions.
- filtering using taxonomic and functional statistical criteria and the microbiota DNA data sequences after applying sub-step h3) can be done using tools such as statistical tools, such as R or Python with libraries such as pandas or scipy, to perform statistical filtering.
- statistical tools such as R or Python with libraries such as pandas or scipy
- the method that generates pre-processed microbiota DNA data in sub-step j3), applying the normalization, correlation and dimension reduction processes to the microbiota DNA data after applying sub-step i3) is performed by the following sub-steps: vii. applying normalization to the microbiota DNA data after applying sub-step i3) using techniques such as z-score normalization or min-max scaling. Using for example libraries like scikit-learn in Python to perform this normalization. viii. calculate the correlation matrix between variables; ix. apply dimension reduction techniques, for example, through Principal Component Analysis (PCA), which is a commonly used technique. This can be done, for example, by using libraries like scikit-learn to implement PCA.
- PCA Principal Component Analysis
- the method that generates pre-processed microbiota DNA data after sub-stage j 3) an additional step of reducing the dimension of the data is performed in one example the step of reducing the dimension of the data is performed by means of Pearson correlations, where those variables that were directly proportional were selected (correlation equal to 1), and that variable was left as representative that, due to its importance or relationship with the study, associates genes related in the case of the microbiota DNA data after the application of sub-stage j3) and species or taxa related to microbiota, were for example associated with the decrease in metabolic risk.
- Relevant variables are defined through two considerations. The first is that the variables correspond to each of the variants in the case of microbiota DNA data and each of the taxa or OTUs (Operational Taxonomic Units); and the second consideration is that the analysis of the relevance of the variables is carried out by calculating the importance and relative importance. In an example, this can be done with a Python function such as "feature importances," which allows calculating the decrease in data impurity (the probability of misclassifying a randomly chosen element in a set), thus selecting the variables that contribute most to the discrimination of the groups without the presence of misclassified values.
- the normalization of the data serves to eliminate atypical data, resulting from the use of a population of individuals whose controlled variables may not be sufficient for the individual to have a similar response, and therefore subsequently have the same Scale and weight in machine learning methods.
- normalization is performed by applying the min-max normalization technique, which adjusts the values of each variable within a range of 0 to 1.
- applying the normalization, correlation and dimension reduction processes to the microbiota DNA data after applying sub-step i3) can be done, for example, in a programming environment such as Python (using libraries such as Pandas for data manipulation).
- the method generating pre-processed microbiota DNA data in sub-step k3) of clustering the microbiota DNA data subsequent to applying sub-step i3) into operational taxonomic units by a predefined threshold greater than 80% gene sequence similarity is performed by the following sub-steps: i. defining a gene sequence similarity threshold that will determine when two sequences will cluster into the same OTU. This threshold could be, for example, X% to Y% similarity; ii. clustering the sequences into OTUs based on the defined similarity threshold, for example using a clustering algorithm; for example, the hierarchical clustering algorithm, the single or complete linkage method, or the k-means clustering algorithm; iii. assigning taxonomic information to each OTU. This can be done using reference databases containing information on the taxonomy of different gene sequences, for example, using tools such as Karken2 and Kaiju;
- clustering the microbiota DNA data after applying sub-step i3) into operational taxonomic units using a predefined threshold greater than 80% gene sequence similarity can be done using tools such as, for example, BLAST (Basic Local Alignment Search Tool) or other gene sequence alignment tools.
- BLAST Basic Local Alignment Search Tool
- the method that generates pre-processed microbiota DNA data in sub-step m3), storing the operational taxonomic units obtained in step k3) in a database is performed by the following sub-steps:
- the method that generates pre-processed microbiota DNA data in sub-step n3), filtering by criteria of operational taxonomic units previously stored in a database and related to the prediction of a disease respectively in the operational taxonomic units stored in step 13), is carried out by means of the following sub-steps: v. have a taxonomic database that assigns taxonomic identifications to the OTUs; vi. assign taxonomic identifications to the OTUs using the database provided in the previous step; vii. filter by criteria viii. create a metadata table with information on the presence or absence of the disease for each sample (or is it sequence?) ix.
- validation steps can be added, such as validating the results using cross-validation techniques, among other techniques.
- a visualization stage can also be generated, such as bar charts or heatmaps, etc., to visually represent the associations between the filtered OTUs and the disease.
- filtering by operational taxonomic unit criteria previously stored in a database and related to the prediction of a disease respectively in the operational taxonomic units stored in step 13) can be done using tools such as Kraken 2 or Kaiju to preprocess sequencing data and assign taxonomic identifications to OTUs using a database that includes eukaryotic and prokaryotic microorganisms.
- Kaiju is a bioinformatics program used for the taxonomic classification of DNA or protein sequences.
- the method that generates pre-processed microbiota DNA data in sub-step o3), repeating steps a3) to step k3) after a time T2 has elapsed and storing the operational taxonomic units obtained in step n in a database, the results obtained in step n after a time T2 has elapsed is carried out by means of the following sub-steps:
- the method that generates pre-processed microbiota DNA data in sub-step p3), filtering by criteria of operational taxonomic units previously stored in a database with the prediction of a disease respectively the operational taxonomic units stored in step 13, is carried out by the following sub-steps: xii. having a taxonomic database that assigns taxonomic identifications to the OTUs; xiii. assigning taxonomic identifications to the OTUs using the database provided in the previous step; xiv. filtering by criteria such as a minimum absolute abundance of 100. xv. creating a metadata table with information on the presence or absence of the disease for each sample. xvi.
- xvii Define a model that relates the abundances of the filtered OTUs to the presence or absence of disease.
- xviii Select results based on statistical analysis to identify OTUs that are significantly associated with disease presence.
- statistical models are used, specifically logistic regression models in the context of binary data (e.g., presence/absence of disease). For example, clinical data appendixes.
- validation steps can be added, such as validating the results using cross-validation techniques, among other techniques.
- a visualization stage can also be generated, such as bar charts or heatmaps, etc., to visually represent the associations between the filtered OTUs and the disease.
- filtering by criteria of operational taxonomic units previously stored in a database with the prediction of a disease respectively the operational taxonomic units stored in step 13, can be done using tools such as Kraken2 and Kaiju to preprocess sequencing data and assign taxonomic identifications to OTUs using a taxonomic database that assigns taxonomic identifications to OTUs. databases for eukaryotic and prokaryotic microorganisms will be used from the NCBI nr.
- the method that generates pre-processed microbiota DNA data in sub-step p3) compares the data stored in step 13) with the data stored in step o3) after a time T2 has elapsed.
- the clinical data of the present disclosure in step a) of the computer-implemented method for obtaining disease risk prediction data may represent a wide range of measures, from metabolic indicators such as glucose and insulin to biomarkers such as CRP and levels of various vitamins and minerals. Their inclusion in a clinical analysis could provide a comprehensive view of the metabolic, nutritional, and endocrine health of the individuals under study.
- the clinical data of the present disclosure in step a) of the computer-implemented method for obtaining disease risk prediction data are selected from the group comprising age, Basal Glucose, Calcium, Iron, CRP, AP01, AP02, APOB, Thiamine, Riboflavin, Niacin, Pyridoxine, Ac Asc, Potassium, Chlorine, Zinc, TNFa, TSH, T4T, Glucagon, HGH, Serum creat, Non-HDL chol, and i HOMA and combinations of the above.
- pre-processed clinical data are generated by applying regression models for missing data imputation and normalization to the clinical data from step a);
- pre-processed clinical data is generated by a method comprising the sub-steps: i. reading the clinical data from step a); ii. Examine the distribution of your clinical variables to assess normality, for example, using statistical tests such as the Shapiro-Wilk normality test; iii. If the clinical data do not follow a normal distribution, bring them closer to normal, for example, by applying transformations such as logarithm, square root, replacement of outliers by the mean, among others; iv. Perform missing data imputation; v.
- Missing data imputation is applied to replace missing values in the clinical dataset from step a) using techniques such as mean, median, regression imputation, or more advanced methods such as MICE (Multiple Imputation by Chained Equations), among others.
- PCA principal component analysis
- MICE Multiple Imputation by Chained Equations
- Preprocessed clinical data can be obtained using tools available in Python such as scipy.stats, statsmodels, pandas, scikit-leam, fancyimpute, numpy,
- the plurality of supervised learning methods are selected from the group comprising: classifier using Ridge regression with cross-validation for classification tasks (RidgeClassifierCV), classifier based on Ridge regression (RidgeClassifier), Perceptron (Perceptron), ensemble algorithm combining multiple weak classifiers to improve accuracy (AdaBoostClassifier), classifier using LightGBM, with gradient boosting (LGBMClassifier), Naive Bayes classifier (BernoulliNB), stochastic gradient descent classifier (SGDClassifier), Bagging technique classifier (BaggingClassifier), random tree-based classifier (ExtraTreeClassifier), random decision tree collection classifier (ExtraTreesClassifier), XGBoost-based gradient boosting classifier (XGBClassifier), Support vector machine (SVC), a classifier that calibrates the probabilities of the underlying classifiers using cross-validation (Calib)
- step e) of the computer-implemented method of obtaining a disease risk prediction data of the present disclosure selecting the supervised learning method with the supervised learning method with the highest metric from the plurality of supervised learning methods of step d) to obtain a disease risk prediction data comprises the sub-steps of: a) importing and loading the labeled training data set; taken from the union of the preprocessed genomic DNA data with the preprocessed microbiota DNA data and with the preprocessed clinical data, of step b of the computer-implemented method of obtaining a disease risk prediction data of the present disclosure.
- selecting the supervised learning method from the supervised learning method with the highest metric from the plurality of supervised learning methods in step d) in sub-step g) initiating a loop for evaluating the plurality of learning methods comprises the sub-steps of: a. training the first supervised learning method of the plurality of methods using the training data set; b. making predictions using the trained model on the validation data set; c. evaluating the performance metric of the model of the first of the supervised learning method of the plurality of methods by calculating the accuracy on the validation set. d. recording the accuracy obtained for the model of the first of the supervised learning method of the plurality of methods; e. repeating the sub-steps of a) to b) for all of the supervised learning methods of the plurality of methods.
- the computer-implemented method of obtaining disease risk prediction data of the present disclosure may be implemented on one or more computing systems, wherein the computing system may be implemented at least in part on networks, the cloud, and/or as a machine or set of machines (e.g., computing machine, server, mobile computing device, computer cluster, etc.) configured to receive a computer-readable medium that stores computer-readable instructions and is capable of storing instructions for the computer-implemented method of obtaining disease risk prediction data of the present disclosure.
- a machine or set of machines e.g., computing machine, server, mobile computing device, computer cluster, etc.
- the system that performs the method is a supercomputer (ASIMOV) with characteristics such as: 254 TB of storage, with infiniband connectivity of 8 GBbps and 624 CPU cores with 4.8 TB of RAM for a calculation capacity (double precision) of 17 TFLOPS,
- ASIMOV supercomputer
- step b) of the computer-implemented method for obtaining disease risk prediction data of the present disclosure the system performing the method is a TAYRA supercomputer having 532 TB of storage, with 52 Gbps infiniband connectivity and 1168 CPU cores with 8 TB of RAM; the TAYRA computing capacity in double precision calculated at 102.96 TFLOPS. With GPU nodes containing 59904 available cores.
- the method that generates the pre-processed genomic DNA data in sub-step b2) processes the genomic DNA data evaluated in sub-step a2) by means of the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data.
- the method that generates pre-processed microbiota DNA data in sub-step e3) trims adapters and unwanted sequences from sequencing reads, the microbiota DNA data after applying sub-step d3). obtains the following adapter sequence.
- the method that generates pre-processed microbiota DNA data in sub-step i3) filters by taxonomic and functional statistical criteria and the microbiota DNA data sequences after applying sub-step h3) eliminates taxa or functions from the following table.
- ADABOOSTCLASS Relevant Variables for the best (ADABOOSTCLASS) that obtains a disease risk prediction data where the disease is dyslipidemia from preprocessed human genome DNA data
- X_38286942 A T retinitis_pigmentosa_GTPase_regulator 19_12847888_G_T; microtubule associated serine/threonine kinase l 4_69200706_T_C; UDP_glucuronosyltransferase_family_2_member_Bl 1 5_21751766_C_A; cadherin_12
- LGBMClassifier that obtains a disease risk prediction data where the disease is diabetes from preprocessed genomic DNA data
- TXID Relevant Variables
- LGBMClassifier Relevant Variables for Better (LGBMClassifier) that obtains a disease risk prediction data where the disease is diabetes from preprocessed microbiota DNA data
- the identification of metabolic profiles that affect health and disease risk may be related to, but not limited to, those mentioned below: energy metabolism, food allergies, influence of diet on metabolic status, oxidative stress, detoxification, bone metabolism, carbohydrate metabolism, vascular health, cognitive health, behavioral disorders, satiety and appetite pathways, response to exercise, metabolic pathways associated with absorption, monitoring and effectiveness of changes in lifestyle, diet, supplementation and diseases such as: chronic inflammation, atherosclerosis, stroke, multiple sclerosis, Alzheimer's, arthritis, inflammatory bowel disease, Crohn's disease, ulcerative colitis, celiac disease, pernicious anemia and sinusitis, Obesity, non-alcoholic fatty liver disease, chronic kidney disease, dyslipidemia, eating disorders, among others.
- Disease risk prediction data It is a computational data that corresponds to a global measure of the risk of developing a disease by a person in relation to the general population, based on a genetic, clinical, biological or other type of marker
- This data can take various forms, such as the nucleotide sequence represented as a string of letters (A, T, G, C), genomic annotations indicating gene structure and location, genetic variants such as single nucleotide polymorphisms (SNPs), and expression data reflecting the relative abundance of messenger RNA under different conditions. It also includes information on epigenetic modifications, sequencing quality, and other relevant attributes.
- This data can be stored in specific formats such as FASTQ, FASTA, VCF, BAM, or expression matrices.
- Microbiota DNA data refers to the genetic information obtained through DNA sequencing of microorganisms present in a biological sample, specifically in the context of the microbiota, for example, from the gut.
- the microbiota is the community of microorganisms, such as bacteria, fungi, viruses, and other microbes, that coexist in a particular environment, such as the gut, skin, mouth, or other sites of the human body or other organisms.
- This data can be stored in specific formats such as FASTQ, FASTA, QUIIME, BIOM, SRA, VCF, or BAM.
- Categorical variables These are attributes that classify variants into discrete, non-numerical categories. These categories describe specific characteristics of variants, such as their type (SNP, indel), their functional impact (synonymous, non-synonymous), their location in the genome (exon, intron), and other relevant aspects. These variables allow variants to be organized and characterized in a meaningful way.
- SAM Sequence Alignment/Map
- BAM Binary Alignment/Map
- VCF (Variant Call Format) file A VCF (Variant Call Format) file is a standard file format used to represent information about genetic variants, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and other types of variants, in genomic sequencing data. The VCF format was designed to efficiently and structure detailed information about genetic variants.
- SNPs single nucleotide polymorphisms
- the VCF format was designed to efficiently and structure detailed information about genetic variants.
- FASTQ file This is a file format used in bioinformatics to store sequencing data from DNA, RNA, or other types of biological molecules. This format is commonly used to represent reads obtained from next-generation sequencing (NGS) technologies.
- NGS next-generation sequencing
- TAXID An abbreviation for "Taxonomy ID.” This term is commonly used in the context of biological and genomic databases, especially in relation to the taxonomic system that organizes and classifies organisms. TAXID is a unique numerical identifier associated with a specific node in the biological taxonomic hierarchy and is used to uniquely identify different organisms in biological databases and resources.
- OTUs are created using clustering techniques for similar genomic sequences, and the degree of similarity required to group sequences into an OTU is set by a predefined threshold. This threshold can vary depending on the study and the technique used.
- Metabolic profiling refers to the comprehensive characterization of the metabolic processes occurring in an organism. These profiles provide detailed information on how nutrients are metabolized, how energy is generated and utilized, and how different biochemical components interact within the body. Metabolic profiles can include data on enzyme activity, metabolite levels, and other biochemical indicators that help understand an individual's metabolic status. These profiles are valuable in medical research, personalized nutrition, and understanding the biological basis of various health conditions, as they offer detailed insight into the underlying biochemical processes in the body.
- Predictive Disease Risk Profiles are assessments that combine clinical, biomedical, and sometimes genetic data to identify and quantify risk factors that may increase the likelihood of developing a specific disease. These profiles seek to predict an individual's risk for specific health conditions, such as heart disease, diabetes, cancer, or other conditions. The information collected may include medical history, lifestyle habits, medical test results, and relevant biological markers. Applying predictive risk profiles allows healthcare professionals to customize preventive and early intervention strategies, thus facilitating a more proactive approach to health and wellness.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Epidemiology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Primary Health Care (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioethics (AREA)
- Biophysics (AREA)
- Pathology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Nutrition Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Description
MÉTODO DE OBTENCIÓN DE DATOS DE PREDICCIÓN DE RIESGO DE ENFERMEDAD METABÓLICAMETHOD FOR OBTAINING DATA FOR PREDICTING METABOLIC DISEASE RISK
CAMPO TÉCNICO TECHNICAL FIELD
La presente divulgación se relaciona con sistemas, métodos y medios legibles por computadora con instrucciones para el análisis, predicción de riesgos, toma de decisiones y/o seguimiento de enfermedades e identificación de marcadores, patrones y relaciones entre datos biológicos relevantes. En particular, se relaciona con métodos implementados por computador y sistemas para el procesamiento de datos mediante aprendizaje automático e inteligencia artificial para predecir un dato de riesgo de enfermedad. This disclosure relates to systems, methods, and computer-readable media with instructions for analysis, risk prediction, decision-making, and/or disease monitoring, and for identifying markers, patterns, and relationships among relevant biological data. In particular, it relates to computer-implemented methods and systems for processing data using machine learning and artificial intelligence to predict disease risk data.
DESCRIPCIÓN DEL ESTADO DE LA TÉCNICA DESCRIPTION OF THE STATE OF THE ART
La predicción y detección temprana de la enfermedad permiten una intervención temprana. Para muchas enfermedades, la detección temprana aumenta la probabilidad de un tratamiento exitoso y brinda al paciente la mejor variedad de opciones para tomar decisiones sobre calidad de vida. Además, permite impulsar la atención preventiva y realizar diagnósticos oportunos. Prediction and early detection of disease allow for early intervention. For many diseases, early detection increases the likelihood of successful treatment and provides patients with the best range of options for making quality-of-life decisions. It also allows for increased preventive care and timely diagnoses.
La predicción y detección tempranas brindan a los profesionales de la salud la capacidad de comenzar pruebas adicionales, orientar diagnósticos y detectar afecciones sobre las que de otro modo no habrían sido alertados, lo que puede ayudar en el tratamiento y la prevención de enfermedades. Early prediction and detection give healthcare professionals the ability to initiate further testing, guide diagnoses, and detect conditions they might not otherwise have been alerted to, which can aid in disease treatment and prevention.
El análisis predictivo es un enfoque para la predicción y detección temprana de la enfermedad que utiliza datos y algoritmos para identificar la probabilidad de resultados futuros y proporcionar datos de riesgo de enfermedad. La Inteligencia Artificial (IA) ha tenido un impacto importante en el análisis predictivo, donde se han utilizado técnicas de inteligencia artificial, como el razonamiento basado en casos y los algoritmos de aprendizaje automático (ML) basados en datos, para respaldar los procesos de toma de decisiones en tareas complejas. Esto se utiliza, por ejemplo, para ayudar a los profesionales médicos a tomar decisiones clínicas, proporcionando datos de riesgo de enfermedad y pronósticos previstos de modelos de aprendizaje automático (ML). Predictive analytics is an approach to disease prediction and early detection that uses data and algorithms to identify the likelihood of future outcomes and provide disease risk data. Artificial intelligence (AI) has had a significant impact on predictive analytics, where AI techniques such as case-based reasoning and data-driven machine learning (ML) algorithms have been used to support decision-making processes in complex tasks. This is used, for example, to assist medical professionals in making clinical decisions by providing disease risk data and predicted prognoses from machine learning (ML) models.
En el estado de la técnica se conocen diferentes métodos y sistemas para obtener un dato de riesgo de enfermedad. In the state of the art, different methods and systems are known for obtaining disease risk data.
El documento Curry, K. D., Ñute, M. G., & Treangen, T. J (2021). It takes guts to learn machine learning techniques for disease detection from the gut microbiome. Emerging Topics in Life Sciences, 5(6), 815-827. Divulga que se han observado asociaciones entre el microbioma intestinal humano y la expresión de enfermedades del huésped en una variedad de afecciones que van desde disfunciones gastrointestinales hasta déficits neurológicos. Los métodos de aprendizaje automático (ML) han generado resultados prometedores para la predicción de enfermedades a partir de información metagenómica intestinal para enfermedades como la cirrosis hepática y la enfermedad del intestino irritable, pero han carecido de eficacia a la hora de predecir otras enfermedades. The paper Curry, K. D., Ñute, M. G., & Treangen, T. J (2021). It takes guts to learn machine learning techniques for disease detection from the gut microbiome. Emerging Topics in Life Sciences, 5(6), 815-827. It reports that associations between the human gut microbiome and host disease expression have been observed in a variety of conditions ranging from gastrointestinal dysfunctions to neurological deficits. Machine learning (ML) methods have generated promising results for disease prediction from gut metagenomic information for diseases such as liver cirrhosis and irritable bowel disease but have lacked effectiveness in predicting other diseases.
El documento US10347368B2 divulga un método para caracterizar una condición relacionada con el microbioma y determinar medidas terapéuticas, así como para evaluar, diagnosticar y tratar al menos una enfermedad cardiovascular en al menos un individuo. El procedimiento comprende recibir una colección de muestras biológicas de una población de individuos; generar al menos un conjunto de información sobre la composición del microbioma y otro sobre la diversidad funcional del microbioma para dicha población; desarrollar una descripción de la condición de enfermedad cardiovascular basada en características obtenidas de los datos sobre la composición y la diversidad funcional del microbioma; basándose en esta descripción, crear un modelo de terapia diseñado para abordar la condición de enfermedad cardiovascular. El conjunto de datos sobre las características del microbioma abarca al menos un componente seleccionado de entre las características taxonómicas, la diversidad de la composición, la diversidad funcional y las características funcionales del microbioma. El documento divulga un dispositivo de salida vinculado a un individuo para fomentar una terapia basada en una descripción y modelo de terapia. US10347368B2 discloses a method for characterizing a microbiome-related condition and determining therapeutic measures, as well as for evaluating, diagnosing, and treating at least one cardiovascular disease in at least one individual. The method comprises receiving a collection of biological samples from a population of individuals; generating at least one set of information on microbiome composition and another on microbiome functional diversity for said population; developing a description of the cardiovascular disease condition based on features obtained from the data on microbiome composition and functional diversity; based on this description, creating a therapy model designed to address the cardiovascular disease condition. The set of Data on microbiome characteristics encompasses at least one component selected from among the taxonomic characteristics, compositional diversity, functional diversity, and functional characteristics of the microbiome. The document discloses an output device linked to an individual to promote therapy based on a description and therapy model.
Igualmente, el documento US10347368B2 divulga que puede transformar el conjunto de datos complementario y las características extraídas del conjunto de datos de composición del microbioma y el conjunto de datos de diversidad funcional del microbioma en un modelo de caracterización de la condición de enfermedad cardiovascular, para lo cual utiliza métodos computaci onales (por ejemplo, métodos estadísticos, métodos de aprendizaje automático, métodos de inteligencia artificial, métodos bioinformáticos, etc.) para caracterizar a un sujeto que presenta características de un grupo de sujetos con enfermedad cardiovascular. Similarly, document US10347368B2 discloses that it can transform the complementary data set and the features extracted from the microbiome composition data set and the microbiome functional diversity data set into a cardiovascular disease condition characterization model, for which it uses computational methods (for example, statistical methods, machine learning methods, artificial intelligence methods, bioinformatics methods, etc.) to characterize a subject that presents characteristics of a group of subjects with cardiovascular disease.
Por otro lado, el documento CN116525105B divulga un sistema para anticipar y prever el pronóstico del shock cardiogénico, junto con un dispositivo y una herramienta almacenadle. El dispositivo incluye varias unidades, como una unidad de adquisición, una unidad de extracción de características, una unidad de diagnóstico, una segunda unidad de adquisición y una unidad de decisión. La unidad de adquisición recopila información genética de una muestra de células mononucleares de sangre periférica y una etiqueta que indica si se aplica o no un tratamiento. La unidad de extracción de características extrae información genética para obtener características genéticas, mientras que la unidad de diagnóstico determina la necesidad de terapia basándose en una firma genética. La segunda unidad de adquisición obtiene el resultado del diagnóstico de la etapa de shock cardiogénico, y la unidad de decisión selecciona un plan de tratamiento según el resultado del diagnóstico en etapas. El sistema desarrollado por la aplicación pronostica el efecto del tratamiento basándose en la información genética del paciente. On the other hand, document CN116525105B discloses a system for anticipating and predicting the prognosis of cardiogenic shock, along with a device and a storable tool. The device includes several units, such as an acquisition unit, a feature extraction unit, a diagnosis unit, a second acquisition unit, and a decision-making unit. The acquisition unit collects genetic information from a sample of peripheral blood mononuclear cells and a label indicating whether or not a treatment is applied. The feature extraction unit extracts genetic information to obtain genetic characteristics, while the diagnosis unit determines the need for therapy based on a genetic signature. The second acquisition unit obtains the diagnosis result of the cardiogenic shock stage, and the decision-making unit selects a treatment plan based on the diagnosis result in stages. The system developed by the application predicts the effect of the treatment based on the patient's genetic information.
El documento CN116525105B también divulga un método específico para detectar características genéticas que incluye análisis de expresión diferencial, detección de características de bucle y análisis de correlación de genes importantes. El método selecciona 21 genes y aplica un algoritmo de aprendizaje automático para identificarlos y obtener el biomarcador óptimo. Document CN116525105B also discloses a specific method for detecting genetic features that includes differential expression analysis, loop feature detection, and correlation analysis of important genes. The method selects 21 genes and applies a machine learning algorithm to identify them and obtain the optimal biomarker.
Finalmente, el documento CN116525105B divulga que el sistema utiliza un modelo de aprendizaje automático con varios algoritmos de clasificación, como KNN, árbol de decisión, bosque aleatorio, SVM, regresión logística, GBDT, XGBoost, e incorpora un mecanismo de retroalimentación negativa basado en el bosque aleatorio. En cada ronda de iteración, se ajustan las ponderaciones de las muestras de entrenamiento para mejorar la probabilidad de clasificar correctamente las muestras previamente clasificadas incorrectamente en la siguiente iteración. Finally, CN116525105B discloses that the system uses a machine learning model with several classification algorithms, such as KNN, decision tree, random forest, SVM, logistic regression, GBDT, XGBoost, and incorporates a random forest-based negative feedback mechanism. In each iteration round, the weights of the training samples are adjusted to improve the probability of correctly classifying previously misclassified samples in the next iteration.
El documento US20190108912A1 presenta un método un sistema de aprendizaje automático que analiza datos clínicos para identificar patrones latentes que predicen enfermedades. Los conjuntos de datos utilizados como datos de entrenamiento del sistema de aprendizaje comprenden información sobre resultados de pruebas, fenotipos, entorno, demografía, geografía, genética, datos clínicos, reclamaciones de seguros y tratamientos. El sistema de aprendizaje automático puede descubrir secuencias y combinaciones de eventos que no serían evidentes para un revisor humano, pero que, sin embargo, son predictores confiables de resultados médicamente importantes. Document US20190108912A1 presents a method for a machine learning system that analyzes clinical data to identify latent patterns that predict disease. The data sets used as training data for the learning system include information about test results, phenotypes, environment, demographics, geography, genetics, clinical data, insurance claims, and treatments. The machine learning system can discover sequences and combinations of events that would not be apparent to a human reviewer but are nevertheless reliable predictors of medically important outcomes.
Además, US20190108912A1 explica que el sistema de aprendizaje automático genera informes para profesionales de la salud, por ejemplo, informes sobre el riesgo de fibromialgia para un paciente en particular. Este informe predictivo habilita a los profesionales de la salud para realizar pruebas adicionales y comenzar intervenciones de tratamiento mucho antes de lo que sería posible de otra manera. También se menciona que, en ciertas implementaciones, el algoritmo de aprendizaje automático incorpora una red neuronal. No obstante, se señala que puede emplear cualquier sistema de aprendizaje automático apropiado, como uno o más de un bosque aleatorio, una búsqueda de cuadrícula o una máquina de vectores de soporte. No obstante, los documentos anteriormente mencionados no divulgan la combinación de fuentes de datos de entrada como son datos de ADN genómico en conjunto con datos de microbiota y datos clínicos como entradas para los métodos de aprendizaje. Lo que afecta de una manera importante el rendimiento de clasificación del método o modelo de aprendizaje. Furthermore, US20190108912A1 explains that the machine learning system generates reports for healthcare professionals, for example, reports on a particular patient's risk of fibromyalgia. This predictive report enables healthcare professionals to perform additional testing and begin treatment interventions much earlier than would otherwise be possible. It also mentions that, in certain implementations, the machine learning algorithm incorporates a neural network. However, it notes that it may employ any appropriate machine learning system, such as one or more of a random forest, a grid search, or a support vector machine. However, the aforementioned documents do not disclose the combination of input data sources, such as genomic DNA data, microbiota data, and clinical data, as inputs for the learning methods. This significantly affects the classification performance of the learning method or model.
Asimismo, no describen un proceso automático para seleccionar métodos de aprendizaje de máquina basados en métricas seleccionándolos de una pluralidad de métodos de aprendizaje de máquina. They also do not describe an automatic process for selecting metrics-based machine learning methods from a plurality of machine learning methods.
Por otro lado, los métodos para caracterizar ciertas condiciones de salud y su manejo mediante soluciones nutrí ci onales, por ejemplo, a través de productos asociados a probióticos de precisión, ingredientes naturales, extractos de plantas, antioxidantes, vitaminas y minerales, que solo o en combinación puedan modificar rutas metabólicas identificadas a partir de datos biológicos individualizados o de grupos poblacionales específicos, no han sido viables su implementación debido a las limitaciones de los estudios actuales y la dificultad de realizar un proceso de integración de datos, dado que los estudios se centran en el uso de un solo tipo de información, ya sea de datos clínicos, datos de ADN de microbiota o datos de ADN genómico, cuando una enfermedad no solo se puede ver reflejada en el fenómeno de disbiosis, sino también en el potencial genético del individuo y en la interacción de ese potencial genético con su entorno particular, que no puede ser abordado solo por variables clínicas, como pruebas de laboratorio o incluso de comportamiento. On the other hand, methods to characterize certain health conditions and their management through nutritional solutions, for example, through products associated with precision probiotics, natural ingredients, plant extracts, antioxidants, vitamins and minerals, which alone or in combination can modify metabolic pathways identified from individualized biological data or from specific population groups, have not been viable to implement due to the limitations of current studies and the difficulty of carrying out a data integration process, given that the studies focus on the use of a single type of information, whether clinical data, microbiota DNA data or genomic DNA data, when a disease can not only be reflected in the phenomenon of dysbiosis, but also in the genetic potential of the individual and in the interaction of that genetic potential with their particular environment, which cannot be addressed only by clinical variables, such as laboratory tests or even behavioral ones.
Adicionalmente, se deben tener en cuenta que, para los tipos de datos manejados provenientes de ómicas, se presenta un problema de dimensionalidad reflejado en la presencia de una gran cantidad de características y un pequeño tamaño de la muestra. Por ejemplo, se pueden presentar 12,000 OTUs (Unidades Operacionales Taxonómicas) y apenas de 50 a 100 individuos. La presencia de nuevas especies de microorganismos o de nuevas variantes que no puedan ser validadas por el bajo número de individuos que participan en el estudio y debido a las fluctuaciones que se presentan en las abundancias de los microorganismos, haciendo de cada muestreo un evento único y una representación estática del estado de ese momento del individuo. Additionally, it should be noted that for the types of data handled from omics, a dimensionality problem arises, reflected in the presence of a large number of characteristics and a small sample size. For example, there may be 12,000 OTUs (Operational Taxonomic Units) and only 50 to 100 individuals. The presence of new species of microorganisms or new variants that cannot be validated due to the low number of individuals participating in the study and due to the fluctuations that occur in abundance of microorganisms, making each sampling a unique event and a static representation of the individual's current state.
Existe la necesidad de un método de obtención de datos de predicción de riesgo de enfermedad para por ejemplo ayudar a los médicos a realizar diagnósticos oportunos y precisos, caracterizar ciertas condiciones de salud, ayudar al diseño de soluciones nutricionales, predecir el cumplimiento de tratamientos, diseñar tratamientos y para asesorar y tratar adecuadamente al paciente. There is a need for a method for obtaining disease risk prediction data, for example, to help physicians make timely and accurate diagnoses, characterize certain health conditions, assist in the design of nutritional solutions, predict treatment adherence, design treatments, and appropriately counsel and treat patients.
BREVE DESCRIPCIÓN DE LA DIVULGACIÓN BRIEF DESCRIPTION OF THE DISCLOSURE
La presente divulgación describe un método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad que comprende las etapas; a) obtener datos clínicos, datos de ADN de microbiota y datos de ADN genómico de una base de datos; b) generar datos clínicos preprocesados, datos de ADN genómico preprocesados, datos de ADN de microbiota preprocesados a partir de los datos clínicos, datos de ADN de microbiota y datos de ADN genómico obtenidos en la etapa a) mediante preprocesamiento de datos; c) aplicar una pluralidad de métodos de aprendizaje supervisado a los datos de preprocesados en la etapa b); d) obtener un dato de predicción de riesgo de enfermedad por cada uno de los métodos de aprendizaje supervisado que conforman pluralidad de métodos de aprendizaje de la etapa c); e) seleccionar el método de aprendizaje supervisado con la métrica más alta de la pluralidad de métodos de aprendizaje supervisado de la etapa d); f) seleccionar el dato de predicción de riesgo de enfermedad del método de aprendizaje supervisado seleccionado en la etapa e) y almacenarlo en una base datos. The present disclosure describes a computer-implemented method for obtaining disease risk prediction data comprising the steps: a) obtaining clinical data, microbiota DNA data, and genomic DNA data from a database; b) generating preprocessed clinical data, preprocessed genomic DNA data, preprocessed microbiota DNA data from the clinical data, microbiota DNA data, and genomic DNA data obtained in step a) by data preprocessing; c) applying a plurality of supervised learning methods to the preprocessed data in step b); d) obtaining disease risk prediction data by each of the supervised learning methods that make up the plurality of learning methods in step c); e) selecting the supervised learning method with the highest metric from the plurality of supervised learning methods in step d); f) selecting the disease risk prediction data from the supervised learning method selected in step e) and storing it in a database.
En algunas realizaciones en la etapa b) los datos de ADN genómico preprocesados se generan mediante un método que comprende las subetapas: a2) evaluar la calidad de los datos de secuenciación de los datos de ADN genómico de la etapa a); b2) procesar mediante las operaciones de recortar, limpiar, filtrar y eliminar secuencias no deseadas de los datos de secuenciación de los datos de ADN genómico evaluados en la subetapa a); c2) realizar el alineamiento de las secuencias de los datos de ADN procesados en la subetapa b; d2) detectar variantes genéticas y recalibrar la calidad de bases de secuenciación y filtrar variantes) de los datos de ADN alineados en la subetapa b); e2) realizar la anotación de la función en su correlación con una enfermedad de las variantes genéticas de las secuencias de los datos de ADN obtenidas en la subetapa d) de y almacenar la anotación una base de datos f2) filtrar variantes de genes asociados a desordenes metabólicos a las variantes anotadas, en la subetapa e); g2) codificar mutaciones utilizando la técnica de One Hot Encoder para las variantes de las secuencias de los datos de ADN filtradas en la subetapa f2). In some embodiments, in step b) the preprocessed genomic DNA data is generated by a method comprising the sub-steps: a2) evaluating the quality of the sequencing data of the genomic DNA data from step a); b2) processing by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the genomic DNA data evaluated in sub-step a); c2) performing the alignment of the sequences of the processed DNA data in step b); sub-stage b; d2) detect genetic variants and recalibrate the quality of sequencing bases and filter variants) from the aligned DNA data in sub-stage b); e2) perform the annotation of the function in its correlation with a disease of the genetic variants of the DNA data sequences obtained in sub-stage d) and store the annotation in a database f2) filter variants of genes associated with metabolic disorders to the annotated variants, in sub-stage e); g2) encode mutations using the One Hot Encoder technique for the variants of the DNA data sequences filtered in sub-stage f2).
En algunas realizaciones se realiza una etapa adicional de reducción de la dimensión los datos de ADN genómico preprocesados posterior a la etapa g2) codificar mutaciones utilizando la técnica de One Hot Encoder para las variantes de las secuencias de los datos de ADN filtradas en la subetapa f2). In some embodiments, an additional step of dimension reduction is performed on the pre-processed genomic DNA data subsequent to step g2) to encode mutations using the One Hot Encoder technique for the sequence variants of the DNA data filtered in sub-step f2).
En algunas realizaciones en la etapa b) los datos de ADN de microbiota preprocesados se generan mediante un método que comprende las subetapas: a3) evaluar la calidad de los datos de ADN de microbiota de la etapa a); b3) procesar mediante las operaciones de recortar, limpiar, filtrar y eliminar secuencias no deseadas de los datos de secuenciación de los datos de ADN de microbiota evaluado en la subetapa a3); c3) identificar y eliminar secuencias de contaminación cruzada que provienen de humano de los datos de ADN de microbiota procesados en la subetapa b3); d3) identificar funciones genéticas y categorizar las secuencias genómicas, separar las secuencias genómicas en conjuntos, encontrar la composición taxonómica de la comunidad microbiana, a partir de los datos de ADN de microbiota posteriores a la eliminación de secuencias de contaminación cruzada en la subetapa c3); e3) recortar adaptadores y secuencias no deseadas de lecturas de secuenciación los datos de ADN de microbiota posteriores a la aplicación de la subetapa d3); f3) asignar funciones biológicas potenciales en las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e); g3) clasificar las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e3); h3) identificar asociaciones entre variables microbiotas y metadatos en las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa g3); i3) filtrar mediante criterios estadísticos taxonómicos y funcionales y las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa h3); j3) aplicar los procesos de normalización, correlación y reducción de dimensiones a los datos de ADN de microbiota posteriores a la aplicación de la subetapa i3); k3) agrupar los datos de ADN de microbiota posteriores a la aplicación de la subetapa i3) en unidades taxonómicas operativas mediante un umbral predefinido de mayor que 80% de similitud de secuencia de genes; 13) almacenar las unidades taxonómicas operativas obtenidas en la etapa k3) en una base de datos; m3) filtrar mediante criterios de unidades taxonómicas operativas previamente almacenados en una base de datos y relacionados con la predicción de una enfermedad respectivamente en las unidades taxonómicas operativas almacenadas en la etapa j 3); n3) repetir las etapas de a3) a la etapa j 3) trascurrido un tiempo T2 y almacenar las unidades taxonómicas operativas obtenidas en la etapa n en una base de datos los resultados obtenidos en la etapa n trascurrido un tiempo T2; o3) filtrar mediante criterios de unidades taxonómicas operativas previamente almacenados en una base de datos con la predicción de una enfermedad respectivamente las unidades taxonómicas operativas almacenadas en la etapa k3; p3) comparar los datos almacenados en la etapa k3) con los datos almacenados en la etapa n3) transcurrido un tiempo T2. In some embodiments, in step b) the pre-processed microbiota DNA data is generated by a method comprising the sub-steps: a3) assessing the quality of the microbiota DNA data from step a); b3) processing by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the microbiota DNA data assessed in sub-step a3); c3) identifying and removing cross-contamination sequences originating from human from the microbiota DNA data processed in sub-step b3); d3) identifying gene functions and categorizing the genomic sequences, separating the genomic sequences into sets, finding the taxonomic composition of the microbial community, from the microbiota DNA data after removing cross-contamination sequences in sub-step c3); e3) trimming adapters and unwanted sequences from sequencing reads the microbiota DNA data after applying sub-step d3); f3) assign potential biological functions in the microbiota DNA data sequences after the application of sub-step e); g3) classify the microbiota DNA data sequences after the application of sub-step e3); h3) identify associations between microbiota variables and metadata in the microbiota DNA data sequences after the application of sub-step g3); i3) filter using taxonomic and functional statistical criteria and the data sequences of Microbiota DNA data after applying sub-step h3); j3) apply the normalization, correlation and dimension reduction processes to the microbiota DNA data after applying sub-step i3); k3) cluster the microbiota DNA data after applying sub-step i3) into operational taxonomic units using a predefined threshold of greater than 80% gene sequence similarity; 13) store the operational taxonomic units obtained in step k3) in a database; m3) filter using operational taxonomic unit criteria previously stored in a database and related to the prediction of a disease respectively in the operational taxonomic units stored in step j 3); n3) repeat steps a3) to j3) after a time T2 has elapsed and store the operational taxonomic units obtained in step n in a database; the results obtained in step n after a time T2 has elapsed; o3) filter using operational taxonomic unit criteria previously stored in a database with the prediction of a disease respectively, the operational taxonomic units stored in step k3; p3) compare the data stored in step k3) with the data stored in step n3) after a time T2 has elapsed.
En algunas realizaciones se realiza una etapa adicional de reducción de la dimensión los datos de ADN de microbiota preprocesados posterior a la etapa j 3). In some embodiments, an additional step of dimension reduction is performed on the pre-processed microbiota DNA data subsequent to step j 3).
En algunas realizaciones donde entre la subetapa m3) y la subetapa n3) se añade una etapa adicional de proporcionar un consumible a un sujeto. In some embodiments where between sub-step m3) and sub-step n3) an additional step of providing a consumable to a subject is added.
BREVE DESCRIPCIÓN DE LAS FIGURAS BRIEF DESCRIPTION OF THE FIGURES
La FIG. 1 ilustra un diagrama de bloques de una descripción general del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente descripción. La FIG. 2 ilustra un diagrama de flujo de una descripción general de una realización de ejemplo del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente descripción. FIG. 1 illustrates a block diagram of a general description of the computer-implemented method of obtaining disease risk prediction data of the present disclosure. FIG. 2 illustrates a flowchart of an overview of an exemplary embodiment of the computer-implemented method of obtaining disease risk prediction data of the present disclosure.
DESCRIPCIÓN DETALLADA DETAILED DESCRIPTION
La presente divulgación describe un sistema, un medio de almacenamiento legible por computadora y un método implementado en computador de obtención de un dato de predicción de riesgo de enfermedad. The present disclosure describes a system, a computer-readable storage medium, and a computer-implemented method of obtaining disease risk prediction data.
Haciendo referencia a la FIG.l, el método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación comprende las etapas de: a) obtener datos clínicos, datos de ADN de microbiota y datos de ADN genómico de una base de datos; b) generar datos clínicos preprocesados, datos de ADN genómico preprocesados, datos de ADN de microbiota preprocesados, a partir de los datos clínicos, datos de ADN de microbiota y datos de ADN genómico obtenidos en la etapa a) mediante preprocesamiento de datos; c) aplicar una pluralidad de métodos de aprendizaje supervisado a los datos de preprocesados en la etapa b); d) obtener un dato de predicción de riesgo de enfermedad por cada uno de los métodos de aprendizaje supervisado que conforman pluralidad de métodos de aprendizaje de la etapa c) e) seleccionar el método de aprendizaje supervisado con la métrica más alta de la pluralidad de métodos de aprendizaje supervisado de la etapa d). f) seleccionar el dato de predicción de riesgo de enfermedad del método de aprendizaje supervisado seleccionado en la etapa e) y almacenarlo en una base datos. El método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación permite filtrar los datos de baja calidad y los datos que pueden estar sometidos a un ruido técnico por su baja representatividad. Referring to FIG. 1, the computer-implemented method of obtaining a disease risk prediction data of the present disclosure comprises the steps of: a) obtaining clinical data, microbiota DNA data, and genomic DNA data from a database; b) generating preprocessed clinical data, preprocessed genomic DNA data, preprocessed microbiota DNA data, from the clinical data, microbiota DNA data, and genomic DNA data obtained in step a) by data preprocessing; c) applying a plurality of supervised learning methods to the preprocessed data in step b); d) obtaining a disease risk prediction data by each of the supervised learning methods that make up the plurality of learning methods in step c); e) selecting the supervised learning method with the highest metric from the plurality of supervised learning methods in step d). f) select the disease risk prediction data from the supervised learning method selected in step e) and store it in a database. The computer-implemented method of obtaining disease risk prediction data of the present disclosure allows filtering out low-quality data and data that may be subject to technical noise due to their low representativeness.
Para efectos de la presente divulgación se entiende por ruido técnico a aquellos datos que tiene un bajo conteo y por ende no se puede establecer si la tendencia o no en el universo de los datos se debe a la sensibilidad de la técnica o tiene un significado biológico. For the purposes of this disclosure, technical noise is understood to be data that has a low count and therefore it cannot be established whether the trend or not in the data universe is due to the sensitivity of the technique or has a biological significance.
El método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación permite obtener un dato de predicción de riesgo de enfermedad, que puede ser modificado en una etapa adicional de intervención en un usuario a través de cambios, por ejemplo en el estilo de vida, dieta, ejercicio o a través de la ingesta de un consumible que incluya ingredientes que por ejemplo se seleccionan del grupo de antioxidantes, probióticos, prebióticos, postbioticos, suplementos vitamínicos, minerales, fibras, extractos naturales, inmunonutrientes, aminoácidos, proteínas vegetales, proteínas animales, que solos o en formulas combinadas, puedan modular rutas metabólicas, rutas moleculares, celulares, rutas inmunológicas, vasculares, rutas involucradas en el manejo del sueño, el estado del ánimo, memoria, estrés oxidativo, envejecimiento, gasto calórico, absorción de nutrientes, estado gastrointestinal, asociadas a la función y/o modulación del microbioma o marcadores clínicos involucrados en alguna patología. Otras intervenciones pueden incluir matrices alimentarias fermentadas, alimentos funcionales, dietas veganas, vegetarianas, carnívoras, cetogénicas, paleo, mediterránea, baja en calorías lo cual puede ser evaluado en diferentes rangos de tiempo. The computer-implemented method for obtaining disease risk prediction data of the present disclosure allows obtaining disease risk prediction data, which can be modified in an additional stage of intervention in a user through changes, for example, in lifestyle, diet, exercise or through the intake of a consumable that includes ingredients that, for example, are selected from the group of antioxidants, probiotics, prebiotics, postbiotics, vitamin supplements, minerals, fibers, natural extracts, immunonutrients, amino acids, plant proteins, animal proteins, which alone or in combined formulas, can modulate metabolic pathways, molecular pathways, cellular pathways, immunological pathways, vascular pathways, pathways involved in sleep management, mood, memory, oxidative stress, aging, caloric expenditure, nutrient absorption, gastrointestinal status, associated with the function and/or modulation of the microbiome or clinical markers involved in some pathology. Other interventions may include fermented food matrices, functional foods, vegan, vegetarian, carnivorous, ketogenic, paleo, Mediterranean, and low-calorie diets, which can be evaluated over different time frames.
En algunas realizaciones de la presente divulgación en la etapa b) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación, genera datos clínicos preprocesados, datos de ADN genómico preprocesados, datos de ADN de microbiota preprocesados mediante preprocesamiento en serie, de manera independiente o en paralelo a partir de los datos clínicos, datos de ADN de microbiota y datos de ADN genómico obtenidos en la etapa a). En algunas realizaciones de la presente divulgación en la etapa c) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación, se aplican la pluralidad de métodos de aprendizaje supervisado a los datos de preprocesados en la etapa b) en serie, de manera independiente o en paralelo. In some embodiments of the present disclosure, in step b) of the computer-implemented method of obtaining a disease risk prediction data of the present disclosure, preprocessed clinical data, preprocessed genomic DNA data, preprocessed microbiota DNA data are generated by serial preprocessing, independently or in parallel from the clinical data, microbiota DNA data and genomic DNA data obtained in step a). In some embodiments of the present disclosure in step c) of the computer-implemented method of obtaining disease risk prediction data of the present disclosure, the plurality of supervised learning methods are applied to the pre-processed data in step b) serially, independently, or in parallel.
Haciendo referencia a la FIG.2 En algunas realizaciones de la presente divulgación en la etapa b) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación, los datos de ADN genómico preprocesados se generan mediante un método que comprende las subetapas: a2) evaluar la calidad de los datos de secuenciación de los datos de ADN genómico de la etapa a); b2) procesar mediante las operaciones de recortar, limpiar, filtrar y eliminar secuencias no deseadas de los datos de secuenciación de los datos de ADN genómico evaluados en la subetapa a2); c2) realizar el alineamiento de las secuencias de los datos de ADN genómico procesados en la subetapa b2; d2) detectar variantes genéticas y recalibrar la calidad de bases de secuenciación y filtrar variantes) de los datos de ADN genómico alineados en la subetapa c2); e2) realizar la anotación de la función en su correlación con una enfermedad de las variantes genéticas de las secuencias de los datos de ADN genómico obtenidas en la subetapa d2) de y almacenar la anotación una base de datos; f2) filtrar variantes de genes asociados a desordenes metabólicos a las variantes anotadas en la subetapa e2); g2) codificar mutaciones utilizando la técnica de One Hot Encoder para las variantes de las secuencias de los datos de ADN genómico filtradas en la subetapa f2). Referring to FIG. 2 In some embodiments of the present disclosure in step b) of the computer-implemented method of obtaining a disease risk prediction data of the present disclosure, the pre-processed genomic DNA data is generated by a method comprising the sub-steps: a2) evaluating the sequencing data quality of the genomic DNA data of step a); b2) processing by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the genomic DNA data evaluated in sub-step a2); c2) performing sequence alignment of the processed genomic DNA data in sub-step b2; d2) detecting genetic variants and recalibrating the sequencing base quality and filtering variants) of the aligned genomic DNA data in sub-step c2); e2) perform the annotation of the function in its correlation with a disease of the genetic variants of the sequences of the genomic DNA data obtained in sub-stage d2) and store the annotation in a database; f2) filter variants of genes associated with metabolic disorders to the variants annotated in sub-stage e2); g2) encode mutations using the One Hot Encoder technique for the variants of the sequences of the genomic DNA data filtered in sub-stage f2).
En algunas realizaciones de la presente divulgación el método que generala los datos de ADN genómico preprocesados en la subetapa a2) evaluar la calidad de los datos de secuenciación de los datos de ADN genómico de la etapa a); evaluando la precisión y confiabilidad de la información obtenida a través de técnicas de secuenciación de ADN genómico la calidad se evalúa mediante diversos parámetros y métricas que proporcionan información sobre la fiabilidad de las lecturas de secuencias generadas Para el caso de la presente descripción se realiza, por ejemplo mediante los siguientes parámetros, Calidad de las bases (Base Quality, Valor PHRED 30), Distribución de longitudes (por ejemplo min 80% del tamaño de la lectura), Contenido de GC por base con comportamiento definido de distribución normal, Contenido de GC por secuencia (Comportamiento Normal del valor %GC, ): Contenido de adaptadores (por ejemplo Sin adaptadores), Secuencias sobrerrepresentadas (por ejemplo, sin secuencias sobre representadas en los extremos de las secuencias), Distribución de longitudes de secuencia, Después de revisar el cumplimiento de estos parámetros se somete, por ejemplo mediante una herramienta de limpieza (como por ejemplo Trimmomatic) el cuál normaliza los valores si es posible, y en caso contrario, las lecturas de secuencias son retiradas del estudio. Si el criterio de calidad no se cumple, por ejemplo, en al menos el 80% de las secuencias, el set entero es rechazado. In some embodiments of the present disclosure the method that generates the pre-processed genomic DNA data in sub-step a2) assessing the quality of the data sequencing of the genomic DNA data from step a); evaluating the accuracy and reliability of the information obtained through genomic DNA sequencing techniques, the quality is assessed using various parameters and metrics that provide information on the reliability of the generated sequence reads. In the case of the present description, it is done, for example, using the following parameters: Base Quality (Base Quality, PHRED Value 30), Length distribution (for example, min. 80% of the read size), GC content per base with defined normal distribution behavior, GC content per sequence (Normal Behavior of the %GC value, ): Adapter content (for example, no adapters), Over-represented sequences (for example, no over-represented sequences at the ends of the sequences), Sequence length distribution. After checking compliance with these parameters, it is subjected, for example, using a cleaning tool (such as Trimmomatic), which normalizes the values if possible, and otherwise, the sequence reads are removed from the study. If the quality criterion is not met, for example, in at least 80% of the sequences, the entire set is rejected.
La calidad de bases se mide mediante el PHRED score, que es un valor logarítmico que indica la probabilidad de que una base sea incorrecta. Un PHRED score de 30 corresponde a una probabilidad de error del 1 en 1,000. Así por ejemplo Lecturas con PHRED scores altos indican mayor calidad. Base quality is measured by the PHRED score, which is a logarithmic value indicating the probability of a base being incorrect. A PHRED score of 30 corresponds to a probability of error of 1 in 1,000. Thus, for example, reads with high PHRED scores indicate higher quality.
Adicionalmente se examina la distribución de los valores de calidad de bases a lo largo de las lecturas. Una distribución uniforme y alta indica buena calidad, mientras que picos o caídas inesperadas pueden indicar problemas. Additionally, the distribution of base quality values across the readings is examined. A uniform and high distribution indicates good quality, while unexpected spikes or drops can indicate problems.
La evaluación de la calidad de los datos de secuenciación de ADN genómico se puede realizar mediante herramientas como por ejemplo FASTQC y MultiQC. Y se analizan aspectos como por ejemplo la calidad de las bases, la presencia de adaptadores y la distribución de longitudes, opcionalmente se generan informes detallados para cada muestra. El umbral de corte por calidad es PHRED 30, con una distribución de longitudes de min el 80% del tamaño de la lectura, con un contenido de GC por base con comportamiento definido de distribución normal, contenido de GC por secuencia con comportamiento Normal del valor del %GC: retirando completamente el contenido de adaptadores y las secuencias sobrerrepresentadas en los extremos. Finalmente, se revisa que después de todos estos filtros la distribución de longitudes de secuencia sea uniforme; Después de revisar el cumplimiento de estos parámetros se somete, por ejemplo, mediante una herramienta de limpieza (como por ejemplo Trimmomatic) el cuál normaliza los valores si es posible, y en caso contrario, las lecturas de secuencias son retiradas del estudio. Si el criterio de calidad no se cumple, por ejemplo, en al menos el 80% de las secuencias, el set entero es rechazado. Genomic DNA sequencing data quality assessment can be performed using tools such as FASTQC and MultiQC. Aspects such as base quality, the presence of adapters, and length distribution are analyzed. Detailed reports are optionally generated for each sample. The quality cutoff is PHRED 30, with a length distribution of min 80% of the read size, and a GC content per base with Normal distribution behavior defined, GC content per sequence with Normal behavior of the %GC value: completely removing the adapter content and the sequences overrepresented at the ends. Finally, after all these filters, the sequence length distribution is checked for uniformity; After checking compliance with these parameters, the sequence reads are submitted, for example, through a cleaning tool (such as Trimmomatic), which normalizes the values if possible, and if not, the sequence reads are removed from the study. If the quality criterion is not met, for example, in at least 80% of the sequences, the entire set is rejected.
En algunas realizaciones de la presente divulgación el método que genera los datos de ADN genómico preprocesados en la subetapa b2) procesa mediante las operaciones de recortar, limpiar, filtrar y eliminar secuencias no deseadas de los datos de secuenciación de los datos de ADN evaluados en la subetapa a2) mediante las siguientes subetapas i. leer los datos de ADN genómico-evaluados en la subetapa a2); ii. eliminar bases de baja calidad al principio y al final de cada lectura. Por ejemplo, con PHRED 30 y PHRED 25 como valor mínimo; iii. realizar un recorte de ventana deslizante para eliminar regiones de baja calidad. Por ejemplo, cada 3 bases tengan en promedio un valor de PHRED 30 o mínimo PHRED 25; iv. realizar filtrado por calidad con otros valores de calidad adicionales como por ejemplo distribución de longitudes en al menos el 80% del tamaño de la lectura, se retira completamente el contenido de adaptadores y las secuencias sobre representadas en los extremos. v. revisar que la distribución de longitudes de secuencia sea uniforme. Así por ejemplo si los criterios de calidad previamente enunciados no se cumplen en al menos el 80% de las secuencias, el set entero es rechazado. In some embodiments of the present disclosure, the method that generates the pre-processed genomic DNA data in sub-step b2) processes by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the DNA data evaluated in sub-step a2) by the following sub-steps i. reading the genomic DNA data evaluated in sub-step a2); ii. removing low-quality bases at the beginning and end of each read. For example, with PHRED 30 and PHRED 25 as the minimum value; iii. performing sliding window trimming to remove low-quality regions. For example, every 3 bases have on average a value of PHRED 30 or at least PHRED 25; iv. performing quality filtering with other additional quality values such as length distribution in at least 80% of the read size, completely removing the adapter content and the over-represented sequences at the ends. v. checking that the sequence length distribution is uniform. For example, if the previously stated quality criteria are not met in at least 80% of the sequences, the entire set is rejected.
Para el entendimiento de la presente divulgación el término "bases de baja calidad" se refiere a nucleótidos cuyas lecturas tienen una probabilidad baja de ser correctas. Cada base en una secuencia de ADN secuenciada está representada por una letra (A, C, G, o T), y a cada base se le asigna una puntuación de calidad. La calidad de una base se expresa generalmente como un valor PHRED, que es negativo. For the purposes of this disclosure, the term "low-quality bases" refers to nucleotides whose reads have a low probability of being correct. Each base in a sequenced DNA sequence is represented by a letter (A, C, G, or T), and each base is assigned a quality score. The quality of a base is usually expressed as a PHRED value, which is negative.
Cuanto mayor es la puntuación PHRED, mejor es la calidad de la base. Por ejemplo, una puntuación PHRED de 20 significa que hay una probabilidad de 1 en 100 de que la base sea incorrecta. The higher the PHRED score, the better the quality of the database. For example, a PHRED score of 20 means there's a 1 in 100 chance that the database is incorrect.
En el contexto de recorte de secuencias (también llamado trimming), se consideran bases de baja calidad aquellas cuya puntuación PHRED cae por debajo de un umbral especificado. Estas bases se eliminan para mejorar la calidad global de la secuencia y para evitar la inclusión de información errónea en el análisis posterior. In the context of sequence trimming, bases whose PHRED score falls below a specified threshold are considered low-quality. These bases are removed to improve the overall quality of the sequence and to prevent the inclusion of erroneous information in subsequent analysis.
Las operaciones de recortar, limpiar, filtrar y eliminar secuencias no deseadas de los datos de secuenciación de los datos de ADN genómico se pueden realizar mediante herramientas como por ejemplo Trimmomatic que es una herramienta que realiza múltiples operaciones de preprocesamiento en datos de secuenciación para mejorar la calidad y la utilidad de las lecturas. Trimming, cleaning, filtering, and removing unwanted sequences from genomic DNA sequencing data can be performed using tools such as Trimmomatic, a tool that performs multiple preprocessing operations on sequencing data to improve the quality and usefulness of reads.
Tabla Puntuación PHRED y la precisión del llamado de bases. Table PHRED Score and Base Calling Accuracy.
Un archivo FASTQ normalmente usa cuatro líneas por secuencia. A FASTQ file typically uses four lines per sequence.
« La línea I : comienza con un carácter y va seguida de un identificador de secuencia y una descripción opcional (como una linea de título F ASTA). ® La línea 2: son nucleótidos que fueron secuenciados. « The I line: begins with a character and is followed by a sequence identifier and an optional description (such as a F ASTA title line). ® Line 2: are nucleotides that were sequenced.
• La línea 3 : comienza con un carácter • Line 3: begins with a character
» La línea 4: codifica los valores de calidad para la secuencia de la línea 2 y debe contener la misma cantidad de símbolos que letras en la secuencia que corresponde al valor PHRED en código ASCII American Standard Code for Information Interchange). » Line 4: encodes the quality values for the sequence in line 2 and must contain the same number of symbols as letters in the sequence that correspond to the PHRED value in ASCII code (American Standard Code for Information Interchange).
Un archivo FASTQ tiene la siguiente estructura: A FASTQ file has the following structure:
> >
@Secuencia 1@Sequence 1
CGCTAACTGAGACGCATGAATAGGATCAGCTTACAATCGTCTTTGAACGGACAATCTATTATGACTTCTG + CGCTAACTGAGACGCATGAATAGGATCAGCTTACAATCGTCTTTGAACGGACAATCTATTATGACTTCTG +
> >
EEEEGDGFGGGDCEGGGGFFFCFFFFFFFFFFACGGFGGGFGFFFFG4 ; 53>EFFFFBFFFFFA? FF< : * EEEEGDGFGGGDCEGGGGFFFCFFFFFFFFFFACGGFGGGFGFFFFG4 ; 53>EFFFFBFFFFFA? FF< : *
Y los valores de calidad PHRED corresponden a sus equivalentes ASCII de la siguiente manera: And the PHRED quality values correspond to their ASCII equivalents as follows:
Codi ficación de calidad : ! " #$ % & ' ( ) * + , - . / 0123456789 : ; <=>? @ABCDEFGHI J I I I I I Quality Coding : ! " #$ % & ' ( ) * + , - . / 0123456789 : ; <=>? @ABCDEFGHI J I I I I I
Puntuación de calidad : 01 11 21 31 41Quality score: 01 11 21 31 41
> >
Por esta razón en este ejemplo, si tenemos en cuenta el valor umbral, en una herramienta como por ejemplo Trimmomatic, las bases que serían excluidas de los análisis posteriores en este ejemplo serían las resaltadas en negrilla: For this reason, in this example, if we take into account the threshold value, in a tool such as Trimmomatic, the bases that would be excluded from subsequent analysis in this example would be those highlighted in bold:
> >
Secuencia 1Sequence 1
CGCTAACTGAGACGCATGAATAGGATCAGCTTACAATCGTCTTTGAACGGACAATCTATTATGACTTCTG CGCTAACTGAGACGCATGAATAGGATCAGCTTACAATCGTCTTTGAACGGACAATCTATTATGACTTCTG
+ +
> >
EEEEGDGFGGGDCEGGGGFFFCFFFFFFFFFFACGGFGGGFGFFFFG4 ; 53>EFFFFBFFFFFA? FF< : * EEEEGDGFGGGDCEGGGGFFFCFFFFFFFFFFACGGFGGGFGFFFFG4 ; 53>EFFFFBFFFFFA? FF< : *
Esto debido a que los símbolos ASCII corresponderían a los valores PHRED sería 29 (<), 25 (:) y 11 (*), valores por debajo del Umbral de PHRED 30 (?). This is because the ASCII symbols that would correspond to the PHRED values would be 29 (<), 25 (:) and 11 (*), values below the PHRED Threshold 30 (?).
En algunas realizaciones de la presente divulgación el método que genera los datos de ADN genómico preprocesados en la subetapa c2) realizar el alineamiento de las secuencias de los datos de ADN procesados en la subetapa b2) se realiza mediante las subetapas de i. construir un índice a partir de una secuencia de referencia ii. alinear las secuencias de los datos de ADN procesados en la subetapa b2) con el índice de referencia: iii. convertir los archivos de formatos, por ejemplo, de S M a formato B M iv. Realizar el ordenamiento del archivo, por ejemplo, BAM In some embodiments of the present disclosure the method that generates the pre-processed genomic DNA data in sub-step c2) performs the alignment of the sequences of the DNA data processed in sub-step b2) is performed by the sub-steps of i. building an index from a reference sequence ii. aligning the sequences of the DNA data processed in sub-step b2) with the reference index; iii. converting the files from formats, for example, from SM to BM format; iv. sorting the file, for example, BAM.
El índice de referencia en el contexto del alineamiento de secuencias genómicas se refiere a una representación eficiente de la secuencia del genoma. este índice se utiliza para realizar de manera rápida y eficiente la búsqueda de coincidencias entre las secuencias de ADN y la secuencia de referencia del genoma. La secuencia de referencia es la secuencia de ADN conocida y anotada de un organismo específico. Por ejemplo, para el genoma humano, la secuencia de referencia sería la versión del genoma humano que se ha ensamblado y anotado exhaustivamente. In the context of genome sequence alignment, the reference index refers to an efficient representation of the genome sequence. This index is used to quickly and efficiently search for matches between DNA sequences and the reference genome sequence. The reference sequence is the known and annotated DNA sequence of a specific organism. For example, for the human genome, the reference sequence would be the version of the human genome that has been assembled and comprehensively annotated.
La construcción de un índice de referencia es un proceso en el cual se crean estructuras de datos a partir de la secuencia de referencia. Estas estructuras permiten realizar búsquedas durante el proceso de alineamiento. Building a reference index is a process in which data structures are created from the reference sequence. These structures allow for searches during the alignment process.
En el proceso de alinear se busca identificar dónde en el genoma de referencia se alinean las secuencias con los datos de ADN genómico procesados en la subetapa b2). The alignment process seeks to identify where in the reference genome the sequences align with the genomic DNA data processed in sub-stage b2).
Para realizar el alineamiento de las secuencias de los datos de ADN genómico procesados en la subetapa b2) se pueden utilizar herramientas como, por ejemplo, Bowtie 2 ó BWA. To perform the alignment of the sequences of the genomic DNA data processed in sub-stage b2), tools such as Bowtie 2 or BWA can be used.
En algunas realizaciones de la presente divulgación el método que genera los datos de ADN genómico preprocesados en la subetapa d2) detectar variantes genéticas y recalibrar la calidad de bases de secuenciación y filtrar variantes) de los datos de ADN genómico alineados en la subetapa c2) se realiza mediante las subetapas de: i. ordena los archivos; por ejemplo, archivo SAM/BAM, por coordenadas ascendentes, lo que facilita el acceso eficiente a regiones específicas del genoma; ii. identifica y marca duplicados en el archivo, por ejemplo, archivos BAM, para evitar resultados sesgados en el análisis de variantes; iii. proporcionar estadísticas detalladas del archivo, por ejemplo, archivos BAM, incluyendo el número de lecturas mapeadas y no mapeadas, el número de duplicados, entre otros; un ejemplo de las estadísticas de la subetapa se proporciona usando de la herramienta SAMtools Flagstat: In some embodiments of the present disclosure the method that generates the pre-processed genomic DNA data in sub-step d2) detecting genetic variants and recalibrating the quality of sequencing bases and filtering variants) of the aligned genomic DNA data in sub-step c2) is performed by the sub-steps of: i. sorts files, e.g., SAM/BAM files, by ascending coordinates, facilitating efficient access to specific genome regions; ii. identifies and flags duplicates in the file, e.g., BAM files, to avoid biased results in variant analysis; iii. provides detailed statistics for the file, e.g., BAM files, including the number of mapped and unmapped reads, the number of duplicates, and so on. An example of substage statistics is provided using the SAMtools Flagstat tool:
7417232 + 0 ir.! total (QC-passed reads + QC-failed reads) (Buen Mapeo) 287618 + 0 duplicates (Número de secuencias Duplicadas) 7417232 + 0 ir.! total (QC-passed reads + QC-failed reads) (Good Mapping) 287618 + 0 duplicates (Number of Duplicate Sequences)
4534962 + 0 mapped (61 . 14%:-nan%i (Total secuencias Mapeadas) 7417232 + 0 paired in sequencing (Total secuencias en parejas) 3708616 + 0 read 1 (pareja 1 ) 4534962 + 0 mapped (61 . 14%:-nan%i (Total Mapped sequences) 7417232 + 0 paired in sequencing (Total sequences in pairs) 3708616 + 0 read 1 (pair 1)
3708616 + 0 read2 (pareja 2) 3708616 + 0 read2 (couple 2)
4528278 + 0 properly paired í61 ,05%:-nan%) (Secuencias Pareadas sin duplicar) 4528278 + 0 properly paired (61.05%:-nan%) (Paired sequences without duplication)
4534962 + 0 with itself and mate mapped iv. marcar duplicados y generar estadísticas detalladas; un ejemplo de Opciones de la herramienta Picard pueden ser utilizados para establecer un criterio de la calidad del mapeo de las lecturas a la referencia: v. identificar errores de secuenciación y recalibra las puntuaciones de calidad de las bases para mejorar la precisión de las llamadas de variantes; vi. aplicar recalibración de bases, por ejemplo, de archivos BAM, ajustando las puntuaciones de calidad de las bases; vii. analizar las covariables utilizadas en el proceso de recalibración de bases para evaluar la calidad; viii. detecta variantes somáticas en datos de secuenciación de cáncer en comparación con secuencias normales; ix. calcular estadísticas de cobertura de pila para variantes en el contexto de análisis de variantes somáticas; x. calcular la contaminación de la muestra por la muestra normal en un análisis de variantes somáticas; xi. filtra variantes basadas en sesgos de orientación observados durante la secuenciación; xii. aplica filtros adicionales a las llamadas de variantes somáticas. 4534962 + 0 with itself and mate mapped iv. Mark duplicates and generate detailed statistics; an example of Picard Tool Options can be used to establish a criterion for the quality of mapping reads to the reference: v. Identify sequencing errors and recalibrate base quality scores to improve variant calling accuracy; vi. Apply base recalibration, e.g., of BAM files, by adjusting base quality scores; vii. Analyze covariates used in the base recalibration process to assess quality; viii. Detect somatic variants in cancer sequencing data compared to normal sequences; ix. Calculate stack coverage statistics for variants in the context of somatic variant analysis; x. Calculate sample contamination by the normal sample in somatic variant analysis; xi. Filter variants based on orientation biases observed during sequencing; xii. Apply additional filters to somatic variant calls.
Después de completar la serie de subetapas de la i a la xii en los datos de ADN genómico mediante por ejemplo GATK ( Genome Analysis Toolkit), se obtienen resultados para una interpretación del genoma. Inicialmente, se logra una organización de los datos de ADN genómico procesados en la subetapa b2), seguida de la eliminación de duplicados para minimizar posibles distorsiones en los análisis subsecuentes. Las estadísticas logradas en la etapa ix brindan una panorámica del conjunto de datos, incluyendo información sobre lecturas mapeadas, no mapeadas y duplicadas. La recalibración de bases y la aplicación de estos ajustes mejoran la exactitud de las llamadas de variantes al abordar posibles errores de secuenciación. El análisis de covariables contribuye a evaluar la calidad de este proceso de recalibración. La detección de variantes somáticas revela mutaciones específicas del cáncer en comparación con secuencias normales. La generación de estadísticas de cobertura y la evaluación de la contaminación ofrecen una comprensión esencial para la interpretación precisa de las variantes somáticas. La aplicación de filtros adicionales garantiza la retención de variantes de alta calidad. En última instancia, la salida del análisis se materializa en un archivo final de variantes somáticas, por ejemplo, en formato VCF, un archivo, por ejemplo, BAM, organizado y sin duplicados, junto con informes detallados que proporcionan datos depurados de los datos de ADN genómico de entrada a las subetapas i. Por ejemplo, para el análisis de GATK se toma como genoma de referencia el HG38 y se usan valores por defecto. After completing the series of substeps (ai) through (twelve) on genomic DNA data using, for example, GATK (Genome Analysis Toolkit), results are obtained for genome interpretation. Initially, the genomic DNA data processed in substep (b2) are organized, followed by duplicate removal to minimize potential bias in subsequent analyses. The statistics obtained in step (ix) provide an overview of the dataset, including information on mapped, unmapped, and duplicate reads. Base recalibration and the application of these adjustments improve the accuracy of variant calling by addressing potential sequencing errors. Covariate analysis contributes to assessing the quality of this recalibration process. The detection of somatic variants reveals cancer-specific mutations compared to normal sequences. The generation of coverage statistics and the assessment of contamination provide essential insights for the accurate interpretation of somatic variants. The application of additional filters ensures the retention of high-quality variants. Ultimately, the output of the analysis materializes in a final file of somatic variants, for example, in VCF format, a file, for example, BAM, organized and free of duplicates, along with detailed reports that provide clean data from the input genomic DNA data to sub-stages i. For example, for GATK analysis, HG38 is taken as the reference genome and default values are used.
En algunas realizaciones de la presente divulgación el método que genera los datos de ADN genómico preprocesados en la subetapa e2), realizar la anotación de la función en su correlación con una enfermedad de las variantes genéticas de las secuencias de los datos de ADN genómico obtenidas en la subetapa d2) y almacenar la anotación una base de datos, se realiza mediante las subetapas de: i. leer las secuencias de datos de ADN genómico obtenidas en la subetapa d2); ii. llamar variantes, utilizando por ejemplo herramientas como SNPeff; iii. configurar una base de datos de anotación; en un ej emplo, se configura como base de datos para la anotación de las variantes ClinVar db, ClinVar db es una base de datos pública que almacena información sobre variantes genéticas y su relación con enfermedades humanas. La base de datos ClinVar db se considera una fuente valiosa para la anotación de variantes genéticas. In some embodiments of the present disclosure, the method that generates the preprocessed genomic DNA data in sub-step e2), performing the annotation of the function in its correlation with a disease of the genetic variants of the genomic DNA data sequences obtained in sub-step d2) and storing the annotation in a database, is carried out by the sub-steps of: i. reading the genomic DNA data sequences obtained in sub-step d2); ii. calling variants, using for example tools such as SNPeff; iii. configuring an annotation database; in one example, ClinVar db is configured as a database for the annotation of the variants, ClinVar db is a public database that stores information on genetic variants and their relationship with human diseases. The ClinVar db database is considered a valuable source for the annotation of genetic variants.
El resultado de esta aplicar estas subetapas será los datos de ADN genómico anotados por ejemplo en un archivo VCF anotado, o en una base datos. The result of applying these sub-steps will be the genomic DNA data annotated, for example, in an annotated VCF file, or in a database.
Para el entendimiento de la presente divulgación la anotación de variantes se refiere al proceso de asociar información biológica y funcional a las variantes genéticas identificadas en un genoma. Las variantes genéticas pueden incluir, entre otras, sustituciones de nucleótidos, inserciones, deleciones y variantes estructurales. La anotación busca comprender el impacto potencial de estas variantes en términos de la función genética y la salud. En una realización particular las anotaciones de los "efectos / consecuencias" múltiples se separan por coma. De manera opcional las anotaciones se ordenan por ordenadas mediante por ejemplo las siguientes subetapas. i. anotar efectos y consecuencias de las variantes genómicas separando por comas si hay múltiples consecuencias asociadas a una variante; ii. estimar la nocividad: cuando se predicen múltiples consecuencias entonces compara, utilizando "más deletéreo" para evaluar cuál de los efectos se considera más peijudicial o dañino; iii. consecuencia codificante: En el caso de consecuencias codificantes (aquellas que afectan a la secuencia de aminoácidos en una proteína) entonces el mejor nivel de soporte de transcripción (TSL, por sus siglas en inglés "Transcription Support Level"') o la transcripción canónica deben colocarse primero. iv. ubicar la variante en el genoma en las coordenadas genómicas de la característica v. comparar alfabéticamente los IDs de las características, incluso si el ID es un número. For the purposes of this disclosure, variant annotation refers to the process of associating biological and functional information with genetic variants identified in a genome. Genetic variants may include, but are not limited to, nucleotide substitutions, insertions, deletions, and structural variants. Annotation seeks to understand the potential impact of these variants on genetic function and health. In a particular embodiment, the annotations of multiple "effects/consequences" are separated by commas. Optionally, the annotations are sorted in an ordered manner, for example, by the following substeps: i. annotating effects and consequences of genomic variants, separating by commas if there are multiple consequences associated with a variant; ii. estimating harmfulness: when multiple consequences are predicted, then comparing, using "most deleterious" to evaluate which of the effects is considered more detrimental or damaging; iii. coding consequence: In the case of coding consequences (those that affect the amino acid sequence in a protein), then the best transcription support level (TSL) or canonical transcript should be placed first. iv. locating the variant in the genome at the genomic coordinates of the feature v. comparing the feature IDs alphabetically, even if the ID is a number.
Para ejecutar la subetapa e2), realizar la anotación de la función en su correlación con una enfermedad de las variantes genéticas de las secuencias de los datos de ADN genómico obtenidas en la subetapa d2) se pueden usar herramientas como por ejemplo SnpEff y se almacenar la anotación una base de datos como, por ejemplo, dbNSFP v4, que es una base de datos completa de anotaciones funcionales específicas de transcripción para variantes de nucleótido único (SNV, por sus siglas en inglés) humanos. To execute sub-step e2), perform the annotation of the function in its correlation with a disease of the genetic variants of the genomic DNA data sequences obtained in sub-step d2), tools such as SnpEff can be used and the annotation can be stored in a database such as dbNSFP v4, which is a complete database of transcript-specific functional annotations for human single nucleotide variants (SNVs).
SnpEff es una herramienta utilizada en bioinformática y genómica para el análisis de variantes genéticas, por ejemplo, polimorfismo de nucleótido único (SNP, por sus siglas en inglés) y variantes de inserción/deleción (indels), en secuencias genómicas se utiliza para por ejemplo predecir los efectos funcionales de las variantes genéticas, es decir, cómo las variantes pueden afectar las proteínas y los genes, y para anotar las consecuencias biológicas de estas variantes, entre otras funciones. En algunas realizaciones de la presente divulgación el método que genera los datos de ADN genómico preprocesados en la subetapa e2), filtrar variantes de genes asociados a desordenes metabólicos a las variantes anotadas en la subetapa e2), se realiza mediante las subetapas de: i. filtrar las variantes para eliminar aquellas con baja calidad (por ejemplo, calidad menor que PEERED 25); ii. identificar genes asociados a desórdenes metabólicos; iii. filtrar las variantes para retener solo aquellas que se encuentran dentro de los genes metabólicos identificados en el paso anterior; iv. filtrar según criterios como por ejemplo la frecuencia alélica y la predicción del efecto funcional. SnpEff is a tool used in bioinformatics and genomics for the analysis of genetic variants, e.g. single nucleotide polymorphisms (SNPs) and insertion/deletion variants (indels), in genomic sequences. It is used, for example, to predict the functional effects of genetic variants, i.e., how the variants may affect proteins and genes, and to annotate the biological consequences of these variants, among other functions. In some embodiments of the present disclosure, the method that generates the preprocessed genomic DNA data in sub-step e2), filtering variants of genes associated with metabolic disorders to the variants annotated in sub-step e2), is performed by the sub-steps of: i. filtering the variants to eliminate those with low quality (for example, quality less than PEERED 25); ii. identifying genes associated with metabolic disorders; iii. filtering the variants to retain only those that are within the metabolic genes identified in the previous step; iv. filtering according to criteria such as allele frequency and prediction of functional effect.
En un ejemplo particular de la presente divulgación para identificar genes asociados a desórdenes metabólicos de la subetapa ii se consultaron las bases de datos como por ej emplo OM IM (Online Mendelian Inheritance in Man) es una base de datos en línea que proporciona información sobre enfermedades genéticas en humanos y los genes que están asociados con estas enfermedades, ClinVar y la literatura científica para hacer en la identificación búsquedas de variantes relacionadas con genes asociados a por ejemplo enfermedad cardio metabólica. In a particular example of the present disclosure to identify genes associated with metabolic disorders of substage II, databases such as, for example, OM IM (Online Mendelian Inheritance in Man) are consulted, which is an online database that provides information on genetic diseases in humans and the genes that are associated with these diseases, ClinVar and the scientific literature to perform searches for variants related to genes associated with, for example, cardiometabolic disease in the identification.
Para ejecutar la subetapa f2), se puede realizar por ejemplo con herramientas como VCFtools, que es un conjunto de herramientas de línea de comandos diseñado para trabajar con archivos VCF (Variant Call Format), GATK, o ANNOVAR es una herramienta bioinformática ampliamente utilizada para la anotación de variantes genéticas en archivos VCF. La anotación de variantes implica agregar información funcional y contextual a las variantes genéticas identificadas a partir de datos genómicos. To execute sub-step f2), it can be done for example with tools such as VCFtools, which is a set of command-line tools designed to work with VCF (Variant Call Format) files, GATK, or ANNOVAR is a widely used bioinformatics tool for the annotation of genetic variants in VCF files. Variant annotation involves adding functional and contextual information to genetic variants identified from genomic data.
En algunas realizaciones de la presente divulgación el método que genera los datos de ADN genómico preprocesados en la subetapa g2, codificar mutaciones utilizando la técnica de One Hot Encoder para las variantes de las secuencias de los datos de ADN genómico filtradas en la subetapa f2), se realiza mediante las subetapas de: i. leer las variantes de las secuencias de los datos de ADN genómico filtradas en la subetapa f2), por ejemplo, utilizando una biblioteca como pandas para leer un archivo VCF y cargar las variantes genéticas; ii. realiza extracción de variables relevantes, por ejemplo, extraer las columnas relevantes que contienen la información de las variantes genéticas, ¿cromosoma, posición, referencia y alternativa?; iii. codificar variables categóricas (variantes) sin noción de cercanía con One Hot Encoder,' In some embodiments of the present disclosure the method that generates the pre-processed genomic DNA data in sub-step g2, encode mutations using the One Hot Encoder technique for sequence variants from genomic DNA data filtered in sub-step f2), is performed by the sub-steps of: i. reading the sequence variants from genomic DNA data filtered in sub-step f2), for example, using a library like pandas to read a VCF file and load the genetic variants; ii. performing relevant variable extraction, for example, extracting the relevant columns containing the genetic variant information, chromosome, position, reference and alternative?; iii. encoding categorical variables (variants) without notion of closeness with One Hot Encoder,'
La codificación de los datos de ADN genómico en variables categóricas, mediante la técnica de One Hot Encoder, con el fin de que cada posible cambio genético sean una variable, por ejemplo, con la siguiente estructura: cromosoma dirección mutación The encoding of genomic DNA data into categorical variables, using the One Hot Encoder technique, so that each possible genetic change is a variable, for example, with the following structure: chromosome address mutation
En la etapa iii la técnica de One Hot Encoder es aplicada en las variables categóricas en las cuales sus categorías no tienen noción de cercanía, esta técnica consiste en convertir cada categoría de la variable en una variable separada, con valores de 1 y 0 que indican la presencia o ausencia de la categoría en la respectiva muestra. Por otro lado, En la subetapa iv para codificar las variables categóricas en las cuales sí existe noción de cercanía, se utilizó la técnica de Ordinal Encoder, la cual consiste en que a cada categoría única de la variable categórica se le asigna un número entero único basado en su orden o jerarquía natural. In stage iii, the One Hot Encoder technique is applied to categorical variables in which their categories have no notion of closeness. This technique consists of converting each category of the variable into a separate variable, with values of 1 and 0 that indicate the presence or absence of the category in the respective sample. On the other hand, in sub-stage iv, to encode categorical variables in which a notion of closeness does exist, the Ordinal Encoder technique was used, which consists of assigning each unique category of the categorical variable a unique integer based on its natural order or hierarchy.
En algunas realizaciones de la presente divulgación el método que genera los datos de ADN genómico preprocesados después de subetapa g2), se realiza una etapa adicional de reducir la dimensión de los datos en un ejemplo la etapa de reducir la dimensión de los datos se realiza mediante correlaciones de Pearson, donde se seleccionaron aquellas variables que fueran directamente proporcionales (correlación igual 1), y se dejó como representante aquella variable que, por su importancia o relación con el estudio, asocia genes relacionados en el caso de los datos de ADN genómico preprocesados después de la etapa g2), y especies o taxones relacionadas con microbiota, estuvieran asociadas a la disminución de riesgo metabólico. In some embodiments of the present disclosure, the method that generates the preprocessed genomic DNA data after sub-step g2), an additional step of reducing the dimension of the data is performed. In one example, the step of reducing the dimension of the data is performed using Pearson correlations, where those variables that were directly proportional (correlation equal to 1) were selected, and that variable that, due to its importance or relationship with the study, associates related genes in the case of the preprocessed genomic DNA data after stage g2), and species or taxa related to microbiota, were associated with decreased metabolic risk.
Se define las variables relevantes mediante dos consideraciones. La primera consideración, es que las variables corresponden a cada una de las variantes para el caso de los datos de ADN genómico y cada uno de los taxones u OTUs (Unidades Operacionales Taxonómicas); y la segunda consideración que el análisis de relevancia de las variables se realiza el cálculo de la importancia e importancia relativa en un ejemplo esto se puede realizar con una función de Python como es “feature importances”, que permite calcular el decrecimiento de la impureza de los datos (es decir la probabilidad de clasificar erróneamente un elemento elegido al azar en un conjunto), seleccionando de esta manera las variables que más aportan a la discriminación de los grupos sin la presencia de valores clasificados erróneamente. Relevant variables are defined through two considerations. The first is that the variables correspond to each of the variants in the case of genomic DNA data and each of the taxa or OTUs (Operational Taxonomic Units); and the second consideration is that the analysis of the relevance of the variables is carried out by calculating the importance and relative importance. In an example, this can be done with a Python function such as "feature importances," which allows calculating the decrease in data impurity (i.e., the probability of misclassifying a randomly chosen element in a set), thus selecting the variables that contribute most to the discrimination of the groups without the presence of misclassified values.
Por otro lado, Haciendo referencia a la FIG.2 en algunas realizaciones de la presente divulgación en la etapa b) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad, los datos de ADN de microbiota preprocesados se generan mediante un método que comprende las subetapas: a3) evaluar la calidad de los datos de ADN de microbiota de la etapa a); b3) procesar mediante las operaciones de recortar, limpiar, filtrar y eliminar secuencias no deseadas de los datos de secuenciación de los datos de ADN de microbiota evaluado en la subetapa a3); c3) identificar y eliminar secuencias de contaminación cruzada que provienen de humano de los datos de ADN de microbiota procesados en la subetapa b3); d3) identificar funciones genéticas y categorizar las secuencias genómicas, separar las secuencias genómicas en conjuntos, encontrar la composición taxonómica de la comunidad microbiana, a partir de los datos de ADN de microbiota posteriores a la eliminación de secuencias de contaminación cruzada en la subetapa c3); e3) recortar adaptadores y secuencias no deseadas de lecturas de secuenciación los datos de ADN de microbiota posteriores a la aplicación de la subetapa d3); f3) asignar funciones biológicas potenciales en las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e); g3) clasificar las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e3); h3) identificar asociaciones entre variables microbiotas y metadatos en las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa g3); i3) filtrar mediante criterios estadísticos taxonómicos y funcionales y las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa h3); j3) aplicar los procesos de normalización, correlación y reducción de dimensiones a los datos de ADN de microbiota posteriores a la aplicación de la subetapa i3); k3) agrupar los datos de ADN de microbiota posteriores a la aplicación de la subetapa i3) en unidades taxonómicas operativas mediante un umbral predefinido mayor que 80% de similitud a la secuencia de genes; On the other hand, referring to FIG. 2 in some embodiments of the present disclosure in step b) of the computer-implemented method of obtaining a disease risk prediction data, the pre-processed microbiota DNA data is generated by a method comprising the sub-steps: a3) assessing the quality of the microbiota DNA data from step a); b3) processing by the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the microbiota DNA data assessed in sub-step a3); c3) identifying and removing cross-contamination sequences originating from human from the microbiota DNA data processed in sub-step b3); d3) identifying gene functions and categorizing the genomic sequences, separating the genomic sequences into sets, finding the taxonomic composition of the microbial community, from the microbiota DNA data after removing cross-contamination sequences in sub-step c3); e3) trim adapters and unwanted sequences from sequencing reads the microbiota DNA data after the application of sub-step d3); f3) assign potential biological functions in the microbiota DNA data sequences after the application of sub-step e); g3) classify the microbiota DNA data sequences after the application of sub-step e3); h3) identify associations between microbiota variables and metadata in the microbiota DNA data sequences after the application of sub-step g3); i3) filter using taxonomic and functional statistical criteria and the microbiota DNA data sequences after the application of sub-step h3); j3) apply the normalization, correlation and dimension reduction processes to the microbiota DNA data after the application of sub-step i3); k3) cluster the microbiota DNA data after the application of sub-step i3) into operational taxonomic units using a predefined threshold greater than 80% similarity to the gene sequence;
13) almacenar las unidades taxonómicas operativas obtenidas en la etapa k3) en una base de datos; m3) filtrar mediante criterios de unidades taxonómicas operativas previamente almacenados en una base de datos y relacionados con la predicción de una enfermedad respectivamente en las unidades taxonómicas operativas almacenadas en la etapa j 3); n3) repetir las etapas de a3) a la etapa j 3) trascurrido un tiempo T2 y almacenar las unidades taxonómicas operativas obtenidas en la etapa n en una base de datos los resultados obtenidos en la etapa n trascurrido un tiempo T2; o3) filtrar mediante criterios de unidades taxonómicas operativas previamente almacenados en una base de datos con la predicción de una enfermedad respectivamente las unidades taxonómicas operativas almacenadas en la etapa k3; p3) comparar los datos almacenados en la etapa k3) con los datos almacenados en la etapa n3) transcurrido un tiempo T2. 13) store the operational taxonomic units obtained in step k3) in a database; m3) filter using operational taxonomic unit criteria previously stored in a database and related to the prediction of a disease respectively in the operational taxonomic units stored in step j 3); n3) repeat steps a3) to j 3) after a time T2 has elapsed and store the operational taxonomic units obtained in step n in a database the results obtained in step n after a time T2 has elapsed; o3) filter using operational taxonomic unit criteria previously stored in a database with the prediction of a disease respectively the operational taxonomic units stored in stage k3; p3) compare the data stored in stage k3) with the data stored in stage n3) after a time T2.
En algunas realizaciones de la presente divulgación el método que genera los datos de ADN de microbiota preprocesados en la subetapa a3) evaluar la calidad de los datos de secuenciación de los datos de ADN de microbiota de la etapa a); evaluando la precisión y confiabilidad de la información obtenida a través de técnicas de secuenciación de ADN la calidad se evalúa mediante diversos parámetros y métricas que proporcionan información sobre la fiabilidad de las lecturas de secuencias generadas Para un ejemplo en el caso de la presente descripción se realiza mediante los siguientes parámetros, El umbral de corte por calidad es PHRED 30, con una distribución de longitudes de min el 80% del tamaño de la lectura, con un contenido de GC por base con comportamiento definido de distribución normal, contenido de GC por secuencia con comportamiento Normal del valor del %GC: retirando completamente el contenido de adaptadores y las secuencias sobrerrepresentadas en los extremos. Finalmente, se revisa que después de estos filtros la distribución de longitudes de secuencia sea uniforme; Después de revisar el cumplimiento de estos parámetros se somete, por ejemplo, mediante una herramienta de limpieza (como por ejemplo Trimmomatic) el cuál normaliza los valores si es posible, y en caso contrario, las lecturas de secuencias son retiradas del estudio. Si el criterio de calidad no se cumple, por ejemplo, en al menos el 80% de las secuencias, el set entero es rechazado. In some embodiments of the present disclosure, the method that generates the pre-processed microbiota DNA data in sub-step a3) evaluates the quality of the sequencing data of the microbiota DNA data from step a); evaluating the accuracy and reliability of the information obtained through DNA sequencing techniques, the quality is evaluated by various parameters and metrics that provide information on the reliability of the generated sequence reads. For an example, in the case of the present description, it is carried out by the following parameters: The quality cut-off threshold is PHRED 30, with a length distribution of min 80% of the read size, with a GC content per base with defined normal distribution behavior, GC content per sequence with Normal behavior of the %GC value: completely removing the adapter content and the over-represented sequences at the ends. Finally, it is checked that after these filters the distribution of sequence lengths is uniform; After checking for compliance with these parameters, the sequence reads are submitted, for example, to a cleaning tool (such as Trimmomatic), which normalizes the values if possible. If not, the sequence reads are removed from the study. If the quality criterion is not met, for example, in at least 80% of the sequences, the entire set is rejected.
En algunas realizaciones de la presente divulgación el método que genera los datos de ADN de microbiota preprocesados entre la subetapa m3) y la subetapa n3) se hace una etapa adicional de proporcionar un consumible a un sujeto, en donde el consumible por ejemplo se seleccionan del grupo de anti oxidantes, probióticos, prebióticos, postbioticos, suplementos vitamínicos, minerales, fibras, extractos naturales, inmunonutrientes, aminoácidos, proteínas vegetales, proteínas animales, que solos o en formulas combinadas entre ellos y otros compuestos, puedan modular rutas metabólicas asociadas a la función del microbioma o marcadores clínicos involucrados en alguna patología o enfermedad, lo cual puede ser evaluado mediante un dato de predicción de riesgo de enfermedad obtenido trascurrido el tiempo T2, Para esto se toman datos clínicos, datos de ADN de microbiota y datos de ADN genómico del sujeto y se ingresan a la base de datos a leer en la etapa a) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación y se aplican las etapas siguientes a la etapa n3) y se aplica el método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación. In some embodiments of the present disclosure, the method that generates the pre-processed microbiota DNA data between sub-step m3) and sub-step n3) is made an additional step of providing a consumable to a subject, where the consumable for example is selected from the group of antioxidants, probiotics, prebiotics, postbiotics, vitamin supplements, minerals, fibers, natural extracts, immunonutrients, amino acids, vegetable proteins, animal proteins, which alone or in combined formulas between them and other compounds, can modulate metabolic pathways associated with the function of the microbiome or clinical markers involved in some pathology or disease, which can be evaluated by means of a disease risk prediction data obtained after time T2 has elapsed. For this purpose, clinical data, microbiota DNA data and genomic DNA data of the subject are taken and entered into the database to be read in step a) of the computer-implemented method for obtaining disease risk prediction data of the present disclosure and the steps following step n3) are applied and the computer-implemented method for obtaining disease risk prediction data of the present disclosure is applied.
Además, en la presente divulgación el método que genera los datos de ADN de microbiota preprocesados en la subetapa n3) y datos clínicos derivados de análisis de bioquímica de sangre pueden generar información para definir perfiles metabólicos y perfiles de riesgo predictivos de enfermedad, que pueden ser utilizados para definir recomendaciones que mejoren la salud como por ejemplo: recomendaciones nutrí ci onales, dieta, ejercicio, suplementación, manejo de porciones y distribución de macronutri entes en el plato, hidratación, cambios en estilo de vida, identificación de alimentos de acuerdo a la estación y ubicación geográfica, seguimiento a través de monitoreo de glicemia y cuerpos cetónicos o cualquier otra variable que pueda modular el riesgo de enfermedad, estado de salud y estado nutrí ci onal. De Igual se puede realizar en intervención que modifique hábitos, estilo de vida o dieta, los cuales pueden generar cambios medibles a través de la microbiota o datos clínicos, a un sujeto, con la posibilidad de generar recomendaciones que mejoren su salud y disminuyan el riego de enfermedad Furthermore, in the present disclosure, the method that generates the pre-processed microbiota DNA data in sub-step n3) and clinical data derived from blood biochemistry analysis can generate information to define metabolic profiles and predictive disease risk profiles, which can be used to define recommendations that improve health such as: nutritional recommendations, diet, exercise, supplementation, portion management and distribution of macronutrients on the plate, hydration, lifestyle changes, food identification according to the season and geographic location, follow-up through monitoring of glycemia and ketone bodies or any other variable that can modulate the risk of disease, health status and nutritional status. Likewise, an intervention can be carried out that modifies habits, lifestyle or diet, which can generate measurable changes through microbiota or clinical data, to a subject, with the possibility of generating recommendations that improve their health and reduce the risk of disease.
En algunas realizaciones de la presente divulgación el método que genera los datos de ADN de microbiota preprocesados en la subetapa a3) evaluar la calidad de los datos de secuenciación de los datos de ADN de microbiota de la etapa a) se realiza mediante las siguientes subetapas: vi. opcionalmente se pueden genera estadísticas generales sobre las lecturas después de aplicar las diferentes etapas de preprocesamiento, proporcionando una visión general de la calidad y la cantidad de datos restantes, eliminar bases de baja calidad al principio y al final de cada lectura. PHRED 30 y PHRED 25 como valor mínimo. vii. realizar un recorte de ventana deslizante para eliminar regiones de baja calidad, por ejemplo, cada 3 bases tengan en promedio un valor de PHRED 30 ó mínimo PHRED 25. viii. realizar filtrado por calidad otros valores de calidad adicionales como una distribución de longitudes de min el 80% del tamaño de la lectura, se retira completamente el contenido de adaptadores y las secuencias sobrerrepresentadas en los extremos. ix. Finalmente, se revisa que después de todos estos filtros la distribución de longitudes de secuencia sea uniforme. Por ejemplo, si los criterios de calidad previamente enunciados no se cumplen en al menos el 80% de las secuencias, el set entero es rechazado. In some embodiments of the present disclosure, the method generating the preprocessed microbiota DNA data in sub-step a3) assessing the sequencing data quality of the microbiota DNA data from step a) is performed by the following sub-steps: vi. optionally, general statistics on the reads can be generated after applying the different preprocessing steps, providing an overview of the quality and quantity of the remaining data, removing low-quality bases at the beginning and end of each read. PHRED 30 and PHRED 25 as a minimum value. vii. perform a sliding window cropping to eliminate low-quality regions, for example, every 3 bases have an average PHRED value of 30 or a minimum of PHRED 25. viii. perform quality filtering with other additional quality values such as a length distribution of at least 80% of the read size, completely removing all adapter content and over-represented sequences at the ends. ix. Finally, after all these filters, the sequence length distribution is checked to ensure it is uniform. For example, if the previously stated quality criteria are not met in at least 80% of the sequences, the entire set is rejected.
La evaluación de la calidad de los datos de ADN de microbiota se puede realizar mediante herramientas como por ej emplo F ASTQC, MetaWRAP y MultiQC Y se analizan aspectos como por ejemplo la calidad de las bases, la presencia de adaptadores y la distribución de longitudes, opcionalmente se generan informes detallados para cada muestra. Microbiota DNA data quality assessment can be performed using tools such as F ASTQC, MetaWRAP and MultiQC. Aspects such as base quality, presence of adapters and length distribution are analysed, and detailed reports are optionally generated for each sample.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa b3), procesar mediante las operaciones de recortar, limpiar, filtrar y eliminar secuencias no deseadas de los datos de secuenciación de los datos de ADN de microbiota evaluado en la subetapa a3), mediante las siguientes subetapas i. leer los datos de ADN de microbiota -evaluados en la subetapa a2); ii. eliminar bases de baja calidad al principio y al final de cada lectura. PHRED menor que 30 iii. realiza un recorte de ventana deslizante para eliminar regiones de baja calidad, realiza operaciones filtrado por calidad, eliminación de adaptadores específicos, etc., según los requisitos de tu proyecto In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step b3) is processed by the operations of trimming, cleaning, filtering, and removing unwanted sequences from the sequencing data of the microbiota DNA data evaluated in sub-step a3), by the following sub-steps: i. reading the microbiota DNA data - evaluated in sub-step a2); ii. removing low-quality bases at the beginning and end of each read. PHRED less than 30 iii. performing sliding window trimming to remove low-quality regions, performing quality filtering operations, removing specific adapters, etc., according to your project requirements.
Para el entendimiento de la presente divulgación el término "bases de baja calidad" se refiere a nucleótidos cuyas lecturas tienen una probabilidad baja de ser correctas. Cada base en una secuencia de ADN secuenciada está representada por una letra (A, C, G, o T), y a cada base se le asigna una puntuación de calidad. La calidad de una base se expresa generalmente como un valor PHRED, que es negativo. For the purpose of understanding this disclosure, the term "low quality bases" refers to nucleotides whose reads have a low probability of being correct. Each Each base in a sequenced DNA sequence is represented by a letter (A, C, G, or T), and each base is assigned a quality score. The quality of a base is usually expressed as a PHRED value, which is negative.
Cuanto mayor es la puntuación PHRED, mejor es la calidad de la base. Por ejemplo, una puntuación PHRED de 20 significa que hay una probabilidad de 1 en 100 de que la base sea incorrecta. The higher the PHRED score, the better the quality of the database. For example, a PHRED score of 20 means there's a 1 in 100 chance that the database is incorrect.
En el contexto de recorte de secuencias (trimming), se consideran bases de baja calidad aquellas cuya puntuación PHRED cae por debajo de un umbral especificado. Calidad de secuencias de acuerdo con parámetros aquí descritos. Estas bases se eliminan para mejorar la calidad global de la secuencia y para evitar la inclusión de información errónea en el análisis posterior. In the context of sequence trimming, bases with a PHRED score below a specified threshold are considered low-quality. Sequence quality is determined according to the parameters described here. These bases are removed to improve the overall quality of the sequence and to prevent the inclusion of erroneous information in subsequent analysis.
Las operaciones de recortar, limpiar, filtrar y eliminar secuencias no deseadas de los datos de secuenciación de los datos de ADN de microbiota evaluado en la subetapa a3), se pueden realizar mediante herramientas como por ejemplo Trimmomatic que es una herramienta para realizar múltiples operaciones de preprocesamiento en datos de secuenciación para mejorar la calidad y la utilidad de las lecturas. The operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data of the microbiota DNA data evaluated in sub-step a3) can be performed using tools such as Trimmomatic, which is a tool for performing multiple preprocessing operations on sequencing data to improve the quality and usefulness of the reads.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa c3), identificar y eliminar secuencias de contaminación cruzada que provienen de humano de los datos de ADN de microbiota procesados en la subetapa b3), se realiza mediante las siguientes subetapas: i. disponer de una base de datos de referencia para identificar y filtrar secuencias contaminantes; ii. indexar la base de datos de referencia; iii. buscar coincidencias entre las secuencias de los datos de ADN de microbiota procesados en la subetapa b3) y la base de datos de referencia indexada; iv. identificar aquellas secuencias que corresponden al genoma humano, considerándolas como contaminación cruzada; v. las secuencias consideradas como contaminación cruzada se filtran del conjunto de datos de microbiota y se guardan en un archivo de salida; vi. generar un archivo de salida que contenga únicamente las secuencias que no fueron identificadas como provenientes del genoma humano, según la base de datos de referencia. In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step c3), identifying and removing cross-contamination sequences originating from humans from the microbiota DNA data processed in sub-step b3), is carried out by means of the following sub-steps: i. having a reference database to identify and filter contaminating sequences; ii. indexing the reference database; iii. searching for matches between the sequences of the microbiota DNA data processed in sub-step b3) and the indexed reference database; iv. identifying those sequences that correspond to the human genome, considering them as cross-contamination; v. Sequences considered cross-contaminated are filtered from the microbiota dataset and saved in an output file; vi. Generate an output file containing only the sequences that were not identified as originating from the human genome, according to the reference database.
El resultado de esta subetapa es un conjunto de datos filtrado libre de secuencias no deseadas provenientes de la contaminación cruzada. The result of this sub-stage is a filtered dataset free of unwanted sequences from cross-contamination.
Para ejecutar la subetapa c3), identificar y eliminar secuencias de contaminación cruzada que provienen de humano de los datos de ADN de microbiota procesa dos en la subetapa b3) se puede hacer mediante herramientas como por ejemplo BMT agger " Biome-specific Metagenome Tagger” que es una herramienta diseñada para abordar el problema de la contaminación cruzada en datos de secuenciación de metagenomas, entendidos como conjuntos de datos que provienen de comunidades microbianas. La contaminación cruzada puede ocurrir cuando secuencias de organismos no deseados se introducen durante el proceso de secuenciación. To perform sub-step c3), identifying and removing cross-contamination sequences originating from human sources from the microbiota DNA data processed in sub-step b3) can be done using tools such as BMT agger “Biome-specific Metagenome Tagger” which is a tool designed to address the problem of cross-contamination in metagenome sequencing data, understood as data sets originating from microbial communities. Cross-contamination can occur when sequences from unwanted organisms are introduced during the sequencing process.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa d3), identificar funciones genéticas y categorizar las secuencias genómicas, separar las secuencias genómicas en conjuntos, encontrar la composición taxonómica de la comunidad microbiana, a partir de los datos de ADN de microbiota posteriores a la eliminación de secuencias de contaminación cruzada en la subetapa c3), se realiza mediante las siguientes subetapas: i. realizar ensamblaje de los genomas microbianos a partir de los datos de ADN de microbiota posteriores a la eliminación de secuencias de contaminación cruzada en la subetapa c3) utilizando una herramienta de ensamblaje de secuencias de genoma como, por ejemplo, Megabit, que es una herramienta de ensamblaje de genomas para la construcción rápida y eficiente de genomas a partir de datos de secuenciación de nueva generación (NGS); ii. realizar la anotación funcional de los genes presentes en los genomas ensamblados en la entapa anterior; iii. categorizar las secuencias genómicas obtenidos en la etapa anterior en conjuntos funcionales utilizando herramientas como por ejemplo HUMAnN3; iv. determinar la composición taxonómica de la comunidad microbiana utilizando herramientas como, por ejemplo, Kraken2; v. guardar en una unidad de memoria o base de datos, por ejemplo, en un archivo en formato BIOM para realizar los análisis estadísticos posteriores. El formato BIOM (Biological Observation Matrix) es un formato estándar diseñado para representar y compartir datos de biomasa In some embodiments of the present disclosure the method that generates pre-processed microbiota DNA data in sub-step d3), identifying gene functions and categorizing the genomic sequences, separating the genomic sequences into sets, finding the taxonomic composition of the microbial community, from the microbiota DNA data after the elimination of cross-contamination sequences in sub-step c3), is performed by the following sub-steps: i. performing assembly of the microbial genomes from the microbiota DNA data after the elimination of cross-contamination sequences in sub-step c3) using a genome sequence assembly tool such as Megabit, which is a genome assembly tool for rapid and efficient genome annotation from next-generation sequencing (NGS) data; ii. perform functional annotation of the genes present in the genomes assembled in the previous step; iii. categorize the genomic sequences obtained in the previous step into functional sets using tools such as HUMAnN3; iv. determine the taxonomic composition of the microbial community using tools such as Kraken2; v. save to a memory stick or database, for example, in a BIOM format file, for subsequent statistical analyses. The BIOM (Biological Observation Matrix) format is a standard format designed to represent and share biomass data
Posterior a la subetapa v se pueden realizar etapas como, por ejemplo, la identificación de vías metabólicas, la comparación de muestras, y la determinación de OTUs que se encuentran de forma diferencial es decir que presentan diferencias estadísticamente significativas. After sub-stage v, steps can be carried out such as, for example, the identification of metabolic pathways, the comparison of samples, and the determination of OTUs that are found differentially, that is, those that present statistically significant differences.
La anotación funcional de los genes presentes en los genomas ensamblados de la subetapa ii permite identificar las funciones genéticas presentes en los genomas ensamblados. Functional annotation of the genes present in the assembled genomes of substage II allows the identification of the genetic functions present in the assembled genomes.
Para ejecutar la subetapa d3), identificar y eliminar secuencias de contaminación cruzada que provienen de humano de los datos de ADN de microbiota procesados en la subetapa b3) se puede hacer mediante herramientas como por ejemplo Meta WRAP que facilitar el procesamiento, preprocesamiento y el análisis de datos de secuenciación de metagenomas. To execute sub-step d3), identifying and removing cross-contamination sequences originating from human sources from the microbiota DNA data processed in sub-step b3) can be done using tools such as Meta WRAP that facilitate the processing, pre-processing and analysis of metagenome sequencing data.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa e3), recortar adaptadores y secuencias no deseadas de lecturas de secuenciación los datos de ADN de microbiota posteriores a la aplicación de la subetapa d3), mediante las siguientes subetapas: i. recortar adaptadores y secuencias no deseadas de lecturas de secuenciación los datos de ADN de microbiota posteriores a la aplicación de la subetapa d3) con las secuencias gel genoma humano de referencia HG38 y de los adaptadores que se utilizaron durante la secuenciación; ii. opcionalmente verificar que los adaptadores han sido recortados correctamente y evaluar la calidad de las lecturas después del recorte, por ejemplo, usando herramientas como, por ejemplo, FastQC; In some embodiments of the present disclosure the method that generates pre-processed microbiota DNA data in sub-step e3), trim adapters and non- desired sequencing reads from the microbiota DNA data after applying sub-step d3) by performing the following sub-steps: i. trimming adapters and unwanted sequences from sequencing reads from the microbiota DNA data after applying sub-step d3) with the sequences of the reference human genome HG38 and the adapters that were used during sequencing; ii. optionally verifying that the adapters have been correctly trimmed and assessing the quality of the reads after trimming, for example, using tools such as FastQC;
Para ejecutar la subetapa e3), recortar adaptadores y secuencias no deseadas de lecturas de secuenciación los datos de ADN de microbiota posteriores a la aplicación de la subetapa d3) se puede hacer mediante herramientas como por ejemplo Cutadapt que permite el recorte (trimming) de adaptadores y secuencias no deseadas de lecturas de secuenciación de ADN o ARN. To execute sub-step e3), trimming adapters and unwanted sequences from sequencing reads of microbiota DNA data after applying sub-step d3) can be done using tools such as Cutadapt, which allows trimming adapters and unwanted sequences from DNA or RNA sequencing reads.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa f3), asignar funciones biológicas potenciales en las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e3), se realiza mediante las siguientes subetapas: i. disponer de una base de datos específicas para llevar a cabo la asignación de funciones de acuerdo con la base de datos KEGG usando las secuencias proteicas inferidas de la anotación estructural del ensamblaje; ii. comparar las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa e3) con las secuencias en las bases de datos; iii. asignar funciones biológicas potenciales de acuerdo con la base de datos KEGG según la comparación de la subetapa anterior. iv. guardar los resultados de la asignación de la subetapa iii; In some embodiments of the present disclosure, the method generating pre-processed microbiota DNA data in sub-step f3) for assigning potential biological functions to the microbiota DNA data sequences subsequent to applying sub-step e3) is performed by the following sub-steps: i. having a specific database to carry out the function assignment according to the KEGG database using the protein sequences inferred from the structural annotation of the assembly; ii. comparing the microbiota DNA data sequences subsequent to applying sub-step e3) with the sequences in the databases; iii. Assign potential biological functions according to the KEGG database based on the comparison from the previous substage. iv. Save the assignment results from substage iii;
Los resultados de la comparación de la etapa ii pueden incluir información sobre las funciones genéticas presentes, las vías metabólicas activas y la abundancia relativa de diferentes organismos en la comunidad microbiana. The results of the stage II comparison may include information on the genetic functions present, the active metabolic pathways, and the relative abundance of different organisms in the microbial community.
Para el entendimiento de la presente divulgación Las funciones biológicas potenciales se refieren a las actividades o roles que pueden desempeñar los genes y las secuencias identificadas en la comunidad microbiana. Estas funciones pueden abarcar una variedad de actividades biológicas, incluyendo procesos metabólicos, funciones celulares específicas y la producción de ciertos productos químicos. For the understanding of this disclosure, potential biological functions refer to the activities or roles that the identified genes and sequences can perform in the microbial community. These functions can encompass a variety of biological activities, including metabolic processes, specific cellular functions, and the production of certain chemicals.
Para ejecutar la subetapa f3), asignar funciones biológicas potenciales en las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e3) se puede hacer mediante herramientas como por ejemplo HUMAnN3 (HMP Unified Metabolic Analysis Network 3), que es diseñada para el análisis funcional y metagenómico de datos del microbioma humano. To execute sub-stage f3), assigning potential biological functions to the microbiota DNA data sequences after the application of sub-stage e3) can be done using tools such as HUMAnN3 (HMP Unified Metabolic Analysis Network 3), which is designed for the functional and metagenomic analysis of human microbiome data.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa g3), clasificar las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e3), se realiza mediante las siguientes subetapas: i. disponer de una base de datos que contiene información sobre las secuencias genómicas de referencia de microorganismos; ii. dividir en pequeños fragmentos k-mers las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e3) iii. compara los k-mers de la subetapa ii con los k-mers presentes en la base de datos de la subetapa i; iv. asignar a cada secuencia un taxón al cual se asemeja con base en la comparación de k-mers de la subetapa; In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step g3), classifying the microbiota DNA data sequences subsequent to the application of sub-step e3), is carried out by means of the following sub-steps: i. having a database that contains information on the reference genomic sequences of microorganisms; ii. dividing the microbiota DNA data sequences subsequent to the application of sub-step e3) into small k-mers fragments; iii. comparing the k-mers from sub-step ii with the k-mers present in the database of sub-step i; iv. assign each sequence a taxon to which it resembles based on the sub-stage k-mers comparison;
Para ejecutar la subetapa g3), clasificar las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e3) se puede hacer mediante herramientas como por ejemplo Kraken 2 que permite la clasificación de secuencias de ADN, secuencias de metagenomas es decir en conjuntos de datos genómicos de comunidades microbianas en función de su origen taxonómico. To execute sub-stage g3), classifying the microbiota DNA data sequences after the application of sub-stage e3) can be done using tools such as Kraken 2, which allows the classification of DNA sequences, metagenome sequences, i.e., in genomic data sets of microbial communities based on their taxonomic origin.
Opcionalmente se puede generar visualización y representar gráficamente los datos taxonómicos clasificados de las secuencias datos de ADN de microbiota posteriores a la aplicación de la subetapa e3) mediante el uso de herramientas como por ejemplo KRONA permitiendo explorar y manipular la visualización para obtener detalles sobre grupos taxonómicos específicos. Optionally, the classified taxonomic data of the microbiota DNA sequence data after the application of sub-step e3) can be visualized and graphed by using tools such as KRONA, allowing the visualization to be explored and manipulated to obtain details about specific taxonomic groups.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa h3), identificar asociaciones entre variables microbiotas y metadatos en las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa g3), se realiza mediante las siguientes subetapas: i. leer datos de ADN de microbiota preprocesados en la subetapa h3) especificando las tablas de abundancias de taxones y metadatos; ii. definir el modelo lineal a ajustar y especificar las variables independientes (taxones microbiotas) y la variable dependiente (metadatos); iii. realizar pruebas estadísticas para evaluar la asociación entre taxones microbiotas y variables metadatos; iv. identificar asociaciones entre taxones específicos y variables metadatos; opcionalmente se puede realizar una etapa de validar las asociaciones identificadas considerando la biología conocida y la literatura relevante. El modelo lineal definido en la subetapa ii se puede seleccionar del grupo que comprende Modelo Lineal Simple: para explorar la asociación entre una variable microbiota y un metadato a la vez; Modelo Lineal Múltiple: Si tienes múltiples metadatos que podrían influir en las abundancias microbiotas: Modelo con Interacción opcionalmente se puede incorporar al modelo lineal interacciones entre variables para considerar el efecto de una variable que depende del valor de otra. In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step h3) to identify associations between microbiota variables and metadata in the microbiota DNA data sequences after applying sub-step g3) is performed by the following sub-steps: i. reading microbiota DNA data pre-processed in sub-step h3) specifying the tables of taxon abundances and metadata; ii. defining the linear model to be fitted and specifying the independent variables (microbiota taxa) and the dependent variable (metadata); iii. performing statistical tests to evaluate the association between microbiota taxa and metadata variables; iv. identifying associations between specific taxa and metadata variables; optionally, a step of validating the identified associations considering known biology and relevant literature may be performed. The linear model defined in sub-stage ii can be selected from the group comprising Simple Linear Model: to explore the association between a microbiota variable and a metadata at a time; Multiple Linear Model: if you have multiple metadata that could influence microbiota abundances; Interaction Model: optionally, interactions between variables can be incorporated into the linear model to consider the effect of one variable depending on the value of another.
Opcionalmente se ajusta el modelo lineal considerando la corrección por múltiples pruebas, al repetir el proceso de las etapas i a iV (por ejemplo, aplicando ajuste de Bonferroni) para controlar los errores tipo I. Optionally, the linear model is adjusted considering the correction for multiple tests, by repeating the process from stages i to iV (for example, applying Bonferroni adjustment) to control type I errors.
Para ejecutar la subetapa h3), identificar asociaciones entre variables microbiotas y metadatos en las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa g3) se puede hacer mediante herramientas como, por ejemplo, MaAsLin2 (Multivariate Association with Linear Models 2), que permite realizar análisis de asociación muí ti vari ante. To execute sub-stage h3), identifying associations between microbiota variables and metadata in the microbiota DNA data sequences after the application of sub-stage g3) can be done using tools such as, for example, MaAsLin2 (Multivariate Association with Linear Models 2), which allows performing multivariate association analysis.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa i3), filtrar mediante criterios estadísticos taxonómicos y funcionales y las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa h3), se realiza mediante las siguientes subetapas: i. realizar filtrado estadístico; o filtrar por abundancia: Elimina taxones o funciones con una baja abundancia. o Filtrar por significancia estadística: Elimina resultados que no son estadísticamente significativos. ii. realizar filtrado taxonómico; o identificar y eliminar taxones que podrían ser contaminantes. iii. realizar filtrado por niveles taxonómico; o definir los niveles taxonómicos de interés y filtra las secuencias que no cumplen con esos criterios iv. realizar filtrado funcional; o eliminar funciones no Relevantes v. realizar filtrado por niveles funcionales; vi. realizar filtrado basado en la literatura o eliminar taxones o funciones que no estén respaldados por la literatura relevante In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step i3), filtering by taxonomic and functional statistical criteria and the microbiota DNA data sequences after applying sub-step h3), is performed by the following sub-steps: i. performing statistical filtering; o Filtering by abundance: Eliminates taxa or functions with a low abundance. o Filtering by statistical significance: Eliminates results that are not statistically significant. ii. performing taxonomic filtering; o Identifying and eliminating taxa that could be contaminants. iii. performing filtering by taxonomic levels; or define the taxonomic levels of interest and filter out sequences that do not meet those criteria iv. perform functional filtering; or eliminate irrelevant functions v. perform filtering by functional levels; vi. perform literature-based filtering or eliminate taxa or functions that are not supported by the relevant literature
Opcionalmente se pueden agregar etapas para aplicar técnicas de validación cruzada para asegurar que los criterios de filtrado no estén sesgados y sean aplicables a diferentes conjuntos de datos. Igualmente, de manera opcional se pueden agregar etapas de revisión exploratoria de los datos antes y después del filtrado para evaluar el impacto de las decisiones de filtrado. Optionally, steps can be added to apply cross-validation techniques to ensure that the filtering criteria are unbiased and applicable to different data sets. Likewise, exploratory data review steps can be added before and after filtering to assess the impact of filtering decisions.
Para ejecutar la subetapa i3), filtrar mediante criterios estadísticos taxonómicos y funcionales y las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa h3) se puede hacer mediante herramientas como por ejemplo herramientas estadísticas, como R o Python con bibliotecas como pandas o scipy, para realizar el filtrado estadístico. Puedes aplicar criterios como: Filtrado por abundancia: Elimina taxones o funciones con una baja abundancia. Filtrado por significancia estadística: Elimina resultados que no son estadísticamente significativos. To execute sub-step i3), filtering using taxonomic and functional statistical criteria and the microbiota DNA data sequences after applying sub-step h3) can be done using tools such as statistical tools, such as R or Python with libraries such as pandas or scipy, to perform statistical filtering. You can apply criteria such as: Filtering by abundance: Eliminates taxa or functions with low abundance. Filtering by statistical significance: Eliminates results that are not statistically significant.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa j3), aplicar los procesos de normalización, correlación y reducción de dimensiones a los datos de ADN de microbiota posteriores a la aplicación de la subetapa i3), se realiza mediante las siguientes subetapas: vii. aplicar normalización a los datos de ADN de microbiota posteriores a la aplicación de la subetapa i3) mediante técnicas como por ejemplo la normalización z-score o min-max scaling. Usando por ejemplo bibliotecas como scikit-learn en Python para realizar esta normalización. viii. calcular la matriz de correlación entre las variables; ix. aplica técnicas de reducción de dimensiones, por ejemplo, mediante Principal Component Analysis (PCA) que es una técnica comúnmente utilizada. Se puede hacer por ejemplo utilizando bibliotecas como scikit-learn para implementar PCA. In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step j3), applying the normalization, correlation and dimension reduction processes to the microbiota DNA data after applying sub-step i3), is performed by the following sub-steps: vii. applying normalization to the microbiota DNA data after applying sub-step i3) using techniques such as z-score normalization or min-max scaling. Using for example libraries like scikit-learn in Python to perform this normalization. viii. calculate the correlation matrix between variables; ix. apply dimension reduction techniques, for example, through Principal Component Analysis (PCA), which is a commonly used technique. This can be done, for example, by using libraries like scikit-learn to implement PCA.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados posterior a la subetapa j 3) se realiza una etapa adicional de reducir la dimensión de los datos en un ejemplo la etapa de reducir la dimensión de los datos se realiza mediante correlaciones de Pearson, donde se seleccionaron aquellas variables que fueran directamente proporcionales (correlación igual 1), y se dejó como representante aquella variable que, por su importancia o relación con el estudio, asocia genes relacionados en el caso los datos de ADN de microbiota posteriores a la aplicación de la subetapa j3) y especies o taxones relacionadas con microbiota, estuvieran por ejemplo asociadas a la disminución de riesgo metabólico. In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data after sub-stage j 3) an additional step of reducing the dimension of the data is performed in one example the step of reducing the dimension of the data is performed by means of Pearson correlations, where those variables that were directly proportional were selected (correlation equal to 1), and that variable was left as representative that, due to its importance or relationship with the study, associates genes related in the case of the microbiota DNA data after the application of sub-stage j3) and species or taxa related to microbiota, were for example associated with the decrease in metabolic risk.
Se define las variables relevantes mediante dos consideraciones. La primera consideración, es que las variables corresponden a cada una de las variantes para el caso de los datos de ADN de microbiota y cada uno de los taxones u OTUs (Unidades Operacionales Taxonómicas); y la segunda consideración que el análisis de relevancia de las variables se realiza el cálculo de la importancia e importancia relativa en un ejemplo esto se puede realizar con una función de Python como es “feature importances”, que permite calcular el decrecimiento de la impureza de los datos (la probabilidad de clasificar erróneamente un elemento elegido al azar en un conjunto), seleccionando de esta manera las variables que más aportan a la discriminación de los grupos sin la presencia de valores clasificados erróneamente. Relevant variables are defined through two considerations. The first is that the variables correspond to each of the variants in the case of microbiota DNA data and each of the taxa or OTUs (Operational Taxonomic Units); and the second consideration is that the analysis of the relevance of the variables is carried out by calculating the importance and relative importance. In an example, this can be done with a Python function such as "feature importances," which allows calculating the decrease in data impurity (the probability of misclassifying a randomly chosen element in a set), thus selecting the variables that contribute most to the discrimination of the groups without the presence of misclassified values.
La normalizaron los datos sirve para que los datos atípicos, resultado del uso de una población de individuos cuyas variables controladas pueden no ser suficientes para que el individuo tenga una respuesta similar, y por ende tengan posteriormente la misma escala y peso en los métodos de aprendizaje automático. En un ejemplo particular de la presente descripción la normalización se realiza aplicando la técnica de normalización min-max, que ajusta los valores de cada variable en un rango de 0 a 1. The normalization of the data serves to eliminate atypical data, resulting from the use of a population of individuals whose controlled variables may not be sufficient for the individual to have a similar response, and therefore subsequently have the same Scale and weight in machine learning methods. In a particular example of this description, normalization is performed by applying the min-max normalization technique, which adjusts the values of each variable within a range of 0 to 1.
Para ejecutar la subetapa j3), aplicar los procesos de normalización, correlación y reducción de dimensiones a los datos de ADN de microbiota posteriores a la aplicación de la subetapa i3) se puede hacer, por ejemplo, en un entorno de programación como Python (usando bibliotecas como Pandas para manipulación de datos). To execute sub-step j3), applying the normalization, correlation and dimension reduction processes to the microbiota DNA data after applying sub-step i3) can be done, for example, in a programming environment such as Python (using libraries such as Pandas for data manipulation).
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa k3), agrupar los datos de ADN de microbiota posteriores a la aplicación de la subetapa i3) en unidades taxonómicas operativas mediante un umbral predefinido mayor que 80 % de similitud de secuencia de genes, se realiza mediante las siguientes subetapas: i. definir un umbral de similitud de secuencia de genes que determinará cuándo dos secuencias se agruparán en la misma OTU. Este umbral podría ser, por ejemplo, del X% al Y% de similitud; ii. agrupar las secuencias en OTUs basándote en el umbral de similitud definido por ejemplo usando, utilizando un algoritmo de clustering; por ejemplo, el algoritmo de clustering jerárquico, el método de enlace único o completo, o el algoritmo de clustering de k-medias; iii. asignar información taxonómica a cada OTUs. Esto puede hacerse utilizando bases de datos de referencia que contengan información sobre la taxonomía de diferentes secuencias de genes, por ejemplo, mediante herramientas como Karken2 y Kaiju; In some embodiments of the present disclosure, the method generating pre-processed microbiota DNA data in sub-step k3) of clustering the microbiota DNA data subsequent to applying sub-step i3) into operational taxonomic units by a predefined threshold greater than 80% gene sequence similarity is performed by the following sub-steps: i. defining a gene sequence similarity threshold that will determine when two sequences will cluster into the same OTU. This threshold could be, for example, X% to Y% similarity; ii. clustering the sequences into OTUs based on the defined similarity threshold, for example using a clustering algorithm; for example, the hierarchical clustering algorithm, the single or complete linkage method, or the k-means clustering algorithm; iii. assigning taxonomic information to each OTU. This can be done using reference databases containing information on the taxonomy of different gene sequences, for example, using tools such as Karken2 and Kaiju;
Para ejecutar la subetapa k3), agrupar los datos de ADN de microbiota posteriores a la aplicación de la subetapa i3) en unidades taxonómicas operativas mediante un umbral predefinido mayor que 80% de similitud de secuencia de genes se puede hacer mediante herramientas como, por ejemplo, BLAST (Basic Local Alignment Search Tool) u otras herramientas de alineación de secuencias genéticas. En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa m3), almacenar las unidades taxonómicas operativas obtenidas en la etapa k3) en una base de datos, se realiza mediante las siguientes subetapas: To execute sub-step k3), clustering the microbiota DNA data after applying sub-step i3) into operational taxonomic units using a predefined threshold greater than 80% gene sequence similarity can be done using tools such as, for example, BLAST (Basic Local Alignment Search Tool) or other gene sequence alignment tools. In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step m3), storing the operational taxonomic units obtained in step k3) in a database, is performed by the following sub-steps:
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa n3), filtrar mediante criterios de unidades taxonómicas operativas previamente almacenados en una base de datos y relacionados con la predicción de una enfermedad respectivamente en las unidades taxonómicas operativas almacenadas en la etapa 13), se realiza mediante las siguientes subetapas: v. disponer de una base de datos taxonómica que asigna identificaciones taxonómicas a las OTUs; vi. asignar identificaciones taxonómicas a las OTUs utilizando la base de datos dispuesta en paso anterior; vii. filtrar por criterios viii. crear una tabla de metadatos con información sobre la presencia o ausencia de la enfermedad para cada muestra ( ¿o será secuencia?) ix. realizar análisis de diferencial de abundancia, para evaluar la asociación entre las OTUs filtradas y la presencia de la enfermedad. Por ejemplo, mediante una herramienta como MaAsLin2 x. definir un modelo que relacione las abundancias de las OTUs filtradas con la presencia o ausencia de la enfermedad. xi. seleccionar resultados según análisis estadístico para identificar las OTUs que están asociadas significativamente con la presencia de la enfermedad. In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step n3), filtering by criteria of operational taxonomic units previously stored in a database and related to the prediction of a disease respectively in the operational taxonomic units stored in step 13), is carried out by means of the following sub-steps: v. have a taxonomic database that assigns taxonomic identifications to the OTUs; vi. assign taxonomic identifications to the OTUs using the database provided in the previous step; vii. filter by criteria viii. create a metadata table with information on the presence or absence of the disease for each sample (or is it sequence?) ix. perform differential abundance analysis, to evaluate the association between the filtered OTUs and the presence of the disease. For example, using a tool such as MaAsLin2 x. define a model that relates the abundances of the filtered OTUs with the presence or absence of the disease. xi. Select results based on statistical analysis to identify OTUs that are significantly associated with the presence of the disease.
Opcionalmente se pueden agregar etapas de validación como por ejemplo validar los resultados utilizando técnicas de validación cruzada, entre otras técnicas. También de manera opcionalmente se puede generar una etapa de visualización, como gráficos de barras o heatmaps, etc, para representar visualmente las asociaciones entre las OTUs filtradas y la enfermedad. Optionally, validation steps can be added, such as validating the results using cross-validation techniques, among other techniques. Optionally, a visualization stage can also be generated, such as bar charts or heatmaps, etc., to visually represent the associations between the filtered OTUs and the disease.
Para ejecutar la subetapa n3), filtrar mediante criterios de unidades taxonómicas operativas previamente almacenados en una base de datos y relacionados con la predicción de una enfermedad respectivamente en las unidades taxonómicas operativas almacenadas en la etapa 13) se puede hacer mediante herramientas como por ejemplo Kraken 2 o Kaiju para preprocesar datos de secuenciación y asignar identificaciones taxonómicas a las OTUs utilizando una base de datos que incluye microorganismos eukariota y procariotas. Kaiju es un programa bioinformático utilizado para la clasificación taxonómica de secuencias de ADN o proteínas. To execute sub-step n3), filtering by operational taxonomic unit criteria previously stored in a database and related to the prediction of a disease respectively in the operational taxonomic units stored in step 13) can be done using tools such as Kraken 2 or Kaiju to preprocess sequencing data and assign taxonomic identifications to OTUs using a database that includes eukaryotic and prokaryotic microorganisms. Kaiju is a bioinformatics program used for the taxonomic classification of DNA or protein sequences.
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa o3), repetir las etapas de a3) a la etapa k3) trascurrido un tiempo T2 y almacenar las unidades taxonómicas operativas obtenidas en la etapa n en una base de datos los resultados obtenidos en la etapa n trascurrido un tiempo T2, se realiza mediante las siguientes subetapas: In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step o3), repeating steps a3) to step k3) after a time T2 has elapsed and storing the operational taxonomic units obtained in step n in a database, the results obtained in step n after a time T2 has elapsed, is carried out by means of the following sub-steps:
En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa p3), filtrar mediante criterios de unidades taxonómicas operativas previamente almacenados en una base de datos con la predicción de una enfermedad respectivamente las unidades taxonómicas operativas almacenadas en la etapa 13, se realiza mediante las siguientes subetapas: xii. disponer de una base de datos taxonómica que asigna identificaciones taxonómicas a las OTUs; xiii. asignar identificaciones taxonómicas a las OTUs utilizando la base de datos dispuesta en paso anterior; xiv. filtrar por criterios como una abundancia absoluta mínima de 100. xv. crear una tabla de metadatos con información sobre la presencia o ausencia de la enfermedad para cada muestra. xvi. realizar análisis de diferencial de abundancia, para evaluar la asociación entre las OTUs filtradas y la presencia de la enfermedad. Por ejemplo, mediante una herramienta como MaAsLin2 xvii. definir un modelo que relacione las abundancias de las OTUs filtradas con la presencia o ausencia de la enfermedad. xviii. seleccionar resultados según análisis estadístico para identificar las OTUs que están asociadas significativamente con la presencia de la enfermedad. In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step p3), filtering by criteria of operational taxonomic units previously stored in a database with the prediction of a disease respectively the operational taxonomic units stored in step 13, is carried out by the following sub-steps: xii. having a taxonomic database that assigns taxonomic identifications to the OTUs; xiii. assigning taxonomic identifications to the OTUs using the database provided in the previous step; xiv. filtering by criteria such as a minimum absolute abundance of 100. xv. creating a metadata table with information on the presence or absence of the disease for each sample. xvi. Perform differential abundance analysis to assess the association between the filtered OTUs and disease presence. For example, using a tool such as MaAsLin2. xvii. Define a model that relates the abundances of the filtered OTUs to the presence or absence of disease. xviii. Select results based on statistical analysis to identify OTUs that are significantly associated with disease presence.
Para definir el modelo que relacione las abundancias de las OTUs filtradas con la presencia o ausencia de la enfermedad de la etapa xvii, se utilizan modelos estadísticos, especialmente modelos de regresión logística en el contexto de datos binarios (ejemplo presencia/ausencia de la enfermedad). Como por ejemplo anexo de datos clínicos. To define the model that relates the abundances of filtered OTUs to the presence or absence of stage XVII disease, statistical models are used, specifically logistic regression models in the context of binary data (e.g., presence/absence of disease). For example, clinical data appendixes.
Opcionalmente se pueden agregar etapas de validación como por ejemplo validar los resultados utilizando técnicas de validación cruzada, entre otras técnicas. Optionally, validation steps can be added, such as validating the results using cross-validation techniques, among other techniques.
También de manera opcionalmente se puede generar una etapa de visualización, como gráficos de barras o heatmaps, etc, para representar visualmente las asociaciones entre las OTUs filtradas y la enfermedad. Optionally, a visualization stage can also be generated, such as bar charts or heatmaps, etc., to visually represent the associations between the filtered OTUs and the disease.
Para ejecutar la subetapa p3), filtrar mediante criterios de unidades taxonómicas operativas previamente almacenados en una base de datos con la predicción de una enfermedad respectivamente las unidades taxonómicas operativas almacenadas en la etapa 13, se puede hacer mediante herramientas como por ejemplo Kraken2 y Kaiju para preprocesar datos de secuenciación y asignar identificaciones taxonómicas a las OTUs utilizando una base de datos taxonómica que asigna identificaciones taxonómicas a las OTUs. se usarán bases de datos para microorganismos eucariotas y procariotas a partir del nr del NCBI. En algunas realizaciones de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa p3), comparar los datos almacenados en la etapa 13) con los datos almacenados en la etapa o3) transcurrido un tiempo T2. To execute sub-step p3), filtering by criteria of operational taxonomic units previously stored in a database with the prediction of a disease respectively the operational taxonomic units stored in step 13, can be done using tools such as Kraken2 and Kaiju to preprocess sequencing data and assign taxonomic identifications to OTUs using a taxonomic database that assigns taxonomic identifications to OTUs. databases for eukaryotic and prokaryotic microorganisms will be used from the NCBI nr. In some embodiments of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step p3) compares the data stored in step 13) with the data stored in step o3) after a time T2 has elapsed.
Los datos clínicos de la presente divulgación en la etapa a) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad pueden representan una amplia gama de medidas, desde indicadores metabólicos como la glucosa y la insulina hasta biomarcadores como el PCR y los niveles de diversas vitaminas y minerales. Su inclusión en un análisis clínico podría proporcionar una visión integral de la salud metabólica, nutrí ci onal y endocrina de los individuos bajo estudio. The clinical data of the present disclosure in step a) of the computer-implemented method for obtaining disease risk prediction data may represent a wide range of measures, from metabolic indicators such as glucose and insulin to biomarkers such as CRP and levels of various vitamins and minerals. Their inclusion in a clinical analysis could provide a comprehensive view of the metabolic, nutritional, and endocrine health of the individuals under study.
Los datos clínicos de la presente divulgación en la etapa a) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad se seleccionan del grupo que comprende edad, Glucosa Basal, Calcio, Hierro, PCR, AP01, AP02, APOB, Tiamina, Riboflavina, Niacina, Piridoxina, Ac Asc, Potasio, Cloro, Zinc, TNFa, TSH, T4T, Glucagón, HGH, creat Suero, col No HDL, e i HOMA y combinaciones de los anteriores The clinical data of the present disclosure in step a) of the computer-implemented method for obtaining disease risk prediction data are selected from the group comprising age, Basal Glucose, Calcium, Iron, CRP, AP01, AP02, APOB, Thiamine, Riboflavin, Niacin, Pyridoxine, Ac Asc, Potassium, Chlorine, Zinc, TNFa, TSH, T4T, Glucagon, HGH, Serum creat, Non-HDL chol, and i HOMA and combinations of the above.
En algunas realizaciones de la presente divulgación en la etapa b) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación, datos clínicos -preprocesados se generan aplicando modelos de regresión para imputación de datos perdidos y normalización a los datos clínicos de la etapa a); In some embodiments of the present disclosure in step b) of the computer-implemented method of obtaining a disease risk prediction data of the present disclosure, pre-processed clinical data are generated by applying regression models for missing data imputation and normalization to the clinical data from step a);
En otras realizaciones de la presente divulgación en la etapa b) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación, datos clínicos -preprocesados se generan mediante un método que comprende las subetapas: i. leer los datos de datos clínicos de la etapa a); ii. examina la distribución de tus variables clínicas para evaluar la normalidad por ejemplo usando pruebas estadísticas como la prueba de normalidad de Shapiro-Wilk; iii. si los datos clínicos no siguen una distribución normal acercarlos a la normalidad, por ejemplo, aplicando transformaciones como logaritmo, raíz cuadrada, el reemplazo de outliers por la media, entre otras; iv. realizar imputación de datos perdidos; v. aplicar técnicas de reducción de dimensiones como, por ejemplo, análisis de componentes principales (PCA) o métodos de selección de características para reducir la dimensionalidad de los datos manteniendo la información relevante; la imputación de datos perdidos se aplica para reemplazar los valores fallantes en el conjunto de datos clínicos de la etapa a) mediante técnicas como pueden ser, por ejemplo, la imputación media, mediana, regresión, o métodos más avanzados como MICE (Multiple Imputation by Chained Equations), entre otros. In other embodiments of the present disclosure in step b) of the computer-implemented method of obtaining a disease risk prediction data of the present disclosure, pre-processed clinical data is generated by a method comprising the sub-steps: i. reading the clinical data from step a); ii. Examine the distribution of your clinical variables to assess normality, for example, using statistical tests such as the Shapiro-Wilk normality test; iii. If the clinical data do not follow a normal distribution, bring them closer to normal, for example, by applying transformations such as logarithm, square root, replacement of outliers by the mean, among others; iv. Perform missing data imputation; v. Apply dimension reduction techniques such as principal component analysis (PCA) or feature selection methods to reduce the dimensionality of the data while maintaining relevant information; Missing data imputation is applied to replace missing values in the clinical dataset from step a) using techniques such as mean, median, regression imputation, or more advanced methods such as MICE (Multiple Imputation by Chained Equations), among others.
Los datos clínicos -preprocesados se pueden conseguir mediante herramientas disponibles en Python como por ejemplo scipy. stats, statsmodels, pandas, scikit-leam, fancyimpute, numpy, Preprocessed clinical data can be obtained using tools available in Python such as scipy.stats, statsmodels, pandas, scikit-leam, fancyimpute, numpy,
En algunas realizaciones de la presente divulgación en la etapa c) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación, la pluralidad de métodos de aprendizaje supervisado se selecciona del grupo que comprende: clasificador que utiliza la regresión Ridge con validación cruzada para tareas de clasificación (RidgeClassifierCV), clasificador basado en la regresión Ridge (RidgeClassifier), Perceptron (Perceptron), algoritmo de conjunto que combina múltiples clasificadores débiles para mejorar la precisión (AdaBoostClassifier), clasificador que utiliza LightGBM, con aumento de gradiente (LGBMClassifier), clasificador Naive Bayes (BernoulliNB), clasificador que de descenso de gradiente estocástico (SGDClassifier), clasificador de técnica de Bagging (BaggingClassifier), clasificador basado en árboles aleatorios (ExtraTreeClassifier), clasificador de colección de árboles de decisión aleatorios (ExtraTreesClassifier), clasificador basado en XGBoost de aumento de gradiente (XGBClassifier), máquina de soporte vectorial (SVC), clasificador que calibra las probabilidades de los clasificadores subyacentes utilizando validación cruzada (CalibratedClassifierCV), ), clasificador de árboles de decisión aleatorios (RandomForestClassifier), máquina de soporte vectorial (SVM) que utiliza un parámetro nu para controlar el número de vectores de soporte (NuSVC), clasificador que determina la clase basándose en el centroide más cercano (NearestCentroid), clasificador de referencia que toma decisiones basadas en estrategias simples (DummyClassifier), algoritmo de análisis discriminante lineal que busca maximizar la separación entre clases (LinearDiscriminantAnalysis), algoritmo de propagación de etiquetas utilizado en tareas de clasificación semi supervisada (Label Spreading), algoritmo de la propagación de etiquetas en datos con etiquetas parciales (LabelPropagation), clasificador basado en la proximidad a los vecinos más cercanos, utilizado en problemas de clasificación (KNeighborsClassifier), clasificador Naive Bayes (GaussianNB), clasificador basado en árboles de decisión (DecisionTreeClassifier), clasificador que actualiza sus parámetros (PassiveAggressiveClassifier), máquina de soporte vectorial (SVM) lineal con una función de decisión lineal (LinearSVC), algoritmo de análisis discriminante (QuadraticDiscriminantAnalysis),. In some embodiments of the present disclosure in step c) of the computer-implemented method of obtaining a disease risk prediction data of the present disclosure, the plurality of supervised learning methods are selected from the group comprising: classifier using Ridge regression with cross-validation for classification tasks (RidgeClassifierCV), classifier based on Ridge regression (RidgeClassifier), Perceptron (Perceptron), ensemble algorithm combining multiple weak classifiers to improve accuracy (AdaBoostClassifier), classifier using LightGBM, with gradient boosting (LGBMClassifier), Naive Bayes classifier (BernoulliNB), stochastic gradient descent classifier (SGDClassifier), Bagging technique classifier (BaggingClassifier), random tree-based classifier (ExtraTreeClassifier), random decision tree collection classifier (ExtraTreesClassifier), XGBoost-based gradient boosting classifier (XGBClassifier), Support vector machine (SVC), a classifier that calibrates the probabilities of the underlying classifiers using cross-validation (CalibratedClassifierCV), Random decision tree classifier (RandomForestClassifier), Support vector machine (SVM) that uses a parameter nu to control the number of support vectors (NuSVC), Classifier that determines the class based on the nearest centroid (NearestCentroid), Reference classifier that makes decisions based on simple strategies (DummyClassifier), Linear discriminant analysis algorithm that seeks to maximize the separation between classes (LinearDiscriminantAnalysis), Label propagation algorithm used in semi-supervised classification tasks (Label Spreading), Label propagation algorithm on partially labeled data (LabelPropagation), Classifier based on proximity to nearest neighbors, used in classification problems (KNeighborsClassifier), Naive Bayes classifier (GaussianNB), Classifier based on decision trees (DecisionTreeClassifier), classifier that updates its parameters (PassiveAggressiveClassifier), linear support vector machine (SVM) with a linear decision function (LinearSVC), discriminant analysis algorithm (QuadraticDiscriminantAnalysis),.
En algunas realizaciones de la presente divulgación en la etapa e) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación, la selección del método de aprendizaje supervisado con el método de aprendizaje supervisado con la métrica más alta de la pluralidad de métodos de aprendizaje supervisado de la etapa d) para obtener un dato de predicción de riesgo de enfermedad comprende las subetapas de: a) importa y carga el conjunto de datos de entrenamiento etiquetado; tomados de la unión de los datos de ADN genómico preprocesados con los datos de ADN de microbiota preprocesados y con los datos clínicos preprocesados, de la etapa b del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación. b) dividir el conjunto de datos de la subetapa a) en dos partes marcadas como datos de entrenamiento y datos de validación, por ejemplo, tomado el 80% de los datos de la etapa a) como datos de entrenamiento y 20% restante como datos de validación; c) iniciar un bucle para evaluar la pluralidad de métodos de aprendizaje donde se calcula una métrica de rendimiento para cada método de aprendizaje que conforma la pluralidad, con el fin de saber cuáles son los modelos que mejor precisión proporcionan y a partir de datos de variantes genómicos, de metagenoma y clínicos de la subetapa a); d) seleccionar el método de aprendizaje con la mejor métrica de rendimiento para predecir un dato de riesgo de enfermedad. In some embodiments of the present disclosure, in step e) of the computer-implemented method of obtaining a disease risk prediction data of the present disclosure, selecting the supervised learning method with the supervised learning method with the highest metric from the plurality of supervised learning methods of step d) to obtain a disease risk prediction data comprises the sub-steps of: a) importing and loading the labeled training data set; taken from the union of the preprocessed genomic DNA data with the preprocessed microbiota DNA data and with the preprocessed clinical data, of step b of the computer-implemented method of obtaining a disease risk prediction data of the present disclosure. b) split the dataset from sub-stage a) into two parts labeled as training data and validation data, for example, taking 80% of the data from stage a) as training data and the remaining 20% as validation data; c) start a loop to evaluate the plurality of learning methods where a performance metric is calculated for each learning method that makes up the plurality, in order to know which are the models that provide the best precision based on genomic, metagenome and clinical variant data from sub-stage a); d) select the learning method with the best performance metric to predict a disease risk data.
En algunas realizaciones de la presente divulgación en la selección del método de aprendizaje supervisado en el método de aprendizaje supervisado con la métrica más alta de la pluralidad de métodos de aprendizaje supervisado de la etapa d) en la subetapa g) iniciar un bucle para evaluar la pluralidad de métodos de aprendizaje comprende las subetapas de: a. entrenar el primer método de aprendizaje supervisado de la pluralidad de métodos utilizando el conjunto de datos de entrenamiento; b. realiza predicciones utilizando el modelo entrenado en el conjunto de datos de validación. c. evalúa la métrica de rendimiento del modelo del primero del método de aprendizaje supervisado de la pluralidad de métodos calculando la precisión en el conjunto de validación. d. registra la precisión obtenida para el modelo del primero del método de aprendizaje supervisado de la pluralidad de métodos; e. repetir las subetapas de a) a b) para todos los métodos de aprendizaje supervisado de la pluralidad de métodos. Arquitectura Hardware: In some embodiments of the present disclosure, selecting the supervised learning method from the supervised learning method with the highest metric from the plurality of supervised learning methods in step d) in sub-step g) initiating a loop for evaluating the plurality of learning methods comprises the sub-steps of: a. training the first supervised learning method of the plurality of methods using the training data set; b. making predictions using the trained model on the validation data set; c. evaluating the performance metric of the model of the first of the supervised learning method of the plurality of methods by calculating the accuracy on the validation set. d. recording the accuracy obtained for the model of the first of the supervised learning method of the plurality of methods; e. repeating the sub-steps of a) to b) for all of the supervised learning methods of the plurality of methods. Hardware Architecture:
El método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente descripción se puede implementar en uno o más sistemas informáticos, en donde el sistema informático se puede implementar al menos en parte en redes, la nube y/o como una máquina o conjunto de máquinas (por ejemplo, informática máquina, servidor, dispositivo informático móvil, clúster de computadoras, etc.) configurado para recibir un medio legible por computadora que almacena instrucciones legibles por computadora y que sea capaz de almacenar instrucciones para el método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente descripción. The computer-implemented method of obtaining disease risk prediction data of the present disclosure may be implemented on one or more computing systems, wherein the computing system may be implemented at least in part on networks, the cloud, and/or as a machine or set of machines (e.g., computing machine, server, mobile computing device, computer cluster, etc.) configured to receive a computer-readable medium that stores computer-readable instructions and is capable of storing instructions for the computer-implemented method of obtaining disease risk prediction data of the present disclosure.
A continuación, se presenta dos arquitecturas hardware y de red para implementar el método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación. Two hardware and network architectures are presented below for implementing the computer-implemented method of obtaining disease risk prediction data of the present disclosure.
En una realización de la presente divulgación en la etapa b) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación el sistema que realiza el método es un supercomputador (ASIMOV) con características como: 254 TB de almacenamiento, con conectividad infiniband de 8GBbps y 624 núcleos CPU con 4.8 TB de RAM para una capacidad de cálculo (precisión doble) de 17 TFLOPS, In an embodiment of the present disclosure in step b) of the computer-implemented method for obtaining disease risk prediction data of the present disclosure, the system that performs the method is a supercomputer (ASIMOV) with characteristics such as: 254 TB of storage, with infiniband connectivity of 8 GBbps and 624 CPU cores with 4.8 TB of RAM for a calculation capacity (double precision) of 17 TFLOPS,
En otra realización de la presente divulgación en la etapa b) del método implementado por computadora de obtención de un dato de predicción de riesgo de enfermedad de la presente divulgación el sistema que realiza el método es un supercomputador TAYRA que tiene 532 TB de almacenamiento, con conectividad infiniband de 52 Gbps y 1168 núcleos CPU con una memoria RAM de 8 TB; la capacidad de cálculo de TAYRA en precisión doble calculada en 102,96 TFLOPS. Con nodos GPU conteniendo 59904 núcleos disponibles. In another embodiment of the present disclosure, in step b) of the computer-implemented method for obtaining disease risk prediction data of the present disclosure, the system performing the method is a TAYRA supercomputer having 532 TB of storage, with 52 Gbps infiniband connectivity and 1168 CPU cores with 8 TB of RAM; the TAYRA computing capacity in double precision calculated at 102.96 TFLOPS. With GPU nodes containing 59904 available cores.
Ejemplo: En un ejemplo de la presente divulgación el método que genera los datos de ADN genómico preprocesados en la subetapa b2) procesa mediante las operaciones de recortar, limpiar, filtrar y eliminar secuencias no deseadas de los datos de secuenciación de los datos de ADN genómico evaluados en la subetapa a2). Y el método que genera datos de ADN de microbiota preprocesados en la subetapa e3), recortar adaptadores y secuencias no deseadas de lecturas de secuenciación los datos de ADN de microbiota posteriores a la aplicación de la subetapa d3). obtiene la siguiente secuencia de adaptadores. Example: In an example of the present disclosure, the method that generates the pre-processed genomic DNA data in sub-step b2) processes the genomic DNA data evaluated in sub-step a2) by means of the operations of trimming, cleaning, filtering and removing unwanted sequences from the sequencing data. And the method that generates pre-processed microbiota DNA data in sub-step e3) trims adapters and unwanted sequences from sequencing reads, the microbiota DNA data after applying sub-step d3). obtains the following adapter sequence.
>Illumina Single End Apapter 1>Illumina Single End Apapter 1
ACACTCTTTCCCTACACGACGCTGTTCCATCT ACACTCTTTCCCTACACGACGCTGTTCCATCT
>Illumina Single End Apapter 2>Illumina Single End Apapter 2
CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
>Illumina Single End PCR Primer 1>Illumina Single End PCR Primer 1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTC CGATCT AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTC CGATCT
>Illumina Single End PCR Primer 2>Illumina Single End PCR Primer 2
CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
>Illumina Single End Sequencing Primer>Illumina Single End Sequencing Primer
ACACTCTTTCCCTACACGACGCTCTTCCGATCT ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina Paired End Adapter 1>Illumina Paired End Adapter 1
ACACTCTTTCCCTACACGACGCTCTTCCGATCT ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina Paired End Adapter 2>Illumina Paired End Adapter 2
CTCGGCATTCCTGCTGAACCGCTCTTCCGATCT CTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
>Illumina Paried End PCR Primer 1>Illumina Paried End PCR Primer 1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTC
CGATCT CGATCT
>Illumina Paired End PCR Primer 2>Illumina Paired End PCR Primer 2
CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCT
CTTCCGATCT CTTCCGATCT
>Illumina Paried End Sequencing Primer 1 >Illumina Paried End Sequencing Primer 1
ACACTCTTTCCCTACACGACGCTCTTCCGATCT >Illumina Paired End Sequencing Primer 2ACACTCTTTCCCTACACGACGCTCTTCCGATCT >Illumina Paired End Sequencing Primer 2
CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT
>Illumina DpnII expression Adapter 1>Illumina DpnII expression Adapter 1
ACAGGTTCAGAGTTCTACAGTCCGAC ACAGGTTCAGAGTTCTACAGTCCGAC
>Illumina DpnII expression Adapter 2>Illumina DpnII expression Adapter 2
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina DpnII expression PCR Primer 1>Illumina DpnII expression PCR Primer 1
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina DpnII expression PCR Primer 2>Illumina DpnII expression PCR Primer 2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina DpnII expression Sequencing Primer>Illumina DpnII expression Sequencing Primer
CGACAGGTTCAGAGTTCTACAGTCCGACGATC CGACAGGTTCAGAGTTCTACAGTCCGACGATC
>Illumina Nlalll expression Adapter 1>Illumina Nlall expression Adapter 1
ACAGGTTCAGAGTTCTACAGTCCGACATG ACAGGTTCAGAGTTCTACAGTCCGACATG
>Illumina Nlalll expression Adapter 2>Illumina Nlall expression Adapter 2
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina Nlalll expression PCR Primer 1>Illumina Nlall expression PCR Primer 1
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina Nlalll expression PCR Primer 2>Illumina Nlall expression PCR Primer 2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina Nlalll expression Sequencing Primer>Illumina Nlall expression Sequencing Primer
CCGACAGGTTCAGAGTTCTACAGTCCGACATG CCGACAGGTTCAGAGTTCTACAGTCCGACATG
>Illumina Small RNA Adapter 1>Illumina Small RNA Adapter 1
GTTCAGAGTTCTACAGTCCGACGATC GTTCAGAGTTCTACAGTCCGACGATC
>Illumina Small RNA Adapter 2 >Illumina Small RNA Adapter 2
TCGTATGCCGTCTTCTGCTTGT TCGTATGCCGTCTTCTGCTGTGT
>Illumina Small RNA RT Primer>Illumina Small RNA RT Primer
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina Small RNA PCR Primer 1 CAAGCAGAAGACGGCATACGA >Illumina Small RNA PCR Primer 1 CAAGCAGAAGACGGCATACGA
>Illumina Small RNA PCR Primer 2>Illumina Small RNA PCR Primer 2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina Small RNA Sequencing Primer>Illumina Small RNA Sequencing Primer
CGACAGGTTCAGAGTTCTACAGTCCGACGATC CGACAGGTTCAGAGTTCTACAGTCCGACGATC
>Illumina Multiplexing Adapter 1 >Illumina Multiplexing Adapter 1
GATCGGAAGAGCACACGTCT GATCGGAAGAGCACACGTCT
>Illumina Multiplexing Adapter 2>Illumina Multiplexing Adapter 2
ACACTCTTTCCCTACACGACGCTCTTCCGATCT ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina Multiplexing PCR Primer 1.01>Illumina Multiplexing PCR Primer 1.01
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTC
CGATCT CGATCT
>Illumina Multiplexing PCR Primer 2.01>Illumina Multiplexing PCR Primer 2.01
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>Illumina Multiplexing Readl Sequencing Primer>Illumina Multiplexing Readl Sequencing Primer
ACACTCTTTCCCTACACGACGCTCTTCCGATCT ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Illumina Multiplexing Index Sequencing Primer>Illumina Multiplexing Index Sequencing Primer
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>Illumina Multiplexing Read2 Sequencing Primer>Illumina Multiplexing Read2 Sequencing Primer
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>Illumina PCR Primer Index 1>Illumina PCR Primer Index 1
CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTC
>Illumina PCR Primer Index 2>Illumina PCR Primer Index 2
CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTC
>Illumina PCR Primer Index 3>Illumina PCR Primer Index 3
CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTC
>Illumina PCR Primer Index 4>Illumina PCR Primer Index 4
CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTC
>Illumina PCR Primer Index 5 >Illumina PCR Primer Index 5
CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTC >Illumina PCR Primer Index 6CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTC >Illumina PCR Primer Index 6
CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTC
>Illumina PCR Primer Index 7>Illumina PCR Primer Index 7
CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTC
>Illumina PCR Primer Index 8>Illumina PCR Primer Index 8
CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTC
>Illumina PCR Primer Index 9>Illumina PCR Primer Index 9
CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTC
>Illumina PCR Primer Index 10>Illumina PCR Primer Index 10
CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTC
>Illumina PCR Primer Index 11Illumina PCR Primer Index 11
CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTC
>Illumina PCR Primer Index 12>Illumina PCR Primer Index 12
CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTC CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTC
>Illumina DpnII Gex Adapter 1>Illumina DpnII Gex Adapter 1
GATCGTCGGACTGTAGAACTCTGAAC GATCGTCGGACTGTAGAACTCTGAAC
>Illumina DpnII Gex Adapter 1.01>Illumina DpnII Gex Adapter 1.01
ACAGGTTCAGAGTTCTACAGTCCGAC ACAGGTTCAGAGTTCTACAGTCCGAC
>Illumina DpnII Gex Adapter 2 >Illumina DpnII Gex Adapter 2
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina DpnII Gex Adapter 2.01>Illumina DpnII Gex Adapter 2.01
TCGTATGCCGTCTTCTGCTTG TCGTATGCCGTCTTCTGCTTG
>Illumina DpnII Gex PCR Primer 1>Illumina DpnII Gex PCR Primer 1
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina DpnII Gex PCR Primer 2>Illumina DpnII Gex PCR Primer 2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina DpnII Gex Sequencing Primer>Illumina DpnII Gex Sequencing Primer
CGACAGGTTCAGAGTTCTACAGTCCGACGATC CGACAGGTTCAGAGTTCTACAGTCCGACGATC
>Illumina NlalII Gex Adapter 1.01 >Illumina NlalII Gex Adapter 1.01
TCGGACTGTAGAACTCTGAAC >Illumina NlalII Gex Adapter 1.02TCGGACTGTAGAACTCTGAAC >Illumina NlalII Gex Adapter 1.02
ACAGGTTCAGAGTTCTACAGTCCGACATG ACAGGTTCAGAGTTCTACAGTCCGACATG
>Illumina NlalII Gex Adapter 2.01 >Illumina NlalII Gex Adapter 2.01
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina NlalII Gex Adapter 2.02 >Illumina NlalII Gex Adapter 2.02
TCGTATGCCGTCTTCTGCTTG TCGTATGCCGTCTTCTGCTTG
>Illumina NlalII Gex PCR Primer 1 >Illumina NlalII Gex PCR Primer 1
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina NlalII Gex PCR Primer 2>Illumina NlalII Gex PCR Primer 2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina NlalII Gex Sequencing Primer>Illumina NlalII Gex Sequencing Primer
CCGACAGGTTCAGAGTTCTACAGTCCGACATG CCGACAGGTTCAGAGTTCTACAGTCCGACATG
>Illumina Small RNA RT Primer >Illumina Small RNA RT Primer
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina 5p RNA Adapter >Illumina 5p RNA Adapter
GTTCAGAGTTCTACAGTCCGACGATC GTTCAGAGTTCTACAGTCCGACGATC
>Illumina RNA Adapter 1 >Illumina RNA Adapter 1
TCGTATGCCGTCTTCTGCTTGT TCGTATGCCGTCTTCTGCTGTGT
>Illumina Small RNA 3p Adapter 1 >Illumina Small RNA 3p Adapter 1
ATCTCGTATGCCGTCTTCTGCTTG ATCTCGTATGCCGTCTTCTGCTTG
>Illumina Small RNA PCR Primer 1 >Illumina Small RNA PCR Primer 1
CAAGCAGAAGACGGCATACGA CAAGCAGAAGACGGCATACGA
>Illumina Small RNA PCR Primer 2>Illumina Small RNA PCR Primer 2
AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA
>Illumina Small RNA Sequencing Primer>Illumina Small RNA Sequencing Primer
CGACAGGTTCAGAGTTCTACAGTCCGACGATC CGACAGGTTCAGAGTTCTACAGTCCGACGATC
>TruSeq Universal Adapter>TruSeq Universal Adapter
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTC
CGATCT >TruSeq Adapter, Index 1CGATCT >TruSeq Adapter, Index 1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 2>TruSeq Adapter, Index 2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 3>TruSeq Adapter, Index 3
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 4>TruSeq Adapter, Index 4
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 5>TruSeq Adapter, Index 5
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 6>TruSeq Adapter, Index 6
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 7>TruSeq Adapter, Index 7
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 8>TruSeq Adapter, Index 8
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 9>TruSeq Adapter, Index 9
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 10>TruSeq Adapter, Index 10
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGGATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCG
TCTTCTGCTTG TCTTCTGCTTG
>TruSeq Adapter, Index 11 GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCC>TruSeq Adapter, Index 11 GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCC
GTCTTCTGCTTG GTCTTCTGCTTG
>TruSeq Adapter, Index 12>TruSeq Adapter, Index 12
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCG
TCTTCTGCTTG TCTTCTGCTTG
>Illumina RNA RT Primer >Illumina RNA RT Primer
GCCTTGGCACCCGAGAATTCCA GCCTTGGCACCCGAGAATTCCA
>Illumina RNA PCR Primer>Illumina RNA PCR Primer
AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA
>RNA PCR Primer, Index 1>RNA PCR Primer, Index 1
CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 2>RNA PCR Primer, Index 2
CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 3>RNA PCR Primer, Index 3
CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 4>RNA PCR Primer, Index 4
CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 5>RNA PCR Primer, Index 5
CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 6>RNA PCR Primer, Index 6
CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 7>RNA PCR Primer, Index 7
CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA >RNA PCR Primer, Index 8CCGAGAATTCCA >RNA PCR Primer, Index 8
CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 9>RNA PCR Primer, Index 9
CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 10>RNA PCR Primer, Index 10
CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 11>RNA PCR Primer, Index 11
CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 12>RNA PCR Primer, Index 12
CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 13>RNA PCR Primer, Index 13
CAAGCAGAAGACGGCATACGAGATTTGACTGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTTGACTGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 14>RNA PCR Primer, Index 14
CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 15>RNA PCR Primer, Index 15
CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 16>RNA PCR Primer, Index 16
CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 17>RNA PCR Primer, Index 17
CAAGCAGAAGACGGCATACGAGATCTCTACGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCTCTACGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 18 CAAGCAGAAGACGGCATACGAGATGCGGACGTGACTGGAGTTCCTTGGCAC>RNA PCR Primer, Index 18 CAAGCAGAAGACGGCATACGAGATGCGGACGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 19>RNA PCR Primer, Index 19
CAAGCAGAAGACGGCATACGAGATTTTCACGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTTTCACGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 20>RNA PCR Primer, Index 20
CAAGCAGAAGACGGCATACGAGATGGCCACGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGGCCACGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 21>RNA PCR Primer, Index 21
CAAGCAGAAGACGGCATACGAGATCGAAACGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCGAAACGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 22>RNA PCR Primer, Index 22
CAAGCAGAAGACGGCATACGAGATCGTACGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCGTACGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 23>RNA PCR Primer, Index 23
CAAGCAGAAGACGGCATACGAGATCCACTCGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCCACTCGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 24>RNA PCR Primer, Index 24
CAAGCAGAAGACGGCATACGAGATGCTACCGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGCTACCGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 25>RNA PCR Primer, Index 25
CAAGCAGAAGACGGCATACGAGATATCAGTGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATATCAGTGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 26>RNA PCR Primer, Index 26
CAAGCAGAAGACGGCATACGAGATGCTCATGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGCTCATGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 27>RNA PCR Primer, Index 27
CAAGCAGAAGACGGCATACGAGATAGGAATGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATAGGAATGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 28>RNA PCR Primer, Index 28
CAAGCAGAAGACGGCATACGAGATCTTTTGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCTTTTGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA >RNA PCR Primer, Index 29CCGAGAATTCCA >RNA PCR Primer, Index 29
CAAGCAGAAGACGGCATACGAGATTAGTTGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTAGTTGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 30>RNA PCR Primer, Index 30
CAAGCAGAAGACGGCATACGAGATCCGGTGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCCGGTGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 31>RNA PCR Primer, Index 31
CAAGCAGAAGACGGCATACGAGATATCGTGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATATCGTGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 32>RNA PCR Primer, Index 32
CAAGCAGAAGACGGCATACGAGATTGAGTGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTGAGTGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 33>RNA PCR Primer, Index 33
CAAGCAGAAGACGGCATACGAGATCGCCTGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCGCCTGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 34>RNA PCR Primer, Index 34
CAAGCAGAAGACGGCATACGAGATGCCATGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGCCATGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 35>RNA PCR Primer, Index 35
CAAGCAGAAGACGGCATACGAGATAAAATGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATAAAATGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 36>RNA PCR Primer, Index 36
CAAGCAGAAGACGGCATACGAGATTGTTGGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTGTTGGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 37>RNA PCR Primer, Index 37
CAAGCAGAAGACGGCATACGAGATATTCCGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATATTCCGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 38>RNA PCR Primer, Index 38
CAAGCAGAAGACGGCATACGAGATAGCTAGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATAGCTAGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 39 CAAGCAGAAGACGGCATACGAGATGTATAGGTGACTGGAGTTCCTTGGCAC>RNA PCR Primer, Index 39 CAAGCAGAAGACGGCATACGAGATGTATAGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 40>RNA PCR Primer, Index 40
CAAGCAGAAGACGGCATACGAGATTCTGAGGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTCTGAGGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 41>RNA PCR Primer, Index 41
CAAGCAGAAGACGGCATACGAGATGTCGTCGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGTCGTCGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 42>RNA PCR Primer, Index 42
CAAGCAGAAGACGGCATACGAGATCGATTAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCGATTAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 43>RNA PCR Primer, Index 43
CAAGCAGAAGACGGCATACGAGATGCTGTAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGCTGTAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 44>RNA PCR Primer, Index 44
CAAGCAGAAGACGGCATACGAGATATTATAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATATTATAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 45>RNA PCR Primer, Index 45
CAAGCAGAAGACGGCATACGAGATGAATGAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATGAATGAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 46>RNA PCR Primer, Index 46
CAAGCAGAAGACGGCATACGAGATTCGGGAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTCGGGAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 47>RNA PCR Primer, Index 47
CAAGCAGAAGACGGCATACGAGATCTTCGAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATCTTCGAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>RNA PCR Primer, Index 48>RNA PCR Primer, Index 48
CAAGCAGAAGACGGCATACGAGATTGCCGAGTGACTGGAGTTCCTTGGCACCAAGCAGAAGACGGCATACGAGATTGCCGAGTGACTGGAGTTCCTTGGCAC
CCGAGAATTCCA CCGAGAATTCCA
>ABI Dynabead EcoP Oligo >ABI Dynabead EcoP Oligo
CTGATCTAGAGGTACCGGATCCCAGCAGT >ABI Solid3 Adapter A CTGATCTAGAGGTACCGGATCCCAGCAGT >ABI Solid3 Adapter A
CTGCCCCGGGTTCCTCATTCTCTCAGCAGCATG CTGCCCCGGGTTCCTCATTCTCTCAGCAGCATG
>ABI Solid3 Adapter B>ABI Solid3 Adapter B
CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT
>ABI Solid3 5' AMP Primer >ABI Solid3 5' AMP Primer
CCACTACGCCTCCGCTTTCCTCTCTATG CCACTACGCCTCCGCTTTCCTTCTATG
>ABI Solid3 3' AMP Primer >ABI Solid3 3' AMP Primer
CTGCCCCGGGTTCCTCATTCT CTGCCCCGGGTTCCTCATTCT
>ABI Solid3 EFl alpha Sense Primer >ABI Solid3 EFl alpha Sense Primer
CATGTGTGTTGAGAGCTTC CATGTGTGTTGAGAGCTTC
>ABI Solid3 EFl alpha Antisense Primer >ABI Solid3 EFl alpha Antisense Primer
GAAAACCAAAGTGGTCCAC GAAAACCAAAGTGGTCCAC
>ABI Solid3 GAPDH Forward Primer >ABI Solid3 GAPDH Forward Primer
TTAGCACCCCTGGCCAAGG TTAGCACCCCTGGCCAAGG
>ABI Solid3 GAPDH Reverse Primer >ABI Solid3 GAPDH Reverse Primer
CTTACTCCTTGGAGGCCATG CTTACTCCTTGGAGGCCATG
En un ejemplo de la presente divulgación el método que genera datos de ADN de microbiota preprocesados en la subetapa i3), filtrar mediante criterios estadísticos taxonómicos y funcionales y las secuencias de datos de ADN de microbiota posteriores a la aplicación de la subetapa h3), eliminar taxones o funciones de la siguiente tabla. In an example of the present disclosure, the method that generates pre-processed microbiota DNA data in sub-step i3) filters by taxonomic and functional statistical criteria and the microbiota DNA data sequences after applying sub-step h3) eliminates taxa or functions from the following table.
En un ejemplo de la presente divulgación la selección del método de aprendizaje con la mejor métrica de rendimiento para predecir un dato de riesgo de enfermedad se presenta en la tabla 1 que presenta una comparación de cada uno de los mejores modelos entrenados para un conjunto particular de datos métrica de rendimiento es precisión. In an example of the present disclosure, the selection of the learning method with the best performance metric for predicting a disease risk data is presented in Table 1 which presents a comparison of each of the best trained models for a particular data set, the performance metric being accuracy.
Para el entendimiento de la presente divulgación la métrica de rendimiento es precisión y se define como el porcentaje de verdaderos positivos. Para el ejemplo de la Tabla 1 el mejor modelo para predecir un dato de riesgo de enfermedad donde la enfermedad es enfermedad cardio-metabólica es aquel entrenado a partir de datos clínicos con los datos de la población de 60 pacientes, dicho modelo proporciona una precisión de 76,08% y corresponde a un LGBMClassifier, por otro lado, el mejor modelo para predecir un dato de riesgo de enfermedad un dato de riesgo de enfermedad donde la enfermedad es enfermedad diabetes es aquel entrenado a partir de datos de ADN genómico preprocesados. For the purposes of this disclosure, the performance metric is accuracy and is defined as the percentage of true positives. For the example in Table 1, the best model to predict a disease risk data where the disease is cardio-metabolic disease is the one trained from clinical data with the data of the population of 60 patients, said model provides an accuracy of 76.08% and corresponds to a LGBMClassifier, on the other hand, the best model to predict a disease risk data where the disease is diabetes is the one trained from preprocessed genomic DNA data.
Con los datos de la población de, por ejemplo, 60 pacientes, dicho modelo proporciona una precisión de 88,88% y corresponde a un DecisionTreeClassifier; por tanto, para hallar la probabilidad de riesgo, se procede a hallar la probabilidad conjunta de las probabilidades que proporcionan los dos modelos mencionados anteriormente, de tal manera que el método entregue la probabilidad de que el paciente pertenezca a cada una de las siguientes categorías: prediabetes sin dislipidemia, prediabetes dislipidemia, diabetes sin dislipidemia y diabetes dislipidemia. With population data of, say, 60 patients, this model provides an accuracy of 88.88% and corresponds to a DecisionTreeClassifier; therefore, to find the risk probability, the joint probability of the probabilities provided by the two models mentioned above is determined. The method provides the probability that the patient falls into each of the following categories: prediabetes without dyslipidemia, prediabetes with dyslipidemia, diabetes without dyslipidemia, and diabetes with dyslipidemia.
En el ejemplo particular en la etapa d) el método de aprendizaje con la mejor métrica de rendimiento para un dato de riesgo de enfermedad donde la enfermedad es enfermedad cardio-metabólica es un método clasificador que utiliza LightGBM, con aumento de gradiente (LGBMClassifier) y en la etapa d) el método de aprendizaje con la mejor métrica de rendimiento para un dato de riesgo de enfermedad donde la enfermedad es enfermedad diabetes es un método clasificador basado en árboles de decisión (DecisionTreeClassifier). In the particular example in stage d) the learning method with the best performance metric for a disease risk data where the disease is cardio-metabolic disease is a classifier method that uses LightGBM, with gradient boosting (LGBMClassifier) and in stage d) the learning method with the best performance metric for a disease risk data where the disease is diabetes is a classifier method based on decision trees (DecisionTreeClassifier).
Una vez obtenidos los modelos de mayor precisión proporcionan, se procede a seleccionar de estos modelos las variables les da relevancia, con el fin de reducir más el número de variables a partir de las cuales se hará la predicción, obteniendo los siguientes resultados: Once the most accurate models have been obtained, the variables that are considered relevant are selected from these models, in order to further reduce the number of variables from which the prediction will be made, obtaining the following results:
Variables Relevantes Para Mejor DecisionTreeClassifier que obtiene un dato de predicción de riesgo de enfermedad donde la enfermedad es dislipidemia Partir de datos clínicos preprocesados: Relevant Variables for Better DecisionTreeClassifier that obtains a disease risk prediction data where the disease is dyslipidemia From preprocessed clinical data:
- edad - Glucosa Basal - age - Basal Glucose
- Calcio - Calcium
- Hierro - Iron
- PCR - PCR
- APO1 - APO1
- APO2 - APO2
- APOB - APOB
- Tiamina - Thiamine
- Riboflavina - Riboflavin
- Niacina - Niacin
- Piridoxina - Pyridoxine
- Ac Asc - Ac Asc
- Potasio - Potassium
- Cloro - Chlorine
- Zinc - Zinc
- TNFa - TNFa
- TSH - TSH
- T4T - T4T
- Glucagón - Glucagon
- HGH - HGH
- creat Suero - creat Serum
- col No HDL - Non-HDL cabbage
- i HOMA - i HOMA
Variables Relevantes Para Mejor modelo LEVEARS VC que obtiene un dato de predicción de riesgo de enfermedad donde la enfermedad es diabetes Partir de datos clínicos preprocesados: Relevant Variables for Better LEVEARS VC Model that Obtains Disease Risk Prediction Data Where the Disease is Diabetes From Preprocessed Clinical Data:
- edad - age
- Glucosa Basal - Basal Glucose
- Calcio - Calcium
- Hierro - Iron
- PCR - AP01 - PCR - AP01
- APO2 - APO2
- APOB - APOB
- Tiamina - Thiamine
- Riboflavina - Riboflavin
- Niacina - Niacin
- Piridoxina - Pyridoxine
- Ac Asc - Ac Asc
- Potasio - Potassium
- Cloro - Chlorine
- Zinc - Zinc
- TNFa - TNFa
- TSH - TSH
- T4T - T4T
- Glucagon - Glucagon
- HGH - HGH
- creat Suero - creat Serum
- col No HDL - Non-HDL cabbage
- i HOMA - i HOMA
Variables Relevantes para el mejor (ADABOOSTCLASS) que obtiene un dato de predicción de riesgo de enfermedad donde la enfermedad es dislipidemia a Partir de Datos de ADN de genoma humano preprocesados Relevant Variables for the best (ADABOOSTCLASS) that obtains a disease risk prediction data where the disease is dyslipidemia from preprocessed human genome DNA data
11 47625686 A G; mitochondrial_carrier_2 11 47625686 A G; mitochondrial_carrier_2
17_44352133_G_A; granulin_precursor 17_44352133_G_A; granulin_precursor
X_1403432_C_T; acetylserotonin_O-methyltransferase_likeX_1403432_C_T; acetylserotonin_O-methyltransferase_like
22_35393515_C_T; heme_oxygenase_l 22_35393515_C_T; heme_oxygenase_l
3 15645186 G C; biotinidase 3 15645186 G C; biotinidase
1_152311665_C_T; filaggrin 1_152311665_C_T; filaggrin
17_5033729_G_C ; solute_carri er_family_52_memb er_ 1 12_98537591_C_G; thymopoietin 17_5033729_G_C ; solute_carri er_family_52_memb er_ 1 12_98537591_C_G; thymopoietin
2_15308262_G_A; NB AS_subunit_of_NRZ_tethering_complex 2_15308262_G_A; NB AS_subunit_of_NRZ_tethering_complex
X_38286942 A T ; retinitis_pigmentosa_GTPase_regulator 19_12847888_G_T; microtubule associated serine/threonine kinase l 4_69200706_T_C; UDP_glucuronosyltransferase_family_2_member_Bl 1 5_21751766_C_A; cadherin_12 X_38286942 A T ; retinitis_pigmentosa_GTPase_regulator 19_12847888_G_T; microtubule associated serine/threonine kinase l 4_69200706_T_C; UDP_glucuronosyltransferase_family_2_member_Bl 1 5_21751766_C_A; cadherin_12
17_78459117_C_T; dynein_axonemal_heavy_chain_17 17_78459117_C_T; dynein_axonemal_heavy_chain_17
16_634579_C_T; methyltransferase_like_26 16_634579_C_T; methyltransferase_like_26
16_55826162_T_A; carboxylesterase_l 16_55826162_T_A; carboxylesterase_l
15_40775037_C_A; DnaJ_heat_shock_protein_family 15_40775037_C_A; DNAJ_heat_shock_protein_family
14_20144294_C_G; olfactory _receptor_family_4_subfamily_N_member_5 Variables Relevantes Para Mejor LGBMClassifier que obtiene un dato de predicción de riesgo de enfermedad donde la enfermedad es diabetes a Partir de Datos de ADN genómico preprocesados 14_20144294_C_G; olfactory _receiver_family_4_subfamily_N_member_5 Relevant Variables for Better LGBMClassifier that obtains a disease risk prediction data where the disease is diabetes from preprocessed genomic DNA data
21 36225611_C_G; DOPl_leucine_zipper_like_protein_B 21 36225611_C_G; DOPl_leucine_zipper_like_protein_B
14_37841528_A_G; tetratricopeptide_repeat_domain_6 14_37841528_A_G; tetratricopeptide_repeat_domain_6
16_634579_C_T; methyltransferase_like_26 16_634579_C_T; methyltransferase_like_26
21_39779236_T_C; immunoglobulin_superfamily_member_5 21_39779236_T_C; immunoglobulin_superfamily_member_5
9 83311898 A T ; FERM_domain_containing_3 9 83311898 A T ; FERM_domain_containing_3
11_94104605_C_A; hephaestin_like_l 11_94104605_C_A; hephaestin_like_l
2_217893477_T_G; tensin l 2_217893477_T_G; voltage l
Variables Relevantes (TAXID) Para Mejor (Logistic regression) O para DecisionTreeClassifier que obtiene un dato de predicción de riesgo de enfermedad donde la enfermedad es dislipidemia Partir de Datos de ADN de microbiota preprocesados Relevant Variables (TAXID) For Better (Logistic regression) Or for DecisionTreeClassifier that obtains a disease risk prediction data where the disease is dyslipidemia From preprocessed microbiota DNA data
28127,28111,28119,214856,1970189,2893885,2654843,2602769,1016,1017,1018,185528127,28111,28119,214856,1970189,2893885,2654843,2602769,1016,1017,1018,1855
336,1908341,1888915,111500,1383885,516051,2749995,1714849,480520,702745,596336,1908341,1888915,111500,1383885,516051,2749995,1714849,480520,702745,596
00,76594,2058134,2908210,1736674,1850246,2905121,398743,762954,2675331,248700,76594,2058134,2908210,1736674,1850246,2905121,398743,762954,2675331,2487
064,2594269,250,2852098,2820270,28251,336810,2932251,2933777,2496028,400092,064,2594269,250,2852098,2820270,28251,336810,2932251,2933777,2496028,400092,
2745197,2520506,2735870,388413,2907623,28454,423351,1234841,84567,2714940,92745197,2520506,2735870,388413,2907623,28454,423351,1234841,84567,2714940,9
84,2029983,1813871,1868325,1379270,833,1160721,438033,2834348,301301,24797684,2029983,1813871,1868325,1379270,833,1160721,438033,2834348,301301,247976
7,1737424,2763667,2068655,1501,94869,36845,1734049,2811778,1348613,863643,317,1737424,2763667,2068655,1501,94869,36845,1734049,2811778,1348613,863643,31
899,712528,907,2217832,2682541,2499213,483913,2936682,2072025,284581,256794899,712528,907,2217832,2682541,2499213,483913,2936682,2072025,284581,256794
1,450367,2837508,2932257,2932256,2303505,33936,2925845,562959,128574,2950601,450367,2837508,2932257,2932256,2303505,33936,2925845,562959,128574,295060
4,29379,308354,45972,2892440,1293,1461582,407035,2213202,2233542,1598147,7684,29379,308354,45972,2892440,1293,1461582,407035,2213202,2233542,1598147,768
53,417367,33987,360911,340146,29330,1637,1844999,1624,2099789,2304606,1612,153,417367,33987,360911,340146,29330,1637,1844999,1624,2099789,2304606,1612,1
590,60520,57037,2767885,53444,51664,76860,150055,1366,2816912,2420313,417368590,60520,57037,2767885,53444,51664,76860,150055,1366,2816912,2420313,417368
,128827,2899121,100886,162289,1287640,1871025,2483401,104336,57043,2614639,6,128827,2899121,100886,162289,1287640,1871025,2483401,104336,57043,2614639,6
83042,2781962,110932,150123,1804990,2499157,2079227,2879621,37928,2211210,183042,2781962,110932,150123,1804990,2499157,2079227,2879621,37928,2211210,1
56980,2590774,257984,2735316,1667168,2805590,1331736,571913,2714931,1276,4856980,2590774,257984,2735316,1667168,2805590,1331736,571913,2714931,1276,48
2462,1773,486698,85693,39695,1716,169292,1223514,1231000,1697,1718,85025,1822462,1773,486698,85693,39695,1716,169292,1223514,1231000,1697,1718,85025,182
4,2567884,2885078,1570939,2907624,334542,2053,1004901,322509,2751189,2072504,2567884,2885078,1570939,2907624,334542,2053,1004901,322509,2751189,207250
5,1188315,2913412,2930049,2782004,206662,2563602,1892,1077946,1920,2710756,65,1188315,2913412,2930049,2782004,206662,2563602,1892,1077946,1920,2710756,6
8570,2686304,146923,67282,66892,53451,1690221,1437453,1940,1886,2126346,27928570,2686304,146923,67282,66892,53451,1690221,1437453,1940,1886,2126346,2792
977,78259,2017486,2898796,2763008,2712223,2040,1871034,700274,103731,946334,977,78259,2017486,2898796,2763008,2712223,2040,1871034,700274,103731,946334,
113562,2081702,59505,280236,53522,2820884,573600,2933797,281472,2844380,194113562,2081702,59505,280236,53522,2820884,573600,2933797,281472,2844380,194
4646,49319,2751170,59932,155977,77021,752201,577489,1173026,1150,1155739,2974646,49319,2751170,59932,155977,77021,752201,577489,1173026,1150,1155739,297
4039,2596745,1394889,33071,1763363,2967302,51365,171281,2094,2112,2115,292334039,2596745,1394889,33071,1763363,2967302,51365,171281,2094,2112,2115,29233
52,2132,216937,47834,229545,2124,2098,45363,502394,274,56957,1839801,244366,252,2132,216937,47834,229545,2124,2098,45363,502394,274,56957,1839801,244366,2
587529,590,1330547,1505597,2945587,1199245,1920114,2681307,2675791,2675778,587529,590,1330547,1505597,2945587,1199245,1920114,2681307,2675791,2675778,
2153385,565,2906475,2872648,82987,61651,40576,584,102862,2218628,204042,21082153385,565,2906475,2872648,82987,61651,40576,584,102862,2218628,204042,2108
399,2866807,1028989,2804761,2049589,2895473,2815936,658642,2870860,118613,6399,2866807,1028989,2804761,2049589,2895473,2815936,658642,2870860,118613,6
58630,2871095,2837969,2745519,2895474,1173283,76758,46257,1190415,1499686,3 64197,1788301,2518644,1306993,47886,2859001,472181,2304594,2072412,2925843,58630,2871095,2837969,2745519,2895474,1173283,76758,46257,1190415,1499686,3 64197,1788301,2518644,1306993,47886,2859001,472181,2304594,2072412,2925843,
2904253,69,346,343,56459,29447,2911538,231455,1542730,2010829,2021234,2806002904253,69,346,343,56459,29447,2911538,231455,1542730,2010829,2021234,280600
8,404011,256839,2490635,227,206042,1777491,2686359,2937286,2746231,2730360,28,404011,256839,2490635,227,206042,1777491,2686359,2937286,2746231,2730360,2
733487,2883106,2854257,2497861,1897729,1081866,44935,33074,204286,2758724,2733487,2883106,2854257,2497861,1897729,1081866,44935,33074,204286,2758724,2
014542,1094342,1917421,2819101,663,300876,265668,664643,190897,552386,17065014542,1094342,1917421,2819101,663,300876,265668,664643,190897,552386,17065
I,680026,51366,1908198,2714110,1646498,2905879,471,1871111,1789224,1148157,4I,680026,51366,1908198,2714110,1646498,2905879,471,1871111,1789224,1148157,4
76,1699623,45610,330922,2819280,2182432,1046,1227,1049,521689,1763998,27780676,1699623,45610,330922,2819280,2182432,1046,1227,1049,521689,1763998,277806
6,650,648,73010,43948,1903694,75984,2679994,716,2591606,435905,2909669,260186,650,648,73010,43948,1903694,75984,2679994,716,2591606,435905,2909669,26018
94,630749,254246,1249552,465721,2908648,986106,1561924,2183911,356,2020312,394,630749,254246,1249552,465721,2908648,986106,1561924,2183911,356,2020312,3
99,1335061,28105,375549,1390132,1325111,44255,722472,475937,1395974,2592814,99,1335061,28105,375549,1390132,1325111,44255,722472,475937,1395974,2592814,
68287,2744521,675281,31998,2282568287,2744521,675281,31998,22825
23,408,388408,2759660,85701,45401,717785,279,2419844,1701758,533,708113,444423,408,388,408,2759,660,857,01,454,01,717,785,279,241,984,170,175,533,708,113,444
44,2928472,2972485,2898433,2219696,304378,152682,2878545,2780074,164608,20544,2928472,2972485,2898433,2219696,304378,152682,2878545,2780074,164608,205
844,2026624,2867233,450378,361183,2338327,2055955,2862331,82367,2867026,278844,2026624,2867233,450378,361183,2338327,2055955,2862331,82367,2867026,278
5912,2867015,2293862,302485,92945,1579316,265959,257438,2561924,2752515,293,5912,2867015,2293862,302485,92945,1579316,265959,257438,2561924,2752515,293,
75,770,142058,89586,54526,2829597,2494234,488731,488729,57975,2735433,31123075,770,142058,89586,54526,2829597,2494234,488731,488729,57975,2735433,311230
,2839983,68895,93218,2770234,556054,1819725,1827195,2968475,2966554,12916,18,2839983,68895,93218,2770234,556054,1819725,1827195,2968475,2966554,12916,18
58609,80869,1484693,1649468,2769491,2045208,279058,1644131,72557,1697043,4658609,80869,1484693,1649468,2769491,2045208,279058,1644131,72557,1697043,46
3014,1416806,507,2953809,2975441,215580,490,492,2917790,1499392,748247,281533014,1416806,507,2953809,2975441,215580,490,492,2917790,1499392,748247,28153
43,1565605,35798,1231,63745,453161,233181,872,57320,2910984,2794998,2813578,43,1565605,35798,1231,63745,453161,233181,872,57320,2910984,2794998,2813578,
161492,56,345632,2897342,890,897,65555,181663,2358,213849,199,1031542,28898,1161492,56,345632,2897342,890,897,65555,181663,2358,213849,199,1031542,28898,1
448857,2808963,28196,1355374,1032072,39766,202747,1591088,135569,1581011,19448857,2808963,28196,1355374,1032072,39766,202747,1591088,135569,1581011,19
4424,191291,1921087,1796921,2527964,2527996,2483368,85991,136,53419,221027,24424,191291,1921087,1796921,2527964,2527996,2483368,85991,136,53419,221027,2
52967,44449,1287055,171,174,856,285729,712368,187101,1617967,940615,940614,252967,44449,1287055,171,174,856,285729,712368,187101,1617967,940615,940614,2
703788,81462,108007,81468,188709,1643949,1184387,2093824,1936990,180,28262,2703788,81462,108007,81468,188709,1643949,1184387,2093824,1936990,180,28262,2
28745,118000,1295609,171695,936456,9606,2235,869886,2931977,203135,2961595,228745,118000,1295609,171695,936456,9606,2235,869886,2931977,203135,2961595,2
961571,62320,69525,1853699,1017351,699433,2210,2215,33865,2202,2224,2731220,961571,62320,69525,1853699,1017351,699433,2210,2215,33865,2202,2224,2731220,
59277,83171,155863,172049,342948,49899,2261,121277,312539,229980,1459637,50359277,83171,155863,172049,342948,49899,2261,121277,312539,229980,1459637,503
39,1673428,74968,1211417,2772059,2948922,2734126,2842979,2843722,2955842,2739,1673428,74968,1211417,2772059,2948922,2734126,2842979,2843722,2955842,27
33124,2732595,10986,2843875,624186,1186051,2686210,2571156,2683193,669357,233124,2732595,10986,2843875,624186,1186051,2686210,2571156,2683193,669357,2
448483,1206777,2587597,718,876364,2709685,2883236,1779135,35817,2862870,537448483,1206777,2587597,718,876364,2709685,2883236,1779135,35817,2862870,537
874,1914850,2846100,1508644,2282309,145262,2732782,2844165,2039639,893,3275874,1914,850,2846,100,1508,644,2282,309,145,262,273,278,2844,165,203,9639,893,3275
I I,10,1105106,2955533,2842820,331278,2560124,10242,60919,701045,1177214,2935I I,10,1105106,2955533,2842820,331278,2560124,10242,60919,701045,1177214,2935
858,2745482,1918522,2842586,2844166,2560370,2734137,1513458,83442,1437364,1858,2745482,1918522,2842586,2844166,2560370,2734137,1513458,83442,1437364,1
982151,2845430,10317,10298,154334,1105171,2816909,2006134,2560269,10385,971982151,2845430,10317,10298,154334,1105171,2816909,2006134,2560269,10385,971
95,2501923,1338689,1655644,1918012,150830,1921705,2732689,346883,2845136,1295,2501923,1338689,1655644,1918012,150830,1921705,2732689,346883,2845136,12
9727,2912629,2843418,1923237,1188795,1981162,1623289,10381,1125677,2956144, 398041,1873778,1920779,1720498,1245890,1921119,1972683,2845481. 9727,2912629,2843418,1923237,1188795,1981162,1623289,10381,1125677,2956144, 398041,1873778,1920779,1720498,1245890,1921119,1972683,2845481.
Variables Relevantes Para Mejor (LGBMClassifier) que obtiene un dato de predicción de riesgo de enfermedad donde la enfermedad es diabetes a Partir de Datos de ADN de microbiota preprocesados Relevant Variables for Better (LGBMClassifier) that obtains a disease risk prediction data where the disease is diabetes from preprocessed microbiota DNA data
171549,28127,815,818,371601,47678,820,817,46506,2650157,357276,387090,209385171549,28127,815,818,371601,47678,820,817,46506,2650157,357276,387090,209385
6,28118,2585118,328814,328813,375288,46503,328812,216851,2929491,2929493,292 9494,2929492,1160721,2831966,2564099,2093857,1550024,301301,418240,45851,75 1585,46228,2763672,39485,437897,39778,543,547,881260,570,729,901,239935. 6,28118,2585118,328814,328813,375288,46503,328812,216851,2929491,2929493,292 9494,2929492,1160721,2831966,2564099,2093857,1550024,301301,418240,45851,75 1585,46228,2763672,39485,437897,39778,543,547,881260,570,729,901,239935.
Una vez obtenidas las variables más relevantes para cada uno de los respectivos modelos, en una etapa adiciona, Se procede a entrenar nuevamente los métodos de aprendizaje supervisado, para el caso del ejemplo (LGBMClassifier) y DecisionTreeClassifier solo con las variables relevantes; con el fin de corroborar que la precisión proporcionada por cada uno de los modelos siguiera siendo la misma o incluso que aumenta o disminuya Once the most relevant variables for each of the respective models have been obtained, in an additional stage, the supervised learning methods are trained again, in the case of the example (LGBMClassifier) and DecisionTreeClassifier only with the relevant variables; in order to corroborate that the precision provided by each of the models remains the same or even increases or decreases.
En cualquiera de las realizaciones del presente documento, la identificación de perfiles metabólicos que afecten la salud y riego de enfermedad, pueden relacionarse, sin limitarse, a las mencionadas a continuación metabolismo energético, alergias alimentarias, influencia de la dieta en el estado metabólico, estrés oxidativo, detoxificación, metabolismo óseo, metabolismo de los carbohidratos, salud vascular, salud cognitiva, desordenes del comportamiento, rutas de la saciedad y apetito, respuesta al ejercicio, rutas metabólicas asociadas a la absorción de, seguimiento y eficacia de cambios en estilo de vida, dieta, suplementación y .enfermedades como por ejemplo: inflamación crónica, aterosclerosis, accidente cerebrovascular, esclerosis múltiple, Alzheimer, artritis, enfermedad inflamatoria intestinal, enfermedad de Crohn, colitis ulcerosa, enfermedad celíaca, anemia perniciosa y sinusitis, Obesidad, enfermedad hepática grasa no alcohólica, enfermedad renal crónica, dislipidemia, trastornos alimenticios, entre otras. In any of the embodiments of this document, the identification of metabolic profiles that affect health and disease risk, may be related to, but not limited to, those mentioned below: energy metabolism, food allergies, influence of diet on metabolic status, oxidative stress, detoxification, bone metabolism, carbohydrate metabolism, vascular health, cognitive health, behavioral disorders, satiety and appetite pathways, response to exercise, metabolic pathways associated with absorption, monitoring and effectiveness of changes in lifestyle, diet, supplementation and diseases such as: chronic inflammation, atherosclerosis, stroke, multiple sclerosis, Alzheimer's, arthritis, inflammatory bowel disease, Crohn's disease, ulcerative colitis, celiac disease, pernicious anemia and sinusitis, Obesity, non-alcoholic fatty liver disease, chronic kidney disease, dyslipidemia, eating disorders, among others.
GLOSARIO: GLOSSARY:
Dato de predicción de riesgo de enfermedad: Es un dato computacional que corresponde a una medida global del riesgo de desarrollar una enfermedad por parte de una persona respecto de la población general, a partir de una característica genética, clínica, biológica u otro tipo de marcador Disease risk prediction data: It is a computational data that corresponds to a global measure of the risk of developing a disease by a person in relation to the general population, based on a genetic, clinical, biological or other type of marker
Datos de ADN genómico: Genomic DNA data:
Es información digital que describe aspectos específicos de la composición genética de un organismo. Este dato puede manifestarse de diversas maneras, como la secuencia de nucleótidos representada como una cadena de letras (A, T, G, C), anotaciones genómicas que indican la estructura y ubicación de genes, variantes genéticas como polimorfismos de nucleótido único (SNP), y datos de expresión que reflejan la abundancia relativa de ARN mensajero en diferentes condiciones. Además, incluye información sobre modificaciones epigenéticas, calidad de secuenciación y otros atributos relevantes. Estos datos, pueden ser almacenados en formatos específicos como por ejemplo FASTQ, FASTA, VCF, BAM, o matrices de expresión It is digital information that describes specific aspects of an organism's genetic makeup. This data can take various forms, such as the nucleotide sequence represented as a string of letters (A, T, G, C), genomic annotations indicating gene structure and location, genetic variants such as single nucleotide polymorphisms (SNPs), and expression data reflecting the relative abundance of messenger RNA under different conditions. It also includes information on epigenetic modifications, sequencing quality, and other relevant attributes. This data can be stored in specific formats such as FASTQ, FASTA, VCF, BAM, or expression matrices.
Datos de ADN de microbiota: se refieren a la información genética obtenida a través de la secuenciación del ADN de los microorganismos presentes en una muestra biológica, específicamente en el contexto del microbiota, por ejemplo provenientes del intestino. La microbiota es la comunidad de microorganismos, como bacterias, hongos, virus y otros microbios, que coexisten en un ambiente particular, como el intestino, la piel, la boca u otros sitios del cuerpo humano o de otros organismos. Estos datos, pueden ser almacenados en formatos específicos como por ejemplo FASTQ, FASTA, QUIIME, BIOM, SRA, VCF, BAM. Microbiota DNA data refers to the genetic information obtained through DNA sequencing of microorganisms present in a biological sample, specifically in the context of the microbiota, for example, from the gut. The microbiota is the community of microorganisms, such as bacteria, fungi, viruses, and other microbes, that coexist in a particular environment, such as the gut, skin, mouth, or other sites of the human body or other organisms. This data can be stored in specific formats such as FASTQ, FASTA, QUIIME, BIOM, SRA, VCF, or BAM.
Variables Categóricas: es un atributo que clasifican en categorías discretas y no numéricas. Estas categorías describen características específicas de variantes, como por ejemplo su tipo (SNP, indel), su impacto funcional (sinónimo, no sinónimo), su ubicación en el genoma (exón, intrón), y otros aspectos relevantes. Estas variables permiten organizar y caracterizar las variantes de manera significativa, Categorical variables: These are attributes that classify variants into discrete, non-numerical categories. These categories describe specific characteristics of variants, such as their type (SNP, indel), their functional impact (synonymous, non-synonymous), their location in the genome (exon, intron), and other relevant aspects. These variables allow variants to be organized and characterized in a meaningful way.
Formato SAM/BAM: SAM/BAM format:
SAM (Sequence Alignment/Map): Es un formato de archivo que se utiliza para representar información sobre la alineación de secuencias de ADN con respecto a un genoma de referencia. Contiene información detallada sobre cada lectura, incluyendo su secuencia, calidad de base, posición de alineación y más. SAM (Sequence Alignment/Map): It is a file format used to represent information about the alignment of DNA sequences with respect to a Reference genome. Contains detailed information about each read, including its sequence, base quality, alignment position, and more.
BAM (Binary Alignment/Map): Es la versión binaria del formato SAM. Aunque el formato SAM es legible por humanos y utiliza texto plano, el formato BAM es más eficiente en términos de almacenamiento y procesamiento, ya que está en formato binario. BAM (Binary Alignment/Map): This is the binary version of the SAM format. Although the SAM format is human-readable and uses plain text, the BAM format is more efficient in terms of storage and processing because it is in binary format.
Archivo VCF (Variant Call Format): Un archivo VCF (Variant Call Format) es un formato estándar de archivo utilizado para representar información sobre variantes genéticas, como polimorfismos de nucleótido único (SNP), inserciones, deleciones y otros tipos de variantes, en datos de secuenciación genómica. El formato VCF fue diseñado para almacenar de manera eficiente y estructurada información detallada sobre variantes genéticas. VCF (Variant Call Format) file: A VCF (Variant Call Format) file is a standard file format used to represent information about genetic variants, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and other types of variants, in genomic sequencing data. The VCF format was designed to efficiently and structure detailed information about genetic variants.
Archivo FASTQ: es un formato de archivo utilizado en bioinformática para almacenar datos de secuenciación de ADN, ARN o de otros tipos de moléculas biológicas. Este formato es comúnmente utilizado para representar las lecturas obtenidas a partir de tecnologías de secuenciación de nueva generación (NGS, por sus siglas en inglés). FASTQ file: This is a file format used in bioinformatics to store sequencing data from DNA, RNA, or other types of biological molecules. This format is commonly used to represent reads obtained from next-generation sequencing (NGS) technologies.
TAXID: es una abreviatura de "Taxonomy ID" (ID de Taxonomía, en español). Este término se utiliza comúnmente en el contexto de las bases de datos biológicas y genómicas, especialmente en relación con el sistema de taxonomía que organiza y clasifica a los organismos. TAXID es un identificador numérico único asociado a un nodo específico en la jerarquía de la taxonomía biológica y se utiliza para identificar de manera única diferentes organismos en bases de datos y recursos biológicos. TAXID: An abbreviation for "Taxonomy ID." This term is commonly used in the context of biological and genomic databases, especially in relation to the taxonomic system that organizes and classifies organisms. TAXID is a unique numerical identifier associated with a specific node in the biological taxonomic hierarchy and is used to uniquely identify different organisms in biological databases and resources.
OTUs: se refiere a "Operational Taxonomic Units" (Unidades Taxonómicas Operativas, en español). En el contexto de la microbiología y la secuenciación de ADN, las OTUs son una forma de agrupar secuencias similares de genes, generalmente secuencias de genes ribosomales, en categorías que representan unidades taxonómicas a un nivel específico, como especie o género. Este enfoque se utiliza comúnmente en estudios de microbioma y metagenómica para analizar la diversidad y la composición de comunidades microbianas. OTUs: refers to "Operational Taxonomic Units." In the context of microbiology and DNA sequencing, OTUs are a way of grouping similar gene sequences, usually ribosomal gene sequences, into categories that represent taxonomic units at a specific level, such as species or genus. This approach is commonly used in studies of microbiome and metagenomics to analyze the diversity and composition of microbial communities.
Las OTUs se crean utilizando técnicas de agolpamiento de secuencias genómicas similares, y el grado de similitud requerido para agrupar las secuencias en una OTU se establece mediante un umbral predefinido. Este umbral puede variar según el estudio y la técnica utilizada. OTUs are created using clustering techniques for similar genomic sequences, and the degree of similarity required to group sequences into an OTU is set by a predefined threshold. This threshold can vary depending on the study and the technique used.
Se debe entender que la presente divulgación no se halla limitada a las realizaciones descritas e ilustradas, pues como será evidente para una persona versada en la técnica existen variaciones y modificaciones posibles que no se apartan del espíritu de la divulgación, el cual se encuentra definido por las siguientes reivindicaciones It should be understood that this disclosure is not limited to the embodiments described and illustrated, since as will be evident to a person skilled in the art, there are possible variations and modifications that do not depart from the spirit of the disclosure, which is defined by the following claims.
Perfiles Metabólicos: Los perfiles metabólicos se refieren a la caracterización integral de los procesos metabólicos que tienen lugar en un organismo. Estos perfiles proporcionan información detallada sobre cómo se metabolizan los nutrientes, cómo se generan y utilizan las energías, y cómo interactúan los diferentes componentes bioquímicos dentro del cuerpo. Los perfiles metabólicos pueden incluir datos sobre la actividad de enzimas, niveles de metabolitos, y otros indicadores bioquímicos que ayudan a comprender el estado metabólico de un individuo. Estos perfiles son valiosos en la investigación médica, la nutrición personalizada y la comprensión de las bases biológicas de diversas condiciones de salud, ya que ofrecen una visión detallada de los procesos bioquímicos subyacentes en el cuerpo. Metabolic Profiling: Metabolic profiling refers to the comprehensive characterization of the metabolic processes occurring in an organism. These profiles provide detailed information on how nutrients are metabolized, how energy is generated and utilized, and how different biochemical components interact within the body. Metabolic profiles can include data on enzyme activity, metabolite levels, and other biochemical indicators that help understand an individual's metabolic status. These profiles are valuable in medical research, personalized nutrition, and understanding the biological basis of various health conditions, as they offer detailed insight into the underlying biochemical processes in the body.
Perfiles de Riesgo Predictivos de Enfermedad: Los perfiles de riesgo predictivos de enfermedad son evaluaciones que combinan datos clínicos, biomédicos y a veces genéticos para identificar y cuantificar los factores de riesgo que pueden aumentar la probabilidad de desarrollar una enfermedad específica. Estos perfiles buscan predecir el riesgo individual de un individuo para condiciones de salud particulares, como enfermedades cardíacas, diabetes, cáncer u otras afecciones. La información recopilada puede incluir antecedentes médicos, hábitos de vida, resultados de pruebas médicas y marcadores biológicos relevantes. La aplicación de perfiles de riesgo predictivos permite a los profesionales de la salud personalizar estrategias preventivas y de intervención temprana, facilitando así un enfoque más proactivo para la salud y el bienestar. Predictive Disease Risk Profiles: Predictive disease risk profiles are assessments that combine clinical, biomedical, and sometimes genetic data to identify and quantify risk factors that may increase the likelihood of developing a specific disease. These profiles seek to predict an individual's risk for specific health conditions, such as heart disease, diabetes, cancer, or other conditions. The information collected may include medical history, lifestyle habits, medical test results, and relevant biological markers. Applying predictive risk profiles allows healthcare professionals to customize preventive and early intervention strategies, thus facilitating a more proactive approach to health and wellness.
Claims
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CONC2023/0017721A CO2023017721A1 (en) | 2023-12-18 | 2023-12-18 | Method for obtaining data for predicting the risk of metabolic disease |
| CONC2023/0017721 | 2023-12-18 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025133869A2 true WO2025133869A2 (en) | 2025-06-26 |
Family
ID=96138901
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2024/062681 Pending WO2025133869A2 (en) | 2023-12-18 | 2024-12-16 | Method for obtaining data for predicting the risk of metabolic disease |
Country Status (2)
| Country | Link |
|---|---|
| CO (1) | CO2023017721A1 (en) |
| WO (1) | WO2025133869A2 (en) |
-
2023
- 2023-12-18 CO CONC2023/0017721A patent/CO2023017721A1/en unknown
-
2024
- 2024-12-16 WO PCT/IB2024/062681 patent/WO2025133869A2/en active Pending
Also Published As
| Publication number | Publication date |
|---|---|
| CO2023017721A1 (en) | 2025-06-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Hao et al. | Dictionary learning for integrative, multimodal and scalable single-cell analysis | |
| US20210327534A1 (en) | Cancer classification using patch convolutional neural networks | |
| US10402748B2 (en) | Machine learning methods and systems for identifying patterns in data | |
| Boileau et al. | Exploring high-dimensional biological data with sparse contrastive principal component analysis | |
| Heiser et al. | Automated quality control and cell identification of droplet-based single-cell data using dropkick | |
| CN111913999B (en) | Statistical analysis method, system and storage medium based on multi-omics and clinical data | |
| CN107980162A (en) | Research proposal system and method based on combination | |
| Zare et al. | Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis | |
| CA3204451A1 (en) | Systems and methods for joint low-coverage whole genome sequencing and whole exome sequencing inference of copy number variation for clinical diagnostics | |
| Marcos-Zambrano et al. | A toolbox of machine learning software to support microbiome analysis | |
| JP7041614B2 (en) | Multi-level architecture for pattern recognition in biometric data | |
| Thibodeau et al. | A neural network based model effectively predicts enhancers from clinical ATAC-seq samples | |
| Doostparast Torshizi et al. | Graph-based semi-supervised learning with genomic data integration using condition-responsive genes applied to phenotype classification | |
| Zappia et al. | Feature selection methods affect the performance of scRNA-seq data integration and querying | |
| Wang et al. | Benchmarking automated cell type annotation tools for single-cell ATAC-seq data | |
| WO2023212563A1 (en) | Two competing guilds as core microbiome signature for human diseases | |
| KR20210145539A (en) | Providing method for health information based on microbiome and analysis apparatus | |
| US20240273359A1 (en) | Apparatus and method for discovering biomarkers of health outcomes using machine learning | |
| Ando et al. | Classification of gene expression profile using combinatory method of evolutionary computation and machine learning | |
| Krause et al. | Understanding the role of (advanced) machine learning in metagenomic workflows | |
| WO2025133869A2 (en) | Method for obtaining data for predicting the risk of metabolic disease | |
| Arulanandham et al. | Role of Data Science in Healthcare | |
| Patel et al. | Big data analytics of genomic and clinical data for diagnosis and prognosis of cancer | |
| Sauvé et al. | Baseline Acute Myeloid Leukemia Prognosis Models using Transcriptomic and Clinical Profiles by Studying the Impacts of Dimensionality Reductions and Gene Signatures on Cox-Proportional Hazard | |
| Liu et al. | scRCA: A Siamese network-based pipeline for annotating cell types using noisy single-cell RNA-seq reference data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24906681 Country of ref document: EP Kind code of ref document: A2 |