[go: up one dir, main page]

WO2024209074A1 - Method for the calculation of the adverse prognoses risk score in respiratory viral disease infections, using host genomics - Google Patents

Method for the calculation of the adverse prognoses risk score in respiratory viral disease infections, using host genomics Download PDF

Info

Publication number
WO2024209074A1
WO2024209074A1 PCT/EP2024/059367 EP2024059367W WO2024209074A1 WO 2024209074 A1 WO2024209074 A1 WO 2024209074A1 EP 2024059367 W EP2024059367 W EP 2024059367W WO 2024209074 A1 WO2024209074 A1 WO 2024209074A1
Authority
WO
WIPO (PCT)
Prior art keywords
score
risk
subject
variant
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2024/059367
Other languages
French (fr)
Inventor
Fabio Grandi
Valentina FAVALLI
Stefano CASALI
Antonio Novelli
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Bases SA
Bambino Gesu Childrens Hospital
Original Assignee
4Bases SA
Bambino Gesu Childrens Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Bases SA, Bambino Gesu Childrens Hospital filed Critical 4Bases SA
Publication of WO2024209074A1 publication Critical patent/WO2024209074A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2537/00Reactions characterised by the reaction format or use of a specific feature
    • C12Q2537/10Reactions characterised by the reaction format or use of a specific feature the purpose or use of
    • C12Q2537/165Mathematical modelling, e.g. logarithm, ratio
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers

Definitions

  • the present invention provides methods and kits relating the calculation of the risk score of adverse prognoses in SARS-C0V2 infections or in other respiratory viral disease infections.
  • Figure 1 Algorithm for the calculation of the risk score.
  • Figure 2 Shows the Machine Learning modelling pipeline (from top to bottom).
  • SNP single nucleotide polymorphism
  • prognoses refers to the ability to discriminate subjects/patients with good/poor prognosis.
  • Variant score refers to the weight of genetic variants in SNPs panel, calculated as number of alleles observed on the total number of alleles expected.
  • Gene score refers to the weight of genetic variants in each gene, calculated using a machine learning algorithm and using the genetic variants identified from the sequenced panel of entire genes.
  • Gene score refers to the weighted average of the Gene scores calculated.
  • the inventors have identified a direct relation between the genetic basis of a patient and its the susceptibility to COVID-19 infection with disease seventy, by evaluating the genetic profile of the subject infected by the virus SARS-C0V2 and affected by a respiratory disease infection (COVID-19).
  • the inventors have developed a new method based on the calculation of the innate risk for the host (intended as person infected by a virus) to develop a severe response to the infection, up to hospitalization in intensive care unit (ICU) and death.
  • ICU intensive care unit
  • the method of the present invention is based on the evaluation of the genetic profile of the subject, through sequencing of a panel of 74 SNPs (Single Nucleotide Polymorphisms), identified from GWAS studies, and sequencing the entire sequence of 83 specific genes involved in the immune response. Those information are then integrated with basic information of the subject (age at infection and gender) in order to calculate the risk profile of a severe outcome, by using a machine learning algorithm.
  • SNPs Single Nucleotide Polymorphisms
  • the method is based on the analysis of specific genetic markers, named SNPs (single nucleotide polymorphisms), and of specific genes involved in the immune response and by an analysis algorithm specifically designed to integrate the basic information of a subject and therefore to calculate the risk profile of a severe/asymptomatic outcome of a respiratory disease patient infection.
  • SNPs single nucleotide polymorphisms
  • the advantages of the presented method is the possibility to create a genetic risk score specific for the subject, associated with the response to the viral infection.
  • the genetic markers used are not specific for the type of infection but for the host, as referred to the genetic profile of the subject, and therefore applicable to different types of respiratory viral infectious diseases, such as COVID-19, MERS, SARS or influenza.
  • the method of the present invention is therefore the first method able to identify a genetic score by considering the genotype of the patient and the genetic predisposition of the same.
  • An embodiment of the present invention is therefore a method in-vitro or ex-vivo for calculating the risk score of adverse prognosis in a respiratory viral infectious disease, comprising the following steps: a) extracting the genomic DNA from a biological sample of a subject; b) sequencing a panel of 74 single nucleotide polymorphisms (SNPs) reported in Table 1 and the entire coding region of the 83 genes reported in Table 3 from said DNA; c) identify the gender and the age of said subject; d) calculating the risk score of said subject by using a machine learning algorithm.
  • P is a real number between 0 and 1 , obtained by the machine learning algorithm in which 1 is the total predisposition of a subject to a severe outcome and 0 is the predisposition of a subject to be asymptomatic
  • the term “P” stand for predisposition of the subject to have an adverse prognosis in a respiratory viral infection disease.
  • P is calculated by the machine learning machine with the following steps:
  • the method of the present invention further comprises step e) wherein the subject is classified at high, medium and low risk of developing a severe infectious disease, based on the calculated risk score.
  • the subject affected by a respiratory infectious disease is classified as: subject at high risk of developing a severe disease, subject at medium risk of developing a severe disease and subject at a low risk of developing a severe disease and/or asymptomatic subject.
  • a subject with a risk score major than 70% is considered at high risk of developing a severe disease
  • a subject with a risk score between 30% to 70% is considered at a medium risk of developing a severe disease
  • a subject with a risk score below 30% is considered at a low risk of developing a severe disease and/or asymptomatic.
  • the panel of 74 SNPs disclosed in Table 1 was herein specifically selected in silico by the inventors after the evaluation of different sources:
  • the panel of 83 genes disclosed in Table 3 was selected by the inventors through an in silico and a clinical evaluation on clinical exome data of genes involved in virus replication, innate immune response, IFN pathway and membrane receptors (19).
  • the genomic DNA isolated in step a) can be extracted with methods and protocols for DNA extraction well known in the art. Said method are preferably selected from QIAamp DNA Blood Kits, QIAamp DNA Mini Kits, Maxwell RSC Blood Kit, Maxwell® CSC Genomic DNA Kit.
  • the sequencing of the 74 SNPs of step b) is made by using the primers listed in Table 2.
  • the sequencing of the 74 SNPs of step b) is made by using the primers having the sequence SEQ ID N. 1- 114.
  • the sequencing of the 83 genes of step b) is made by using specific probes suitable for sequencing the entire coding region of said genes.
  • the sequencing of said 83 genes is made by using probes that cover the gene target regions listed in Table 4.
  • the methods for DNA sequencing disclosed in step b) are selected from the next generation sequences method and instruments known in the art, such as Illumina platforms, Ion-Torrent, Nanopore, GeneRead, PacBio or MGI.
  • said methods of sequencing the DNA are selected from amplicon method, capture method, enrichment method, pyrosequencing, incorporation of nucleotides, semiconductor technologies, nanopore real time reading or sequencing methods without PCR.
  • the sequencing of the 74 SNPs and of the 83 genes is made at the same time in a single platform or at two different moments with different platforms.
  • the respiratory viral infectious disease is selected from COVID-19, influenza, SARS or MERS, preferably COVID-19 or other infections caused by a RNA or a DNA virus, preferably by a RNA virus.
  • the subject tested with the method of the present invention can be asymptomatic or he may present symptoms of a respiratory infectious disease.
  • the symptoms which the subject may present with are symptoms of a pulmonary disease (e.g. cough, breathing difficulty).
  • the subject of this embodiment may further present symptoms of an infectious disease, such as fever, nausea, headache, sore throat, runny nose, rash and/or muscle soreness.
  • the biological sample used in the present method is a blood sample, a tissue, saliva or a buccal swab.
  • the algorithm is composed by two parts, the first part is related to the so-called variant calling, that is the identification of genetic variants after sequencing, and calculation of Variant score and Genotype score.
  • the second part is related to the calculation of the risk score, through the usage of a machine learning method.
  • the steps used in the first part are:
  • Variant calling for example using samtools mpileup version 1.16 or upper, with default parameters and specific target file for SNPs and a region bed file for genes;
  • the steps used in the second part are:
  • the entire workflow is designed to identify the genetic profile of the subject through the DNA preparation using a library preparation for NGS sequencing (Next generation sequencing), and the sequencing of the resulting DNA library on an NGS platform (irrespective of the platform used, such as Illumina, Ion Torrent, GeneRead, MGI or Oxford Nanopore).
  • the Raw data obtained (FASTQ files) are analyzed through a specific analysis pipeline which integrates the machine learning trained algorithm, in order to obtain a risk score.
  • the machine learning model was trained using data from an observational study and literature data. Trained model is used to calculate the risk score, and the class of risk is predicted on the basis of the risk score obtained. According to the present invention, the prediction of the method was selected as the best balance between sensitivity and specificity, with the specific intent to create a screening test. In other words, an higher sensitivity was privileged.
  • the method of the present invention has a sensitivity of about 95%.
  • the method of the present invention has a specificity of about 72%.
  • a further embodiment of the present invention relates to a computer program product comprising a computer- usable medium having computer-readable program codes or instructions embodied thereon for enabling a processor to carry out the analysis and correlating functions as described above.
  • a further embodiment of the present invention is a computer medium comprising instructions which, when executed by a computer, cause the computer to carry the following steps:
  • a further embodiment of the present invention is a kit for calculating the risk score of adverse prognosis in a respiratory viral infectious disease, comprising a library for sequencing the panel of 74 single nucleotide polymorphisms (SNPs) reported in Table 1 and the 83 genes reported in Table 3, according to the method of the present invention and, optionally, a multi-well plate and a microarray.
  • SNPs single nucleotide polymorphisms
  • said respiratory viral infectious disease is selected from COVID-19, influenza, SARS, MERS, preferably COVID-19 or other infections caused by a RNA or a DNA virus, preferably by a RNA virus.
  • SNPs single nucleotide polymorphisms
  • Table 1 reports in details the 74 SNPs ID number (RSID) and their locus/gene.
  • DNA libraries for NGS sequencing were prepared for sequencing using both amplicon based and capture based panels.
  • Nextseq Illumina platform was used for sequencing.
  • the pipeline for data analysis were developed and integrated in the 4eVAR (htps://4evar.4bases.ch/) cloud based platform.
  • primers with sequences SEQ ID N: 1 to 57 are primers forward, while the primers with SEQ ID N: 58 to 114 are primers reverse.
  • SNPs are amplified by the same primer pairs as reported in Table 3.
  • Table 5 are reported the coordinates of the probes used to cover the target regions in the gene panel for sequencing the 83 genes reported in Table 4.
  • the Machine Learning approach used in the application is optimized for a supervised binary classification task; hence, samples should be classified into two different and mutually exclusive populations or categories, ideally ‘severe disease’ vs ‘asymptomatic’.
  • the Machine Learning pipeline can be flexibly adapted to different, non-overlapping populations as well (e.g., ‘mild disease’ vs ‘asymptomatic’).
  • Tabular dataset was built starting from multi-sample .vcffiles derived from variant calling analysis step. For each variant in the gene panel, a group of characteristics (or features) and scores specific of the variant were identified in order to calculate its weight in the gene score. Those features are identified using so-called variant annotation step.
  • Variants identified in the panel of entire genes were annotated (SNPeff v5.1 ) and classified using Varsome API v11 .1 .6 .
  • Gene score represents the weight of the genotype (specific group of variants) of the specific gene, in the predisposition to severe infection, and is represented as a number from 0 to 1. Then a “Genotype score” was calculated, as mean of genotype scores for the subject.
  • the final dataset for machine model of predisposition prediction is composed by age, gender, variant score calculated by SNPs panel and genotype score calculated by the gene panel.
  • the target feature i.e. the final output that we wanted to predict
  • the target feature is generally represented by the seventy of disease.
  • the minority class is oversampled with SMOTE (Synthetic Minority Oversampling Technique) .
  • the Machine Learning modelling pipeline (from top to bottom) is shown in Figure 2.
  • SHAP library for feature importance estimation and selection for the final machine learning model was used: the concept of “Shapley value”, a well- established method in cooperative game theory for estimating the marginal contribution of individual players, can be applied as a “model-agnostic” method to calculate feature importance.
  • DNA samples were collected in Ospedale Tor Vergata and Ospedale Pediatrico Bambin Gesu’, during the period March-Sept 2020. DNA samples were used to prepare libraries for both Amplicon based sequencing using a panel of 74 SNPs and capture-based sequencing with a panel covering 83 entire genes. NGS instruments were used for sequencing of the obtained libraries (Illumina).
  • the final dataset was composed by 124 samples for training and 40 samples for test.
  • Random Forest provided highest values for both accuracy and recall (see Table 8) and was therefore selected as the best model.
  • Table 8 Comparison between different machine learning algorithm, calculated using genetic data and subject age and gender.
  • Figure 4 represented the division the 3 classes, in the test set.
  • the method can be used to calculate the predisposition of a healthy subject to develop a severe outcome facing a respiratory infection. This information can be crucial in hospital management or organization of vaccination campaigns.
  • the method can be extended to other respiratory infections, such as influenza, or DNA or RNA viruses, thanks to the usage of a genotype involving the immune response pathways, not specific only for SARS C0V2 infection.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Pathology (AREA)
  • Biophysics (AREA)
  • Public Health (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Primary Health Care (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides methods and kits relating the calculation of the risk score of adverse prognoses in SARS-COV2 infections or in other respiratory viral disease infections.

Description

Method for the calculation of the adverse prognoses risk score in respiratory viral disease infections, using host genomics
Description
The present invention provides methods and kits relating the calculation of the risk score of adverse prognoses in SARS-C0V2 infections or in other respiratory viral disease infections.
Background of the invention
Numerous studies of human genetics have been started during the SARS C0V2 pandemic (1-7). Starting from the first weeks of the pandemic, it was clear to clinicians the presence of different outcomes in different patients who contracted the infection. Sometimes members of the same family shared the same outcome regardless of the age, gender or comorbidities, suggesting the possible presence of common germline genetic factors involved in the host response to infection.
Two different approaches have been used in the identification of a possible host genotype, explaining the severe or asymptomatic outcome in different subjects.
Common variants have been extensively studied using genome-wide association studies (GWAS) (8-10), revealing alleles of increased susceptibility and/or partial resistance to SARS-COV2 infection. The most important initiative during SARS COV2 pandemics was the COVID-19 Host Genetics Initiative (https://www.covid19hg.org/, last accessed on 12th January 2023, (11-13)) which sought to identify genetic variants that account for the variability in individuals’ susceptibility to COVID-19, as well as in the seventy of the disease.
The initiative was based on the study of 49,562 individuals with COVID-19 infection (including 13,641 individuals who were hospitalized and, of those, 6,179 with critical conditions e) and around 2 million control individuals without known infection. This comparison pointed out 13 locations in the genome (loci): variants in 4 of these loci are associated with COVID-19 susceptibility, whereas variants in other 9 loci with disease seventy.
Other groups focused the analysis on rare variants in genes involved in to the immune response. The International Consortium COVID Human Genetic Effort (www.COVIDhge.com, accessed on 12 January 2023), found that the 3% of patients with critical COVID-19 pneumonia had congenital errors of immunity that compromise innate and interferon-mediated immunity as a result of mutations in the genes TLR3, TLR7, and IRF7 (especially in patients < 65 years) (14-16).
In addition, the same consortium found the presence of auto-antibodies (auto- Abs) neutralizing interferon type I in at least 10% of other patients with severe disease. Overall, this research has shown that 20% of severe COVID-19 cases have a defect in the interferon circuit [18], This proportion is unprecedented among infectious diseases, but is well known and well-studied in cancer (17,18). However, the genetic basis of the remaining 80% of patients is still unknown.
Direct sequencing of whole exomes, or whole genomes in patients affected by severe respiratory disease would be the more complete solution in order to clearly identify a correlation between host genotypes and the infection.
However, this approach results too complex, time consuming, expensive and laborious, as it requires the sequencing of an enormous quantity of genes and the processing of a huge amount of data, in order to find a possible relation between the host genotype and the severe or asymptomatic outcome of an infectious disease in different subjects. As a matter of fact, the studies reported above have not been able to identify a sensible and reliable method that is able to calculate patient risk score and therefore a direct relation between the host genotype and the susceptibility to a respiratory disease infection, such as a COVID-19, with a severe prognosis.
Direct sequencing of a precise target genes involved in the pathogenesis of the infection, the immune response, the recognition of viral RNA, and the integration of those information with ones from GWAS studies could be the key for a predictive score of pathogenicity. But the genotype alone is not sufficient, because it should be formulated in a method able to learn how to translate genetic data in a risk score.
It’s therefore, necessary to develop a reliable, precise and sensitive method that can be used for the calculation of the risk score of adverse prognoses in SARS- C0V2 infections and also for other respiratory disease infections based on DNA or RNA viruses, such as influenza, SARS or MERS.
Brief description of the figures
Figure 1 . Algorithm for the calculation of the risk score.
Figure 2. Shows the Machine Learning modelling pipeline (from top to bottom).
Figure 3. Performance of the final machine learning algorithm, with and without adding age and gender of the subject.
Figure 4. Thresholds used to calculate the risk classes, based on the risk score calculated.
Definitions
Unless otherwise defined, all terms of art, notations and other scientific terminology used herein are intended to have the meanings commonly understood by those persons skilled in the art to which this disclosure pertains. In some cases, terms with commonly understood meanings are defined herein for clarity and/or for ready reference; thus, the inclusion of such definitions herein should not be construed to represent a substantial difference over what is generally understood in the art.
The terms “approximately” and “about” herein refers to the range of the experimental error, which may occur in a measurement.
The terms “comprising”, “having”, “including” and “containing” are to be construed open-ended terms (i.e. meaning “including, but not limited to”) and are to be considered as providing support also for terms as “consist essentially of”, “consisting essentially of”, “consist of” or “consisting of”.
The terms “consist essentially of”, “consisting essentially of” are to be construed as semi-closed terms, meaning that no other ingredients which materially affects the basic and novel characteristics of the invention are included (optional excipients may thus be included).
The terms “consists of”, “consisting of” are to be construed as closed terms.
A "single nucleotide polymorphism (SNP)" in the context of this invention refers to the presence of an alternative base in specific positions of DNA sequence, chosen to characterize DNA genotype of an individual with respect with another.
The term “prognoses” herein refers to the ability to discriminate subjects/patients with good/poor prognosis.
For the purpose of the present invention the term “Variant score” refers to the weight of genetic variants in SNPs panel, calculated as number of alleles observed on the total number of alleles expected. The term “Gene score” refers to the weight of genetic variants in each gene, calculated using a machine learning algorithm and using the genetic variants identified from the sequenced panel of entire genes.
The term “Genotype score” refers to the weighted average of the Gene scores calculated.
Description of the invention
The inventors have identified a direct relation between the genetic basis of a patient and its the susceptibility to COVID-19 infection with disease seventy, by evaluating the genetic profile of the subject infected by the virus SARS-C0V2 and affected by a respiratory disease infection (COVID-19).
In particular, the inventors have developed a new method based on the calculation of the innate risk for the host (intended as person infected by a virus) to develop a severe response to the infection, up to hospitalization in intensive care unit (ICU) and death.
The method of the present invention is based on the evaluation of the genetic profile of the subject, through sequencing of a panel of 74 SNPs (Single Nucleotide Polymorphisms), identified from GWAS studies, and sequencing the entire sequence of 83 specific genes involved in the immune response. Those information are then integrated with basic information of the subject (age at infection and gender) in order to calculate the risk profile of a severe outcome, by using a machine learning algorithm.
In particular, the method is based on the analysis of specific genetic markers, named SNPs (single nucleotide polymorphisms), and of specific genes involved in the immune response and by an analysis algorithm specifically designed to integrate the basic information of a subject and therefore to calculate the risk profile of a severe/asymptomatic outcome of a respiratory disease patient infection.
The advantages of the presented method is the possibility to create a genetic risk score specific for the subject, associated with the response to the viral infection. The genetic markers used are not specific for the type of infection but for the host, as referred to the genetic profile of the subject, and therefore applicable to different types of respiratory viral infectious diseases, such as COVID-19, MERS, SARS or influenza.
Other risk scores have been calculated during the pandemic (https://www.mdcalc.com/covid-19, last accessed January 20th 2023), but all those are based on the comorbidities and clinical situation of the affected patient. The method of the present invention is therefore the first method able to identify a genetic score by considering the genotype of the patient and the genetic predisposition of the same.
An embodiment of the present invention is therefore a method in-vitro or ex-vivo for calculating the risk score of adverse prognosis in a respiratory viral infectious disease, comprising the following steps: a) extracting the genomic DNA from a biological sample of a subject; b) sequencing a panel of 74 single nucleotide polymorphisms (SNPs) reported in Table 1 and the entire coding region of the 83 genes reported in Table 3 from said DNA; c) identify the gender and the age of said subject; d) calculating the risk score of said subject by using a machine learning algorithm. According to a preferred embodiment in the method of the present invention the risk score is calculated in step d) by the formula risk score = P x 100, wherein P is a number comprised between 0 and 1 .
P (Predisposition) is a real number between 0 and 1 , obtained by the machine learning algorithm in which 1 is the total predisposition of a subject to a severe outcome and 0 is the predisposition of a subject to be asymptomatic
According to the present invention the term “P” stand for predisposition of the subject to have an adverse prognosis in a respiratory viral infection disease.
According to a preferred embodiment, in the method of the present invention, P is calculated by the machine learning machine with the following steps:
- preprocessing and preparing raw data for variant calling; calculate variant score and gene score;
- calculate genotype score as average of gene scores;
- integrate genotype score, variant score, age at infection and gender of the subject.
According to a preferred embodiment, the method of the present invention further comprises step e) wherein the subject is classified at high, medium and low risk of developing a severe infectious disease, based on the calculated risk score.
According to the method of the present invention the subject affected by a respiratory infectious disease is classified as: subject at high risk of developing a severe disease, subject at medium risk of developing a severe disease and subject at a low risk of developing a severe disease and/or asymptomatic subject. Preferably, a subject with a risk score major than 70% is considered at high risk of developing a severe disease, a subject with a risk score between 30% to 70% is considered at a medium risk of developing a severe disease, and a subject with a risk score below 30% is considered at a low risk of developing a severe disease and/or asymptomatic.
For the purpose of the present invention, the panel of 74 SNPs disclosed in Table 1 was herein specifically selected in silico by the inventors after the evaluation of different sources:
- Public literature about COVID19 GWAS studies;
- Public literature about MERS and SARS GWAS studies;
- Meta analysis of observational studies (GWAS, Whole exome, whole genomes) of infected individuals VS general population;
- SNPs extracted from a panel of genes specifically designed to evaluate the immune response of the subject.
For the purpose of the present invention, the panel of 83 genes disclosed in Table 3 was selected by the inventors through an in silico and a clinical evaluation on clinical exome data of genes involved in virus replication, innate immune response, IFN pathway and membrane receptors (19).
According to a preferred embodiment, the genomic DNA isolated in step a) can be extracted with methods and protocols for DNA extraction well known in the art. Said method are preferably selected from QIAamp DNA Blood Kits, QIAamp DNA Mini Kits, Maxwell RSC Blood Kit, Maxwell® CSC Genomic DNA Kit. In a further preferred embodiment, the sequencing of the 74 SNPs of step b) is made by using the primers listed in Table 2.
In a further preferred embodiment, the sequencing of the 74 SNPs of step b) is made by using the primers having the sequence SEQ ID N. 1- 114.
In a further preferred embodiment, the sequencing of the 83 genes of step b) is made by using specific probes suitable for sequencing the entire coding region of said genes.
Preferably, the sequencing of said 83 genes is made by using probes that cover the gene target regions listed in Table 4.
According to the present invention, the methods for DNA sequencing disclosed in step b) are selected from the next generation sequences method and instruments known in the art, such as Illumina platforms, Ion-Torrent, Nanopore, GeneRead, PacBio or MGI.
Preferably, said methods of sequencing the DNA are selected from amplicon method, capture method, enrichment method, pyrosequencing, incorporation of nucleotides, semiconductor technologies, nanopore real time reading or sequencing methods without PCR.
According to the method of the present invention, the sequencing of the 74 SNPs and of the 83 genes is made at the same time in a single platform or at two different moments with different platforms.
According to a preferred embodiment, the respiratory viral infectious disease is selected from COVID-19, influenza, SARS or MERS, preferably COVID-19 or other infections caused by a RNA or a DNA virus, preferably by a RNA virus. In a further preferred embodiment, the subject tested with the method of the present invention can be asymptomatic or he may present symptoms of a respiratory infectious disease.
In another preferred embodiment, the symptoms which the subject may present with are symptoms of a pulmonary disease (e.g. cough, breathing difficulty). The subject of this embodiment may further present symptoms of an infectious disease, such as fever, nausea, headache, sore throat, runny nose, rash and/or muscle soreness.
According to a preferred embodiment the biological sample used in the present method is a blood sample, a tissue, saliva or a buccal swab.
The algorithm is composed by two parts, the first part is related to the so-called variant calling, that is the identification of genetic variants after sequencing, and calculation of Variant score and Genotype score.
The second part is related to the calculation of the risk score, through the usage of a machine learning method.
The steps used in the first part are:
- Alignment of raw data (for example in FASTQ format) and preprocessing of the data, using for example BWA and samtools algorithms respectively;
Variant calling (for example using samtools mpileup version 1.16 or upper), with default parameters and specific target file for SNPs and a region bed file for genes;
- Calculation of Variant score according to the presence or absence of SNPs, and allele frequency;
- Annotation of the variants identified in the genes, and calculation of a Gene score for each gene in the panel, considering specific features of each variant using a machine learning algorithm (for example Random Forest machine learning algorithm) trained to identify the weight of each feature of the variant and create a score.
- Calculation of the Genetic score as average of each Gene score.
The steps used in the second part are:
- Creation of a single set of features for each subject comprising age, gender, Variant score and Genotype score;
- Calculation of the risk score using a machine learning algorithm specifically trained to identify as 1 the predisposition to develop a severe outcome.
- Calculation of the risk score percentage as predisposition X100, and identification of a risk class as Severe, Mild, Asymptomatic, according to thresholds identified.
A scheme of the entire algorithm is represented in Figure 1.
The entire workflow is designed to identify the genetic profile of the subject through the DNA preparation using a library preparation for NGS sequencing (Next generation sequencing), and the sequencing of the resulting DNA library on an NGS platform (irrespective of the platform used, such as Illumina, Ion Torrent, GeneRead, MGI or Oxford Nanopore).
The Raw data obtained (FASTQ files) are analyzed through a specific analysis pipeline which integrates the machine learning trained algorithm, in order to obtain a risk score.
The machine learning model was trained using data from an observational study and literature data. Trained model is used to calculate the risk score, and the class of risk is predicted on the basis of the risk score obtained. According to the present invention, the prediction of the method was selected as the best balance between sensitivity and specificity, with the specific intent to create a screening test. In other words, an higher sensitivity was privileged.
According to a preferred embodiment, the method of the present invention has a sensitivity of about 95%.
Preferably, the method of the present invention has a specificity of about 72%.
A further embodiment of the present invention relates to a computer program product comprising a computer- usable medium having computer-readable program codes or instructions embodied thereon for enabling a processor to carry out the analysis and correlating functions as described above.
A further embodiment of the present invention is a computer medium comprising instructions which, when executed by a computer, cause the computer to carry the following steps:
(i) receiving raw data from sequences obtained from the method of claim 1 ;
(ii) integrating said sequencing data with the gender and the age at infection of the subject;
(iii) preprocessing and preparing raw data for variant calling,
(iv) calculating variant score, and gene score;
(v) calculating genotype score as average of gene scores;
(vi) integrating genotype score, variant score, age and gender;
(vii) calculating the risk score by the following formula: P x 100, wherein P is a number comprised between 0 and 1 .
A further embodiment of the present invention is a kit for calculating the risk score of adverse prognosis in a respiratory viral infectious disease, comprising a library for sequencing the panel of 74 single nucleotide polymorphisms (SNPs) reported in Table 1 and the 83 genes reported in Table 3, according to the method of the present invention and, optionally, a multi-well plate and a microarray.
According to a preferred embodiment said respiratory viral infectious disease is selected from COVID-19, influenza, SARS, MERS, preferably COVID-19 or other infections caused by a RNA or a DNA virus, preferably by a RNA virus.
Additional benefits and/or uses of the methods/kit of the present disclosure will be readily apparent to one of skill in the art.
EXPERIMENTAL DATA
Materials & Methods
We selected the specific 74 single nucleotide polymorphisms (SNPs), from GWAS studies present in literature and classified according to GWAS, international databases and scientific research, into one of the following categories of association with disease outcome:
- Severe disease;
- Severe disease (respiratory failure);
- Mild-to-moderate disease requiring hospitalization;
- Associated with increased susceptibility;
- Higher risk of infection for blood group A vs. non-A and lower risk of infection for blood group 0 vs. non-0;
- Associated with increased resistance.
Table 1 reports in details the 74 SNPs ID number (RSID) and their locus/gene.
Figure imgf000015_0001
Figure imgf000016_0001
Figure imgf000017_0001
Figure imgf000018_0001
* SNP identification number
** role of the gene identified in (12,20)
DNA libraries for NGS sequencing were prepared for sequencing using both amplicon based and capture based panels. Nextseq Illumina platform was used for sequencing. The pipeline for data analysis were developed and integrated in the 4eVAR (htps://4evar.4bases.ch/) cloud based platform.
In Table 2 are reported primers specifically designed and used in the present application for sequencing the 74 SNPs.
Table 2.
Figure imgf000018_0002
Figure imgf000019_0001
Figure imgf000020_0001
Figure imgf000021_0001
In particular, the primers with sequences SEQ ID N: 1 to 57 are primers forward, while the primers with SEQ ID N: 58 to 114 are primers reverse.
Some SNPs are amplified by the same primer pairs as reported in Table 3.
Table 3
Figure imgf000021_0002
Figure imgf000022_0001
The panel of the 83 entire genes in the capture-based panel is reported in Table
Table 4.
Figure imgf000022_0002
Figure imgf000023_0001
In Table 5 are reported the coordinates of the probes used to cover the target regions in the gene panel for sequencing the 83 genes reported in Table 4.
Table 5.
Figure imgf000023_0002
Figure imgf000024_0001
Figure imgf000025_0001
Figure imgf000026_0001
Figure imgf000027_0001
Figure imgf000028_0001
Figure imgf000029_0001
Figure imgf000030_0001
Figure imgf000031_0001
Figure imgf000032_0001
Figure imgf000033_0001
Figure imgf000034_0001
Figure imgf000035_0001
Figure imgf000036_0001
Figure imgf000037_0001
Figure imgf000038_0001
Figure imgf000039_0001
Figure imgf000040_0001
Figure imgf000041_0001
Figure imgf000042_0001
Figure imgf000043_0001
Machine learning model design
The Machine Learning approach used in the application is optimized for a supervised binary classification task; hence, samples should be classified into two different and mutually exclusive populations or categories, ideally ‘severe disease’ vs ‘asymptomatic’. However, the Machine Learning pipeline can be flexibly adapted to different, non-overlapping populations as well (e.g., ‘mild disease’ vs ‘asymptomatic’).
The full Machine Learning model pipeline is summarized in Figure 2.
Data preparation
Data analysis of NGS sequencing raw files was done using well established pipelines for alignment (BWA-mem vO.7.17) and variant calling (samtools mpileup v1 .16).
First the presence or absence (0) of the 74 SNPs were identified and each was categorized as homozygous (2) or heterozygous (1 ). For each subject a “Variant Score” was calculated, as Num variant for class/GT, were GT was the total number of alleles carrying a variant, and Num of variant for class was the sum of alleles (considering homozygous as 2, heterozygous as 1 and 0 as SNP not present or wild type).
Tabular dataset was built starting from multi-sample .vcffiles derived from variant calling analysis step. For each variant in the gene panel, a group of characteristics (or features) and scores specific of the variant were identified in order to calculate its weight in the gene score. Those features are identified using so-called variant annotation step.
Variants identified in the panel of entire genes were annotated (SNPeff v5.1 ) and classified using Varsome API v11 .1 .6 .
The list of features identified for each variant is reported in Table 6. Table 6.
Figure imgf000045_0001
For each of the 83 genes in the panel, we calculated a “Gene score” using a random forest algorithm. Gene score represents the weight of the genotype (specific group of variants) of the specific gene, in the predisposition to severe infection, and is represented as a number from 0 to 1. Then a “Genotype score” was calculated, as mean of genotype scores for the subject.
The final dataset for machine model of predisposition prediction is composed by age, gender, variant score calculated by SNPs panel and genotype score calculated by the gene panel.
The target feature (i.e. the final output that we wanted to predict) is generally represented by the seventy of disease. We decided to train the model to identify the extreme outcomes: ‘severe disease’ vs ‘asymptomatic’. In case of class imbalance, the minority class is oversampled with SMOTE (Synthetic Minority Oversampling Technique) .
Model selection Five different supervised classification algorithms:
• Logistic Regression
• Bernoulli Naive Bayes
• Decision Tree
• Random Forest
• XGBoost
As a validation procedure, the dataset is iteratively split into train and validation sets, using Repeated Stratified K-Fold Cross Validation (with k = 10, number of repetitions = 3). In order to compare the performance of different models, we mainly focused on two metrics: average accuracy and recall. Accuracy is a commonly used metric for assessing model performance; we also chose to keep into consideration recall (or sensitivity) as it is directly affected by false negatives rate.
Statistical comparison can then be performed using both parametric (one-way ANOVA, post-hoc Tukey-HSD) and non-parametric (Kruskal-Wallis, post-hoc Dunn’s test) methods.
Finally, hyperparameters tuning and optimization for the best model can be performed using Grid Search.
Several relevant metrics are recorded automatically: accuracy, area under the ROC curve (ROC-AUC score), precision and recall.
The Machine Learning modelling pipeline (from top to bottom) is shown in Figure 2.
SHAP library for feature importance estimation and selection for the final machine learning model was used: the concept of “Shapley value”, a well- established method in cooperative game theory for estimating the marginal contribution of individual players, can be applied as a “model-agnostic” method to calculate feature importance.
Software
Programming environment: Python 3.8, Scikit-Learn and SHAP libraries for Machine Learning
We tested the machine learning model over two - partially overlapping - types of data set: genetic features and genetic features combined with subject info, such as age and gender.
Example 1
A Use Case
Samples
A total number of 200 DNA samples were collected in Ospedale Tor Vergata and Ospedale Pediatrico Bambin Gesu’, during the period March-Sept 2020. DNA samples were used to prepare libraries for both Amplicon based sequencing using a panel of 74 SNPs and capture-based sequencing with a panel covering 83 entire genes. NGS instruments were used for sequencing of the obtained libraries (Illumina).
We tested our Machine Learning pipeline on a dataset comprising 124 samples, including 100 ‘severe disease’ cases and 24 ‘asymptomatic’ cases. Both groups tested positive for COVID-19; while subjects belonging to the ‘asymptomatic’ group didn’t display any specific symptom, subjects from the ‘severe disease’ group were all hospitalized into Intensive Care Units.
For all but one subject, an ‘asymptomatic’ one, age and gender data were also available and therefore included into the second step predictive analysis. Given the unequal distribution of the output classes (‘severe disease’ vs ‘asymptomatic’ ratio ~ = 4:1 ), we chose to oversample the minority class.
It must be noticed that subjects from the ‘severe disease’ group were significantly older than those from the ‘asymptomatic’ group: 67.6 ± 13.6 years for ‘severe disease’ and 44.8 ± 10.9 years (mean, s.d.) for ‘asymptomatic’; similarly, gender distribution was imbalanced too: 22% of males for the ‘asymptomatic’ group against 78% of males for the ‘severe disease’ group.
The final dataset was composed by 124 samples for training and 40 samples for test.
Model training
Following parametric and non-parametric statistical analysis, we didn’t observe any significant difference among the algorithms, neither for accuracy nor for recall; we chose to select Random Forest as the best model because the associated recall score was higher than for other algorithms (see Table 7).
Table 7. Comparison between different models used to calculate the final predisposition score using only genetic data.
Figure imgf000048_0001
We performed hyperparameter optimization for Random Forest accuracy through Grid Search and Stratified K-Fold Cross-validation (k=10); we selected the following hyperparameters:
• Number of estimators (from 10 to 130, with a step of 20)
• Max number of features per estimator (from 3 to 8)
• Bootstrap (True or False)
The optimal configuration was achieved with 50 estimators, 4 features per estimator and the Bootstrap parameter set to False: average accuracy equal to 0.78 ± 0.07
Genetic Data and subject information
We then tested the differences in adding to the genotype information, age and gender.
Following parametric analysis, Naive Bayes average accuracy resulted significantly lower than Random Forest and XGBoost. We observed more significant differences when comparing algorithms recall: Naive Bayes had lower recall then both Random Forest and XGBoost; Random Forest also significantly outperformed Logistic Regression. Non-parametric analysis highlighted the same significant comparisons; besides, also the difference between Logistic Regression and XGBoost was significant.
Random Forest provided highest values for both accuracy and recall (see Table 8) and was therefore selected as the best model.
Table 8. Comparison between different machine learning algorithm, calculated using genetic data and subject age and gender.
Figure imgf000049_0001
Figure imgf000050_0001
We performed Grid Search optimization over the same hyperparameters described in the previous section; here the optimal configuration was achieved with 70 estimators, 7 features per estimator and the Bootstrap parameter set to False: average accuracy equal to 0.89 ± 0.08.
Given the better results for the Random Forest also in this second test, we finally decided to use also age and gender in the final prediction.
Model test
ROC analysis using final model on the test set is represented in Figure 3
The best threshold between the groups of severe and asymptomatic is difficult to be assigned, however, as expected, subjects with a mild response are well represented in the grey zone of the image. It suggests the possibility to divide in 3 classes based on the prediction score, severe with risk score >70%, mild with risk score between 30% and 70%, and asymptomatic with risk score <30%.
Figure 4 represented the division the 3 classes, in the test set.
Conclusions
We developed a genetic based method to calculate a risk score and to predict the genetic predisposition of a subject to develop a severe outcome facing a respiratory infection. The method can be used to calculate the predisposition of a healthy subject to develop a severe outcome facing a respiratory infection. This information can be crucial in hospital management or organization of vaccination campaigns.
The method can be extended to other respiratory infections, such as influenza, or DNA or RNA viruses, thanks to the usage of a genotype involving the immune response pathways, not specific only for SARS C0V2 infection.
References
1. Perfetto L, Micarelli E, lannuccelli M, Lo Surdo P, Giuliani G, Latini S, et al. A Resource for the Network Representation of Cell Perturbations Caused by SARS-CoV-2 Infection. Genes. 2021 Mar 22;12(3):450.
2. Anaclerio F, Ferrante R, Mandatori D, Antonucci I, Capanna M, Damiani V, et al. Different Strategies for the Identification of SARS-CoV-2 Variants in the Laboratory Practice. Genes. 2021 Sep 16; 12(9): 1428.
3. Huang SW, Miller SO, Yen CH, Wang SF. Impact of Genetic Variability in ACE2 Expression on the Evolutionary Dynamics of SARS-CoV-2 Spike D614G Mutation. Genes. 2020 Dec 24; 12(1 ): 16.
4. Mbarek H, Cocca M, Al-Sarraj Y, Saad C, Mezzavilla M, AIMuftah W, et al. Poking COVID-19: Insights on Genomic Constraints among Immune-Related Genes between Qatari and Italian Populations. Genes. 2021 Nov 22;12(11 ):1842.
5. Monticelli M, Hay Mele B, Benetti E, Fallerini C, Baldassarri M, Furini S, et al. Protective Role of a TMPRSS2 Variant on Severe COVID-19 Outcome in Young Males and Elderly Women. Genes. 2021 Apr 19;12(4):596.
6. Russo R, Andolfo I, Lasorsa VA, Cantalupo S, Marra R, Frisso G, et al. The TNFRSF13C H159Y Variant Is Associated with Severe COVID-19: A Retrospective Study of 500 Patients from Southern Italy. Genes. 2021 Jun 8;12(6):881.
7. Colona VL, Vasiliou V, Watt J, Novelli G, Reichardt JKV. Update on human genetic susceptibility to COVID-19: susceptibility to virus and response. Hum Genomics. 2021 Dec;15(1 ):57, s40246-021-00356-x.
8. Novelli G, Biancolella M, Mehrian-Shai R, Colona VL, Brito AF, Grubaugh ND, et al. COVID-19 one year into the pandemic: from genetics and genomics to therapy, vaccination, and policy. Hum Genomics. 2021 Dec;15(1):27.
9. Regan JA, Abdulrahim JW, Bihlmeyer NA, Haynes C, Kwee LC, Patel MR, et al. Phenome-Wide Association Study of Severe COVID-19 Genetic Risk Variants. J Am Heart Assoc. 2022 Mar;11 (5):e024004. . Suh S, Lee S, Gym H, Yoon S, Park S, Cha J, et al. A systematic review on papers that study on Single Nucleotide Polymorphism that affects coronavirus 2019 seventy. BMC Infect Dis. 2022 Dec;22(1):47. . Anastassopoulou C, Gkizarioti Z, Patrinos GP, Tsakris A. Human genetic factors associated with susceptibility to SARS-CoV-2 infection and COVID-19 disease severity. Hum Genomics. 2020 Dec;14(1 ):40. . The GenOMICC Investigators, The ISARIC4C Investigators, The COVID- 19 Human Genetics Initiative, 23andMe Investigators, BRACOVID Investigators, Gen-COVID Investigators, et al. Genetic mechanisms of critical illness in COVID-19. Nature. 2021 Mar 4;591 (7848):92-8. . The Severe Covid-19 GWAS Group. Genomewide Association Study of Severe Covid-19 with Respiratory Failure. N Engl J Med. 2020 Oct 15;383(16): 1522-34. . Zhang Q, Bastard P, COVID Human Genetic Effort, Karbuz A, Gervais A, Tayoun AA, et al. Human genetic and immunological determinants of critical COVID-19 pneumonia. Nature. 2022 Mar 24;603(7902):587-98. . Zhang Q, Bastard P, Liu Z, Le Pen J, Moncada-Velez M, Chen J, et al. Inborn errors of type I IFN immunity in patients with life-threatening COVID-19. Science. 2020 Oct 23;370(6515):eabd4570. . Bastard P, Rosen LB, Zhang Q, Michailidis E, Hoffmann HH, Zhang Y, et al. Autoantibodies against type I IFNs in patients with life-threatening COVID- 19. Science. 2020 Oct 23;370(6515):eabd4585. . Bastard P, Zhang Q, Zhang SY, Jouanguy E, Casanova JL. Type I interferons and SARS-CoV-2: from cells to organisms. Curr Opin Immunol. 2022 Feb;74: 172-82. . Borden EC. Interferons a and [3 in cancer: therapeutic opportunities from new insights. Nat Rev Drug Discov. 2019 Mar; 18(3):219-34. . Velavan TP, Pallerla SR, Ruter J, Augustin Y, Kremsner PG, Krishna S, et al. Host genetic factors determining COVID-19 susceptibility and seventy. eBioMedicine. 2021 Oct;72: 103629. . COVID-19 Host Genetics Initiative, COVID-19 Host Genetics InitiativeLeadership, Niemi MEK, Karjalainen J, Liao RG, Neale BM, et al. Mapping the human genetic architecture of COVID-19. Nature. 2021 Dec 16;600(7889):472-7.

Claims

Claims
1. A method in-vitro or ex-vivo for calculating the risk score of adverse prognosis in a respiratory viral infectious disease, comprising the following steps: a) extracting the genomic DNA from a biological sample of a subject; b) sequencing a panel of 74 single nucleotide polymorphisms (SNPs) reported in Table 1 and the entire coding region of 83 genes reported in Table 4 from said DNA; c) identify the gender and the age of said subject; d) calculating the risk score of said subject by using a machine learning algorithm.
2. A method according to claim 1 , characterized in that the risk score is calculated in step d) by the formula Risk score = P x 100, wherein P is a number comprised between 0 and 1 .
3. A method according to claim 2, characterized in that P is calculated by the machine learning machine with the following steps:
- pre-processing and preparing raw data for variant calling;
- calculate variant score and gene score;
- calculate genotype score as average of gene scores;
- integrate genotype score, variant score, age at infection and gender of the subject.
4. A method according to any of the previous claims further comprising step e) wherein the subject is classified at high, medium and low risk of developing a severe infectious viral disease, based on the calculated risk score.
5. A method according to claim 4, wherein a subject with a risk score major than 70% is considered at high risk of developing a severe disease, a subject with a risk score between 30 to 70% is considered at a medium risk of developing a severe disease, and a subject with a risk score below 30% is considered at a low risk of developing a severe disease and/or asymptomatic.
6. A method according to any of the previous claims, wherein the sequencing of the 74 SNPs at point b) is made by using the primers listed in Table 2.
7. A method according to any of the previous claims, wherein the sequencing of the 83 genes of step b) is made by using specific probes suitable for sequencing the entire coding region of said genes.
8. A method according to any of the previous claims, wherein the respiratory viral infectious disease is selected from COVID-19, influenza, SARS or MERS, preferably COVID-19 or from other infections caused by a RNA o a DNA virus, preferably from a RNA virus.
9. A method according to any of the previous claims wherein the biological sample is a blood sample, a tissue, saliva or a buccal swab.
10. A method according to any of the previous claims, wherein the methods of sequencing the DNA at step b) are selected from amplicon method, capture method, enrichment method, pyrosequencing, incorporation of nucleotides, semiconductor technologies, nanopore real time reading or sequencing methods without PCR.
11. A method according to any of the previous claims, wherein the sequencing of the 74 SNPs and of the 83 genes of step b) is made at the same time in a single platform or at two different moments.
12. A method according to any of the previous claims, wherein the algorithm is composed by two parts, the first part including the following steps:
- alignment of raw data and preprocessing of data;
- variant calling, with default parameters and specific target file for SNPs and a region bed file for genes;
- calculation of the variant score according to the presence or absence of SNPs, and allele frequency;
- annotation of the variants identified in the genes, and calculation of a gene score for each gene in the panel, considering specific features of each variant using a machine learning algorithm trained to identify the weight of each feature of the variant and create a score;
- calculation of the genetic score as average of each gene score; the second part including the following steps:
- creation of a single set of features for each subject comprising age, gender, variant score and genotype score;
- calculation of the risk score using a machine learning algorithm specifically trained to identify as 1 the predisposition to develop a severe outcome;
- calculation of the risk score percentage as predisposition X100, and identification of a risk class as severe, mild, asymptomatic, according to thresholds identified.
13. A method according to claim 12, wherein the alignment of raw data is in
FASTQ format.
14. A method according to claim 12, wherein preprocessing of data is done using BWA and samtools algorithms.
15. A method according to claim 12, wherein variant calling is done with samtools mpileup.
16. A method according to claim 12, wherein the machine learning algorithm is Random Forest.
17. A kit for calculating the risk score of adverse prognosis in a respiratory viral infectious disease, comprising a library for sequencing the panel of 74 single nucleotide polymorphisms (SNPs) reported in Table 1 and the 83 genes reported in Table 4, according to the method of anyone of claims 1 to 16.
18. A kit according to claim 17 also comprising a multi-well plate and a microarray.
19. A computer medium comprising instructions which, when executed by a computer, cause the computer to carry the following steps:
(viii) receiving raw data from sequences obtained from the method of anyone of claims 1 to 16;
(ix) integrating said sequencing data with the gender and the age at infection of the subject;
(x) preprocessing and preparing raw data for variant calling,
(xi) calculating variant score, and gene score;
(xii) calculating genotype score as average of gene scores;
(xiii) integrating genotype score, variant score, age and gender; (xiv) calculating the risk score by the following formula: P x 100, wherein P is a number between 0 and 1 .
PCT/EP2024/059367 2023-04-06 2024-04-05 Method for the calculation of the adverse prognoses risk score in respiratory viral disease infections, using host genomics Pending WO2024209074A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IT202300006801 2023-04-06
IT102023000006801 2023-04-06

Publications (1)

Publication Number Publication Date
WO2024209074A1 true WO2024209074A1 (en) 2024-10-10

Family

ID=87036646

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2024/059367 Pending WO2024209074A1 (en) 2023-04-06 2024-04-05 Method for the calculation of the adverse prognoses risk score in respiratory viral disease infections, using host genomics

Country Status (1)

Country Link
WO (1) WO2024209074A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119851972A (en) * 2025-03-21 2025-04-18 中国人民解放军总医院 Computer readable storage medium and data processing device for grouping novel coronavirus infected persons
CN120527028A (en) * 2025-07-24 2025-08-22 四川国际旅行卫生保健中心(成都海关口岸门诊部) Method and device for dividing influenza epidemic intensity threshold, storage medium and electronic device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023284768A1 (en) * 2021-07-13 2023-01-19 北京爱普益生物科技有限公司 Fusion primer direct amplification method-based human mitochondrial whole genome high-throughput sequencing kit

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023284768A1 (en) * 2021-07-13 2023-01-19 北京爱普益生物科技有限公司 Fusion primer direct amplification method-based human mitochondrial whole genome high-throughput sequencing kit

Non-Patent Citations (22)

* Cited by examiner, † Cited by third party
Title
ANACLERIO FFERRANTE RMANDATORI DANTONUCCI ICAPANNA MDAMIANI V ET AL.: "Different Strategies for the Identification of SARS-CoV-2 Variants in the Laboratory Practice", GENES, vol. 12, no. 9, 16 September 2021 (2021-09-16), pages 1428
ANASTASSOPOULOU CGKIZARIOTI ZPATRINOS GPTSAKRIS A: "Human genetic factors associated with susceptibility to SARS-CoV-2 infection and COVID-19 disease severity", HUM GENOMICS, vol. 14, no. 1, December 2020 (2020-12-01), pages 40
BASTARD PROSEN LBZHANG QMICHAILIDIS EHOFFMANN HHZHANG Y ET AL.: "Autoantibodies against type I IFNs in patients with life-threatening COVID-19", SCIENCE, vol. 370, no. 6515, 23 October 2020 (2020-10-23), XP055768940, DOI: 10.1126/science.abd4585
BASTARD PZHANG QZHANG SYJOUANGUY ECASANOVA JL: "Type I interferons and SARS-CoV-2: from cells to organisms", CURR OPIN IMMUNOL., vol. 74, February 2022 (2022-02-01), pages 172 - 82
BORDEN EC: "Interferons a and β in cancer: therapeutic opportunities from new insights", NAT REV DRUG DISCOV, vol. 18, no. 3, March 2019 (2019-03-01), pages 219 - 34, XP036715679, DOI: 10.1038/s41573-018-0011-2
COLONA VLVASILIOU VWATT JNOVELLI GREICHARDT JKV: "Update on human genetic susceptibility to COVID-19: susceptibility to virus and response", HUM GENOMICS, vol. 15, no. 1, December 2021 (2021-12-01), pages 57
DOMDOM MARIE-ANGELA ET AL: "A multifactorial score including autophagy for prognosis and care of COVID-19 patients", AUTOPHAGY, vol. 16, no. 12, 29 November 2020 (2020-11-29), US, pages 2276 - 2281, XP093089745, ISSN: 1554-8627, DOI: 10.1080/15548627.2020.1844433 *
HUANG SWMILLER SOYEN CHWANG SF: "Impact of Genetic Variability in ACE2 Expression on the Evolutionary Dynamics of SARS-CoV-2 Spike D614G Mutation", GENES, vol. 12, no. 1, 24 December 2020 (2020-12-24), pages 16
INVESTIGATORS ET AL.: "Genetic mechanisms of critical illness in COVID-19", NATURE, vol. 591, no. 7848, 4 March 2021 (2021-03-04), pages 92 - 8, XP037386549, DOI: 10.1038/s41586-020-03065-y
MBAREK HCOCCA MAL-SARRAJ YSAAD CMEZZAVILLA MAIMUFTAH W ET AL.: "Poking COVID-19: Insights on Genomic Constraints among Immune-Related Genes between Qatari and Italian Populations", GENES, vol. 12, no. 11, 22 November 2021 (2021-11-22), pages 1842
MONTICELLI MHAY MELE BBENETTI EFALLERINI CBALDASSARRI MFURINI S ET AL.: "Protective Role of a TMPRSS2 Variant on Severe COVID-19 Outcome in Young Males and Elderly Women", GENES, vol. 12, no. 4, 19 April 2021 (2021-04-19), pages 596
NIEMI MEKKARJALAINEN JLIAO RGNEALE BM ET AL.: "Mapping the human genetic architecture of COVID-19", NATURE, vol. 600, no. 7889, 16 December 2021 (2021-12-16), pages 472 - 7
NOVELLI GBIANCOLELLA MMEHRIAN-SHAI RCOLONA VLBRITO AFGRUBAUGH ND ET AL.: "COVID-19 one year into the pandemic: from genetics and genomics to therapy, vaccination, and policy", HUM GENOMICS, vol. 15, no. 1, December 2021 (2021-12-01), pages 27
PERFETTO LMICARELLI ELANNUCCELLI MLO SURDO PGIULIANI GLATINI S ET AL.: "A Resource for the Network Representation of Cell Perturbations Caused by SARS-CoV-2 Infection", GENES, vol. 12, no. 3, 22 March 2021 (2021-03-22), pages 450
REGAN JAABDULRAHIM JWBIHLMEYER NAHAYNES CKWEE LCPATEL MR ET AL.: "Phenome-Wide Association Study of Severe COVID-19 Genetic Risk Variants", J AM HEART ASSOC, vol. 11, no. 5, March 2022 (2022-03-01)
RUSSO RANDOLFO ILASORSA VACANTALUPO SMARRA RFRISSO G ET AL.: "TNFRSF13C H159Y Variant Is Associated with Severe COVID-19: A Retrospective Study of 500 Patients from Southern Italy", GENES, vol. 12, no. 6, 8 June 2021 (2021-06-08), pages 881, XP055863247, DOI: 10.3390/genes12060881
SUH SLEE SGYM HYOON SPARK SCHA J ET AL.: "A systematic review on papers that study on Single Nucleotide Polymorphism that affects coronavirus 2019 severity", BMC INFECT DIS, vol. 22, no. 1, December 2022 (2022-12-01), pages 47
THE SEVERE COVID-19 GWAS GROUP: "Genomewide Association Study of Severe Covid-19 with Respiratory Failure", N ENGL J MED, vol. 383, no. 16, 15 October 2020 (2020-10-15), pages 1522 - 34
VELAVAN TPPALLERLA SRRUTER JAUGUSTIN YKREMSNER PGKRISHNA S ET AL.: "Host genetic factors determining COVID-19 susceptibility and severity", EBIOMEDICINE, vol. 72, October 2021 (2021-10-01), pages 103629
XIN GAO ET AL: "Genome-wide screening of SARS-CoV-2 infection-related genes based on the blood leukocytes sequencing data set of patients with COVID-19", JOURNAL OF MEDICAL VIROLOGY, JOHN WILEY & SONS, INC, US, vol. 93, no. 9, 28 May 2021 (2021-05-28), pages 5544 - 5554, XP071480845, ISSN: 0146-6615, DOI: 10.1002/JMV.27093 *
ZHANG QBASTARD PCOVID HUMAN GENETIC EFFORTKARBUZ AGERVAIS ATAYOUN AA ET AL.: "Human genetic and immunological determinants of critical COVID-19 pneumonia", NATURE, vol. 603, no. 7902, 24 March 2022 (2022-03-24), pages 587 - 98, XP037768984, DOI: 10.1038/s41586-022-04447-0
ZHANG QBASTARD PLIU ZLE PEN JMONCADA-VELEZ MCHEN J ET AL.: "Inborn errors of type I IFN immunity in patients with life-threatening COVID-19", SCIENCE, vol. 370, no. 6515, 23 October 2020 (2020-10-23)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119851972A (en) * 2025-03-21 2025-04-18 中国人民解放军总医院 Computer readable storage medium and data processing device for grouping novel coronavirus infected persons
CN120527028A (en) * 2025-07-24 2025-08-22 四川国际旅行卫生保健中心(成都海关口岸门诊部) Method and device for dividing influenza epidemic intensity threshold, storage medium and electronic device

Similar Documents

Publication Publication Date Title
Barrie et al. Elevated genetic risk for multiple sclerosis emerged in steppe pastoralist populations
Xu et al. The interplay between host genetics and the gut microbiome reveals common and distinct microbiome features for complex human diseases
Pairo-Castineira et al. Genetic mechanisms of critical illness in COVID-19
Yang et al. Identifying cis-mediators for trans-eQTLs across many human tissues using genomic mediation analysis
WO2024209074A1 (en) Method for the calculation of the adverse prognoses risk score in respiratory viral disease infections, using host genomics
Billingsley et al. Genome‐wide analysis of structural variants in Parkinson disease
WO2019191123A1 (en) Methods for predicting genomic variation effects on gene transcription
Burren et al. Genetic architecture of telomere length in 462,666 UK Biobank whole-genome sequences
Kotliar et al. Genome-wide association study identifies human genetic variants associated with fatal outcome from Lassa fever
Liu et al. Admixture mapping identifies genetic regions associated with blood pressure phenotypes in African Americans
Marchal et al. Lack of association between classical HLA genes and asymptomatic SARS-CoV-2 infection
Chanda et al. Comprehensive evaluation of imputation performance in African Americans
Hancock et al. Population‐based case‐control association studies
Almal et al. Sequencing and analysis of the whole genome of Indian Gujarati male
Parikh et al. Deconvoluting complex correlates of COVID-19 severity with a multi-omic pandemic tracking strategy
Barrie et al. Elevated genetic risk for multiple sclerosis originated in Steppe Pastoralist populations
Garcia et al. The legacy of infectious disease exposure on the genomic diversity of indigenous Southern Mexicans
Liu et al. Ongoing natural selection drives the evolution of SARS-CoV-2 genomes
Lin et al. Differential performance of polygenic prediction across traits and populations depending on genotype discovery approach
Fatumo et al. KidneyGenAfrica: A multi-cohort Genome-wide association study and polygenic prediction of kidney function in 110,000 continental and diasporan Africans
Nolte et al. Candidate gene and genome-wide association studies in behavioral medicine
Walker et al. Genetic control of gene expression and splicing in the developing human brain
Kapur et al. Comparison of strategies to detect epistasis from eQTL data
Parikh et al. Deconvoluting complex correlates of COVID19 severity with local ancestry inference and viral phylodynamics: Results of a multiomic pandemic tracking strategy
Docherty et al. Genome-wide association study of suicide death and polygenic prediction of clinical antecedents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24720420

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024720420

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2024720420

Country of ref document: EP

Effective date: 20251106

ENP Entry into the national phase

Ref document number: 2024720420

Country of ref document: EP

Effective date: 20251106

ENP Entry into the national phase

Ref document number: 2024720420

Country of ref document: EP

Effective date: 20251106

ENP Entry into the national phase

Ref document number: 2024720420

Country of ref document: EP

Effective date: 20251106

ENP Entry into the national phase

Ref document number: 2024720420

Country of ref document: EP

Effective date: 20251106