[go: up one dir, main page]

WO2019169117A1 - Détection d'allèles variants dans des séquences répétitives complexes dans des ensembles de données de séquençage de génome entier - Google Patents

Détection d'allèles variants dans des séquences répétitives complexes dans des ensembles de données de séquençage de génome entier Download PDF

Info

Publication number
WO2019169117A1
WO2019169117A1 PCT/US2019/020024 US2019020024W WO2019169117A1 WO 2019169117 A1 WO2019169117 A1 WO 2019169117A1 US 2019020024 W US2019020024 W US 2019020024W WO 2019169117 A1 WO2019169117 A1 WO 2019169117A1
Authority
WO
WIPO (PCT)
Prior art keywords
reads
sequence
reference sequence
genome
remainder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2019/020024
Other languages
English (en)
Inventor
Scott C. Blanchard
Matthew M. PARKS
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cornell University
Original Assignee
Cornell University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cornell University filed Critical Cornell University
Publication of WO2019169117A1 publication Critical patent/WO2019169117A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1089Design, preparation, screening or analysis of libraries using computer algorithms
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B40/00Libraries per se, e.g. arrays, mixtures
    • C40B40/04Libraries containing only organic compounds
    • C40B40/06Libraries containing nucleotides or polynucleotides, or derivatives thereof
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Definitions

  • Repetitive sequences are patterns of nucleic acids (DNA or RNA) that occur in multiple copies throughout the genome. Repetitive sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next- generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed.
  • ribosomal DNA regions in mammals for instance - for which there may be 20 to 2000 distinct rDNA operon elements within each individual (mouse or human) that exhibit related, but unknown sequences - there is no existing product specifically designed for or capable of detecting rDNA sequence variants in the encoded operons or the ribosomal RNA (rRNA) expressed from them nor estimating their copy number.
  • Human and mouse genomes encode hundreds of copies of the rDNA operon, which are arranged in tandem arrays on five chromosomes (chromosomes 13, 14, 15, 21, 22 in human; chromosomes 12, 15, 17, 18, 19 in mouse).
  • Each rDNA operon encodes a pre-rRNA that is post-transcriptionally processed into the 18S rRNA of the small nbosomai subunit and the 5.8S and 28S rRNAs of the large ribosomal subunit.
  • Human and mouse also possess tens to hundreds of copies of 5S rRNA, a third rRNA component of the large subunit, on chromosomes 1 and 8, respectively.
  • sequence variations in the rDNA promoter and externally transcribed spacer (ETS) regions are associated with tissue-specific transcriptional activities.
  • enhancer elements in the rDNA intergenie spacers determine heterochromatin formation and the silencing of distinct rDNA subtypes, contributing to cell cycledependent expression patterns and cellular differentiation.
  • IGS rDNA intergenie spacers
  • Ribosomopathies are diseases arising from aberrations m the assembly, composition, or function of the ribosome, the two-subunit RNA-protein complex responsible for cellular protein synthesis (K. L McCann, et al., Science , (2013) 341, 849- 850). Varied and tissue-specific disease phenotypes associated with the perturbation of ribosomal proteins have inspired hypotheses of functionally distinct ribosome sub- populations in the cell (S. Xue, et al, Nat Rev Mol Cell Biol., (2012) 13, 355-369; Z. Shi, et al. , Annu Rev Cell Dev Biol., (2015) 31, 31-54; E. W. Mills, et al., Science., (2017)
  • rRNA ribosomal RNA
  • rDNA ribosomal DNA
  • Ribosomeassociated proteins the "ribo-interaetome” have been implicated in ribosome-mediated impacts on gene expression (D. Simsek et al. Cell, (2017), 169, 1051-1065. el 8). While physical heterogeneities arising from ribosomal protein mutations and ribosomal protein and ribosome-associated protein stoichiometry have recently drawn significant attention (M. L. Holland et al., Science, (2016), 353, 495- 498; A. I.
  • Each rDNA operon encodes a 47S pre-rRNA that is post-transcriptional!y processed into the 18S rRNA of the small ribosomal subunit and the 5.8S and 28S rRNAs of the large ribosomal subunit.
  • Human and mouse also possess tens to hundreds of copies of 5S rRNA, a third rRNA component of the large subunit, on chromosomes 1 and 8, respectively (FIG. 1 A) (D. M. Stults, et al, Genome Res., (2008), 18, 13-18).
  • sequence variations in the rDNA promoter and externally transcribed spacer (ETS) regions are associated with tissue-specific transcriptional activities (J. G. Gibbons, et al, Proc Natl Acad Sci USA., (2015), 112, 2485-2490; B. A. Kuo, et al, Nucleic Acids Res., (1996),
  • enhancer elements in the rDNA intergenic spacers determine heterochromatin formation and the silencing of distinct rDNA subtypes, contributing to cell cycle-dependent expression paterns and cellular differentiation (S. Zhang, et al PLoS ONE, (2007), 2, e902; R. Santoro, et al, EMBO Rep., (2010), 11, 52- 58).
  • Evidence of altered rDNA operon expressi on patterns in the lifecycles of both bacteria and parasites (E. W. Mills, et al, Science., (2017) 358) argue that such control mechanisms may be evolutionarily conserved.
  • [QQQ7] The notion that ribosomes are homogeneous and consequently passive players in the protein synthesis mechanism (K. L. McCann, et a!., Science (2013) 341, 849-850; A.
  • An aspect of the present disclosure is directed to a processor programmed to perform:
  • each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity
  • mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism
  • the processor is further programmed to perform:
  • the processor is further programmed to perform
  • the number of reads expected to map to the reference sequence is calculated from a GC-content specific fragmentation rate analysis of the remainder reads.
  • the processor is further programmed to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity, wherein the second reference sequence is generated from said organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
  • the present disclosure is directed to a computer-readable storage device, comprising instructions to perform:
  • each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity
  • mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the computer-readable storage device further comprises instructions to perform:
  • the present disclosure is directed to a computer-readable storage device, further comprising instructions to perform determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads
  • the computer-readable storage device further comprises instructions to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity, wherein the second reference sequence is generated from said organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used m generating the first reference sequence
  • Another aspect of the disclosure is directed to a method comprising
  • each candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity:
  • mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the method further comprises performing:
  • the method further comprises determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
  • the set of high-throughput sequencing reads comprises DNA sequencing reads (e.g., DNA-seq).
  • RNA sequencing reads e.g , RNA-seq.
  • the contiguous stretch of the reference sequence is at least 25 nucleotides long.
  • the contiguous stretch of the reference genome is at least 25 nucleotides long.
  • the first reference sequence is a ribosoma! sequence.
  • the ribosomal sequence comprises a 47S/45S rDNA prototype sequence.
  • the first reference sequence is selected from the group consisting of NCBI reference sequences NR_G03278 3, NR_003279. l, NR__0Q3280.2, and NR_030686.1.
  • the first reference sequence is selected from the group consisting of NCBI reference sequences NR_003285.2, NR_003286.2, NR_003287.2, and NR_023379.1.
  • the first reference sequence is selected from the group consisting of GenBank IDs U13369.1 and X12811.1.
  • the first reference sequence is selected from the group consisting of BK000964.3 and Mus mus cuius chromosome 8 region [123538334, 123539354] [0029]
  • the second reference sequence comprises a sequence generated by sequencing ribosomes.
  • the organism is a prokaryote.
  • the organism is a mammal .
  • the mammal is selected from the group consisting of Homo sapiens, Mus mus cuius , and Ralius norvegicus.
  • FIGS. 1A - IB rDNA operons in the human genome.
  • A Chromosomal locations of rDNA operons and organization of the rRNA genes within the 47 S rDNA operon m the human genome. Together with the 5S rRNA, the IBS, 5.8S, and 28S rRNA form the RNA core of the ribosome
  • B Per-individual rDNA copy number estimation in humans, grouped by population.
  • FIGS. 2A - 2B High frequency genomic rRNA variants detected in human.
  • FIGS. 3.4 - 3D rRNA variants with population-stratified intra-individual AF.
  • B -(D). Tertiary structures of the ribosome showing strongly population-stratified variants (Vst>0.5) with high intra-individual AF (>20%) in any individual analyzed that are located (B) in expansion segments (ES6S and ESS) in the surface near ribosoma!
  • Ribosome tertiary structures show rRNA (tan), ribosomal proteins (green), and key structural landmarks.
  • FIGS. 4A - 4C Evolutionary conserved rRNA variants in functionally important centers of the human ribosome.
  • A 18S C543U variant (red) of helix hl6 contacts DHX29 (yellow), a component of translation initiation.
  • B 18S G480A variant (red) of helix h5 occurs near contact points with residues 256-260 within Domain 2 (yellow) of eEFlA (blue).
  • C 28S G1764A variant (red) of helix H38 in the central protuberance.
  • the structural models shown are based on EMD-3056-3058(45), PDB IDs 5LVS (B E.
  • FIG. 5 Tissue-specific expression of rRNA variants.
  • rRNA variant expression heatmap and hierarchical clustering of the 26 variants detected to be differentially expressed among pairs of tissues.
  • Each row represents a biological replicate. Rows are grouped by tissue source (3 biological replicates, i.e., rows, per tissue source).
  • Each column represents an rRNA variant. Expression is normalized per rRNA variant (i.e., by column), across all replicates and tissues (i.e., 12 samples per each column). For example, the rRNA variant represented by the leftmost column has higher relative expression in brain, while the variant represented by the rightmost column has lowest relative expression in liver.
  • FIGS. 6A - 61 Bioinformatics strategy for rRNA copy number estimation and variant discover ⁇ ' .
  • A The rDNA prototype and RNA-seq reads from actively translating ribosomes are used to identify whole genome sequencing (WGS) reads putatively generated from rDNA regions by
  • B computational hybridization, wherein paired-end reads are selected if at least one of the mates contains a contiguous stretch of 30 nt of perfect identity to the prototype or any RNA-seq read.
  • C Candidate rRNA WGS reads are then aligned to the reference genome and the rDNA prototype, separately, and (D) only paired-end reads which do not have better alignments to the reference are retained.
  • FIG. 7 Block diagram of the system in accordance with the aspects of the disclosure.
  • CPU Central Processing Unit
  • FIG, 8 Flow- chart of an embodiment identifying repetitive sequence reads among high throughput sequencing reads.
  • FIG, 9. Flowchart of embodiments for detecting rare sequence variants, and for determining the number of copies a repetitive sequence has in the genome.
  • base quality score refers to per-base estimates of error emitted by the sequencing machines, which express how confident the machine was that it called the correct base each time. For example, if the machine reads an A (Adenosine) nucleotide, and assigns a quality score of Q20— in Phred-scale, which means it's 99% sure it identified the base correctly. This means that one ca expect it to be wrong in one case out of 100: so if there are several billion base calls (one gets ⁇ 90 billion in a 3 Ox genome), at that rate the machine would make the wrong call in 900 million bases.
  • A Addenosine
  • base quality score recalibration refers to a process in which machine learning is used to model the per-base estimates of error emitted by the sequencing machines empirically and to adjust the quality scores accordingly. For example, for a given run, it may be identified that whenever two A nucleotides are called in a row, the next base called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. Adjusting the quality scores is done over several different covariables (e.g., mainly sequence context and position in read, or cycle) in a way that is additive. As a result, the same base may have its quality score increased for one reason and decreased for another.
  • covariables e.g., mainly sequence context and position in read, or cycle
  • Base calls themselves are not corrected, i.e., one can't determine whether that low-quality A should actually have been a T but recalibration allows the variant caller more accurately hotv far it can trust that A.
  • computer readable medium refers to a computer readable storage device or a computer readable signal medium.
  • a computer readable storage device may be, for example, a magnetic, optical, electronic, electromagnetic, infrared, or
  • the computer readable storage device is not limited to these examples except a computer readable storage device excludes computer readable signal medium.
  • Additional examples of the computer readable storage device can include: a portable computer diskette, a hard disk, a magnetic storage device, a portable compact disc read-only memory (CD-ROM), a random access memory (RAM), a read-only memory' (ROM), an erasable programmable read-only memory' (EPROM or Flash memory ), an optical storage device, or any appropriate combination of the foregoing; however, the computer readable storage device is also not limited to these examples. Any tangible medium that can contain, or store, a program for use by or in connection with an instruction execution system, apparatus, or device could be a computer readable storage device.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, such as, but not limited to, m baseband or as part of a carrier wave.
  • a propagated signal may take any of a plurality of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium (exclusive of computer readable storage device) that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device.
  • Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including but not limited to wireless, wired, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • GATK Best Practices refers to step-by-step recommendations from the Broad Institute for performing variant discovery' analysis in high-throughp t sequencing (HTS) data.
  • HTS high-throughp t sequencing
  • the Best Practices documentation attempts to describe in detail the key principles of the processing and analysis steps required to go from raw reads coming off the sequencin machine, all the way to an appropriately filtered variant callset that can be used in downstream analyses.
  • high homology' refers to a sequence that demonstrates a high level of sequence identity to another given sequence.
  • “high homolog ⁇ '” refers to at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95% or at least 99% sequence identity.
  • two sequences ith “high homology" are substantially complementary' to each other.
  • nucleic acid fragment is capable of hybridizing to at least one nucleic acid strand or duplex even if less than ail nucleobases do not base pair with a counterpart nucleobase.
  • a“substantially complementary’' nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, 8%, about 79%, about 80%, about 81%, about 82%, about 83%, about 84%, about 85%, about 86%, about 87%, about 88%, about 89%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base-pairing with at least one single or double stranded nucleic acid molecule during hybridization.
  • mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another.
  • sequence alignments of proteins the degree of similarity between amino acids occupying a particular position in the sequence can he interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages.
  • DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.
  • high throughput sequencing refers to sequencing technologies having increased throughput as compared to the traditional Sanger- and capillary electrophoresis-based approaches, for example with the ability to generate hundreds of thousands or millions of relatively short sequence reads at a time.
  • next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization.
  • next generations sequencing methods include, but are not limited to, pyrosequencing as used by the GS Junior and GS FLX Systems (454 Life Sciences, Bradford, Conn.); sequencing by synthesis as used by Miseq and Solexa system (Illumina, Inc., San Diego, Calif.); the SOLiDTM (Sequencing by Oligonucleotide Ligation and Detection) system and Ion Torrent Sequencing systems such as the Personal Genome Machine or the Proton Sequencer (Thermo Fisher Scientific, Waltham, Mass.), and nanopore sequencing systems (Oxford Nanopore Technologies, Oxford, united Kingdom).
  • the term "local indel realignment” refers to a process to locally realign reads such that the number of mismatching bases is minimized across all the reads.
  • a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual's genome with respect to the reference genome.
  • Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs.
  • read mapping algorithms operate on each read independently, it is impossibl e to place reads on the reference genome such at mismatches are minimized across all reads.
  • Local realignment around indels allows correction of mapping errors made by genome aligners. Local realignment makes read alignments more consistent in regions that contain indels. Genome aligners can only consider each read independently, and the scoring strategies they use to align reads relative to the reference limit their ability to align reads well in the presence of indels. Depending on the variant event and its relative location within a read, the aligner may favor alignments with mismatches or soft-clips instead of opening a gap in either the read or the reference sequence. In addition, the aligner's scoring scheme may use arbitrary tie-breaking, leading to different, non- parsimonious representations of the event in di fferent reads. In contrast, local realignment considers all reads spanning a given position. This makes it possible to achieve a high- scoring consensus that supports the presence of an indel event. It also produces a more parsimonious representation of the data in the region.
  • mapping refers to the process of aligning short reads to a reference sequence, whether the reference is a complete genome, transcriptome, or de novo assembly.
  • the term "memory” as used herein comprises program memory and working memory.
  • the program memory may have one or more programs or software modules.
  • the working memory stores data or information used by the CPU in executing the functionality described herein.
  • processor may include a single core processor, a multi-core processor, multiple processors located m a single device, or multiple processors in wared or wireless communication with each other and distnaded over a network of devices, the Internet, or the cloud.
  • functions, features or instructions performed or configured to be performed by a '‘processor” may include the performance of the functions, features or instructions by a single core processor, may include performance of the functions, features or instructions collectively or collaborative!) by multiple cores of a multi-core processor, or may include performance of the functions, features or instructions collectively or coliaboratively by multiple processors, where each processor or core is not required to perform every function, feature or instruction individually.
  • the processor may be a CPU (central processing unit).
  • the processor may comprise other types of processors such as a GPU (graphical processing unit).
  • the processor may be an ASIC (application-specific integrated circuit), analog circuit or other functional logic, such as a FPGA (field-programmable gate array), PAL (Phase
  • PLA programmable logic array
  • the CP U is configured to execute programs (also described herein as modules or instructions) stored in a program memory to perform the functionality described herein.
  • the memory may be, but not limited to, RAM (random access memeory), ROM (read only memory') and persistent storage.
  • the memory is any piece of hardware that is capable of storing information, such as, for example without limitation, data, programs, instructions, program code, and/or other suitable information, either on a temporary basis and/or a permanent basis.
  • terminal read quality reduction refers to reducing the base quality scores for the 8 terminal nucleotides of both ends of ever) ' read to Q0 (meaning 0% accuracy, 100% error rate).
  • variants refer to alternative forms of a gene, genetic locus or gene sequence. Each variant has a distinct nucleic acid sequence at the locus of interest. For example, the inventors have discovered 1,790 variant alleles were detected at 1,662 positions of the 7,184 nucleotides (23%) in human 5S, 5.8S, 18S and 28S rDNA.
  • a variant of a gene or DNA sequence can show a sequence identity , e.g., 60%, 65%, 70%, 75%, 78%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 95%, 97%,
  • Sequence identity refers to the percent of exact matches between the nucleic acids of two sequences which are being compared.
  • the processor, the computer-readable storage device or the method of the present disclosure (“the technolog ⁇ ' described herein") are applied to discover sequence diversity, sequence variations and to estimate copy number of repetitive sequences within any multi-copy gene or genomic region.
  • the technology described herein entails the use of a "prototype" template sequence ("a reference sequence") for a specific gene type of interest that is then used to robustly and reliably quantify the subset of high-throughput whole genome sequencing (WGS) reads corresponding to this region of genomic DNA using a rigorous computational hybridization approach.
  • a reference sequence a "prototype” template sequence
  • WGS whole genome sequencing
  • the gene type of interest is ribosomal DM A.
  • multi-copy genes, genomic regions or repetitive sequences used in the technology described herein include ribosomal DNA, centromeric DNA, telomeric DNA, the human leukocyte antigen (HLA) complex of genes, the major histocompatibility complex (MHC) of genes, segmental duplications (Bailey, Jeffrey A., et al., Science , 297.5583 (2002): 1003-1007), and retrotransposons such as Alu, LI, Bl, LINE, and SINE elements.
  • HLA human leukocyte antigen
  • MHC major histocompatibility complex
  • retrotransposons such as Alu, LI, Bl, LINE, and SINE elements.
  • the technology described herein is used to rapidly assess the diversity of species present in a mixed biological sample by extracting and analyzing WGS reads from the mixture and computationally mapping these reads to a prototype (reference) sequence.
  • the technology' described herein is applied to the assessment of the diversity of rDNA operons present within a microhiome sample to rapidly characterize the gut flora of a patient.
  • the technology described herein is used to assess the diversity' of sequence variations in expressed RNA transcripts arising from RNA editing of post-transcriptional modifications - coming from a distinct region of the genome.
  • the rDNA sequence variants identified by the disclosed technology' are used to distinguish local populations or ethnic groups and to predict the ancestry' of an individual using sequencing data from a biological sample.
  • the discovery' and identification of sequence variants in rDNA with the disclosed technology is used to screen, diagnose, or predict the onset, progression, severity, life expectancy, or general health of an individual.
  • the high-throughput sequencing method used is DNA sequencing (DNA-seq). In some embodiments the high-throughput sequencing used is RNA sequencing (RNA-seq).
  • Various aspects of the present disclosure may be embodied as a program, software, or computer instructions embodied or stored in a computer or machine usable or readable medium, or a group of media which causes the computer or machine to perform the steps of the method when executed on the computer, processor, and/or machine.
  • a progra storage device readable by a machine e.g., a computer readable medium, tangibly embodying a program of instructions executable by the machine to perform various functionalities and methods described in the present disclosure is also provided.
  • the present disclosure includes a system comprising a CPU, a display, a network interface, a user interface, a memory, a program memory and a working memor' (FIG. 7), where the system is programmed to execute a program, software, or computer instructions directed to methods or processes of the instant disclosure.
  • the organism used in the present disclosure is selected from the kingdoms consisting of Monera (bacteria). Protista, Fungi, Plantae, and Animalia. In a specific embodiment, the organism is a member of the kingdom Animalia selected from the group consisting of a human ⁇ Homo Sapiens ), a macaque (Macaca fuscaia), a horse ( Equus cahallus), a cow ( Bos taurus), a sheep (Ovis aries), a dog ( Canis lupus familiar is), a cat (Felis cams), a mouse (Mus rnusculus), and a rat ( Raitus norvegicus). In some embodiments, the organism is a bacteria.
  • the reference genome used in the present disclosure is the genome of the organism on which the high-throughput sequencing was performed.
  • At least one sequence is designated as the predefined reference sequence (aka. "prototype sequence") (labeled as "Input 2: First reference sequence” in FIG. 8).
  • the predefined reference sequence displays high homology with the putatively existing copies of the gene/a!le!e/genetic element.
  • the sequence chosen to be the prototype can be any one of the alleles/copies of the gene or genetic element from any individual, or it can be any of the alleles/copies appearing in a reference genome or other genome assembly, or it can be a generalization such as consensus sequence. As an increasing number of genome sequences become available and as the depth and quality of genome sequencing data for repetitive sequences improves, the prototype may be updated and improved as well.
  • the repetitive sequence is rDNA and the prototype comprises published rDNA sequences (e.g , Genbank ID U13369, or sequences in Gonzales et al., Genomics, 1995 May 20; 27(2): 320-8) or corresponding rRNA sequences.
  • the prototype comprises rDNA sequence reads generated from the sequencing of pure ribosomes or polysomes.
  • the organism is mouse (Mus musculus) and the rDNA reference sequence is selected from the group consisting of CBI reference sequences NR 003278.3, NR 003279.1 , NR 003280.2, and NR 030686.1.
  • the mouse rDNA reference sequence is selected from the group consisting of genomic region BK000964.3 and chromosome 8 region [123538334, 123539354]
  • the organism is human ( Homo sapiens) and the rD A reference sequences are selected from the group consisting of NCBI reference sequences NR 003285.2, NR 003286.2, NR 003287.2. NR 023379.1
  • the human rDNA reference sequence is selected from the group consisting of genomic region U13369.1 and Xl2811. l.
  • the process outlined in FIG. 8 begins by identifying reads in a high-throughput sequencing data (which comprise a plurality of reads) that map to a predefined reference sequence (a "prototype" - as described above) with a contiguous stretch of 100% sequence identity (See FIG. 8).
  • a predefined reference sequence a "prototype" - as described above
  • the contiguous stretch of 100% sequence identity comprises at least a stretch of contiguous 20
  • nucleotides 21 nucleotides, 22 nucleotides, 23 nucleotides, 24 nucleotides, 25 nucleotides, 26 nucleotides, 27, nucleotides, 28 nucleotides, 29 nucleotides, 30 nucleotides, 35 nucleotides, or 40 nucleotides that map to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the identification of reads that map to a predefined reference sequence with a contiguous stretch of 100% sequence identity is achieved by using the command "-mumreference" from the software package "MUMmer” (Kurtz et al, Genome Biol. 2004; 5(2):R12).
  • sequence mapping is achieved by the sequence aligner "bowtie2" with the "— end-to-end” option. In some embodiments, mapping is achieved by the sequence aligner “BWA.” In some embodiments, mapping is achieved by a sequence aligner selected from the group consisting of bowtie2, BWA, mrFast & mrsFast, and novoAlign.
  • some reads that map to the reference sequence may have been generated from non-repeiitive regions of the genome. Some non-repetitive regions of the genome may display homology/sequence identity to subregi ons of the reference sequence (prototype) representing the repetitive region.
  • This procedure includes mapping reads to the non-repetitive regions of the published genome sequence of the organism and to the reference sequence, separately, and discarding a read if it maps to the non-repetitive regions of the genome with fewer errors than to the reference sequence (See FIG. 8). In some embodiments, a read is discarded if it maps to the non-repetiti ve regions of the genome with smaller edit distance.
  • the total edit distance of an aligned pairs is calculated as the sum of the edit distance of each mate, and only concordant alignments are considered.
  • regions of the reference genome with significant homology to the reference sequence are masked by aligning contiguous subsequences of the reference sequence to the reference genome and masking regions which have > 75% homology to the reference sequence.
  • regions of the reference genome which significant homology to the reference sequence are not considered for evaluating if a read maps better to the non-repetitive regions of the genome than to the reference sequence.
  • read sequences that map to the reference sequence, and not to the non- repetitive regions of the genome represent "remainder" reads (aka. candidate reads).
  • a "local indel realignment” is performed to obtain a realignment of the remainder reads (See FIG. 9).
  • local indel realignment locally realigns reads such that the number of mismatching bases is minimized across all the reads.
  • indel realignment is performed using the Genome analysis Toolkit (GATK) software package (DePristo et a!., Nat Genet , 201 1 May; 43(5):49l-8).
  • the GATK software package is used to perform the "IndelRealigner” function.
  • the LoFreq software package is used for indel realignment.
  • a "base quality score recalibration" is performed after the realignment (See FIG. 9).
  • the base quality score recalibration involves using machine learning algorithms to model the per-base estimates of error emitted by the sequencing machines empirically and adjusting the quality scores accordingly.
  • base quality score recalibration is performed using the GATK software package, where the recalibration table is obtained by aligning reads to the reference genome and following the GATK Best Practices (DePristo et al., Nat Genet., 201 1 May; 43(5):491 ⁇ 8).
  • base quality score recalibration is performed using the“Pr tlleads” function of the GATK software package.
  • the base quality score recalibration is achieved using software selected from the group consisting of GATK, BBMap and FreeBayes.
  • a "terminal read quality reduction" is performed after the base quality score recalibration (See FIG. 9).
  • the base quality scores of 6 terminal read nucleotides, at both ends of every read are reduced to Q0 (Phred scale).
  • the base quality scores of 7 terminal read nucleotides, at both ends of every read are reduced to Q0.
  • the base quality scores of 8 terminal read nucleotides, at both ends of ever ⁇ ' read are reduced to Q0.
  • the base quality scores of 9 terminal read nucleotides, at both ends of every read are reduced to Q0
  • the base quality scores of 10 terminal read nucleotides, at both ends of every read are reduced to Q0.
  • a "detection of rare sequence variants” is performed on the reads that have gone through terminal read quality reduction (See FIG. 9).
  • the detection of rare sequence variants is achieved by LoFreq (Wilm et af, Nucleic Acids Res., 2012 Dec; 40(22): 11189-201), a software for detecting rare sequence variants from a heterogeneous cell population in non-repetitive regions of the genome.
  • the detection of rare sequence variants is achieved by achieved by variant calling algorithms selected from LoFreq, samtools, GATK, and SOAPsnp.
  • Another aspect of the disclosure includes determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence (See FIG. 9)
  • the number of reads expected to map to the reference sequence is determined by calculating GC-content specific fragmentation rates.
  • an estimation of insert size L is computed.
  • L refers to the median insert size estimated from all concordantly aligned paired-end reads.
  • L is estimated empirically from the median insert size of all properly paired paired-end reads mapped to the genome.
  • insert sizes are estimated from library fractionation protocol parameters.
  • the insert size is estimated by the median insert size as computed by PICARD tool "CollectlnsertSizeMetrics.”
  • known repetitive regions or regions that may have experienced structural variation should be excluded.
  • regions can be identified as those regions with structural variation as reported by the 1000 Genomes Project or Mouse Genomes Project. Repeats such as segmental duplications (Bailey, Jeffrey A., et al, Science , 297.5583 (2002): 1003-1007) constitute repeats which should be excluded.
  • the fragmentation rate is calculated as follows.
  • the per- position GC-content of the hypothetical fragment of length L at p to be the number of guanine (G) and cytosine (C) nucleotides in the region j p, p+L-lJ.
  • G guanine
  • C cytosine
  • the estimation of a copy number is achieved by comparing the number of observed reads mapping to the prototype to the number of expected reads as estimated by the GC-content specific fragmentation rates.
  • the expected number of fragments generated at p i.e. with 5 end at p
  • G(p) gives the GC-content of the region [p, p+L-1].
  • the division by 2 is necessary for diploid organisms where the genome assembly aligned to is haploid.
  • this formula one can estimate the copy number of any region of the genome or the copy number of any repetitive element using a prototype. Namely, given a region or prototype R with observed fragment count X, then the copy number C of R is estimated as
  • the estimation of the copy number is achieved by comparing the total number of reads aligned to the reference sequence to the average coverage of reads across the genome. In some embodiments, the estimation of the copy number is achieved by comparing the total number of reads aligned to the reference sequence to an estimate of coverage derived from selected regions of the genome, such as the exons of single-copy genes.
  • the disclosure is directed to a processor programmed to detect rare sequence variants in repetitive sequences from high throughput sequencing reads.
  • a processor is programmed to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is a repetitive sequence present m multiple copies in the genome of the organism.
  • the processor is programmed to identify candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the processor is further programmed to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the processor is further programmed to perform a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads.
  • the processor is further programmed to perform a base quality score recalibration on the remainder reads after the realignment.
  • the processor is further programmed to perform a terminal read quality reduction on the remainder reads after the base quality score recalibration.
  • the processor is further programmed to detect rare sequence variants on the reads that have gone through terminal read quality reduction.
  • the disclosure provides a processor programmed to detect rare sequence variants in repetitive sequences from high throughput sequencing reads by- excluding at least one process selected from the list consisting of indel realignment, base quality score recalibration, terminal read quality reduction, candidate rDNA or rRNA read identification by minimal perfect homology, and comparison between alignments to the non-rDNA reference genome and the reference sequence.
  • the exclusion of a process results in faster detection of rare sequence variants in repetitive sequences from high throughput sequencing reads as compared to instances where no process is excluded.
  • Another aspect of the disclosure is directed to a processor programmed to detect the number of copies of a repetitive sequence in a genome.
  • a processor is programmed to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is present in multiple copies in the genome of the organism.
  • the processor is programmed to identify candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27, nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the processor is further programmed to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the processor is further programmed to determine the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -con tent specific fragmentation rate analysis of the remainder reads.
  • the processor is further programmed to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity .
  • the second reference sequence is generated from the organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
  • the disclosure is directed to a computer-readable storage device storing computer readable instructions, which when executed by a processor causes the processor to detect rare sequence variants in repetitive sequences.
  • a computer-readable storage device comprises instructions to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is a repetitive sequence present m multiple copies in the genome of the organism.
  • the computer-readable storage device comprises instructions to identify candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identify comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the computer-readable storage device further comprises instructions to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the computer-readable storage device further comprises instructions to perform a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads.
  • the computer-readable storage device further comprises instructions to perform a base quality score recalibration on the remainder reads after the realignment.
  • the computer-readable storage device further comprises instructions perform a terminal read quality reduction on the remainder reads after the base qualify score recalibration.
  • the computer-readable storage device further comprises instructions to detect rare sequence variants on the reads that have gone through terminal read quality reduction.
  • the disclosure provides a computer-readable storage device comprising instructions to detect rare sequence variants in repetitive sequences from high throughput sequencing reads by excluding at least one process selected from the list consisting of indel realignment, base quality score recalibration, terminal read quality reduction, candidate rDNA or rR A read identification by minimal perfect homology, and comparison between alignments to the non -rDNA reference genome and the reference sequence.
  • the exclusion of a process results in faster detection of rare sequence variants in repetitive sequences from high throughput sequencing reads as compared to instances where no process is excluded.
  • Another aspect of the disclosure is directed to a computer-readable storage device storing computer readable instructions, which when executed by a processor causes the processor to detect the number of copies of a repetitive sequence in a genome.
  • a computer-readable storage device comprises instructions to perform mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is present in multiple copies in the genome of the organism.
  • the computer-readable storage device comprises instructions to identify candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the computer-readable storage device further comprises instructions to map the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism.
  • the computer-readable storage device further comprises instructions to determine the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
  • the computer-readable storage device further comprises instructions to perform mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity.
  • the second reference sequence is generated from the organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
  • the disclosure is directed to a method for detecting rare sequence variants in repetitive sequences.
  • the method comprises mapping a set of high- throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is a repetitive sequence present m multiple copies in the genome of the organism.
  • the method comprises identifying candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identify.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27 nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the method further comprises mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence which maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the multiple copies of the first reference sequence in the genome of the organism
  • the method further comprises performing a local indel realignment of the remainder reads to the reference sequence to obtain a realignment of the remainder reads. [00127] In some embodiments, the method further comprises performing a base quality score recalibration on the remainder reads after the realignment.
  • the method further comprises performing a terminal read quality reduction on the remainder reads after the base quality score recalibration.
  • the method further comprises detecting rare sequence variants on the reads that have gone through terminal read quality reduction.
  • the discl osure provides a method of detecting rare sequence variants in repetitive sequences from high throughput sequencing reads by excluding at least one process selected from the list consisting of indel realignment, base quality score recalibration, terminal read quality reduction, candidate rDNA or rRNA read identification by minimal perfect homology, and comparison between alignments to the non-rDNA reference genome and the reference sequence.
  • the exclusion of a process results in faster detection of rare sequence variants in repetitive sequences fro high throughput sequencing reads as compared to instances where no process is excluded.
  • Another aspect of the disclosure is directed to a method of detecting the number of copies of a repetitive sequence in a genome.
  • the method comprises performing mapping a set of high-throughput sequencing reads generated from an organism to a first reference sequence to obtain an initial alignment.
  • the first reference sequence is present in multiple copies in the genome of the organism.
  • the method comprises identifying candidate reads within the set based on the initial alignment.
  • a candidate read comprises a sequence that maps to a contiguous stretch of the reference sequence with 100% sequence identity.
  • the contiguous stretch of the reference sequence with 100% sequence identity comprises at least 20 nucleotides, at least 21 nucleotides, at least 22 nucleotides, at least 23 nucleotides, at least 24 nucleotides, at least 25 nucleotides, at least 26 nucleotides, at least 27, nucleotides, at least 28 nucleotides, at least 29 nucleotides, at least 30 nucleotides, at least 35 nucleotides, or at least 40 nucleotides.
  • the method further comprises mapping the identified candidate reads to a reference genome to eliminate reads that comprise a sequence winch maps to a contiguous stretch of the reference genome with 100% sequence identity, and identifying the remainder reads as reads that map to the mul tiple copies of the first reference sequence in the genome of the organism.
  • the method further comprises determining the number of copies of the first reference sequence in the genome of the organism by comparing the number of reads in the remainder reads with number of reads expected to map to the reference sequence.
  • the number of reads expected to map to the reference sequence is calculated from a GC -content specific fragmentation rate analysis of the remainder reads.
  • the method further comprises mapping the remainder reads to a second reference sequence to eliminate reads that do not comprise a sequence which maps to a contiguous stretch of the second reference sequence with 100% sequence identity.
  • the second reference sequence is generated from the organism and is different from the first reference sequence in that the second reference sequence was generated using a different method than that was used in generating the first reference sequence.
  • Mouse mammary epithelial cells were obtained from ATCC and were grown in T225 flasks (Corning) in high glucose (4 5 g/L) DMEM (Gibco) supplemented with 10% fetal bovine serum (Atlanta Bio!ogieals), 1%
  • HEK293T cells penicillin/streptomycin (Gibco), and insulin (10 pg/mL) HEK293T cells were grown in the same manner and media with the exception of insulin . Both cell lines were routinely checked for mycoplasma using the MycoAlertTM Mycoplasma Detection Kit (Lonza). To stabilize polysomes prior to harvest, cells were treated with 350 mM cycloheximide (Sigma Aldrich) for 30 minutes at 37 °C. Cells were then released from the surface using 0 05% Trypsin-EDTA (Gibco), centrifuged at 750 rpm for 5 min.
  • cycloheximide Sigma Aldrich
  • lysis buffer (1 g cells/mL lysis buffer) containing 20 mM Tri-HCl (pH :::: 7.5), 5 mM MgC12, 10 mM KC1, 1 mM DTT, 5 mM putrescine, 350 mM cycloheximide (Sigma Aldrich),
  • Clarified lysate was then loaded onto pre-cooled 10-50% sucrose gradients buffered in the same manner as the lysis buffer, but without RNase inhibitors, protease inhibitors, or DNase.
  • Sucrose gradients 'ere spun at 30k rpm for 3 h. at +4 °C m an SW32 rotor m an OptimaTM XPN-100 ultracentrifuge (Beckman Coulter). Gradients were then fractionated using a BR-188 density gradient fractionation system (Brandel). Fractions corresponding to polysomes were collected and pelleted at 35.7k rpm for 18 h.
  • NMuMG polysomes isolated as described above m biological triplicate, were denatured, reduced, and alkylated followed by proteolytic digestion with LysC (Wako Chemicals) and trypsin (Promega). Approximately 120 femtomol of each desalted (S. Shao et al., Cell., (2016), 167, 1229-1240. el 5sample was analyzed by nano-LC- MS/MS system (Ultimate 3000 coupled to a QExactive Plus, Thermo Scientific). Peptides were separated using a 12 cm x 75pm Cl 8 column (3 pm particles, Nikkyo Technos Co., Ltd.
  • RNA samples Prior to sequencing, the quantity and concentration of all RNA samples was determined using a Nano VueTM Spectrophotometer (GE Healthcare). High RNA integrity was verified using a 2100 Bioanalyzer (Agilent). Illumina-compatible libraries were prepared using a modified TruSeq RNA Sample Preparation method (I!!umina) that omitted the poly-A enrichment step. Sequencing w3 ⁇ 4s performed using 100 nucleotide paired-end sequencing on a HiSeq 4000 (illumina) in the Genomics Resources Core Facility at Weill Cornell Medicine.
  • Genbank IDs of the rRNA prototypes (reference sequences) used are:
  • Genbank IDs of rDNA prototypes used were: U13369.1 and X12811.1 (human), BK000964.3 and chromosome 8 region [123538334, 123539354] (mouse). This region of chromosome 8 wus chosen as a representative 5S rRNA and operon for mouse, as used previously (D M. Stulls, et al. Genome Res., (2008), 18, 13—18), since no such sequence could be found in Genbank.
  • the whole genome reference sequence assemblies used were GRCm38 for mouse and GRCh38 for human, downloaded from Ensembl. dbSNP release bl47 was used for known variant locations in the reference genome assemblies. Read libraries from the 1000 Genomes Project and Mouse Genomes Project were downloaded from NCBI SRA using the sra-toolkit.
  • BaseRecalibrator tool from GATK with mismatch context size of 3, and using the dbSNP and individuai-/strainspecific variant sites for the "-knownSites" option, as well as the reference regions declared to be unannotated rRNAs, as defined above. Paired-end read insert size distributions were computed with the CollectlnsertSizeMetrics tool of Picard.
  • a paired-end read was discarded if it did not map concordantly to the prototype rDNA, or if there existed a concordant alignment in the reference genome on one of the fully assembled chromosomes that did not overlap with any of the unannotated rRNA regions which had a strictly smaller edit distance than the edit distance of the paired-end read on the rDNA prototype.
  • the edit distance was computed as the sum of the edit distances for each mate.
  • the base qualities for the 8 nt on each end of each mate were manually reduced to Q0.
  • variants were called with LoFreq, restricted to the regions of the rDNA consisting of the transcribed rRNA.
  • the variant discover)' strategy detected a variant allele defined by variant nucleotide v at position x m individual A.
  • f the fraction of reads for individual B overlapping position x which exhibit nucleotide v
  • n the estimated rDNA copy number of individual B. Then, the variant AF in individual B is f if f*n > 1, and 0 otherwise.
  • Short-read sequencing data read depth distribution is biased according to the GC-content of the fragment that generate the read(s) (J. G. Gibbons, et al., Nat Commun., (2014), 5, 4850).
  • the median fragment length was computed with Picard "CollectlnsertSizeMetrics”.
  • RNA-seq reads from were aligned against the rRNA prototypes with bowtie2 with parameters as described above ("Calling rRNA variants").
  • Calling rRNA variants To prevent false positive variant calls due to alignment artifacts and poorly aligned read termini, reads were realigned with GATK and the quality scores of read termini were manually reduced to Q0 to preclude their contribution to variant calling, again as described above ("Calling rRNA variants").
  • LoFreq which implements the strand bias test and accounts for qualify scores and has been proven to have an extremely low' false positive rate, was then used to call sequence variants. Per-samp!e allele frequencies for detected variants were then computed from read alignment pileups.
  • each sample represents RNA-seq reads obtained from a single organ of a single mouse.
  • each organ there were three observations - one from each of the three mice analyzed.
  • Differential expression of variant allele frequencies between pairs of tissues were calculated using Lirnma, which uses a moderated t-statistic to account for common biological variation, with an false discovery rate (FDR) threshold of 0.05.
  • FDR false discovery rate
  • RNA-seq Rationalizing that the full di versify of rRNA valiants is best captured by actively translating ribosomes, RNA-seq was performed on polysomes isolated from proliferating mouse and human cells to use as bait. Polysomes represent the cellular fraction of ribosomes actively engaged in translating single mRNAs (D. J. Adams, et al, Mamin Genome ., (2015), 26, 403-412; M. Reschke et al., Cell Rep., (2013), 4, 1276- 1287; H. Chasse, et al., Nucleic Acids Res., (2017), 45, e!5; E.
  • Example 3 rDNA copy number varies widely across individuals in human and mouse
  • this variant discovery strategy detected the known rDNA sequence variants using WGS data from Escherichia coli K-12 with high specificity and sensitivity, where the estimated allel e frequency (AF) was highly correlated with the true AF (r>0.98, p ⁇ le-6).
  • Example 5 Humans exhibit pervasive inter - and intra-individual sequence variation in the rRNA genes
  • intra-individual rRNA variant AFs ranged over a broad spectrum, from a single copy to all rDNA copies in an individual’s genome, with numerous variants having extreme penetrance-to-population relationships. At one extreme, 19 variants were observed occurring in >50% of humans with a maximum intra-individual AF ⁇ 5%. In other words, these variants were found in over half of individuals tested, but w ithin any single individual, at most 5% of the individual’s rDNA operons contained the variant. For instance, the G928A 18S rRNA variant occurred with a maximum intra-individual AF of 3.7%.
  • 62 rRNA variants were found to occur with maximum intra-individual AF >75%, meaning that at least 75% of the rDNA operons within at least one individual contained the variant. Further, 22 variants in the 18S and 28S rRNA w'ere found to occur at a minimum intra-individual AF >10% in more than 500 individuals. Such high intra-individual AFs suggest that the majority of expressed ribosomes in these respective individuals likely contain these variants.
  • Eukaryotic rRNA is post-transcriptionally modified in functional centers of the ribosome(I B. Lomakin, et a! ., Nature, (2013), 500, 307-311; W. A. Decatur, et al , Trends Biochem Sci., (2002), 27, 344-351; J. Ofengand, et al., JMol Biol, (1997), 266, 246-268).
  • I B. Lomakin, et a! . Nature, (2013), 500, 307-311; W. A. Decatur, et al , Trends Biochem Sci., (2002), 27, 344-351; J. Ofengand, et al., JMol Biol, (1997), 266, 246-268).
  • 1,790 rDNA variants identified in the 5S, 5.8S, 18S and 28S rRNA 61 localized to 59 of the 256 (23%) positions known to be modified (I. B. Lomakin, e
  • the C462T variant in 18S rRNA a post-transcriptionally modified position that is a known site of interaction with eukaryotic release factor 1 (eRFl) during translation termination and ribosome release (MADEN, B E H, Progress in nucleic acid research and molecular biology ⁇ , (1990)), is found with intra-individual AF up to 61%.
  • the Al 183G variant in 18S rRNA found with intra-individual AF as high as 27%, occurs at a position located in intersubunit bridge eB14 near the decoding center of the ribosome that is 2 , -0-methylated by the small-nucleolar RNA (snoRNA) snR41 (1.
  • rRNA variants may impact ribosome function
  • rRNA variants with high intra-individual AF >20%) localized to intersubunit bridge elements or to the binding sites of ribosomal proteins that mediate them (Bla, B2b, B2c, B2e, B4, B7b, B7c, eBl 1, eB14, eB8b).
  • no variants were observed in intersubunit bridges B2a, B5, or B5b.
  • Intersubunit bridge Bla often referred to as the A-site finger (ASF)
  • ASF A-site finger
  • Bridge B2c contributes to the central intersubunit bridge structure, bridge B2, that links the site of peptide-bond formation within the large subunit to the site responsible for ammoacyl-tRNA decoding in the small subunit decoding center (H.
  • Bridge eBl 3 located at the leading edge of the translating ribosome and established through ribosomal protein eL24, extends from the base of the large subunit to ribosomal protein eS6, a small subunit protein implicated in mTQR-dependent regulation of the translation mechanism (T. Budkevich et al., Mol Cell. , (201 1), 44, 214-224; A. C. Hsieh et al., Nature, (2012), 485, 55-61).
  • variants spanning bridge el ements may impact the process of translation initiation by altering intersubunit bridge formation or the dynamic remodeling of intersubunit bridge elements that accompany the mechanism of protein synthesis (H. Chasse, et al .Nucleic Acids Res., (2017), 45, el 5; H. Khatter, et al , Nature, (2015), 520, 640-645; I. A. Dunkle et al., Science, (2011), 332, 981-984).
  • Variants on both sides of an intersubunit bridge may also template a mechanism for coordinating the pairing of specific 40S and 60S subunits comprised of distinct rRNA alleles in the cellular milieu.
  • Example 8 rRNA variants stratify by ancestry
  • rRNA variants to impact ribosome function may be greater if they occur on the same rRNA or in the same ribosome. Unambiguous determination of the occurrence of rR A variants on the same operon is prohibited by the large number of highly homologous rDNA operons in the genome and the short insert size of the paired-end reads (R. Heilig. Nature , (2004), 431, 931-945). Therefore correlation between intraindividual rRNA variant AFs were assessed to provide a first approximation of whether variants may be genetically linked. To do so, all pairwise intra-individual AF correlations were calculated for variants occurring in >75% of 1 ,887 unrelated individuals.
  • Correlations of this kind may exist if variants occurred within the same ancestral rDNA operon, which has since experienced extensive duplication and deletion rearrangements through meiotic recombination, or if direct or indirect functional linkages exist between these distal regions of the ribosome.
  • Example 10 rRNA variants are evolutionarily conserved
  • Table 1 Summary of rRNA variants detected in mouse strains from the Mouse Genomes Project.
  • the C543T variant in hi 6 of 18S rRNA, winch has an intra- individual AF as high as 22% in humans, is thought to directly contact the mRNA helicase DHX29 (D. F. Gudbjartsson et al., Scientific data, (2015), 2, 150011 ), an extra-ribosomal factor implicated in translation initiation (FIG. 4A).
  • the G480A variant in h5 of 18S rRNA which has an intra-individual AF as high as 65% in humans, resides at a position thought to contribute to GTP hydrolysis during tRNA selection (O.
  • Example 10 rRNA variants are differentially expressed between organs
  • RNA-seq without rRNA depletion (Methods).
  • 70 distinct rRNA variants were reproducibly detected: 19 in the 18S, 1 in the 5.8S, and 50 in the 28S. Consistent with a direct relationship between rDNA variant copy number and expression level, 31 of these variants (44%) were also detected in the rDNA of the BALB/cJ mouse strain.
  • ES27 Half of the ten variants showing the largest tissue-specific expression differences were found in ES27 (H63) located on the back side of the 60S subunit.
  • This highly dynamic ES plays an important yet unknown role in stabilizing the ribosome-associated complex (RAC) and coordinating extra-nbosomal factors implicated in co-transiationai protein folding near the nascent peptide exit tunnel (G. Serin et al., Biochirnie., (1996), 78, 530-538; R. Freitas, et al., Kinematic Seif-Replicating Machines (Landes Bioscience, 2004)).
  • the full repertoire of ES27 contributions to the translation mechanism has, however, yet to be elucidated. Intriguingly, variants in these regions were also found in human at high intra-individual AF, suggesting that their tissue-specific expression is conserved.
  • Example 11 rRNA variants are incorporated into actively translating ribosomes
  • 62 rRNA variants were detected, of wfiich 30 were found in the rDNA of at least one of the mouse strains analyzed and another 19 coincided with known positions of rRNA modification.
  • Ribosomal DNA is currently "dark matter" that is missing from contemporary genome assemblies (I. L. Gonzalez, et al., Nucleic Acids Res., (1988), 16, 10213-10224). Consequently, the contribution of rRNA heterogeneities to
  • ribosomopathies or functional distinctions m the pool of assembled ribosomes has yet to be explored.
  • the rigorous rDNA copy number estimation and rRNA sequence variant discovery strategies that are described here reveal pervasive intraand inter-individual rRNA sequence variation in the human genome, with at least 1,662 of the 7,184 nt of rRNA (23%) exhibiting sequence variation in the global population, and variation in rDNA copy number between individuals that spans nearly two orders of magnitude.
  • Tissue-specific rRNA variant expression and extensive overlap between the discovered rRNA variants and positions of known functional importance in the assembled ribosome, including sites identified as being post- transcriptionally modified were further observed (W. A.
  • rDNA operon-specific regulation or 2] tissue-to-tissue variation in the copy number of sub-classes of rDNA operons bearing specific variant alleles. While differences in total rDNA copy number between mouse tissues have been reported (H. F. Noller, Philos Trans R Soc Lond, B, Biol Sci., (2017), 372), it is unknown how the copy number of specific sub-classes of rDNA operons varies by tissue. Notably, larger variations in tissue-specific rRNA variant expression were observed than expected from rDNA copy variation alone.
  • chromatin structure and epigenetic modifications of the rDNA promoter contribute to the specificity of rDNA operon transcription (R. M. Voorhees, et al, Science , (2010), 330, 835-838; C. Lei dig et al, Nat Struct Mol Biol, (2013), 20, 23-28; R. Santoro, et al., Mol Cell, (2001), 8, 719- 725).
  • the expression of rDNA is also correlated with nucleotide variants within the IGS region, rDNA promoter, and the 5’ ETS region upstream of the rRNA genes (J. W. Briggs et al, Mol Cell, (2017) 67, 3-4; B. A.
  • tissue-specific rRNA variant expression is likely also regulated by rDNA promoter, IGS, and ETS sequence heterogeneities.
  • IGS regions contain variable-length tandem repeats, transposable A!u repeat elements that are highly abundant in the human genome, simple sequence microsatellites (B. Xu et al., PLoS Genet., (2017), 13, el006771), and exhibit substantial structural variation (P. Grozdanov, et al., Genomics , (2003), 82, 637-643; R.
  • compositions a feature that has been recently connected to the translation of distinct mRNAs (S. Caburet et al , Genome Res., (2005), 15, 1079-1085).
  • rDNA copy number and rRNA sequence variation may template specialized functions that enable the ribosome to actively participate in determining the cellular proteome (M. Bama, FASEB J (2017); J. W. Briggs, et al., Mol Cell, (2017) 67, 3-4).
  • the present analysis likely represents a lower bound on the true diversity of rDNA sequences in the mouse and human genomes and thus sequence variation within the assembled ribosome.
  • Read coverage is low in regions of extremely high GC-content (35), which applies to rRNA in general and especially for regions of the 28S (>80% GC) for which we observed nearly zero read depth. These regions include those identified as mter-species variable domains that are expressed and present in actively translating ribosomes (J. G. Gibbons, et al., Proc Natl Acad Sci USA., (2015), 1 12, 2485- 2490: H. Tseng et al, PLoS ONE, (2008), 3, el 843; H.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • Genetics & Genomics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medicinal Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Theoretical Computer Science (AREA)
  • Plant Pathology (AREA)
  • Microbiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

La présente invention concerne des processeurs, des supports lisibles par ordinateur et des procédés de détection de variants de séquences rares dans des séquences répétitives dans un génome. La présente invention concerne également des processeurs, des supports lisibles par ordinateur et des procédés pour déterminer le nombre de copies d'une séquence répétitive donnée dans un génome.
PCT/US2019/020024 2018-02-28 2019-02-28 Détection d'allèles variants dans des séquences répétitives complexes dans des ensembles de données de séquençage de génome entier Ceased WO2019169117A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862636466P 2018-02-28 2018-02-28
US62/636,466 2018-02-28

Publications (1)

Publication Number Publication Date
WO2019169117A1 true WO2019169117A1 (fr) 2019-09-06

Family

ID=67805916

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/020024 Ceased WO2019169117A1 (fr) 2018-02-28 2019-02-28 Détection d'allèles variants dans des séquences répétitives complexes dans des ensembles de données de séquençage de génome entier

Country Status (1)

Country Link
WO (1) WO2019169117A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113308540A (zh) * 2020-02-27 2021-08-27 上海鹍远生物技术有限公司 甲状腺结节相关rDNA甲基化标志物及其应用
CN113436683A (zh) * 2020-03-23 2021-09-24 北京合生基因科技有限公司 筛选候选插入片段的方法和系统
WO2021195594A1 (fr) * 2020-03-26 2021-09-30 San Diego State University (SDSU) Foundation, dba San Diego State University Research Foundation Compositions et méthodes de traitement ou d'amélioration d'infections
CN113571131A (zh) * 2021-08-06 2021-10-29 广东省农业科学院水稻研究所 一种泛基因组的构建方法及其相应的结构变异挖掘方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015083004A1 (fr) * 2013-12-02 2015-06-11 Population Genetics Technologies Ltd. Procédé d'évaluation de variants minoritaires dans un échantillon

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015083004A1 (fr) * 2013-12-02 2015-06-11 Population Genetics Technologies Ltd. Procédé d'évaluation de variants minoritaires dans un échantillon

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BENJAMINI, Y ET AL.: "Summarizing and Correcting the GC Content Bias in High-Throughput Sequencing", NUCLEIC ACIDS RESEARCH, vol. 40, no. 10, 9 February 2012 (2012-02-09), pages 1 - 14, XP055162924, doi:10.1093/nar/gks001 *
GROZDANOV, P ET AL.: "Complete Sequence of the 45-Kb Mouse Ribosomal DNA Repeat: Analysis of the Intergenic Spacer", GENOMICS, vol. 82, no. 6, December 2003 (2003-12-01), pages 637 - 643, XP004472311, doi:10.1016/S0888-7543(03)00199-X *
PARKS, MM ET AL.: "Variant Ribosomal RNA Alleles Are Conserved And Exhibit Tissue-Specific Expression", SCIENCE ADVANCES, vol. 4, no. 2, 28 February 2018 (2018-02-28), pages 1 - 13, XP055635181 *
WILM, A ET AL.: "Lofreq: A Sequence-Quality Aware, Ultra-Sensitive Variant Caller For Uncovering Cell -Population Heterogeneity From High-Throughput Sequencing Datasets", NUCLEIC ACIDS RESEARCH, vol. 40, no. 22, 12 October 2012 (2012-10-12), pages 11189 - 11201 *
YU , S ET AL.: "A Portrait of Ribosomal DNA Contacts with Hi-C Reveals 5S and 45S rDNA Anchoring Points in the Folded Human Genome", GENOME BIOLOGY AND EVOLUTION, vol. 8, no. 11, 25 October 2016 (2016-10-25), pages 3545 - 3558, XP055635180 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113308540A (zh) * 2020-02-27 2021-08-27 上海鹍远生物技术有限公司 甲状腺结节相关rDNA甲基化标志物及其应用
CN113436683A (zh) * 2020-03-23 2021-09-24 北京合生基因科技有限公司 筛选候选插入片段的方法和系统
WO2021195594A1 (fr) * 2020-03-26 2021-09-30 San Diego State University (SDSU) Foundation, dba San Diego State University Research Foundation Compositions et méthodes de traitement ou d'amélioration d'infections
CN113571131A (zh) * 2021-08-06 2021-10-29 广东省农业科学院水稻研究所 一种泛基因组的构建方法及其相应的结构变异挖掘方法

Similar Documents

Publication Publication Date Title
Lee et al. Simultaneous profiling of chromatin accessibility and methylation on human cell lines with nanopore sequencing
Swart et al. The Oxytricha trifallax macronuclear genome: a complex eukaryotic genome with 16,000 tiny chromosomes
Plassais et al. Whole genome sequencing of canids reveals genomic regions under selection and variants influencing morphology
Lowe et al. Transcriptomics technologies
Shibata et al. Extensive evolutionary changes in regulatory element activity during human origins are associated with altered gene expression and positive selection
Shukla et al. Comprehensive analysis of cancer-associated somatic mutations in class I HLA genes
Duan et al. Adaptation of A-to-I RNA editing in Drosophila
Bottomly et al. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays
Alvarez-Cubero et al. Next generation sequencing: an application in forensic sciences?
Guttman et al. Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs
Hoegg et al. Phylogeny and comparative substitution rates of frogs inferred from sequences of three nuclear genes
Llanes et al. The genome of Leishmania panamensis: insights into genomics of the L.(Viannia) subgenus.
Mudge et al. Functional transcriptomics in the post-ENCODE era
Hedges et al. Comparison of three targeted enrichment strategies on the SOLiD sequencing platform
Zhang et al. Isoform evolution in primates through independent combination of alternative RNA processing events
Bayega et al. Current and future methods for mRNA analysis: a drive toward single molecule sequencing
KR101832834B1 (ko) 다중점도표 분석 기반 변이 탐색 방법 및 시스템
WO2018144782A1 (fr) Procédés de détection de variants somatiques et de lignée germinale dans des tumeurs impures
JP2017537646A (ja) シーケンシングコントロール
WO2019169117A1 (fr) Détection d'allèles variants dans des séquences répétitives complexes dans des ensembles de données de séquençage de génome entier
Orozco et al. Intergenerational genomic DNA methylation patterns in mouse hybrid strains
Nowick et al. Gain, loss and divergence in primate zinc-finger genes: a rich resource for evolution of gene regulatory differences between species
Dharshini et al. Identifying suitable tools for variant detection and differential gene expression using RNA-seq data
US20220392570A1 (en) Method for screening ivf embryos
Bredemeyer et al. Rapid macrosatellite evolution promotes X-linked hybrid male sterility in a feline interspecies cross

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19761542

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19761542

Country of ref document: EP

Kind code of ref document: A1