[go: up one dir, main page]

HK40058475A - Systems and methods for identifying chromosomal abnormalities in an embryo - Google Patents

Systems and methods for identifying chromosomal abnormalities in an embryo Download PDF

Info

Publication number
HK40058475A
HK40058475A HK62022047635.7A HK62022047635A HK40058475A HK 40058475 A HK40058475 A HK 40058475A HK 62022047635 A HK62022047635 A HK 62022047635A HK 40058475 A HK40058475 A HK 40058475A
Authority
HK
Hong Kong
Prior art keywords
sequence information
genomic sequence
sample
baseline
sample genomic
Prior art date
Application number
HK62022047635.7A
Other languages
Chinese (zh)
Inventor
约翰·布鲁克
迈克尔·J·拉奇
乔舒亚·布拉热克
Original Assignee
合作基因组公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 合作基因组公司 filed Critical 合作基因组公司
Publication of HK40058475A publication Critical patent/HK40058475A/en

Links

Description

System and method for identifying chromosomal abnormalities in embryos
Technical Field
Embodiments disclosed herein are generally directed to systems and methods for identifying candidate embryos for implantation into the uterus. More specifically, there is a need for an automated system and method for identifying chromosomal abnormalities in vitro fertilized candidate embryos implanted into a mother-in-law.
Background
The purpose of in vitro fertilization is to then implant the embryo into the expectant mother. For a given embryo, it is important to examine defects that may prevent a healthy child from successfully birth, and given multiple embryos, the best embryo must be selected for each IVF cycle to increase the likelihood of successful implantation.
In the past, clinical experts identified non-optimal embryos by microscopic examination of embryo morphology or microscopic examination of chromosomal banding patterns. These methods are poor in resolution and inconsistent by relying on manual operations. Conventional karyotyping is limited to detecting features greater than 5 megabases (mb) whereas FISH analysis is limited to less than 1mb and both are limited by a set of probes that must be designed for a particular genetic locus. Microscopic examination of candidate embryos using human experts can introduce a paperwork and examination error rate and other uncertainties to the embryo screening process.
The availability of Next Generation Sequencing (NGS) provides genome wide coverage, requiring much less custom design effort than conventional karyotyping methods. Furthermore, assay costs can be controlled by sequencing depth, which can also be optimized for the required resolution, where deeper sequencing can achieve higher resolution.
However, NGS karyotyping does have problems with signal-to-noise ratio. In particular, due to confounding factors such as sample handling, amplification bias, guanine-cytosine (GC) content, and technical differences between different genetic loci; similarly sized regions of the same copy number typically have very different sequence counts. The differences caused by these confounders are typically greater in magnitude than the differences caused by the true variation in copy number. Therefore, accurate interpretation of NGS data requires methods that can effectively separate the copy number signal from confounding noise.
Furthermore, given a denoised copy number signal, interpretation as either a cytogenetic state (known as aneuploidy or segment duplication/deletion) or karyotype can also present some challenges. The first problem is the amount of sample that the laboratory must handle. Another problem is the rate of artifacts (even in de-noised data) which appear to be characteristic of copy number variation in virtually normal genomic regions (normal means that the copy number of a somatic region is 2, and the sex chromosome is 2, with at least 1 copy belonging to Chr X). Also, not every copy number change is identical in clinical sense, and therefore chromosomal abnormalities with serious consequences should be considered as a more important problem. Finally, previous and current methods rely heavily on human inspection of the graph, which can lead to uncertainty, subjectivity-induced errors, fatigue, insufficient training, and other causes of error.
Accordingly, there is a need for a method or system that can accurately/robustly identify chromosomal abnormalities in candidate embryos to allow selection of embryos that are most likely to result in successful pregnancy upon implantation.
Disclosure of Invention
In one aspect, a method of identifying a chromosomal abnormality in an embryo is disclosed. Receiving sample genomic sequence information obtained from an embryo, wherein the sample genomic sequence information consists of a plurality of genomic sequence reads. The sample genome sequence information is aligned to the reference genome. The sample genomic sequence information is normalized relative to the baseline genomic sequence information to correct the sample genomic sequence information for site effects and generate a normalized sample genomic sequence information dataset. One or more correction factors from regression analysis of the error factors are applied to the normalized sample genomic sequence information dataset to correct the technical effect and generate a denoised sample genomic sequence information dataset. Copy number variations in the denoised sample genome sequence information dataset may be identified when the frequency of the genome sequence reads aligned with a chromosomal location on the reference genome deviates from a frequency threshold.
In another aspect, a system for identifying a chromosomal abnormality in an embryo is disclosed. The system includes a data storage unit, a computing device, and a display communicatively coupled to each other.
The data storage unit is configured to store sample genomic sequence information obtained from the embryo. The computing device hosts a data denoising engine and an interpretation engine. The data denoising engine is configured to receive sample genomic sequence information from the data store, normalize the sample genomic sequence information against the baseline genomic sequence information to correct the sample genomic sequence information for site effects, and apply one or more correction factors derived from regression analysis of error factors to correct technical effects and generate a denoised sample genomic sequence information dataset. The interpretation engine is configured to identify copy number variations in the denoised sample genomic sequence information dataset when a frequency of genome sequence reads aligned with a chromosomal location in the denoised sample genomic sequence information dataset deviates from a frequency threshold.
The display is configured to display a report containing the identified copy number variation.
In another aspect, the invention discloses a method for identifying embryonic aneuploidy. Receiving sample genomic sequence information obtained from an embryo, wherein the sample genomic sequence information consists of a plurality of genomic sequence reads. The sample genome sequence information is aligned to a reference genome. The sample genomic sequence information is normalized relative to the baseline genomic sequence information to correct the sample genomic sequence information for site effects and generate a normalized sample genomic sequence information dataset. One or more correction factors from regression analysis of the error factors are applied to the normalized sample genomic sequence information dataset to correct the technical effect and generate a denoised sample genomic sequence information dataset. And analyzing the denoised sample genome sequence information data set by using the trained neural network, and classifying the sex aneuploidy state of the embryo.
Drawings
For a more complete understanding of the principles disclosed herein and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
1A-1E depict embryos with normal and abnormal chromosome conditions, according to various embodimentsAnd (4) visualization graphs.
Fig. 2 is an exemplary flow diagram illustrating a method for identifying chromosomal abnormalities, in accordance with various embodiments.
Figure 3 illustrates how read counts are normalized for site effects, in accordance with various embodiments.
Fig. 4 is a diagram illustrating an evaluation of similarity between a target sample and a baseline sample, in accordance with various embodiments.
Fig. 5 is a depiction of how a baseline vector is constructed from multiple baseline samples in a baseline set, in accordance with various embodiments.
Fig. 6A is a graph illustrating bin effect normalization of embryo data according to various embodiments.
Fig. 6B is a diagram illustrating real-time sample effect correction, in accordance with various embodiments.
Fig. 7 is a depiction of how the LOWESS technique may be used for GC correction, in accordance with various embodiments.
Fig. 8A-8B are graphs illustrating the technical impact of GC on interval fractions, according to various embodiments.
Fig. 9 is a schematic diagram of a system for identifying a chromosomal abnormality in an embryo, in accordance with various embodiments.
FIG. 10 is a block diagram illustrating a computer system in accordance with various embodiments.
Fig. 11 is an exemplary flow diagram illustrating a method for identifying a sexual aneuploidy in an embryo according to various embodiments.
Figure 12 is a depiction of a Hidden Markov Model (HMM) finite state machine topology, in accordance with various embodiments.
Fig. 13A-13B are diagrams showing de-noising and normalization of a deletion on chromosome 15, according to various embodiments.
Fig. 14 is a diagram depicting a method of determining complex embryonic aneuploidies using chromosome clusters, in accordance with various embodiments.
Fig. 15 is a depiction of a normalized and denoised interval data neural network for predicting complex sexual aneuploidies in an embryo, in accordance with various embodiments.
Fig. 16 is a depiction of a feed-forward network structure in accordance with various embodiments.
FIG. 17 is a block diagram illustrating an improved system and method (PGTai) and a conventional subjective call method (by) as will be disclosed herein, in accordance with various embodiments) Provided withSoftware), a plot of the net change in the various ploidy classifications.
It should be understood that the drawings are not necessarily drawn to scale, nor are the objects in the drawings necessarily drawn to scale relative to each other. The accompanying drawings are included to provide a further understanding of the various embodiments of the apparatus, systems, and methods disclosed herein. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Moreover, it should be understood that the drawings are not intended to limit the scope of the present teachings in any way.
Detailed Description
The present specification describes various exemplary embodiments of systems and methods for identifying chromosomal abnormalities in candidate embryos for implantation for in vitro fertilization. However, the present disclosure is not limited to these exemplary embodiments and applications, nor to the manner in which the exemplary embodiments and applications operate or are described herein. Further, the figures may show simplified or partial views, and the sizes of elements in the figures may be exaggerated or otherwise not in proportion. In addition, because words such as the terms "on.. attached," "connected," "coupled" and the like are used herein, one element (e.g., material, layer, substrate, etc.) can be "on," "attached to," "connected to," or "coupled to" another element, regardless of whether one element is directly on, attached, connected, or coupled to the other element or whether one or more intervening elements are present. Further, where a list of elements (e.g., elements a, b, c) is referred to, such reference is intended to include any listed elements by themselves, any combination of fewer than all listed elements, and/or any combination of all listed elements. The division of parts in the description is for ease of viewing only and does not limit any combination of the elements discussed.
Unless defined otherwise, scientific and technical terms used in connection with the present teachings described herein shall have the meanings that are commonly understood by one of ordinary skill in the art. Furthermore, unless the context requires otherwise, singular terms shall include the plural, and plural terms shall include the singular. Generally, the terminology and techniques associated with cell and tissue culture, molecular biology, and protein and oligonucleotide or polynucleotide chemistry and hybridization described herein are well known and commonly used in the art. Standard techniques, for example for nucleic acid purification and preparation, chemical analysis, recombinant nucleic acid and oligonucleotide synthesis, are used. Enzymatic reactions and purification techniques are performed according to the manufacturer's instructions or as commonly done in the art or as described herein. The techniques and procedures described herein are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the present specification. See, for example, Sambrook et al, Molecular Cloning: A Laboratory Manual (third edition, Cold Spring Harbor Laboratory Press, New York, Cold Spring Harbor, 2000). The terminology used in connection with the description herein, and the laboratory procedures and techniques, are well known and commonly used in the art.
DNA (deoxyribonucleic acid) is a nucleotide chain consisting of four types of nucleotides; a (adenine), T (thymine), C (cytosine) and G (guanine), and RNA (ribonucleic acid) are composed of 4 types of nucleotides; A. u (uracil), G and C. Certain nucleotide pairs specifically bind to each other in a complementary manner (referred to as complementary base pairing). That is, adenine (A) is paired with thymine (T) (in the case of RNA, adenine (A) is paired with uracil (U)), and cytosine (C) is paired with guanine (G). When a first nucleic acid strand is joined to a second nucleic acid strand consisting of nucleotides complementary to the nucleotides in the first strand, the two strands join to form a double strand. The human reference genome is representative of one of these strands (referred to as strand 1, as used herein). As used herein, the reverse complement of strand 1 is referred to as strand 2. As used herein, "nucleic acid sequencing data," "nucleic acid sequencing information," "nucleic acid sequence," "genomic sequence," "genetic sequence," or "fragment sequence" or "read nucleic acid sequencing" refers to any or data representing the order of nucleotide bases in a molecule of DNA or RNA (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.). It should be understood that the present teachings contemplate sequence information obtained using all of the various available techniques, platforms, or technologies, including but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide recognition systems, pyrosequencing, ion-or pH-based detection systems, electronic signature-based systems, and the like.
"Polynucleotide", "nucleic acid" or "oligonucleotide" refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) linked by nucleoside bonds. Typically, a polynucleotide comprises at least three nucleosides. Typically, oligonucleotides range in size from a few monomeric units, e.g., from 3-4 to several hundred monomeric units. Whenever a polynucleotide (such as an oligonucleotide) is represented by an alphabetical sequence (e.g., "ATGCCTG"), it is understood that the nucleotides are in 5'- >3' order from left to right, and unless otherwise specified, "a" represents deoxyadenosine, "C" represents deoxycytidine, "G" represents deoxyguanosine, and "T" represents thymidine. The letters A, C, G and T can be used to refer to the base itself, to a nucleoside, or to a nucleotide comprising a base, as is standard in the art.
The phrase "next generation sequencing" (NGS) refers to a sequencing technique that has a higher throughput than traditional Sanger and capillary electrophoresis based methods, e.g., the ability to generate thousands of relatively small sequence reads at a time. Some examples of next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization. More specifically, the MISEQ, hisseq and nextsseq systems of Illumina, and the Personal Genome Machine (PGM) and SOLiD sequencing systems of Life Technologies corp. The SOLID System and related workflow, protocols, chemistry, etc. are described in more detail in International filing date, 2006, 2/1, entitled "Reagents, Methods, and library for load-Based Sequencing", PCT application publication No. WO 2006/084132, U.S. patent application No. 12/873,190 entitled "Low-Volume Sequencing System and Method of Use" filed 2006, 1/1, and U.S. patent application No. 12/873,132 entitled "Last-indicating Filter factory and Method of Use" filed 2010, 8/31, and the entire contents of which are incorporated herein by reference.
The phrase "sequencing run" refers to any step or portion of a sequencing experiment performed to determine certain information about at least one biomolecule (e.g., a nucleic acid molecule).
As used herein, the phrase "genomic signature" may refer to a genomic region (e.g., a gene, protein coding sequence, mRNA, tRNA, rRNA, repeat sequence, inverted repeat, miRNA, siRNA, etc.) or genetic/genomic variant (e.g., single nucleotide polymorphism/variant, insertion/deletion sequence, copy number variation, inversion, etc.) with some annotated function, which represents a single gene or a group of genes (DNA or RNA) that has undergone a change due to mutation, recombination/crossover, or genetic drift, to a particular species or to a subpopulation within a particular species.
Genomic variants can be identified using a variety of techniques, including but not limited to: array-based methods (e.g., DNA microarrays, etc.), real-time/digital/quantitative PCR instrument methods, and complete or targeted nucleic acid sequencing systems (e.g., NGS systems, capillary electrophoresis systems, etc.). By nucleic acid sequencing, coverage data can be obtained at single base resolution.
The phrase "fragment pool" refers to a collection of nucleic acid fragments, wherein one or more fragments serve as a sequencing template. The pool of fragments can be generated, for example, by cleaving or shearing larger nucleic acids into smaller fragments. The library of fragments may be generated from naturally occurring nucleic acids, such as mammalian or bacterial nucleic acids. Libraries containing synthetic nucleic acid sequences of similar size can also be generated to create synthetic fragment libraries.
The phrase "chromosomal abnormality" or "multiple chromosomal abnormalities" refers to structural (e.g., deletions, duplications, translocations, inversions, insertions, etc.) and digital (i.e., aneuploidy) chromosomal disorders.
The phrase "chimeric embryo" refers to an embryo that contains two or more cytogenetically distinct cell lines. For example, a chimeric embryo may contain cell lines with different types of aneuploidy or a mixture of euploids and genetically abnormal cells containing DNA with genetic variations that may be detrimental to the viability of the embryo during pregnancy.
In various embodiments, the sequence alignment method can align a fragment sequence with a reference sequence or another fragment sequence. Fragment sequences may be obtained from a library of fragments, a paired-end library, a large-fragment two-terminal library (mate-pair library), a library of tandem fragments, or another type of library that may be reflected or represented by nucleic acid sequence information, including, for example, RNA, DNA, and protein-based sequence information.
In general, the length of a fragment sequence may be substantially less than the length of a reference sequence. The fragment sequence and the reference sequence may each comprise a sequence of symbols. The alignment of the fragment sequence and the reference sequence may include a limited number of mismatches between the symbols of the fragment sequence and the symbols of the reference sequence. In general, a fragment sequence can be aligned with a portion of a reference sequence to minimize the number of mismatches between the fragment sequence and the reference sequence.
In particular embodiments, the symbols of the fragment sequence and the reference sequence may represent the composition of the biomolecule. For example, these symbols may correspond to the identity of nucleotides in a nucleic acid, such as RNA or DNA, or to the identity of amino acids in a protein. In some embodiments, the symbols may have direct correlation with these subcomponents of the biomolecule. For example, each symbol may represent a single base of a polynucleotide. In other embodiments, each symbol may represent two or more adjacent subcomponents of a biomolecule, such as two adjacent bases of a polynucleotide. Additionally, the symbols may represent overlapping sets of adjacent sub-components or different sets of adjacent sub-components. For example, where each symbol represents two adjacent bases of a polynucleotide, two adjacent symbols representing overlapping sets may correspond to three bases of a polynucleotide sequence, while two adjacent symbols representing different sets may represent a sequence of four bases. Further, symbols may correspond directly to a subcomponent (e.g., nucleotide), or they may correspond to a color call or other indirect measure of a subcomponent. For example, the symbol may correspond to the incorporation or non-incorporation of a particular nucleotide stream.
In various embodiments, a computer program product may include instructions for selecting a contiguous portion of a sequence of segments; instructions for mapping the contiguous portions of the sequence of segments to the reference sequence using an approximate string matching method that produces at least one match of the contiguous portions to the reference sequence.
In various embodiments, a system for nucleic acid sequence analysis may include a data analysis unit. The data analysis unit may be configured to obtain a fragment sequence from a sequencing instrument, obtain a reference sequence, select contiguous portions of the fragment sequence, and map the contiguous portions of the fragment sequence to the reference sequence using an approximate string mapping method that would generate at least one match of the contiguous portions to the reference sequence.
As used herein, "substantially" means sufficient for the intended purpose. Thus, the term "substantially" allows for minor, insignificant variations from absolute or perfect states, dimensions, measurements, results, etc., as would be expected by one of ordinary skill in the art, but without significantly affecting overall performance. "substantially" when used in reference to a value or parameter or characteristic that may be expressed as a numerical value means within ten percent.
The term "plurality" means a plurality.
As used herein, the term "plurality" may be 2, 3, 4, 5, 6, 7, 8, 9, 10, or greater.
As used herein, the term "cell" is used interchangeably with the term "biological cell". Non-limiting examples of biological cells include eukaryotic cells, plant cells, animal cells, such as mammalian cells, reptile cells, avian cells, fish cells, and the like, prokaryotic cells, bacterial cells, fungal cells, protozoan cells, and the like, cells dissociated from tissues such as muscle, cartilage, fat, skin, liver, lung, neural tissue, and the like, immune cells such as T cells, B cells, natural killer cells, macrophages, and the like, embryos (e.g., fertilized eggs), oocytes, egg cells, sperm cells, hybridomas, cultured cells, cells from cell lines, cancer cells, infected cells, transfected and/or transformed cells, reporter cells, and the like. Mammalian cells can be, for example, from humans, mice, rats, horses, goats, sheep, cattle, primates, and the like.
Conventional methods for processing NGS data to identify chromosomal abnormalities
Many clinical procedures (clinical pipelines) using NGS data follow a similar initial workflow. Firstly, performing multi-path decoding on an original sequence generated by using a sequencer; when sequencing many samples simultaneously, sequences from different subjects will be tagged with initial barcodes, which will be removed after assigning the sequences to the subjects. Adapters or other artificial features are removed from the generated sequence. Sequences are typically assigned to genetic loci by computer programs that align or match the bases of the resulting sequences to known genomic reference sequences, and typically eliminate PCR repeats and low quality sequences during or shortly after the alignment process. Sequences that have been processed and matched to a genetic locus are often referred to as aligned sequences or aligned reads. The number of sequences generated from each target sample is commonly referred to as the "sequencing depth".
IlluminaA commercial implementation of the conventional method for replicating number variant (CNV) calls is provided that also smoothes data by taking a median over a sliding window over k near-end bins (bins).
CNVs are genomic alterations that result in an abnormal copy number of one or more genes and can lead to disease.The software generates a chart that allows the user to visualize, analyze, and interpret genetic anomalies.
Embryos with normal chromosome numbers are euploid embryos. As shown in FIG. 1A, euploid embryos areThe graph is visualized as having two copies (on the y-axis of the graph) of each chromosome number (1-22) shown on the x-axis of the graph. For sex, female embryos have two copies of the X chromosome and no copies of the Y chromosome (as shown in fig. 1A), while male embryos have one copy of the X chromosome and one copy of the Y chromosome.
On the other hand, embryos with abnormal chromosome numbers are aneuploid embryos. Chromosomes with increased copies (3 copies instead of the normal 2 copies) are called trisomies, while chromosomes with lost copies (1 copy instead of the normal 2 copies) are called monosomy. FIG. 1B depicts a male aneuploid embryo with monosomy. Two copies of chromosomes 1-14, 16-22 are visible, while only one copy of chromosome 15 (monomer) is visible. There is also one copy of the X and Y chromosomes, indicating that the embryo is male.
When only a part of a chromosome is abnormally copied or deleted, it is called a duplication or deletion, respectively. Fig. 1C depicts a male embryo deleted on chromosome 5. For chromosomes 1-4, 6-22, two copies were visualized and a portion of chromosome 5 was deleted. There is also one copy of the X and Y chromosomes, indicating that the embryo is male.
An embryo that possesses both normal and abnormal cells of a particular chromosome is called a chimeric embryo. Visually, the number of chromosomal copies of this embryo is between that of normal copies (2 copies) and abnormal copies (1 copy or 3 copies, depending on trisomy or monosomy). Fig. 1D depicts a male embryo with a chimeric chromosome 16. Two copies of chromosomes 1-15, 17-22 were visualized, while chromosome 16 was chimeric (copy number 2.5). There is also one copy of the X and Y chromosomes, indicating that the embryo is male.
There are significant limitations to the approach taken by software. Interpretation of the data becomes more difficult as the noise (background) level of the data increases if the quality of the embryo biopsy is compromised, DNA degradation, or the preparation of the library itself is problematic. Higher noise levels make it difficult to interpret which changes from normal are truly genetic abnormalities, rather than problems with DNA quality itself. The result of these drawbacks is that segmentation or mosaic calls, or complex aneuploidy calls, must be made by a human technician by examining a graph of normalized interval scores. The subjectivity and uncertainty associated with human interpretation of images may lead to unnecessary changes in embryo analysis for chromosomal abnormalities. Fig. 1E depicts a male embryo with a high noise level, which makes it difficult for a human technician to interpret whether there is a true genetic abnormality in the embryo.
For processing NGS data to identify chromosomal abnormalitiesFrequent automatic machine interpretation method
Systems and methods are disclosed for automatically detecting chromosomal abnormalities, including segment duplications/deletions, chimeric features, and complex sexual aneuploidies. Conceptually, these systems and methods have two main flows: 1) denoising/normalization (to denoise the original sequence reads), and 2) interpretation (decoding the denoised/normalized signal into a karyotype graph and clinical aneuploidy calls).
Fig. 2 is an exemplary flow diagram illustrating a method 200 for automatically identifying a chromosomal abnormality in an embryo, in accordance with various embodiments. In step 202, sample genomic sequence information obtained from an embryo is received. The sample genomic information consists of a plurality of genomic sequence reads generated using various genomic sequencing techniques including NGS, PCR, and the like. In step 204, the sample genomic sequence information is aligned with the reference genome. In various embodiments, the reference genome is a human reference genome.
In step 206, the sample genomic sequence information is normalized relative to the baseline genomic sequence information to correct the sample genomic sequence information for site effects. Site effects are an aspect of genomic position, which correlates with changes in sequence coverage even if copy number is unchanged. Examples of site effects may be, but are not limited to: 1) GC contents within bases of 50, 100, 150, etc. of the base part, 2) the potential for DNA around the genomic position to form secondary structures, 3) sequence similarity with other genomic positions, etc.
In various embodiments, normalizing the sample genomic sequence information for site effects involves first setting the interval size. In various embodiments, the span size is set to 1 megabase (mb). However, it should be understood that the interval size may be set to any size, including: 100kb, 500kb or any other value between 100 and 2000 kilometres as long as it does not exceed the length of the human genome. Next, the sample genome sequence information and the baseline genome sequence information are divided into a plurality of intervals based on the size of the intervals. Then, a number of genome sequence reads is determined from the sample genome sequence information aligned with each of the plurality of sample genome sequence information intervals to generate a sample interval score for each of the plurality of sample genome sequence information intervals.
Next, a number of genome sequence reads is determined from the baseline genome sequence information aligned with each of the plurality of baseline genome sequence information intervals to generate a baseline interval score for each of the plurality of baseline genome sequence information intervals. The sample interval score is then normalized to the baseline interval score to generate a normalized sample genomic sequence data set.
In various embodiments, the baseline interval score is determined by first receiving a plurality of baseline genomic sequence information datasets obtained from euploid embryos. An interval score is then determined for each of the plurality of baseline genomic sequence information datasets. Next, a subset of the baseline genomic sequence information dataset having an interval score that exceeds a similarity threshold of the sample genomic sequence information is selected from the plurality of baseline genomic sequence information datasets. Finally, a baseline interval score is generated by determining a median of the interval scores in the selected subset of the baseline genomic information dataset.
In step 208, one or more correction factors derived from the regression analysis of the error factors are applied to correct the technical effect and generate a denoised sample genome sequence information dataset.
In step 210, a CNV is identified from the denoised sample genome sequence information dataset when the frequency of the genome sequence reads aligned to a chromosome position on the reference genome deviates from a frequency threshold.
Fig. 3-8B illustrate various aspects of the method 200. As shown in fig. 3, for each strand (strand 1 and strand 2 of the human genome as described above) and each interval, nx is defined as the interval count scaled by the total number of reads 302 aligned with the diploid chromosomes of the target sample on the same strand.
As shown in fig. 4, a first correction for gene locus (interval) effects can be made by normalizing the interval counts from the target sample relative to the set of baseline groups for the whole-fold sample. The span size may be set to 1 megabase 304 first. However, it should be understood that the interval size can be set to essentially any size, including: 100kb, 500kb or any other value between 1 and 2000 ten thousand. Next, as shown in fig. 5, the sample genomic sequence information is segmented into multiple intervals, and then the best subset of baseline samples (rather than using the entire set of baselines) is selected to normalize for the interval effect, where the optimality is defined as having a baseline nx that is most similar to the target sample nx. The similarity measure is then quantified as the correlation of nx for the baseline sample and nx for the target sample. In various embodiments, rank correlation may be used as a measure of similarity, although there are many choices (such as MSE/residual squared, euclidean distance, or mahalanobis distance).
Given the above method for calculating the similarity between the target sample and the baseline sample, the sample with the highest similarity to the target sample is selected from the baseline.
Given a set of similarity values s ═ s1, s 2.., s (baseline sample number) }, the similarity between the baseline sample and the target sample is selected, s > t in the baseline sample, where t is the g th percentile of s. In various embodiments, the parameter g may be set to 90%, but may also be set to 10%, 30%, 50%, 80%, or any other number between 1 and 100. In addition to correcting for interval marginal effects on gene locus counts, this also corrects for distant intervals with associated scores, where coverage in one interval suggests coverage in another. After selecting the best subset of baseline samples, the samples of target interval scores are normalized by the median baseline-subset normalized interval score. Normalization can then be performed by division, the result being a vector of interval fractions centered at 1.0.
One of the benefits of these methods for correcting gene locus effects is that the running samples are accumulated and the euploid samples inform of future normalizations, thus making the normalized interval scores less noisy and the overall system more accurate over time.
Biological processes that are specific to the target sample state (i.e., real-time sample effects) when sequenced, such as gene expression or regulation, can also potentially affect genome availability during the sequencing process, but they can be corrected. One result of these real-time effects is signal attenuation of the individual chains. A local weighted regression scatter smoothing (LOWESS) estimator can be used to derive a correction to the range signal for a particular chain by r ═ the ratio of the number of intervals in the preceding chain. The bin fraction for a particular chain can then be normalized (divided) by this correction factor. As shown in fig. 6A and 6B, LOWESS calculates a correction factor 602 at each value of r by evaluating a low order polynomial fit centered on r that uses only a subset of the data points (r, interval _ score) having values closest to r.
As described above, gene site-specific concentrations of "c" and "g" bases and other technical effects (e.g., amplification bias, secondary structure, nucleosome density, miRNA blockade, gene expression, etc.) can affect sequence counts in the interval; however, the above described correction of gene site effects does not take into account the differential response of each sample to the effects of these techniques. There are many technical effects associated with sample interaction correction. As shown in fig. 7, LOWESS can also be used to correct for the effects of GC content. LOWESS can be used to define a correction for each level of technical effect and normalize (subtract) the interval score by this factor. As shown in fig. 8A and 8B, LOWESS calculates the correction for each value p of the gc percentage by evaluating a low order polynomial fit centered on p that uses only a subset of the data points (gc, interval _ score) with the gc value closest to p.
Fig. 9 is a schematic diagram of a system for identifying a chromosomal abnormality in an embryo, in accordance with various embodiments. The system 900 includes a sequencer 902, a computing device/analytics server 904, and a display 912.
The sequencer 902 is communicatively connected to a computing device/analytics server 904. In various embodiments, the computing device 904 may be communicatively connected to the genome sequencer 902 via a network connection, which may be a "hardwired" physical network connection (e.g., the internet, a LAN, a WAN, a VPN, etc.) or a wireless network connection (e.g., Wi-Fi, WLAN, etc.). In various embodiments, the computing device 904 may be a workstation, a mainframe computer, a distributed computing node ("cloud computing" or part of a distributed networked system), a personal computer, a mobile device, and so forth. In various embodiments, the genome sequencer 902 can be a nucleic acid sequencer (e.g., NGS, capillary electrophoresis system, etc.), a real-time/digital/quantitative PCR instrument, a microarray scanner, or the like. However, it should be understood that the genome sequencer 902 can be essentially any type of instrument that can generate nucleic acid sequence data from a sample containing genomic fragments.
One skilled in the art will recognize that various embodiments of genome sequencer 502 can be used to practice a variety of sequencing methods, including ligation-based methods, sequencing by synthesis, single molecule methods, nanopore sequencing, and other sequencing techniques. Ligation sequencing may involve a single ligation technique, or a variation of a ligation technique in which multiple ligations are performed sequentially on a single primary nucleic acid sequence strand. Sequencing by synthesis may include incorporation of dye-labeled nucleotides, chain termination, ion/proton sequencing, pyrophosphate sequencing, and the like. Single molecule techniques may include continuous sequencing, wherein the identity of the karyotype may be determined without interrupting or delaying the sequencing reaction or staggering the sequence during the incorporation process, wherein the sequencing reaction is halted to determine the identity of the incorporated nucleotide.
In various embodiments, the genome sequencer 902 can determine the sequence of a nucleic acid (such as a polynucleotide or oligonucleotide). Nucleic acids may comprise DNA or RNA, and may be single-stranded (such as ssDNA and RNA) or double-stranded (e.g., dsDNA or RNA/cDNA pairs). In various embodiments, the nucleic acid may include or be derived from a library of fragments, a library of large fragment bipartite sequences, chromatin co-immunoprecipitation (ChIP) fragments, and the like. In particular embodiments, the genome sequencer 902 can obtain sequence information from a single nucleic acid molecule or from a set of substantially identical nucleic acid molecules.
In various embodiments, the genome sequencer 902 may output data (genome sequence information) of nucleic acid sequencing reads in a variety of different output data file types/formats, including but not limited to: *. fasta,. csfasta,. xsq,. seq.txt,. qseq.txt,. fastq,. sff,. prb.txt,. sms,. srs and/or. qv.
In various embodiments, the sequencer 902 also includes a data store configured to store sample genomic sequencing information generated by the sequencer 902 during a sample run.
Computing device/analytics server 904 may be configured to host a data denoising engine 906, an Artificial Intelligence (AI)/Machine Learning (ML) driven interpretation engine 908, and an AI/ML driven aneuploidy recognition engine 910.
The data denoising engine 906 can be configured to receive sample genomic sequence information from the sequencer 902 (or a data store associated with the sequencer 902), normalize the sample genomic sequence information for baseline genomic sequence information to correct the sample genomic sequence information for gene locus effects, and apply one or more correction factors derived from regression analysis of sampling error factors to correct technical effects and generate a denoised sample genomic sequence information dataset.
The AI/ML driven interpretation engine 908 can be configured to identify copy number variations in the denoised sample genomic sequence information dataset when the gene sequence read frequencies aligned with the chromosomal locations in the denoised sample genomic sequence information deviate from a frequency threshold.
The AI/ML driven sexual aneuploidy engine 910 may be configured to utilize a trained neural network to analyze the denoised sample genomic sequence information dataset and classify the sexual aneuploidy status of the embryo.
After the chromosomal abnormality has been identified, the results may be displayed on a display or client terminal 912 communicatively connected to the computing device 904. In various embodiments, the client terminal 912 may be a thin client computing device. In various embodiments, client terminal 912 may be a web browser (e.g., INTERNET EXPLORER)TM、FIREFOXTM、SAFARITMEtc.) that may be used to control a data de-noising engine 906, an Artificial Intelligence (AI)/Machine Learning (ML) driven interpretation engine 908, and/or an AI/ML driven sexual aneuploidy recognition engine 910.
Explanation of the invention
When bin bit normalization and denoising is complete, the bin score will be centered at 1.0 (representing copy number state 2). The genetic locus scores can then be interpreted (or decoded) into karyotyping maps and clinical aneuploidies using machine learning and "artificial intelligence" methods.
As shown in fig. 12, Hidden Markov Models (HMMs) are a series of machine learning techniques that are common in speech recognition and signal processing. For each chromosome, a finite state machine is constructed whose launch and transition probabilities are parameterized by the input data characteristics and the resolution required by the user.
At each chromosome location j, the model has a plurality of states, each state representing a portion of the copy number variation. The initial states all have the same probability, and as one proceeds to the next genomic interval, the transitions between states are defined by duration modeling that averages regions greater than or equal to 3 megabases (which is a configurable parameter so that in a megabase interval size, the probability of remaining in a non-2.0 copy number state is 1/3, and all other transitions are equal in probability). The scores emitted by each state follow a normal distribution (there may be different distributions within the scope of the invention), where the standard deviation is estimated from the interval scores and the average of the copy number k res (k res)/2.0, where res is the defined resolution (0.01 by default). The process of assigning intervals to copy numbers given an HMM is called decoding, which is performed using a forward-backward algorithm, which is a standard method of assigning membership probabilities in states to each observation. Other decoding algorithms (e.g., Viterbi) may also be used. The initial decoding by the forward-backward algorithm defines the probability that each interval is in each state, and thus each interval is assigned to one copy number state.
In various embodiments, the systems and methods disclosed herein may accommodate non-uniformities in data. In the "BLUE FUSE" method described above, it is assumed that all samples at all loci have a constant variance (default value of 0.33). As disclosed herein, by default, HMMs are parametrically set by dynamically calculated variances of target samples, which allows samples with lower variances (typically samples with higher sequencing depth or DNA quality) to have higher resolution and controls the number of false positive non-diploid assignments for more variable samples (typically samples with lower sequencing depth or DNA quality).
In various embodiments, the systems and methods disclosed herein use machine learning to assign copy numbers to genetic loci so that heterogeneity and heteroscedasticity in the data can be considered. For example, as shown in fig. 13A to 13B, although the normalized and denoised interval scores have a constant center, they have different expansions or standard deviations. In particular, fig. 13A depicts a karyotype chart showing a deletion of chromosome 15. The denoised and normalized interval scores 1306 are more closely distributed around the decoded copy number line 1302. Fig. 13B depicts a karyotype chart showing the non-constant variance of the normalized interval scores 1304 relative to the non-normalized interval scores 1308 for a subset of baseline normalized embryo samples. HMMs can be operated in a heterogeneous fashion to accommodate gene site-specific variations.
Various other non-HMM methods exist, such as round-robin binary partitioning, greedy algorithms, and other methods that may be used to assign copy number states, and remain within the scope of this disclosure.
In various embodiments, the systems and methods disclosed herein have the ability to accurately determine the presence of a complexity aneuploidy in an embryo. Such as the methods discussed aboveIt is not possible to automatically provide 47: XXY (sex aneuploidy), 47: XXX (sexual aneuploidy), 69: XXY (triploidy) or 69: automatic complexity aneuploidy invocation of XYY (triploidy).
Fig. 14 is a diagram depicting a method of determining complex embryonic aneuploidies using chromosome clusters, in accordance with various embodiments. This method assigns aneuploidy states using machine learning methods (e.g., k-nearest neighbor algorithm on vectors) according to classification methods such as k-nearest neighbor algorithm using Mahabalonis statistical distance including: { ratio of sequences aligned with X, interval normalized chromosome X score, ratio of sequences aligned with Y, interval normalized Y score }.
In various embodiments, the systems and methods disclosed herein may also utilize neural network methods and other "artificial intelligence" methods. That is, interval scores throughout the genome can be processed using a neuro-learning multi-tier perceptron approach to predict aneuploidy status.
In various embodiments, the neural network topology 1500 used to specify inputs for all or some interval scores in a genome fed across a genome feed-forward network includes two hidden layers comprising four 1502 and two nodes 1504, respectively, and a complex aneuploidy result/call 1506, as shown in fig. 15. Back propagation can then be used to construct neural network weights on a set of training data of known embryonic aneuploidy states.
Fig. 16 is an illustration of a feed-forward network structure in accordance with various embodiments. In various embodiments, the input to the network (input layer) is a subset of the normalized interval scores, as described in the "denoise and normalize" description above, or by a similar process, using by default all normalized intervals on the X and Y chromosomes, as well as all autosomes (1-22 chromosomes of the human genome). In various embodiments, chromosome subsets or chromosome intervals may also be used, such as by inspection determination or by process estimation to determine which intervals are more important for gender determination.
The hidden layer of the network is located between the input and the output. In various embodiments, a neural network for identifying complexity aneuploidies in an embryo includes two hidden layers, where a first hidden layer includes four nodes, a second hidden layer includes two nodes, and each layer has an additional bias node. However, it should be understood that a different number of hidden layers with different nodes may be used, depending on the requirements of a particular application.
The final output layer has one node for each possible result (in this case, one node for each gender state).
The structure of each non-input node may be a standard perceptron, where the output is a non-linear "activation function" of the inputs. By default, the activation function may be a rectifier linear unit (ReLU), although ELUs, Sigmoid, arcfinger, Step, softmax, and many other activation functions may be used within the scope of the present disclosure.
After ReLU is activated, the output f of the input x of a given node is max (0, x).
However, it should be understood that many other types of neural networks may be applied within the scope of the present disclosure; such as convolutional neural networks (with additional pooling and convolutional layers), recursive neural networks (where nodes have connections to previous nodes), etc.
One of the unique advantages of the systems and methods disclosed herein is that previously run samples and interpretations can be accumulated to inform future decoding, which can help the training systems and methods to be more accurate over time. In various embodiments of the systems and methods disclosed herein, knowledge of features and/or translocations in the parental sample can also be incorporated into learning, allowing for the detection of small translocations.
Fig. 11 is an exemplary flow diagram illustrating a method 1100 for identifying a sexual aneuploidy in an embryo, in accordance with various embodiments.
In step 1102, sample genomic sequence information obtained from an embryo is received. The sample genomic information consists of a plurality of genomic sequence reads generated using various genomic sequencing techniques including NGS, PCR, and the like. In step 1104, the sample genomic sequence information is aligned to a reference genome. In various embodiments, the reference genome is a human reference genome.
In step 1106, the sample genomic sequence information is normalized with respect to the baseline genomic sequence information to correct the sample genomic sequence information for gene locus effects.
In various embodiments, normalizing the sample genomic sequence information for gene site effects involves first setting an interval size. In various embodiments, the span size is set to 1 megabase (mb). However, it should be understood that the interval size may be set to any size, including: 100kb, 500kb or any other value between 100 and 2000 kilometres as long as it does not exceed the length of the human genome. Next, the sample genome sequence information and the baseline genome sequence information are partitioned into a plurality of intervals based on the selected interval size. Then, a number of genome sequence reads is determined from the sample genome sequence information aligned with each of the plurality of sample genome sequence information intervals to generate a sample interval score for each of the plurality of sample genome sequence information intervals.
Next, a number of genome sequence reads is determined from the baseline genome sequence information aligned with each of the plurality of baseline genome sequence information intervals to generate a baseline interval score for each of the plurality of baseline genome sequence information intervals. The sample interval score is then normalized to the baseline interval score to generate a normalized sample genomic sequence data set.
In various embodiments, the baseline interval score is determined by first receiving a plurality of baseline genomic sequence information datasets obtained from euploid embryos. An interval score is then determined for each of the plurality of baseline genomic sequence information datasets. Next, a subset of the baseline genomic sequence information datasets having an interval score that exceeds a similarity threshold of the sample genomic sequence information is selected from the plurality of baseline genomic sequence information datasets. Finally, a baseline interval score is generated by determining a median of the interval scores in the selected subset of the baseline genomic information dataset.
In step 1108, one or more correction factors derived from the regression analysis of the error factors are applied to correct the technical effect and generate a denoised sample genome sequence information dataset.
In step 1110, the denoised sample sequence information dataset may be analyzed using a trained neural network algorithm/technique to classify the complexity aneuploidy state of the embryo.
Computer-implemented system
In various embodiments, the method for identifying chromosomal abnormalities in an embryo may be implemented by computer software or hardware. That is, as shown in fig. 9, the method can be implemented on a computing device/system 904, the computing device/system 904 including a data denoising engine 906, an Artificial Intelligence (AI)/Machine Learning (ML) driven interpretation engine 908, and an AI/ML driven sex aneuploidy recognition engine 910. In various embodiments, the computing device/system 904 may be communicatively connected to the NGS sequencer 902 and the display device 912 via a direct connection or through an internet connection.
It should be understood that the various engines shown in FIG. 9 may be combined or collapsed into a single engine, component, or module depending on the requirements of a particular application or system architecture. Further, in various embodiments, the data denoising engine 906, the Artificial Intelligence (AI)/Machine Learning (ML) -driven interpretation engine 908, and the AI/ML-driven aneuploidy recognition engine 910 may include other engines or components as required by a particular application or system architecture.
FIG. 10 is a block diagram that illustrates a computer system 1000 upon which an embodiment of the present teachings may be implemented. In various embodiments of the present teachings, computer system 1000 may include a bus 1002 or other communication mechanism for communicating information, and a processor 1004 coupled with bus 1002 for processing information. In various embodiments, computer system 1000 may also include a memory, which may be a Random Access Memory (RAM)1006 or other dynamic storage device, coupled to bus 1002 to determine instructions for execution by processor 1004. Memory may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1004. In various embodiments, computer system 1000 may also include a Read Only Memory (ROM)1008 or other static storage device coupled to bus 1002 for storing static information and instructions for processor 1004. A storage device 1010, such as a magnetic disk or optical disk, may be provided and coupled to bus 1002 for storing information and instructions.
In various embodiments, computer system 1000 may be coupled via bus 1002 to a display 1012, such as a Cathode Ray Tube (CRT) or Liquid Crystal Display (LCD), for displaying information to a computer user. An input device 1014, including alphanumeric and other keys, may be coupled to bus 1002 for communicating information and command selections to processor 1004. Another type of user input device is cursor control 1016, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1004 and for controlling cursor movement on display 1012. This input device 1014 typically has two degrees of freedom in two axes, a first axis (i.e., x) and a second axis (i.e., y), that allows the device to specify positions in a plane. However, it should be understood that input devices 1014 that allow 3-dimensional (x, y, and z) cursor movement are also contemplated herein.
Consistent with certain embodiments of the present teachings, computer system 1000 may provide results in response to processor 1004 executing one or more sequences of one or more instructions contained in memory 1006. Such instructions may be read into memory 1006 from another computer-readable medium or computer-readable storage medium, such as storage device 1010. Execution of the sequences of instructions contained in memory 1006 may cause processor 1004 to perform processes described herein. Alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement the teachings. Thus, implementations of the present teachings are not limited to any specific combination of hardware circuitry and software.
The term "computer-readable medium" (e.g., data store, etc.) or "computer-readable storage medium" as used herein refers to any medium that participates in providing instructions to processor 1004 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, and transmission media. Examples of non-volatile media may include, but are not limited to, optical solid state disks, such as storage device 1010. Examples of volatile media may include, but are not limited to, dynamic memory, such as memory 1006. Examples of transmission media may include, but are not limited to, coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1002.
Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other tangible medium from which a computer can read.
In addition to computer-readable media, instructions or data may also be provided as signals on transmission media included in a communication device or system to provide one or more sequences of instructions to the processor 1004 of the computer system 1000 for execution. For example, the communication device may include a transceiver having signals indicative of instructions and data. The instructions and data are configured to cause one or more processors to implement the functions outlined in the disclosure herein. Representative examples of data communication transmission connections may include, but are not limited to, telephone modem connections, Wide Area Networks (WANs), Local Area Networks (LANs), infrared data connections, NFC connections, and the like.
It should be understood that the flow diagrams, figures, and accompanying disclosed methods described herein may be implemented using the computer system 1000 as a standalone device or over a distributed network of shared computer processing resources, such as a cloud computing network.
Results of the experiment
The improved systems and methods disclosed herein are compared to conventional methods of identifying chromosomal abnormalities in embryos in order to quantify the improvement in the overall accuracy of ploidy classification.
FIG. 17 is a block diagram illustrating the improved systems and methods (PGTai) and conventional subjective call methods (e.g., from a computer) when the present disclosure is combined with the conventional subjective call methodsProvided withSoftware), a plot of the net change in the various ploidy classifications. Approximately 20,000 embryos were analyzed and classified over a six month period using the systems and methods described herein (i.e., PGTai). Comparing the classification rate with the passing of the conventional subjective partyMethod (i.e. of) The interpreted control embryo populations were compared. The classification rates are then evaluated by relative comparison, noting the overall classification rate achieved by the novel systems and methods disclosed herein versus the classification rate by conventional means. For example, if the novel systems and methods disclosed herein demonstrate that 46% of embryos are classified as euploids, whereas conventional methodologies indicate that populations of the same origin produce a euploid rate of 41% through conventional subjective interpretation, this is expressed as + 5%. As previously mentioned, subjective interpretation, particularly in the presence of un-attenuated noise, is prone to errors. In particular, the presence of noise or an exceptionally low signal-to-noise ratio leads to over-interpretation. In this case, over-interpretation is represented by false positive classification. As an example, in embryonic genetics, this can be expressed as interpreting true euploids as chimeras, or true chimeras as aneuploidies. As shown in fig. 17, when a total of about 40,000 embryos were analyzed (20,000 by the system and method disclosed herein, 20,000 by conventional subjective methods), a decrease in the aneuploidy and chimerism rate of the material was observed, while an increase in the classification rate of euploid of the material was observed. Given that substances are processed in the same laboratory, which are obtained from the same clinical center, but the method of data analysis differs, these results indicate that the improved denoising process described herein reduces false calls due to over-interpreting noise.
The methods described herein may be implemented in various ways depending on the application. For example, the methods may be implemented in hardware, firmware, software, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.
In various embodiments, the methods of the present teachings may be implemented as firmware and/or software programs and applications written in conventional programming languages, such as C, C + +, Python, and the like. If implemented as firmware and/or software, the embodiments described herein may be implemented on a non-transitory computer-readable medium having stored therein a program for causing a computer to perform the above-described methods. It should be understood that the various engines described herein can be provided on a computer system, such as computer system 1000, whereby the processor 1004 will perform the analysis and determination provided by these engines on instructions provided by any one or combination of the storage component 1006/1008/1010 or user input provided through the input device 1014.
While the present teachings are described in conjunction with various embodiments, there is no intent to limit the present teachings to such embodiments. On the contrary, it is intended that the present teachings encompass various alternatives, modifications, and equivalents as will be appreciated by those skilled in the art.
In describing various embodiments, the specification may have presented a method and/or process as a particular sequence of steps. To the extent, however, that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. As one of ordinary skill in the art will appreciate, other sequences of steps are possible. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Additionally, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the various embodiments.

Claims (28)

1. A method of identifying chromosomal abnormalities in an embryo, comprising:
receiving sample genomic sequence information obtained from an embryo, wherein the sample genomic sequence information comprises a plurality of genomic sequence reads;
aligning the sample genomic sequence information to a reference genome;
normalizing the sample genomic sequence information relative to baseline genomic sequence information to correct the sample genomic sequence information for site effects and generate a normalized sample genomic sequence information dataset;
applying one or more correction factors from regression analysis of error factors to the normalized sample genomic sequence information dataset to correct for technical effects and generate a denoised sample genomic sequence information dataset; and
identifying copy number variations in the denoised sample genomic sequence information dataset when the frequency of genomic sequence reads aligned to a chromosomal location on the reference genome deviates from a frequency threshold.
2. The method of claim 1, further comprising:
generating a karyotype map or molecular karyotype from the denoised sample genomic sequence information dataset.
3. The method of claim 1, wherein normalizing the sample genomic sequence information for site effects further comprises:
setting the size of the interval;
dividing the sample genome sequence information and the baseline genome sequence information into a plurality of intervals according to the interval size;
determining a number of genome sequence reads from the sample genome sequence information aligned with each of a plurality of sample genome sequence information intervals to generate a sample interval score for each of the plurality of sample genome sequence information intervals;
determining a number of genome sequence reads from the baseline genome sequence information aligned with each of a plurality of baseline genome sequence information intervals to generate a baseline interval score for each of the plurality of baseline genome sequence information intervals;
normalizing the sample interval score relative to the baseline interval score; and
a normalized sample genomic sequence information dataset is generated.
4. The method of claim 3, further comprising:
receiving a plurality of baseline genomic sequence information datasets obtained from euploid embryos;
determining an interval score for each of the plurality of baseline genomic sequence information datasets;
selecting a subset of a baseline genomic sequence information dataset from the plurality of baseline genomic sequence information datasets, the subset of the baseline genomic sequence information dataset having an interval score that exceeds a similarity threshold of the sample genomic sequence information; and
the baseline interval score is generated by determining a median of the interval scores in the selected subset of the baseline genomic sequence information dataset.
5. The method of claim 4, further comprising:
calculating a similarity value for each of the plurality of baseline genomic sequence information data sets, wherein the similarity value is a measure of the similarity of each baseline genomic sequence information data set to the sample genomic sequence information.
6. The method of claim 4, wherein the similarity value is determined using Euclidean distance analysis.
7. The method of claim 4, wherein the similarity value is determined using a Mahalanobis distance analysis.
8. The method of claim 4, wherein the similarity value is a percentage of similarity between the baseline genomic sequence information dataset and the sample genomic sequence information.
9. The method of claim 1, wherein correcting the sample genomic sequence information for sampling effects further comprises:
one or more correction factors are calculated using a locally weighted scatter plot smoothed regression analysis.
10. The method of claim 1, wherein the error factor is related to GC content.
11. The method of claim 1, wherein the error factor is related to an amplification bias.
12. The method of claim 1, wherein the error factor is related to secondary structure.
13. The method of claim 1, wherein the error factor is related to nucleosome density.
14. The method of claim 1, wherein the error factor is associated with miRNA blocking.
15. The method of claim 1, wherein the error factor is associated with gene expression.
16. A system for identifying chromosomal abnormalities in an embryo, comprising:
a data storage unit configured to store sample genomic sequence information obtained from an embryo;
a computing device communicatively connected to the data storage unit, the computing device comprising:
a data de-noising engine configured to receive the sample genomic sequence information from a data store, normalize the sample genomic sequence information relative to baseline genomic sequence information to correct the sample genomic sequence information for site effects, and apply one or more correction factors derived from regression analysis of error factors to correct technical effects and generate a de-noised sample genomic sequence information dataset, and
an interpretation engine configured to identify copy number variations in the denoised sample genomic sequence information dataset when the frequency of genomic sequence reads aligned to a chromosomal location in the denoised sample genomic sequence information dataset deviates from a frequency threshold; and
a display communicatively connected to the computing device and configured to display a report containing the identified copy number variation.
17. The system of claim 16, wherein the error factor is related to GC content.
18. The system of claim 16, wherein the error factor is related to an amplification bias.
19. The system of claim 16, wherein the error factor is related to secondary structure.
20. The system of claim 16, wherein the error factor is related to nucleosome density.
21. The system of claim 16, wherein the error factor is associated with miRNA blocking.
22. The system of claim 16, wherein the error factor is associated with gene expression.
23. The system of claim 16, wherein the computing device further comprises:
a sexual aneuploidy recognition engine configured to analyze the de-noised sample genomic sequence information dataset with a trained neural network to classify a sexual aneuploidy status of the embryo.
24. A method of identifying embryonic aneuploidy, comprising:
receiving sample genomic sequence information obtained from an embryo, wherein the sample genomic sequence information comprises a plurality of genomic sequence reads;
aligning the sample genomic sequence information to a reference genome;
normalizing the sample genomic sequence information relative to baseline genomic sequence information to correct the sample genomic sequence information for site effects and generate a normalized sample genomic sequence information dataset;
applying one or more correction factors from regression analysis of error factors to the normalized sample genomic sequence information dataset to correct for technical effects and generate a denoised sample genomic sequence information dataset; and
and analyzing the denoised sample genome sequence information data set by using the trained neural network, and classifying the sex aneuploidy state of the embryo.
25. The method of claim 24, further comprising:
receiving a de-noised sample genomic information dataset obtained from a plurality of embryos having a known aneuploidy classification; and
updating a neural network using the denoised sample genomic information dataset to generate a trained neural network.
26. The method of claim 24, wherein the trained neural network comprises:
an input layer;
a first hidden layer consisting of four nodes;
a second hidden layer consisting of two nodes; and
an output layer having a plurality of nodes corresponding to different aneuploidy classifications.
27. The method of claim 25, wherein the neural network has a feed-forward neural network architecture.
28. The method of claim 25, further comprising applying a back propagation technique to train the neural network.
HK62022047635.7A 2018-10-05 2019-10-07 Systems and methods for identifying chromosomal abnormalities in an embryo HK40058475A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US62/742,211 2018-10-05

Publications (1)

Publication Number Publication Date
HK40058475A true HK40058475A (en) 2022-04-22

Family

ID=

Similar Documents

Publication Publication Date Title
CA3115273C (en) Systems and methods for identifying chromosomal abnormalities in an embryo
AU2022201545A1 (en) Deep convolutional neural networks for variant classification
EP3378001B1 (en) Methods for detecting copy-number variations in next-generation sequencing
JP7333838B2 (en) Systems, computer programs and methods for determining genetic patterns in embryos
US20190108311A1 (en) Site-specific noise model for targeted sequencing
US20210102262A1 (en) Systems and methods for diagnosing a disease condition using on-target and off-target sequencing data
US20230005569A1 (en) Chromosomal and Sub-Chromosomal Copy Number Variation Detection
US20200399701A1 (en) Systems and methods for using density of single nucleotide variations for the verification of copy number variations in human embryos
CN114258572A (en) Systems and methods for determining genomic ploidy
KR20250158791A (en) Optimizing sequencing panel allocation
HK40058475A (en) Systems and methods for identifying chromosomal abnormalities in an embryo
US20200105374A1 (en) Mixture model for targeted sequencing
JP2025536913A (en) Component mixture model for tissue identification in DNA specimens
HK40026566B (en) Semi-supervised learning for training an ensemble of deep convolutional neural networks