[go: up one dir, main page]

WO2018137496A1 - 确定生物样本中预定来源的游离核酸比例的方法及装置 - Google Patents

确定生物样本中预定来源的游离核酸比例的方法及装置 Download PDF

Info

Publication number
WO2018137496A1
WO2018137496A1 PCT/CN2018/072045 CN2018072045W WO2018137496A1 WO 2018137496 A1 WO2018137496 A1 WO 2018137496A1 CN 2018072045 W CN2018072045 W CN 2018072045W WO 2018137496 A1 WO2018137496 A1 WO 2018137496A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
nucleic acid
predetermined
window
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/072045
Other languages
English (en)
French (fr)
Inventor
袁玉英
柴相花
王书元
陈丽娜
周丽君
刘强
张红云
王威
刘娜
尹烨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Genomics Co Ltd
Original Assignee
BGI Genomics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Genomics Co Ltd filed Critical BGI Genomics Co Ltd
Priority to PL18743980.7T priority Critical patent/PL3575407T3/pl
Priority to EP18743980.7A priority patent/EP3575407B1/en
Priority to ES18743980T priority patent/ES2981092T3/es
Priority to HRP20240709TT priority patent/HRP20240709T1/hr
Priority to RS20240650A priority patent/RS65618B1/sr
Priority to CN201880006995.9A priority patent/CN110191964B/zh
Publication of WO2018137496A1 publication Critical patent/WO2018137496A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12MAPPARATUS FOR ENZYMOLOGY OR MICROBIOLOGY; APPARATUS FOR CULTURING MICROORGANISMS FOR PRODUCING BIOMASS, FOR GROWING CELLS OR FOR OBTAINING FERMENTATION OR METABOLIC PRODUCTS, i.e. BIOREACTORS OR FERMENTERS
    • C12M1/00Apparatus for enzymology or microbiology
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the present invention relates to the field of biotechnology, particularly non-invasive prenatal gene detection and tumor gene detection, and in particular to a method and apparatus for determining the proportion of free nucleic acids of a predetermined source in a biological sample.
  • the present invention aims to solve at least one of the technical problems existing in the prior art. To this end, it is an object of the present invention to provide a method for accurately and efficiently determining the proportion of free nucleic acids of a predetermined source in a biological sample.
  • the proportion of DNA used to estimate fetal freeness in peripheral blood has four main directions: 1) Estimation of different responses to methylation of specific markers from maternal and maternal free DNA fragments in maternal peripheral blood mononuclear cells; 2) Using the variability of single nucleotide polymorphism (SNPs) loci, selecting multiple representative SNPs for estimation; 3) using the difference between the fetal and maternal DNA fragments in the maternal circulation; 4) Use the Y chromosome to estimate the fetal concentration of pregnant women.
  • SNPs single nucleotide polymorphism
  • method 1) requires a large amount of plasma
  • method 2) requires probe capture and high sequencing depth or needs to obtain paternal information
  • method 3) needs to sequence the length of the fragment Will increase the cost of sequencing, and the accuracy of the sequencing results in general
  • Method 4) is limited to pregnant women with pregnant women, can not estimate the fetal concentration of female fetus.
  • the inventors have developed a method for estimating fetal concentration that utilizes only the sequencing data currently detected by NIPT without the need to add additional sequencing data. That is, the method can accurately quantify peripheral blood fetal DNA concentration by low coverage sequencing data.
  • the development of this method is mainly based on the discovery that the number of reads (readings) in each window is consistent with the fetal concentration after the autosomes are divided according to a certain window, and the fetal concentration is estimated. The fetal concentration of the female fetus can be estimated and the accuracy is also high.
  • the method is widely applicable and can be applied to free DNA of different sources.
  • the method is also applicable to free tumor nucleic acids in peripheral blood of tumor patients, suspected tumor patients or tumor screeners.
  • the invention provides a method of determining the proportion of free nucleic acids of a predetermined source in a biological sample.
  • the method comprises: (1) performing nucleic acid sequencing on a biological sample containing free nucleic acid to obtain a sequencing result composed of a plurality of sequencing data; (2) comparing the sequencing result with a reference sequence And determining (3) determining a ratio of free nucleic acids of a predetermined source in the biological sample based on the number of sequencing data falling within the predetermined window.
  • the method of the present invention can accurately and efficiently determine the proportion of free nucleic acids from a predetermined source in a biological sample, and is particularly suitable for determining free fetal nucleic acid in peripheral blood of pregnant women, as well as tumor patients, suspected tumor patients or tumor screening.
  • the proportion of free tumor nucleic acid in peripheral blood is particularly suitable for determining free fetal nucleic acid in peripheral blood of pregnant women, as well as tumor patients, suspected tumor patients or tumor screening.
  • the invention also provides an apparatus for determining the proportion of free nucleic acids of a predetermined source in a biological sample.
  • the apparatus includes: a sequencing device for performing nucleic acid sequencing on a biological sample containing free nucleic acid to obtain a sequencing result composed of a plurality of sequencing data; a counting device, the counting device And the sequencing device, configured to compare the sequencing result with a reference sequence to determine a number of sequencing data falling within a predetermined window in the sequencing result; and a free nucleic acid ratio determining device, the free nucleic acid ratio determining The device is coupled to the counting device for determining a ratio of free nucleic acids of a predetermined source in the biological sample based on the number of sequencing data falling within the predetermined window.
  • the apparatus of the invention is adapted to carry out the method of determining a proportion of free nucleic acids of a predetermined source in a biological sample of the invention as described above, thereby enabling the accurate and efficient determination of a predetermined source in the biological sample using the apparatus of the invention
  • the ratio of free nucleic acids is particularly useful for determining the proportion of free fetal nucleic acid in the peripheral blood of pregnant women, as well as the free tumor nucleic acid in the peripheral blood of tumor patients, suspected tumor patients or tumor screeners.
  • FIG. 1 is a schematic flow diagram of a method of determining a proportion of free nucleic acids of a predetermined source in a biological sample, in accordance with one embodiment of the present invention
  • FIG. 2 is a flow diagram showing the determination of the number of sequencing data falling within a predetermined window, in accordance with one embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of an apparatus for determining a ratio of free nucleic acids of a predetermined source in a biological sample, in accordance with one embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a counting device 200 according to an embodiment of the present invention.
  • FIG. 5 is a weight distribution diagram obtained by using a ridge regression statistical model for weight estimation according to an embodiment of the present invention.
  • FIG. 6 is a correlation analysis diagram of a fetal concentration estimation result of a test set sample determined by a ridge regression statistical model and a fetal concentration estimation result using a Y chromosome according to an embodiment of the present invention
  • ChrY-based is an estimated fetal free nucleic acid ratio based on the Y chromosome
  • Ridge Regreession is an estimated value of fetal free nucleic acid using the ridge regression statistical model based on the present invention.
  • FIG. 7 is a weight distribution diagram obtained by using a neural network model for weight estimation according to an embodiment of the present invention.
  • FIG. 8 is a correlation analysis diagram of a fetal concentration estimation result of a test set sample determined by a neural network model and a fetal concentration estimation result by using a Y chromosome, according to an embodiment of the present invention.
  • Male is a male sample
  • Female is a female sample
  • ChrY-based is a fetal free nucleic acid ratio estimation based on Y chromosome
  • FF-QuantSC is a neural network model based on the present invention for estimating fetal free nucleic acid ratio.
  • FIG. 9 is a correlation analysis diagram of estimating the fetal concentration of a female fetus using a ridge regression statistical model and a neural network model according to an embodiment of the present invention; in the figure, Ridge Regreession is a fetus based on the ridge regression statistical model of the present invention.
  • the estimated free nucleic acid ratio, FF-QuantSC is an estimated value of fetal free nucleic acid using the neural network model based on the present invention.
  • the invention provides a method of determining the proportion of free nucleic acids of a predetermined source in a biological sample.
  • the inventors have surprisingly found that the method of the present invention enables accurate and efficient determination of the proportion of free nucleic acids in a biological sample, and is particularly useful for determining fetal nucleic acid in peripheral blood of pregnant women and the proportion of tumor nucleic acid in peripheral blood of tumor patients.
  • the expression "the ratio of free nucleic acids of a predetermined source in a biological sample” as used herein refers to the ratio of the number of free nucleic acid molecules of a specific source to the total number of free nucleic acid molecules in a biological sample.
  • the “proportion of free nucleic acid of a predetermined source in the biological sample” is a ratio of free fetal nucleic acid, indicating the freeness contained in the peripheral blood of the pregnant woman.
  • the ratio of the number of fetal nucleic acid molecules to the total number of free nucleic acid molecules can sometimes be referred to as "the concentration of free fetal DNA in the peripheral blood of pregnant women" or the ratio of free fetal DNA.
  • the biological sample is a tumor patient, a suspected tumor patient, or a tumor screening person peripheral blood
  • the predetermined source of free nucleic acid is a free tumor nucleic acid
  • the “proportion of free nucleic acid of a predetermined source in the biological sample” is a free tumor.
  • the ratio of nucleic acids indicates the ratio of the number of free tumor nucleic acid molecules contained in the peripheral blood of a tumor patient, a suspected tumor patient, or a tumor screening person to the total number of free nucleic acid molecules.
  • the method comprises:
  • the biological sample containing the free nucleic acid is subjected to nucleic acid sequencing to obtain a sequencing result composed of a plurality of sequencing data.
  • the biological sample is peripheral blood.
  • the free nucleic acid of the predetermined source is at least one selected from the group consisting of: free fetal nucleic acid in the peripheral blood of the pregnant woman; free nucleic acid derived from the mother in the peripheral blood of the pregnant woman; and a tumor patient, a suspected tumor patient or
  • the tumor reporter is a free tumor nucleic acid or a non-tumor derived free nucleic acid in the peripheral blood.
  • the sequencing is double-end sequencing, single-end sequencing or single molecule sequencing.
  • the nucleic acid is DNA. It should be noted that the term "sequencing data" as used herein is a sequence reads corresponding to the sequenced nucleic acid molecules.
  • the sequencing result is aligned with a reference sequence to determine the number of sequencing data that falls within a predetermined window of the sequencing results.
  • the reference sequence is a reference genomic sequence, preferably hg19.
  • the predetermined window is obtained by successively dividing a predetermined chromosome of the reference genome sequence.
  • the predetermined chromosome comprises an autosome, preferably the autosome does not comprise at least one of chromosomes 13, 18 and 21.
  • the predetermined window has a length of 60K.
  • the division of the predetermined window needs to keep the readings in each window uniform, that is, to ensure the Reads uniformity within the window.
  • Reads uniformity in the window means that the number of reads in each window is substantially the same, that is, the variance is close to zero.
  • S200 further includes:
  • S210 Align the sequencing result with a reference genome. Specifically, the sequencing results are aligned with a reference genome to construct a unique alignment sequencing data set, each of the sequencing data in the unique alignment sequencing data set being only capable of matching one position of the reference genome;
  • S220 Determine a reference genome location. Specifically, determining a reference genomic location corresponding to each sequencing data in the unique alignment sequencing data set;
  • S230 Determine the number of sequencing data falling into the predetermined window. Specifically, the number of sequencing data falling into the predetermined window is determined.
  • the number of sequencing data falling within a predetermined window in the sequencing result can be easily determined, and the result is accurate and reliable, and the repeatability is good.
  • the proportion of free nucleic acids of a predetermined source in the biological sample is determined based on the number of sequencing data falling within the predetermined window.
  • step (3) the proportion of free nucleic acids of a predetermined source in the biological sample is determined using the weights of the respective predetermined windows.
  • the weight of each predetermined window is predetermined by using a training sample.
  • the results are accurate, reliable, and repeatable.
  • the weight is determined using at least one of a ridge regression statistical model and a neural network model.
  • the neural network model employs a TesnsorFlow learning system.
  • the parameters of the TesnsorFlow learning system include: using the number of sequencing data of each window of the autosome as the input layer; using the fetal concentration as the output layer; the neuron type adopting ReLu; and the optimization algorithm is selected from Adam At least one of SGD and Ftrl; preferably Ftrl.
  • the parameters of the Tesnsor Flow learning system further include: the learning rate is set to 0.002; the number of layers of the hidden layer is 1; and the number of neurons in the hidden layer is 200.
  • the results are accurate and reliable.
  • weight as used herein is a relative concept for a certain indicator.
  • the weight of an indicator refers to the relative importance of the indicator in the overall evaluation.
  • a weight of a predetermined window refers to the relative importance of a certain booking window in all predetermined windows.
  • a “connection weight” refers to the relative importance of a two different layer connection in all two different layer connections.
  • the training sample is a maternal peripheral blood sample of known free fetal nucleic acid ratio.
  • the training sample is a peripheral blood sample of a pregnant woman with a normal male fetus that is known to have a free fetal nucleic acid ratio. Therefore, the determination result of the ratio of free fetal or maternal-derived free nucleic acid in the peripheral blood sample of the pregnant woman to be tested is more accurate and reliable.
  • the weight is determined using a ridge regression statistical model, and the calculation formula of the ridge regression statistical model is as follows:
  • x j is the number of windows reads, For each window weight, For deviation, versus It is obtained when training the model.
  • the weights are determined using a neural network model, the calculation formula of which is as follows:
  • l is the serial number of the layer of the network
  • the first layer is the input layer
  • the last layer is the output layer (only one neuron)
  • the middle is the hidden layer.
  • Is the value of the jth neuron in the first layer Is the value of the kth neuron in the l-1st layer
  • the weight of the connection from the kth neuron of the l-1th layer to the jth neuron of the 1st layer The input deviation of the jth neuron of the first layer.
  • the values of the neurons are calculated layer by layer according to the above formula, and the neuron value of the last layer is the predicted value of the fetal concentration model.
  • the determining the weight by using the neural network model comprises: calculating the value of the neuron layer by layer according to the calculation formula of the neural network model, wherein the neuron value of the last layer is the fetal concentration model Predictive value.
  • the number of sequencing data falling into the predetermined window is previously GC-corrected to obtain an effective number of sequencing falling into the predetermined window.
  • the determined number of sequencing data falling within the predetermined window is accurate and reliable.
  • the GC correction is performed as follows:
  • the effective sequence number of the i-th window is ER i
  • the GC content of the reference reference in the window is GC i
  • the average number of effective sequences of all windows on the predetermined chromosome is recorded.
  • the sex of the fetus is predetermined before the step (3) is performed, preferably by the ratio of the sequencing data of the Y chromosome to the total sequencing data. Therefore, the determination result of the ratio of free fetal or maternal-derived free nucleic acid in the peripheral blood sample of the pregnant woman to be tested is more accurate and reliable.
  • the specific steps of estimating the fetal concentration of the sequenced samples using the ridge regression statistical model and the neural network model of the TensorFlow library using the method of the present invention are as follows:
  • the reference (hg19) is divided into adjacent windows by a fixed length (this method uses 60K), the window in the N area is filtered out, and the GC content in the window is counted to obtain a reference window file hg19.gc;
  • the number of valid sequences in the i-th window is ER i
  • the GC content of the reference in the window is GC i (recorded in the hg19.gc file)
  • the autosomal chromosome Chromosome 1 to 22
  • the sex of the fetus is confirmed according to the ratio of the number of effective sequences of the Y chromosome to the total number of effective sequences (ER%); the threshold value a is specified (the range of a is [0.001, 0.003]), and the Y chromosome ER% is greater than or equal to a, which is male, less than a when is a woman;
  • Weight estimation is performed on each window of the autosome (excluding chromosomes 13, 18, and 21) using the ridge regression model (this weight is equivalent to the regression coefficient ⁇ in the ridge regression model). Get the estimated weight distribution.
  • the first five steps are exactly the same as the ridge regression statistical model method described above.
  • b) Construct a neural network with the number of effective sequences per window of relatively stable autosomes as the input layer, with a single neuron as the output layer (corresponding to the fetal concentration), and no hidden layer.
  • the neuron type is selected by ReLU, and the optimization algorithm is Adam.
  • the neural network is used to predict the fetal concentration in the training set, and the learning rate is adjusted according to the change of the learning effect of each round, so that the learning rate is maximized under the condition that the learning effect of the training set is not reciprocating.
  • Ridge regression is an improvement to the least squares method, which reduces the overfitting of the model by adding a second-order regularization term.
  • the least squares method is to find ⁇ 0 , ⁇ 1 , ⁇ 2 with the smallest sum of squares of residuals...:
  • RSS is the sum of squared residuals
  • y i is the dependent variable
  • x ij is the independent variable
  • the value of ⁇ needs to be specified.
  • the general method is to take several values and use cross-validation to find the ⁇ that minimizes the objective function of the verification set.
  • Artificial neural networks are a nonlinear machine learning method.
  • the basic building blocks are neurons, and each neuron performs a weighted average of the deviations of several inputs x j :
  • w j is the weight
  • w 0 is the deviation
  • f(z) is output according to the result of the weighted average.
  • a number of neurons form a multi-layer network, where the first layer (input layer) is input with the argument of the data, and the output of the previous layer is the input of the next layer, and so on, until the last layer (output layer) There is only one neuron with one output (model predictor). Layers other than the input and output layers are called hidden layers.
  • the basic parameters of a neural network are the weights and deviations of each layer, which are generally trained by backpropagation.
  • parameters such as learning rate, neuron type, optimizer, hidden layer number, number of neurons in the hidden layer, regularization coefficient, etc., which are generally preset according to experience, and then repeatedly adjusted with the training effect.
  • the invention also provides an apparatus for determining the proportion of free nucleic acids of a predetermined source in a biological sample.
  • the apparatus of the present invention is suitable for performing the method of the present invention for determining the proportion of free nucleic acids of a predetermined source in a biological sample as described above, thereby enabling the accurate and efficient determination of a predetermined source of biological samples using the apparatus of the present invention.
  • the ratio of free nucleic acids is particularly useful for determining free fetal nucleic acid in maternal peripheral blood and free nucleic acids from maternal origin, as well as the proportion of free tumor nucleic acids in peripheral blood of tumor patients, suspected tumor patients or tumor screeners.
  • the apparatus includes: a sequencing apparatus 100, a counting apparatus 200, and a free nucleic acid ratio determining apparatus 300.
  • the sequencing device 100 is configured to perform nucleic acid sequencing on a biological sample containing free nucleic acid to obtain a sequencing result composed of a plurality of sequencing data; the counting device 200 is connected to the sequencing device 100 for using the sequencing result Reference sequence alignment to determine the number of sequencing data falling within a predetermined window in the sequencing result; free nucleic acid ratio determining device 300 is coupled to said counting device 200 for number of sequencing data based on said predetermined window And determining a proportion of free nucleic acids of a predetermined source in the biological sample.
  • the kind of the biological sample is not particularly limited.
  • the biological sample is peripheral blood.
  • the free nucleic acid of the predetermined source is at least one selected from the group consisting of: free fetal nucleic acid in the peripheral blood of the pregnant woman; free nucleic acid derived from the mother in the peripheral blood of the pregnant woman; and a tumor patient, a suspected tumor patient or
  • the tumor reporter is a free tumor nucleic acid or a non-tumor derived free nucleic acid in the peripheral blood.
  • the ratio of free fetal nucleic acid or maternal-derived free nucleic acid in the peripheral blood of the pregnant woman, or the ratio of free tumor nucleic acid in the peripheral blood of the tumor patient, the suspected tumor patient or the tumor screening person can be easily determined.
  • the nucleic acid is DNA.
  • the sequencing is double-end sequencing, single-end sequencing or single molecule sequencing. Thereby, the subsequent steps are facilitated.
  • the reference sequence is a reference genomic sequence, preferably hg19.
  • the predetermined window is obtained by successively dividing a predetermined chromosome of the reference genome sequence.
  • the predetermined chromosome comprises an autosome, preferably the autosome does not comprise at least one of chromosomes 13, 18 and 21.
  • the predetermined window has a length of 60K.
  • the counting device 200 further includes: a comparison unit 210, a position determination unit 220, and a number determination unit 230.
  • the comparison unit 210 is configured to compare the sequencing result with a reference genome to construct a unique alignment sequencing data set, each of the sequencing data in the unique alignment sequencing data set can only be associated with the reference genome
  • a location matching location determining unit 220 is coupled to the comparing unit 210 for determining a reference genomic location corresponding to each sequencing data in the unique alignment sequencing data set
  • the number determining unit 230 is coupled to the location determining unit 220 for determining The number of sequencing data falling into the predetermined window. Thereby, the number of sequencing data falling within the predetermined window can be easily determined, and the result is accurate and reliable, and the repeatability is good.
  • the free nucleic acid ratio determining means 300 is adapted to determine the proportion of free nucleic acids of a predetermined source in the biological sample using the weights of the respective predetermined windows.
  • the weights of the respective predetermined windows are predetermined by using training samples. As a result, the results are accurate, reliable, and repeatable.
  • the weight is determined using at least one of a ridge regression statistical model and a neural network model.
  • the neural network model employs a Tesnsor Flow learning system.
  • the parameters of the Tesnsor Flow learning system include: the number of sequencing data of each window of the autosome is used as an input layer; the fetal concentration is used as an output layer; the neuron type is ReLu; and the optimization algorithm is selected from the group consisting of ReLu; At least one of Adam, SGD and Ftrl; preferably Ftrl.
  • the parameters of the Tesnsor Flow learning system further include: the learning rate is set to 0.002; the number of layers of the hidden layer is 1; and the number of neurons in the hidden layer is 200.
  • the results are accurate and reliable.
  • the training sample is a maternal peripheral blood sample of known free fetal nucleic acid ratio.
  • the training sample is a peripheral blood sample of a pregnant woman with a normal male fetus that is known to have a free fetal nucleic acid ratio. Therefore, the determination of the proportion of free fetal or maternal-derived free nucleic acids in the peripheral blood samples of pregnant women is more accurate and reliable.
  • the weight is determined using a ridge regression statistical model, and the calculation formula of the ridge regression statistical model is as follows:
  • x j is the number of windows reads, For each window weight, For deviation, versus It is obtained when training the model.
  • the weights are determined using a neural network model, the calculation formula of which is as follows:
  • l is the serial number of the layer of the network
  • the first layer is the input layer
  • the last layer is the output layer (only one neuron)
  • the middle is the hidden layer.
  • Is the value of the jth neuron in the first layer Is the value of the kth neuron in the l-1st layer
  • the weight of the connection from the kth neuron of the l-1th layer to the jth neuron of the 1st layer The input deviation of the jth neuron of the first layer.
  • the determining the weight by using the neural network model comprises: calculating the value of the neuron layer by layer according to the calculation formula of the neural network model, wherein the neuron value of the last layer is the predicted value of the fetal concentration model.
  • a GC correction device (not shown) connected to the counting device 200 and the free nucleic acid ratio determining device 300, respectively, for determining the biological
  • the number of sequencing data falling within the predetermined window is previously GC corrected to obtain the number of valid sequencings falling within the predetermined window.
  • the determined number of sequencing data falling within the predetermined window is accurate and reliable.
  • the GC correction device is adapted to perform the GC correction as follows:
  • the effective sequence number of the i-th window is ER i
  • the GC content of the reference reference in the window is GC i
  • the average number of effective sequences of all windows on the predetermined chromosome is recorded.
  • a fetal sex determining device (not shown) connected to the free nucleic acid ratio determining device 300 for predetermining the sex of the fetus, preferably by
  • the sequencing data of the Y chromosome is determined by the proportion of the total sequencing data. Therefore, the determination result of the ratio of free fetal or maternal-derived free nucleic acid in the peripheral blood sample of the pregnant woman to be tested is more accurate and reliable.
  • normal male fetus/female fetus/fetal used herein means that the fetal chromosome is normal.
  • normal male fetus refers to a male fetus whose normal chromosome is normal.
  • the "normal male/female/fetal” may be a single or twin.
  • a "normal male” may be a normal single male or a normal double male; a "normal fetus” does not limit the fetus.
  • the gender is not limited to single or twin.
  • Fetal concentration estimation was performed on 1400 sequenced samples using the ridge regression statistical model. The specific steps are as follows:
  • the reference (hg19) is divided into adjacent windows by a fixed length (60K in this embodiment), the window in the N area is filtered out, and the GC content in the window is counted to obtain a reference window file hg19.gc;
  • the number of valid sequences in the i-th window is ER i
  • the GC content of the reference in the window is GC i (recorded in the hg19.gc file)
  • the autosomal chromosome Chromosome 1 to 22
  • the threshold is specified [0.001, 0.003], when the Y chromosome ER% is greater than or equal to 0.003, it is male, and when it is less than 0.001, it is female;
  • Yi is the fetal concentration estimated by Y chromosome
  • xij is the number of window reads
  • is the coefficient of the second-order regularization term. It is necessary to find the ⁇ which minimizes the objective function of the verification set by cross-validation.
  • the estimated weight distribution map is shown in the figure (see Figure 5).
  • x j is the number of windows reads, For each window weight, For deviation. versus Obtained when training the model.
  • the method of the invention estimates the fetal free nucleic acid concentration, and the obtained result is accurate and reliable.
  • the fetal concentration of 1400 samples was sequenced using the neural network model of the TensorFlow library. The specific steps are as follows:
  • the first five steps are identical to the first embodiment.
  • b) Construct a neural network with the number of effective sequences per window of relatively stable autosomes as the input layer, with a single neuron as the output layer (corresponding to the fetal concentration), and no hidden layer.
  • the neuron type is selected by ReLU, and the optimization algorithm is Adam.
  • c) Use the neural network to predict fetal concentration (learning) in the training set, and adjust the learning rate according to the change of each round of learning effect, so that the learning rate is maximum under the condition that the learning effect of the training set is not reciprocating.
  • the way to learn is, according to the number of reads in each window Calculate the value of each neuron layer by layer according to the following formula
  • l is the serial number of the layer of the network
  • the first layer is the input layer
  • the last layer is the output layer (only one neuron)
  • the middle is the hidden layer.
  • Is the value of the jth neuron in the first layer Is the value of the kth neuron in the l-1st layer
  • the weight of the connection from the kth neuron of the l-1th layer to the jth neuron of the 1st layer The input deviation of the jth neuron of the first layer.
  • i) Estimate the fetal concentration of the test set samples using the trained neural network. The estimated method is based on the number of reads in each window. Calculate the value of each neuron layer by layer according to the following formula The neuron value of the last layer is the predicted value of the fetal concentration model.
  • the method of the invention estimates the fetal free nucleic acid concentration, and the obtained result is accurate and reliable.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Microbiology (AREA)
  • Genetics & Genomics (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Immunology (AREA)
  • Molecular Biology (AREA)
  • Sustainable Development (AREA)
  • Biomedical Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

本发明公开了确定生物样本中预定来源的游离核酸比例的方法及装置,包括:(1)对含有游离核酸的生物样本进行核酸测序,以便获得由多个测序数据构成的测序结果;(2)将所述测序结果与参照序列比对,以便确定所述测序结果中落入预定窗口的测序数据的数目;以及(3)基于所述落入预定窗口的测序数据的数目,确定所述生物样本中预定来源的游离核酸比例。

Description

确定生物样本中预定来源的游离核酸比例的方法及装置 技术领域
本发明涉及生物技术领域,特别是无创产前基因检测和肿瘤基因检测,具体地涉及确定生物样本中预定来源的游离核酸比例的方法及装置。
背景技术
自1977年,研究者先后在肿瘤患者的外周血中发现了癌源性DNA,还证实了孕妇血浆中存在cff-DNA,而检测估算出肿瘤患者的外周血中的癌源性DNA,以及孕妇血浆中游离的胎儿DNA比例,即确定生物样本中预定来源的游离核酸比例,意义重大。
然而,目前确定生物样本中预定来源的游离核酸比例的方法仍有待改进。
发明内容
本发明旨在至少解决现有技术中存在的技术问题之一。为此,本发明的一个目的在于提出一种能够准确高效地确定生物样本中预定来源的游离核酸比例的方法。
需要说明的是,本发明是基于发明人的下列发现而完成的:
目前,用来估算外周血中胎儿游离的DNA比例主要有四个方向:1)利用母体外周血单核细胞中来自母体和来自胎儿游离DNA片段对特定标记物甲基化的不同反应进行估算;2)利用单核苷酸多态性(SNPs)位点表现的差异性,选择多个具有代表性的SNPs位点进行估算;3)利用母血循环中胎儿与母亲DNA片段的差异情况进行估算;4)利用Y染色体估计怀男胎孕妇的胎儿浓度。这四种方法都存在一定的局限性:方法1)需要的血浆量比较大;方法2)需要探针捕获和高测序深度或是需要获得父源性信息;方法3)需要测序出片段的长度会增加一定的测序成本,并且测序结果准确性一般;方法4)仅限于怀男胎的孕妇,无法估计女胎的胎儿浓度。
为了克服以上方法的局限性,发明人研发出了一种胎儿浓度估计的方法,其仅利用目前NIPT检测的测序数据,不需要增加额外的测序数据。也即,该方法可以通过低覆盖度测序数据精确定量外周血胎儿DNA浓度。该方法的研发主要是基于一种发现,即常染色体按照一定窗口划分后观察到每个窗口中的reads(读段)数与胎儿浓度均有一定的相关性,进行胎儿浓度估计,该方法不但可以估计女胎的胎儿浓度,而且准确性也很高。
而发明人经过进一步研究发现,该方法应用性广,能应用到游离的不同来源DNA的领域,例如该方法还适用于肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中的游离肿瘤核酸或非肿瘤来源的游离核酸的比例的确定,并且,同样能够通过低覆盖度测序数据得到准确可靠的结果。
因而,根据本发明的一个方面,本发明提供了一种确定生物样本中预定来源的游离核酸比例的方法。根据本发明的实施例,该方法包括:(1)对含有游离核酸的生物样本进行核酸测序,以便获得由多个测序数据构成的测序结果;(2)将所述测序结果与参照序列比 对,以便确定所述测序结果中落入预定窗口的测序数据的数目;以及(3)基于所述落入预定窗口的测序数据的数目,确定所述生物样本中预定来源的游离核酸比例。
发明人惊奇地发现,利用本发明的方法能够准确高效地确定生物样本中预定来源的游离核酸比例,尤其适用于确定孕妇外周血中的游离胎儿核酸,以及肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中游离肿瘤核酸的比例。
根据本发明的另一方面,本发明还提供了一种用于确定生物样本中预定来源的游离核酸比例的设备。根据本发明的实施例,该设备包括:测序装置,所述测序装置用于对含有游离核酸的生物样本进行核酸测序,以便获得由多个测序数据构成的测序结果;计数装置,所述计数装置与所述测序装置相连,用于将所述测序结果与参照序列比对,以便确定所述测序结果中落入预定窗口的测序数据的数目;以及游离核酸比例确定装置,所述游离核酸比例确定装置与所述计数装置相连,用于基于所述落入预定窗口的测序数据的数目,确定所述生物样本中预定来源的游离核酸比例。
根据本发明的实施例,本发明的设备适于实施前面所述的本发明的确定生物样本中预定来源的游离核酸比例的方法,进而利用本发明的设备能够准确高效地确定生物样本中预定来源的游离核酸的比例,尤其适用于确定孕妇外周血中的游离胎儿核酸,以及肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中的游离肿瘤核酸的比例。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:
图1是根据本发明一个实施例的确定生物样本中预定来源的游离核酸比例的方法的流程示意图;
图2是根据本发明一个实施例的确定落入预定窗口的测序数据的数目的流程示意图;
图3是根据本发明一个实施例的用于确定生物样本中预定来源的游离核酸比例的设备的结构示意图;
图4是根据本发明一个实施例的计数装置200的结构示意图;
图5是根据本发明一个实施例,运用岭回归统计模型进行权重估计获得的权重分布图;
图6是根据本发明一个实施例,运用岭回归统计模型确定的测试集样本的胎儿浓度估计结果与利用Y染色体进行的胎儿浓度估计结果的相关性分析图;图中Male为男性样本,Female为女性样本,ChrY-based为基于Y染色体进行胎儿游离核酸比例估计值,Ridge Regreession为基于本发明运用岭回归统计模型进行胎儿游离核酸比例估计值。
图7是根据本发明一个实施例,运用神经网络模型进行权重估计获得的权重分布图;
图8是根据本发明一个实施例,运用神经网络模型确定的测试集样本的胎儿浓度估计结果与利用Y染色体进行的胎儿浓度估计结果的相关性分析图。图中Male为男性样本, Female为女性样本,ChrY-based为基于Y染色体进行胎儿游离核酸比例估计值,FF-QuantSC为基于本发明运用神经网络模型进行胎儿游离核酸比例估计值。
图9是根据本发明一个实施例,运用岭回归统计模型和神经网络模型分别对女胎胎儿浓度估计结果之间的相关性分析图;图中Ridge Regreession为基于本发明运用岭回归统计模型进行胎儿游离核酸比例估计值,FF-QuantSC为基于本发明运用神经网络模型进行胎儿游离核酸比例估计值。
具体实施方式
下面详细描述本发明的实施例。下面描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。
确定生物样本中游离核酸比例的方法
根据本发明的一个方面,本发明提供了一种确定生物样本中预定来源的游离核酸比例的方法。发明人惊奇地发现,利用本发明的方法能够准确高效地确定生物样本中游离核酸的比例,尤其适用于确定孕妇外周血中的胎儿核酸,以及肿瘤患者外周血中肿瘤核酸的比例。
需要说明的是,在本文中所采用的表达方式“生物样本中预定来源的游离核酸比例”是指在生物样本中特定来源的游离核酸分子数占总游离核酸分子数的比例。例如,当所述生物样本为孕妇外周血,所述预定来源的游离核酸为游离胎儿核酸时,“生物样本中预定来源的游离核酸比例”即游离胎儿核酸比例,表示孕妇外周血中含有的游离胎儿核酸分子数占总游离核酸分子数的比例,有时,也可以称之为“孕妇外周血中的游离胎儿DNA浓度”或者游离胎儿DNA比例。再例如,当所述生物样本为肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血,所述预定来源的游离核酸为游离肿瘤核酸时,“生物样本中预定来源的游离核酸比例”即游离肿瘤核酸比例,表示肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中含有的游离肿瘤核酸分子数占总游离核酸分子数的比例。根据本发明的实施例,参照图1,该方法包括:
S100:核酸测序
对含有游离核酸的生物样本进行核酸测序,以便获得由多个测序数据构成的测序结果。根据本发明的实施例,所述生物样本为外周血。根据本发明的实施例,所述预定来源的游离核酸为选自下列的至少之一:孕妇外周血中的游离胎儿核酸;孕妇外周血中母亲来源的游离核酸;以及肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中的游离肿瘤核酸或非肿瘤来源的游离核酸。由此,能够容易地确定孕妇外周血中的游离胎儿核酸或目前来源的游离核酸的比例,或者肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中游离肿瘤核酸的比例。根据本发明的实施例,所述测序为双末端测序、单末端测序或单分子测序。根据本发明的一些具体示例,所述核酸为DNA。需要说明的是,本文中所使用的术语“测序数据”即sequence reads,对应测序的核酸分子。
S200:确定落入预定窗口的测序数据的数目
将所述测序结果与参照序列比对,以便确定所述测序结果中落入预定窗口的测序数据的数目。
根据本发明的实施例,所述参照序列为参考基因组序列,优选hg19。
根据本发明的实施例,所述预定窗口是通过对参考基因组序列的预定染色体进行连续划分而获得的。
根据本发明的实施例,所述预定染色体包括常染色体,优选地,所述常染色体不包括第13、18和21号染色体的至少之一。
根据本发明的实施例,所述预定窗口的长度为60K。
其中,预定窗口的划分需要使每个窗口内reads保持均一,也即保证窗口内Reads均一性。需要说明的是,“窗口内Reads均一性”指每个窗口内的reads数基本上相同,即方差接近0。
根据本发明的实施例,参照图2,S200进一步包括:
S210:将测序结果与参考基因组进行比对。具体地,将测序结果与参考基因组进行比对,以便构建唯一比对测序数据集,所述唯一比对测序数据集中的每一个测序数据仅能够与所述参考基因组的一个位置匹配;
S220:确定参考基因组位置。具体地,确定所述唯一比对测序数据集中各测序数据所对应的参考基因组位置;以及
S230:确定落入预定窗口的测序数据的数目。具体地,确定落入所述预定窗口的测序数据的数目。
由此,能够容易地确定测序结果中落入预定窗口的测序数据的数目,且结果准确可靠,重复性好。
S300:确定游离核酸比例
基于落入预定窗口的测序数据的数目,确定所述生物样本中预定来源的游离核酸比例。
根据本发明的实施例,在步骤(3)中,利用各预定窗口的权重,确定所述生物样本中预定来源的游离核酸比例。根据本发明的一些具体示例,步骤(3)中,所述各预定窗口的权重是通过利用训练样品预先确定的。由此,结果准确可靠,可重复性好。其中,根据本发明的实施例,所述权重是利用岭回归统计模型和神经网络模型的至少之一确定的。根据本发明的一些实施例,所述神经网络模型采用TesnsorFlow学习系统。根据本发明的一些具体示例,所述TesnsorFlow学习系统的参数包括:采用常染色体的各窗口的测序数据数目作为输入层;采用胎儿浓度作为输出层;神经元类型采用ReLu;优化算法采用选自Adam、SGD和Ftrl的至少之一;优选Ftrl。优选地,Tesnsor Flow学习系统的参数进一步包括:学习速率设置为0.002;隐藏层的层数为1;隐藏层中神经元数为200。由此,结果准确可靠。需要说明的是,本文中所使用的术语“权重”是一个相对的概念,针对某一指标而言。某一指标的权重是指该指标在整体评价中的相对重要程度。例如,某一个“预定窗口的权重”指某一个预订窗口在所有预定窗口中的相对重要程度。某一个“连接权重”指某一个两个不同层连接在所有两个不同层连接中的相对重要程度。
根据本发明的实施例,所述训练样品为已知游离胎儿核酸比例的孕妇外周血样本。由此,能够有效确定待测孕妇外周血样本中游离胎儿或母亲来源的游离核酸比例。根据本发明的一些具体示例,所述训练样品为已知游离胎儿核酸比例的怀有正常男胎的孕妇外周血样本。由此,对待测孕妇外周血样本中游离胎儿或母亲来源的游离核酸比例的确定结果更准确可靠。
根据本发明的实施例,利用岭回归统计模型确定所述权重,所述岭回归统计模型的计算公式如下:
Figure PCTCN2018072045-appb-000001
其中
Figure PCTCN2018072045-appb-000002
为胎儿浓度模型预测值,x j为窗口reads数,
Figure PCTCN2018072045-appb-000003
为各窗口权重,
Figure PCTCN2018072045-appb-000004
为偏差,
Figure PCTCN2018072045-appb-000005
Figure PCTCN2018072045-appb-000006
是在训练模型时得到。
根据本发明的另一些实施例,利用神经网络模型确定所述权重,所述神经网络模型的计算公式如下:
Figure PCTCN2018072045-appb-000007
其中l为网络的层的序号,第一层为输入层,最后一层为输出层(只有一个神经元),中间为隐藏层。
Figure PCTCN2018072045-appb-000008
为第l层第j个神经元的数值,
Figure PCTCN2018072045-appb-000009
为第l-1层第k个神经元的数值,
Figure PCTCN2018072045-appb-000010
为第l-1层第k个神经元到第l层第j个神经元的连接权重,
Figure PCTCN2018072045-appb-000011
为第l层第j个神经元的输入偏差。函数f的最常用形式为rectified linear unit,亦即f(x)=max(0,x),w与b在训练模型时得到。
其中,应用神经网络模型时,按照以上公式逐层计算神经元的数值,最后一层的神经元数值即为胎儿浓度模型预测值。
也即,根据本发明的实施例,所述利用神经网络模型确定权重包括:按照所述神经网络模型的计算公式逐层计算神经元的数值,其中最后一层的神经元数值即为胎儿浓度模型预测值。
根据本发明的实施例,在进行步骤(3)之前,预先对所述落入预定窗口的测序数据的数目进行GC校正,以便获得落入所述预定窗口的有效测序数目。由此,确定的落入预定窗口的测序数据的数目准确可靠。
优选地,根据本发明的一些实施例,按照以下步骤进行所述GC校正:
a)对于某个样本,记第i个窗口的有效序列数为ER i,记reference在该窗口的GC含 量为GC i,记预定染色体上所有窗口有效序列数均值为
Figure PCTCN2018072045-appb-000012
b)利用预定染色体所有窗口的有效序列数及GC含量进行拟合得到二者之间的关系式:ER=f(gc);
c)对所有染色体的窗口进行校正:
Figure PCTCN2018072045-appb-000013
记第i个窗口GC校正后的有效序列数为ERA i
根据本发明的实施例,在进行步骤(3)之前,预先确定所述胎儿的性别,优选通过Y染色体的测序数据占总测序数据的比例进行确定。由此,对待测孕妇外周血样本中游离胎儿或母亲来源的游离核酸比例的确定结果更准确可靠。
此外,根据本发明的一些具体示例,利用本发明的方法,运用岭回归统计模型和TensorFlow库的神经网络模型对已测序样本进行胎儿浓度估计的具体步骤分别如下:
1、运用岭回归统计模型对已测序样本进行胎儿浓度估计:
1)将reference(hg19)按固定长度(本方法使用60K)连续划分相邻的窗口,过滤掉N区内的窗口,统计窗口内GC含量,得到参照窗口文件hg19.gc;
2)比对。将基于CG平台SE测序之后的序列(28bp)比对(BWA V0.7.7-r441)到reference(hg19);
3)过滤及初步统计。根据比对结果选择唯一完全比对的序列,去掉重复序列和存在碱基错配的序列得到有效序列,然后按照hg19.gc文件中窗口统计各个窗口的有效序列数和其GC含量;
4)GC校正。具体步骤如下:
a)对于某个样本,记第i个窗口的有效序列数为ER i,记reference在该窗口的GC含量为GC i(hg19.gc文件中记录),记常染色体(1~22号染色体)上所有窗口有效序列数均值为
Figure PCTCN2018072045-appb-000014
b)利用常染色体所有窗口的有效序列数及GC含量进行拟合(本方法中使用三次样条拟合)得到二者之间的关系式:ER=f(gc);
c)对所有染色体的窗口进行校正:
Figure PCTCN2018072045-appb-000015
记第i个窗口GC校正后的有效序列数为ERS i
5)确认胎儿性别
根据Y染色体有效序列数占总有效序列数的比值(ER%)确认胎儿性别;指定阈值a(a的取值范围为【0.001,0.003】),Y染色体ER%大于等于a时为男性,小于a时为女性;
6)运用岭回归模型进行胎儿浓度估计。具体步骤如下:
a)选择男胎的样本作为训练集;选择一批样本作为测试集(可使怀男胎和女胎的样本数相同)。
b)运用岭回归模型对常染色体(排除13、18、21号染色体)每个窗口进行权重估计(该权重在岭回归模型中等同于回归系数β)。得到预估的的权重分布。
c)运用已知权重对测试集样本进行胎儿浓度估计。
2、运用TensorFlow库的神经网络模型对已测序样本进行胎儿浓度估计
前5个步骤与前述的岭回归统计模型法完全一样。
6)运用TensorFlow库的神经网络模型进行胎儿浓度估计。具体步骤如下:
a)选择男胎的样本作为训练集;选择一批男胎的样本作为测试集(可使怀男胎和女胎的样本数相同)。将所有数据标准化,亦即对每个变量作线性变换,使得该变量在全样本中的均值为0,标准差为1。
b)构建神经网络,以相对稳定的常染色体每个窗口的有效序列数为输入层,以单个神经元为输出层(对应胎儿浓度),无隐藏层。神经元类型选用ReLU,优化算法选用Adam。
c)在训练集中运用该神经网络预测胎儿浓度,依每轮学习效果的变化调整学习速率,使得在保证训练集学习效果不出现往复波动的情况下学习速率最大。
d)在运算能力允许的情况下训练尽可能多的轮次,直到学习效果饱和。
e)换用其他的优化算法(SGD,Ftrl等)、重复b-d各步骤,根据学习效果选择最优的优化算法。
f)尝试在神经网络中加入二阶正则化项并调节大小,看加入前后、调节大小时学习效果如何。
g)加入一层隐藏层,调整隐藏层中神经元个数,重复b-f各步骤,根据学习效果选择最优的隐藏层架构。
h)将优化完毕的神经网络在训练集上训练。得到最优的参数和各窗口的权重(每个输入层神经元至隐藏层的平均权重)分布。
i)运用训练完毕的神经网络对测试集样本进行胎儿浓度估计。
其中,为方便理解,对本发明所述的各模型的基本原理进行简介如下:
(1)岭回归统计模型
岭回归是对最小二乘法的改良,通过加入二阶正则化项来降低模型的过拟合。
数学形式上,最小二乘法是求使残差平方和最小的β 0、β 1、β 2……:
Figure PCTCN2018072045-appb-000016
其中RSS为残差平方和,y i为因变量,x ij为自变量。
若对以上目标函数加入二阶正则化项进行修正,则为岭回归,亦即求出使以下目标函数最小的β 0、β 1、β 2……:
Figure PCTCN2018072045-appb-000017
其中λ的值需要指定,一般做法是取若干个值,以交叉验证法找出使验证集目标函数最小的λ。
(2)人工神经网络
人工神经网络(即神经网络模型)是一种非线性机器学习方法。其基本组成单元为神经元,每个神经元对若干输入x j进行带偏差的加权平均:
z=Σ j(w j+x j+w 0),
其中w j为权重,w 0为偏差,并依据加权平均的结果输出f(z)。目前最常用的函数形式为rectified linear unit,亦即f(z)=max(0,z)。
若干神经元组成一个多层网络,其中第一层(输入层)以数据的自变量为输入,而前一层的输出为后一层的输入,依此类推,直至最后一层(输出层)仅有一个神经元与一个输出(模型预测值)。输入、输出层以外的层称为隐藏层。
神经网络的基本参数为每层的权重及偏差,一般由反向传播法(backpropagation)训练。此外还有学习速率、神经元类型、优化算法(optimizer)、隐藏层层数、隐藏层中神经元个数、正则化系数等参数,一般依经验预设,再结合训练效果反复调整。
用于确定生物样本中游离核酸比例的设备
根据本发明的另一方面,本发明还提供了一种用于确定生物样本中预定来源的游离核酸比例的设备。发明人惊奇地发现,本发明的设备适于实施前面所述的本发明的确定生物样本中预定来源的游离核酸比例的方法,进而利用本发明的设备能够准确高效地确定生物样本中预定来源的游离核酸的比例,尤其适用于确定孕妇外周血中的游离胎儿核酸和母亲来源的游离核酸,以及肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中游离肿瘤核酸的比例。
根据本发明的实施例,参照图3,该设备包括:测序装置100、计数装置200和游离核酸比例确定装置300。
具体地,测序装置100用于对含有游离核酸的生物样本进行核酸测序,以便获得由多个测序数据构成的测序结果;计数装置200与所述测序装置100相连,用于将所述测序结果与参照序列比对,以便确定所述测序结果中落入预定窗口的测序数据的数目;游离核酸 比例确定装置300与所述计数装置200相连,用于基于所述落入预定窗口的测序数据的数目,确定所述生物样本中预定来源的游离核酸比例。
根据本发明的实施例,所述生物样本的种类不受特别限制。根据本发明的具体示例,所述生物样本为外周血。根据本发明的实施例,所述预定来源的游离核酸为选自下列的至少之一:孕妇外周血中的游离胎儿核酸;孕妇外周血中母亲来源的游离核酸;以及肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中的游离肿瘤核酸或非肿瘤来源的游离核酸。由此,能够容易地确定孕妇外周血中的游离胎儿核酸或母亲来源的游离核酸的比例,或者肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中游离肿瘤核酸的比例。
根据本发明的实施例,所述核酸为DNA。
根据本发明的实施例,所述测序为双末端测序、单末端测序或单分子测序。由此,有利于后续步骤的进行。
根据本发明的实施例,所述参照序列为参考基因组序列,优选hg19。
根据本发明的实施例,所述预定窗口是通过对参考基因组序列的预定染色体进行连续划分而获得的。根据本发明的一些具体示例,所述预定染色体包括常染色体,优选地,所述常染色体不包括第13、18和21号染色体的至少之一。根据本发明的一些优选实施例,所述预定窗口的长度为60K。
根据本发明的实施例,参照图4,所述计数装置200进一步包括:比对单元210、位置确定单元220和数目确定单元230。具体地,比对单元210用于将所述测序结果与参考基因组进行比对,以便构建唯一比对测序数据集,所述唯一比对测序数据集中的每一个测序数据仅能够与所述参考基因组的一个位置匹配位置确定单元220与比对单元210相连,用于确定所述唯一比对测序数据集中各测序数据所对应的参考基因组位置;数目确定单元230与位置确定单元220相连,用于确定落入所述预定窗口的测序数据的数目。由此,能够容易地确定落入预定窗口的测序数据的数目,且结果准确可靠,重复性好。
根据本发明的实施例,所述游离核酸比例确定装置300适于利用各预定窗口的权重,确定所述生物样本中预定来源的游离核酸比例。根据本发明的一些具体示例,所述各预定窗口的权重是通过利用训练样品预先确定的。由此,结果准确可靠,可重复性好。其中,根据本发明的实施例,所述权重是利用岭回归统计模型和神经网络模型的至少之一确定的。根据本发明的一些实施例,所述神经网络模型采用Tesnsor Flow学习系统。根据本发明的一些具体示例,所述Tesnsor Flow学习系统的参数包括:采用常染色体的各窗口的测序数据数目作为输入层;采用胎儿浓度作为输出层;神经元类型采用ReLu;优化算法采用选自Adam、SGD和Ftrl的至少之一;优选Ftrl。优选地,Tesnsor Flow学习系统的参数进一步包括:学习速率设置为0.002;隐藏层的层数为1;隐藏层中神经元数为200。由此,结果准确可靠。
根据本发明的实施例,所述训练样品为已知游离胎儿核酸比例的孕妇外周血样本。由此,能够有效确定待测孕妇外周血样本中游离胎儿或母亲来源的游离核酸比例。根据本发明的一些具体示例,所述训练样品为已知游离胎儿核酸比例的怀有正常男胎的孕妇外周血样本。由此,对待测孕妇外周血样本中游离胎儿或母亲来源的游离核酸比例的确定结果更 准确可靠。
根据本发明的实施例,利用岭回归统计模型确定所述权重,所述岭回归统计模型的计算公式如下:
Figure PCTCN2018072045-appb-000018
其中
Figure PCTCN2018072045-appb-000019
为胎儿浓度模型预测值,x j为窗口reads数,
Figure PCTCN2018072045-appb-000020
为各窗口权重,
Figure PCTCN2018072045-appb-000021
为偏差,
Figure PCTCN2018072045-appb-000022
Figure PCTCN2018072045-appb-000023
是在训练模型时得到。
根据本发明的另一些实施例,利用神经网络模型确定所述权重,所述神经网络模型的计算公式如下:
Figure PCTCN2018072045-appb-000024
其中l为网络的层的序号,第一层为输入层,最后一层为输出层(只有一个神经元),中间为隐藏层。
Figure PCTCN2018072045-appb-000025
为第l层第j个神经元的数值,
Figure PCTCN2018072045-appb-000026
为第l-1层第k个神经元的数值,
Figure PCTCN2018072045-appb-000027
为第l-1层第k个神经元到第l层第j个神经元的连接权重,
Figure PCTCN2018072045-appb-000028
为第l层第j个神经元的输入偏差。函数f的最常用形式为rectified linear unit,亦即f(x)=max(0,x),w与b在训练模型时得到。
根据本发明的实施例,所述利用神经网络模型确定权重包括:按照所述神经网络模型的计算公式逐层计算神经元的数值,其中最后一层的神经元数值即为胎儿浓度模型预测值。
根据本发明的实施例,进一步包括GC校正装置(图中未示出),所述GC校正装置分别与所述计数装置200和所述游离核酸比例确定装置300相连,用于在确定所述生物样本中预定来源的游离核酸比例之前,预先对所述落入预定窗口的测序数据的数目进行GC校正,以便获得落入所述预定窗口的有效测序数目。由此,确定的落入预定窗口的测序数据的数目准确可靠。
优选地,根据本发明的一些实施例,所述GC校正装置适于按照以下步骤进行所述GC校正:
a)对于某个样本,记第i个窗口的有效序列数为ER i,记reference在该窗口的GC含量为GC i,记预定染色体上所有窗口有效序列数均值为
Figure PCTCN2018072045-appb-000029
b)利用预定染色体所有窗口的有效序列数及GC含量进行拟合得到二者之间的关系式: ER=f(gc);
c)对所有染色体的窗口进行校正:
Figure PCTCN2018072045-appb-000030
记第i个窗口GC校正后的有效序列数为ERA i
根据本发明的实施例,进一步包括胎儿性别确定装置(图中未示出),所述胎儿性别确定装置与所述游离核酸比例确定装置300相连,用于预先确定所述胎儿的性别,优选通过Y染色体的测序数据占总测序数据的比例进行确定。由此,对待测孕妇外周血样本中游离胎儿或母亲来源的游离核酸比例的确定结果更准确可靠。
需要说明的是,在本文中所使用的表达方式“正常男胎/女胎/胎”是指胎儿染色体正常,例如,“正常男胎”是指染色体正常的男胎。并且,“正常男胎/女胎/胎”可以为单胎或者双胎,例如,“正常男胎”可以是正常单男胎,也可以为正常双男胎;“正常胎儿”则不限定胎儿的性别,也不限定是单胎还是双胎。
下面将结合实施例对本发明的实施方案进行详细描述,但是本领域技术人员将会理解,下列实施例仅用于说明本发明,而不应视为限定本发明的范围。实施例中未注明具体条件者,按照常规条件或制造商建议的条件进行。所用试剂或仪器未注明生产厂商者,均为可以通过市场获得的常规产品。
实施例1
运用岭回归统计模型对1400例已测序样本进行胎儿浓度估计,具体步骤如下:
1)将reference(hg19)按固定长度(本实施例使用60K)连续划分相邻的窗口,过滤掉N区内的窗口,统计窗口内GC含量,得到参照窗口文件hg19.gc;
2)比对。将基于CG平台SE测序之后的序列(28bp)比对(BWA V0.7.7-r441)到reference(hg19);
3)过滤及初步统计。根据比对结果选择唯一完全比对的序列,去掉重复序列和存在碱基错配的序列得到有效序列,然后按照hg19.gc文件中窗口统计各个窗口的有效序列数和其GC含量;
4)GC校正。具体步骤如下:
a)对于某个样本,记第i个窗口的有效序列数为ER i,记reference在该窗口的GC含量为GC i(hg19.gc文件中记录),记常染色体(1~22号染色体)上所有窗口有效序列数均值为
Figure PCTCN2018072045-appb-000031
b)利用常染色体所有窗口的有效序列数及GC含量进行拟合(本实施例中使用三次样条拟合)得到二者之间的关系式:ER=f(gc);
c)对所有染色体的窗口进行校正:
Figure PCTCN2018072045-appb-000032
记第i个窗口GC校正后的有效序列数为ERA i
5)确认胎儿性别。
根据Y染色体有效序列数占总有效序列数的比值(ER%)确认胎儿性别;指定阈值【0.001,0.003】,Y染色体ER%大于等于0.003时为男性,小于0.001时为女性;
6)运用岭回归模型进行胎儿浓度估计。具体步骤如下:
a)选择男胎的样本作为训练集,样本数为1000;选择一批样本作为测试集,样本数为400,其中怀男胎和女胎的样本数分别为200。
b)运用岭回归模型对常染色体(排除13、18、21号染色体)每个窗口进行权重估计。估计的方法为,寻找偏差β 0及权重β j使得
Figure PCTCN2018072045-appb-000033
取最小值,其中
Figure PCTCN2018072045-appb-000034
yi为以Y染色体估计的胎儿浓度,xij为窗口reads数,λ为二阶正则化项的系数,需以交叉验证法找出使验证集目标函数最小的λ。估计出来的权重分布图见附图(见图5)。
c)运用已知权重对测试集样本进行胎儿浓度估计。估计的方法为,
Figure PCTCN2018072045-appb-000035
其中
Figure PCTCN2018072045-appb-000036
为胎儿浓度模型预测值,x j为窗口reads数,
Figure PCTCN2018072045-appb-000037
为各窗口权重,
Figure PCTCN2018072045-appb-000038
为偏差。
Figure PCTCN2018072045-appb-000039
Figure PCTCN2018072045-appb-000040
在训练模型时得到。
胎儿浓度估计结果与Y染色体进行的胎儿浓度估计值相关性见附图(见图6),从图中可以看出相关性极强(r=0.92;p value<1e-10),说明使用本发明方法对胎儿游离核酸浓度进行估算,所获得的结果准确可靠。
实施例2
运用TensorFlow库的神经网络模型对1400例已测序样本进行胎儿浓度估计,具体步骤 如下:
前5个步骤与实施例1完全一样。
6)运用TensorFlow库的神经网络模型进行胎儿浓度估计。具体步骤如下:
a)选择男胎的样本作为训练集,样本数为1000;选择一批样本作为测试集,样本数为400,其中怀男胎和女胎的样本数分别为200。将所有数据标准化,亦即对每个变量作线性变换,使得该变量在全样本中的均值为0,标准差为1。
b)构建神经网络,以相对稳定的常染色体每个窗口的有效序列数为输入层,以单个神经元为输出层(对应胎儿浓度),无隐藏层。神经元类型选用ReLU,优化算法选用Adam。
c)在训练集中运用该神经网络预测胎儿浓度(学习),依每轮学习效果的变化调整学习速率,使得在保证训练集学习效果不出现往复波动的情况下学习速率最大。学习的方法是,根据各个窗口的reads数
Figure PCTCN2018072045-appb-000041
依据以下公式逐层计算各个神经元的数值
Figure PCTCN2018072045-appb-000042
Figure PCTCN2018072045-appb-000043
其中l为网络的层的序号,第一层为输入层,最后一层为输出层(只有一个神经元),中间为隐藏层。
Figure PCTCN2018072045-appb-000044
为第l层第j个神经元的数值,
Figure PCTCN2018072045-appb-000045
为第l-1层第k个神经元的数值,
Figure PCTCN2018072045-appb-000046
为第l-1层第k个神经元到第l层第j个神经元的连接权重,
Figure PCTCN2018072045-appb-000047
为第l层第j个神经元的输入偏差。函数f的最常用形式为rectified linear unit,亦即f(x)=max(0,x)。
对所有样本{s},比较输出层神经元的数值
Figure PCTCN2018072045-appb-000048
与依据Y染色体估计的胎儿浓度y s,调整各层的权重
Figure PCTCN2018072045-appb-000049
及偏差
Figure PCTCN2018072045-appb-000050
以使
Figure PCTCN2018072045-appb-000051
最小。
d)在运算能力允许的情况下训练尽可能多的轮次,直到学习效果饱和。
e)换用其他的优化算法(SGD,Ftrl等)、重复b-d各步骤,根据学习效果选择最优的优化算法。
f)尝试在神经网络中加入二阶正则化项λ并调节大小,看加入前后、调节大小时学习效果如何。二阶正则化项的意义为,学习过程中不再寻求使
Figure PCTCN2018072045-appb-000052
最小,而寻求使
Figure PCTCN2018072045-appb-000053
最小。
g)加入一层隐藏层,调整隐藏层中神经元个数,重复b-f各步骤,根据学习效果选择最优的隐藏层架构。
h)将优化完毕的神经网络在训练集上训练。最优的参数见下表1。各窗口的权重(每个输入层神经元至隐藏层的平均权重)分布见附图(见图7)。
表1 运用TensorFlow库的神经网络模型最优参数
Figure PCTCN2018072045-appb-000054
i)运用训练完毕的神经网络对测试集样本进行胎儿浓度估计。估计的方法为,根据各个窗口的reads数
Figure PCTCN2018072045-appb-000055
依据以下公式逐层计算各个神经元的数值
Figure PCTCN2018072045-appb-000056
最后一层的神经元数值即为胎儿浓度模型预测值。
Figure PCTCN2018072045-appb-000057
胎儿浓度估计结果与Y染色体进行的胎儿浓度估计值相关性见附图(见图8),从图中可以看出相关性极强(r=0.982;p value<1e-10),说明使用本发明方法对胎儿游离核酸浓度进行估算,所获得的结果准确可靠。
最终,发明人比较了两种模型对女胎胎儿浓度估计的相关性,结果如图9所示。可以明显看出两种模型所得到的胎儿浓度值相关性极强(r=0.935;p value<1e-10)。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同物限定。

Claims (56)

  1. 一种确定生物样本中预定来源的游离核酸比例的方法,其特征在于,包括:
    (1)对含有游离核酸的生物样本进行核酸测序,以便获得由多个测序数据构成的测序结果;
    (2)将所述测序结果与参照序列比对,以便确定所述测序结果中落入预定窗口的测序数据的数目;以及
    (3)基于所述落入预定窗口的测序数据的数目,确定所述生物样本中预定来源的游离核酸比例。
  2. 根据权利要求1所述的方法,其特征在于,所述生物样本为外周血。
  3. 根据权利要求2所述的方法,其特征在于,所述预定来源的游离核酸为选自下列的至少之一:
    孕妇外周血中的游离胎儿核酸;
    孕妇外周血中母亲来源的游离核酸;以及
    肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中的游离肿瘤核酸或非肿瘤来源的游离核酸。
  4. 根据权利要求1所述的方法,其特征在于,所述测序为双末端测序、单末端测序或单分子测序。
  5. 根据权利要求1所述的方法,其特征在于,所述核酸为DNA。
  6. 根据权利要求1~5任一项所述的方法,其特征在于,所述参照序列为参考基因组序列。
  7. 根据权利要求6所述的方法,其特征在于,所述参照序列为hg19。
  8. 根据权利要求6所述的方法,其特征在于,所述预定窗口是通过对参考基因组序列的预定染色体进行连续划分而获得的。
  9. 根据权利要求8所述的方法,其特征在于,所述预定染色体包括常染色体。
  10. 根据权利要求9所述的方法,其特征在于,所述常染色体不包括第13、18和21号染色体的至少之一。
  11. 根据权利要求8所述的方法,其特征在于,所述预定窗口的长度为60K。
  12. 根据权利要求8所述的方法,其特征在于,步骤(2)进一步包括:
    (2-1)将所述测序结果与参考基因组进行比对,以便构建唯一比对测序数据集,所述唯一比对测序数据集中的每一个测序数据仅能够与所述参考基因组的一个位置匹配;
    (2-2)确定所述唯一比对测序数据集中各测序数据所对应的参考基因组位置;以及
    (2-3)确定落入所述预定窗口的测序数据的数目。
  13. 根据权利要求8所述的方法,其特征在于,在步骤(3)中,利用各预定窗口的权重,确定所述生物样本中预定来源的游离核酸比例。
  14. 根据权利要求13所述的方法,其特征在于,步骤(3)中,所述各预定窗口的权 重是通过利用训练样品预先确定的。
  15. 根据权利要求13或14所述的方法,其特征在于,所述权重是利用岭回归统计模型和神经网络模型的至少之一确定的。
  16. 根据权利要求15所述的方法,其特征在于,所述神经网络模型采用TesnsorFlow学习系统。
  17. 根据权利要求16所述的方法,其特征在于,所述Tesnsor Flow学习系统的参数包括:
    采用常染色体的各窗口的测序数据数目作为输入层;
    采用胎儿浓度作为输出层;
    神经元类型采用ReLu;
    优化算法采用选自Adam、SGD和Ftrl的至少之一。
  18. 根据权利要求17所述的方法,其特征在于,所述优化算法为Ftrl。
  19. 根据权利要求17所述的方法,其特征在于,所述Tesnsor Flow学习系统的参数进一步包括:
    学习速率设置为0.002;
    隐藏层的层数为1;
    隐藏层中神经元数为200。
  20. 根据权利要求14所述的方法,其特征在于,所述训练样品为已知游离胎儿核酸比例的孕妇外周血样本。
  21. 根据权利要求20所述的方法,其特征在于,所述训练样品为已知游离胎儿核酸比例的怀有正常男胎的孕妇外周血样本。
  22. 根据权利要求20所述的方法,其特征在于,利用岭回归统计模型确定权重,所述岭回归统计模型的计算公式如下:
    Figure PCTCN2018072045-appb-100001
    其中
    Figure PCTCN2018072045-appb-100002
    为胎儿浓度模型预测值,x j为窗口reads数,
    Figure PCTCN2018072045-appb-100003
    为各窗口权重,
    Figure PCTCN2018072045-appb-100004
    为偏差,
    Figure PCTCN2018072045-appb-100005
    Figure PCTCN2018072045-appb-100006
    是在训练模型时得到。
  23. 根据权利要求20所述的方法,其特征在于,利用神经网络模型确定权重,所述神经网络模型的计算公式如下:
    Figure PCTCN2018072045-appb-100007
    其中l为网络的层的序号,
    Figure PCTCN2018072045-appb-100008
    为第l层第j个神经元的数值,
    Figure PCTCN2018072045-appb-100009
    为第l-1层 第k个神经元的数值,
    Figure PCTCN2018072045-appb-100010
    为第l-1层第k个神经元到第l层第j个神经元的连接权重,
    Figure PCTCN2018072045-appb-100011
    为第l层第j个神经元的输入偏差,w与b是在训练模型时得到。
  24. 根据权利要求23所述的方法,其特征在于,所述利用神经网络模型确定权重包括:
    按照所述神经网络模型的计算公式逐层计算神经元的数值,其中最后一层的神经元数值即为胎儿浓度模型预测值。
  25. 根据权利要求1所述的方法,其特征在于,在进行步骤(3)之前,预先对所述落入预定窗口的测序数据的数目进行GC校正,以便获得落入所述预定窗口的有效测序数目。
  26. 根据权利要求25所述的方法,其特征在于,按照以下步骤进行所述GC校正:
    a)对于某个样本,记第i个窗口的有效序列数为ER i,记reference在该窗口的GC含量为GC i,记预定染色体上所有窗口有效序列数均值为
    Figure PCTCN2018072045-appb-100012
    b)利用预定染色体所有窗口的有效序列数及GC含量进行拟合得到二者之间的关系式:ER=f(gc);
    c)对所有染色体的窗口进行校正:
    Figure PCTCN2018072045-appb-100013
    记第i个窗口GC校正后的有效序列数为ERA i
  27. 根据权利要求1所述的方法,其特征在于,在进行步骤(3)之前,预先确定所述胎儿的性别。
  28. 根据权利要求27所述的方法,其特征在于,通过Y染色体的测序数据占总测序数据的比例确定所述胎儿的性别。
  29. 一种用于确定生物样本中预定来源的游离核酸比例的设备,其特征在于,包括:
    测序装置,所述测序装置用于对含有游离核酸的生物样本进行核酸测序,以便获得由多个测序数据构成的测序结果;
    计数装置,所述计数装置与所述测序装置相连,用于将所述测序结果与参照序列比对,以便确定所述测序结果中落入预定窗口的测序数据的数目;以及
    游离核酸比例确定装置,所述游离核酸比例确定装置与所述计数装置相连,用于基于所述落入预定窗口的测序数据的数目,确定所述生物样本中预定来源的游离核酸比例。
  30. 根据权利要求29所述的设备,其特征在于,所述生物样本为外周血。
  31. 根据权利要求30所述的设备,其特征在于,所述预定来源的游离核酸为选自下列的至少之一:
    孕妇外周血中的游离胎儿核酸;
    孕妇外周血中母亲来源的游离核酸;以及
    肿瘤患者、疑似肿瘤患者或者肿瘤筛查者外周血中的游离肿瘤核酸或非肿瘤来源的游 离核酸。
  32. 根据权利要求29所述的设备,其特征在于,所述测序为双末端测序、单末端测序或单分子测序。
  33. 根据权利要求29所述的设备,其特征在于,所述核酸为DNA。
  34. 根据权利要求29~33任一项所述的设备,其特征在于,所述参照序列为参考基因组序列。
  35. 根据权利要求34所述的设备,其特征在于,所述参照序列为hg19。
  36. 根据权利要求34所述的设备,其特征在于,所述预定窗口是通过对参考基因组序列的预定染色体进行连续划分而获得的。
  37. 根据权利要求36所述的设备,其特征在于,所述预定染色体包括常染色体。
  38. 根据权利要求37所述的设备,其特征在于,所述常染色体不包括第13、18和21号染色体的至少之一。
  39. 根据权利要求36所述的设备,其特征在于,所述预定窗口的长度为60K。
  40. 根据权利要求36所述的设备,其特征在于,所述计数装置进一步包括:
    比对单元,所述比对单元用于将所述测序结果与参考基因组进行比对,以便构建唯一比对测序数据集,所述唯一比对测序数据集中的每一个测序数据仅能够与所述参考基因组的一个位置匹配;
    位置确定单元,所述位置确定单元与所述比对单元相连,用于确定所述唯一比对测序数据集中各测序数据所对应的参考基因组位置;以及
    数目确定单元,所述数目确定单元与所述位置确定单元相连,用于确定落入所述预定窗口的测序数据的数目。
  41. 根据权利要求36所述的设备,其特征在于,所述游离核酸比例确定装置适于利用各预定窗口的权重,确定所述生物样本中预定来源的游离核酸比例。
  42. 根据权利要求41所述的设备,其特征在于,所述各预定窗口的权重是通过利用训练样品预先确定的。
  43. 根据权利要求41或42所述的设备,其特征在于,所述权重是利用岭回归统计模型和神经网络模型的至少之一确定的。
  44. 根据权利要求43所述的设备,其特征在于,所述神经网络模型采用Tesnsor Flow学习系统。
  45. 根据权利要求44所述的设备,其特征在于,所述Tesnsor Flow学习系统的参数包括:
    采用常染色体的各窗口的测序数据数目作为输入层;
    采用胎儿浓度作为输出层;
    神经元类型采用ReLu;
    优化算法采用选自Adam、SGD和Ftrl的至少之一。
  46. 根据权利要求45所述的设备,其特征在于,所述优化算法为Ftrl。
  47. 根据权利要求46所述的设备,其特征在于,所述Tesnsor Flow学习系统的参数进一步包括:
    学习速率设置为0.002;
    隐藏层的层数为1;
    隐藏层中神经元数为200。
  48. 根据权利要求42所述的设备,其特征在于,所述训练样品为已知游离胎儿核酸比例的孕妇外周血样本。
  49. 根据权利要求48所述的设备,其特征在于,所述训练样品为已知游离胎儿核酸比例的怀有正常男胎的孕妇外周血样本。
  50. 根据权利要求41所述的设备,其特征在于,利用岭回归统计模型确定权重,所述岭回归统计模型的计算公式如下:
    Figure PCTCN2018072045-appb-100014
    其中
    Figure PCTCN2018072045-appb-100015
    为胎儿浓度模型预测值,x j为窗口reads数,
    Figure PCTCN2018072045-appb-100016
    为各窗口权重,
    Figure PCTCN2018072045-appb-100017
    为偏差,
    Figure PCTCN2018072045-appb-100018
    Figure PCTCN2018072045-appb-100019
    是在训练模型时得到。
  51. 根据权利要求41所述的设备,其特征在于,利用神经网络模型确定权重,所述神经网络模型的计算公式如下:
    Figure PCTCN2018072045-appb-100020
    其中l为网络的层的序号,
    Figure PCTCN2018072045-appb-100021
    为第l层第j个神经元的数值,
    Figure PCTCN2018072045-appb-100022
    为第l-1层第k个神经元的数值,
    Figure PCTCN2018072045-appb-100023
    为第l-1层第k个神经元到第l层第j个神经元的连接权重,
    Figure PCTCN2018072045-appb-100024
    为第l层第j个神经元的输入偏差,w与b是在训练模型时得到。
  52. 根据权利要求51所述的设备,其特征在于,所述利用神经网络模型确定权重包括:
    按照所述神经网络模型的计算公式逐层计算神经元的数值,其中最后一层的神经元数值即为胎儿浓度模型预测值。
  53. 根据权利要求29所述的设备,其特征在于,进一步包括GC校正装置,所述GC校正装置分别与所述计数装置和所述游离核酸比例确定装置相连,用于在确定所述生物样本中预定来源的游离核酸比例之前,预先对所述落入预定窗口的测序数据的数目进行GC校正,以便获得落入所述预定窗口的有效测序数目。
  54. 根据权利要求53所述的设备,其特征在于,所述GC校正装置适于按照以下步骤进行所述GC校正:
    a)对于某个样本,记第i个窗口的有效序列数为ER i,记reference在该窗口的GC含量为GC i,记预定染色体上所有窗口有效序列数均值为
    Figure PCTCN2018072045-appb-100025
    b)利用预定染色体所有窗口的有效序列数及GC含量进行拟合得到二者之间的关系式:ER=f(gc);
    c)对所有染色体的窗口进行校正:
    Figure PCTCN2018072045-appb-100026
    记第i个窗口GC校正后的有效序列数为ERA i
  55. 根据权利要求29所述的设备,其特征在于,进一步包括胎儿性别确定装置,所述胎儿性别确定装置与所述游离核酸比例确定装置相连,用于预先确定所述胎儿的性别。
  56. 根据权利要求55所述的设备,其特征在于,通过Y染色体的测序数据占总测序数据的比例确定所述胎儿的性别。
PCT/CN2018/072045 2017-01-24 2018-01-10 确定生物样本中预定来源的游离核酸比例的方法及装置 Ceased WO2018137496A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
PL18743980.7T PL3575407T3 (pl) 2017-01-24 2018-01-10 Sposób oznaczania udziału frakcji bezkomórkowych kwasów nukleinowych z wcześniej określonego źródła w próbce biologicznej
EP18743980.7A EP3575407B1 (en) 2017-01-24 2018-01-10 Method for determining proportion of cell-free nucleic acids from predetermined source in biological sample
ES18743980T ES2981092T3 (es) 2017-01-24 2018-01-10 Procedimiento para determinar la proporción de ácidos nucleicos acelulares procedentes de una fuente predeterminada en una muestra biológica
HRP20240709TT HRP20240709T1 (hr) 2017-01-24 2018-01-10 Postupak za određivanje udjela nukleinskih kiselina bez stanica iz unaprijed određenog izvora u biološkom uzorku
RS20240650A RS65618B1 (sr) 2017-01-24 2018-01-10 Postupak za utvrđivanje frakcije nukleinskih kiselina bez ćelija iz prethodno određenog izvora u biološkom uzorku
CN201880006995.9A CN110191964B (zh) 2017-01-24 2018-01-10 确定生物样本中预定来源的游离核酸比例的方法及装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710055200 2017-01-24
CN201710055200.0 2017-01-24

Publications (1)

Publication Number Publication Date
WO2018137496A1 true WO2018137496A1 (zh) 2018-08-02

Family

ID=62978786

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072045 Ceased WO2018137496A1 (zh) 2017-01-24 2018-01-10 确定生物样本中预定来源的游离核酸比例的方法及装置

Country Status (8)

Country Link
EP (1) EP3575407B1 (zh)
CN (1) CN110191964B (zh)
ES (1) ES2981092T3 (zh)
HR (1) HRP20240709T1 (zh)
HU (1) HUE067465T2 (zh)
PL (1) PL3575407T3 (zh)
RS (1) RS65618B1 (zh)
WO (1) WO2018137496A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109272046A (zh) * 2018-09-26 2019-01-25 北京科技大学 基于L2重新正则化Adam切换模拟回火SGD的深度学习方法
CN112669282A (zh) * 2020-12-29 2021-04-16 燕山大学 一种基于深度神经网络的脊柱定位方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882944B (zh) * 2022-06-22 2024-09-20 深圳微伴医学检验实验室 基于Metagenome测序的肠道微生物样品宿主性别鉴定方法、装置及应用
CN117106870B (zh) * 2022-12-30 2024-09-13 深圳市真迈生物科技有限公司 胎儿浓度的确定方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014043763A1 (en) * 2012-09-20 2014-03-27 The Chinese University Of Hong Kong Non-invasive determination of methylome of fetus or tumor from plasma
CN105296606A (zh) * 2014-07-25 2016-02-03 深圳华大基因股份有限公司 确定生物样本中游离核酸比例的方法、装置及其用途
CN105331606A (zh) * 2014-08-12 2016-02-17 焦少灼 应用于高通量测序的核酸分子定量方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102329876B (zh) * 2011-10-14 2014-04-02 深圳华大基因科技有限公司 一种测定待检测样本中疾病相关核酸分子的核苷酸序列的方法
EP3026124A1 (en) * 2012-10-31 2016-06-01 Genesupport SA Non-invasive method for detecting a fetal chromosomal aneuploidy
CN104169929B (zh) * 2013-09-10 2016-12-28 深圳华大基因股份有限公司 用于确定胎儿是否存在性染色体数目异常的系统和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014043763A1 (en) * 2012-09-20 2014-03-27 The Chinese University Of Hong Kong Non-invasive determination of methylome of fetus or tumor from plasma
CN105296606A (zh) * 2014-07-25 2016-02-03 深圳华大基因股份有限公司 确定生物样本中游离核酸比例的方法、装置及其用途
CN105331606A (zh) * 2014-08-12 2016-02-17 焦少灼 应用于高通量测序的核酸分子定量方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3575407A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109272046A (zh) * 2018-09-26 2019-01-25 北京科技大学 基于L2重新正则化Adam切换模拟回火SGD的深度学习方法
CN112669282A (zh) * 2020-12-29 2021-04-16 燕山大学 一种基于深度神经网络的脊柱定位方法
CN112669282B (zh) * 2020-12-29 2023-02-14 燕山大学 一种基于深度神经网络的脊柱定位方法

Also Published As

Publication number Publication date
HRP20240709T1 (hr) 2024-08-16
CN110191964B (zh) 2023-12-05
RS65618B1 (sr) 2024-07-31
ES2981092T3 (es) 2024-10-07
EP3575407A1 (en) 2019-12-04
CN110191964A (zh) 2019-08-30
EP3575407B1 (en) 2024-03-27
EP3575407A4 (en) 2020-03-04
PL3575407T3 (pl) 2024-07-29
HUE067465T2 (hu) 2024-10-28

Similar Documents

Publication Publication Date Title
US20240379190A1 (en) Anomalous Fragment Detection and Classification
KR102018444B1 (ko) 생물학적 샘플 중의 무세포 핵산의 분획을 결정하기 위한 방법 및 장치 및 이의 용도
JP2024119880A (ja) 合成トレーニングサンプルによるがん分類
US20050216208A1 (en) Diagnostic decision support system and method of diagnostic decision support
JP2021505977A (ja) 体細胞突然変異のクローン性を決定するための方法及びシステム
CN115836349A (zh) 用于评估纵向生物特征数据的系统和方法
CN107133495A (zh) 一种非整倍性生物信息的分析方法和分析系统
CN108778287B (zh) 用于早产结果的早期风险评估的方法和系统
JP7467504B2 (ja) 染色体異数性を判定するためおよび分類モデルを構築するための方法およびデバイス
WO2018137496A1 (zh) 确定生物样本中预定来源的游离核酸比例的方法及装置
CN110890131B (zh) 一种基于遗传性基因突变预测癌症风险的方法
CN116583904A (zh) 用于癌症分类的样品确认
CN117457065A (zh) 一种基于单细胞多组学数据识别表型相关细胞类型的方法和系统
US20180300451A1 (en) Techniques for fractional component fragment-size weighted correction of count and bias for massively parallel DNA sequencing
US20250061963A1 (en) Dynamically selecting sequencing subregions for cancer classification
US20240312561A1 (en) Optimization of sequencing panel assignments
CN117106870B (zh) 胎儿浓度的确定方法及装置
CN110459312B (zh) 类风湿性关节炎易感位点及其应用
CN119604937A (zh) 基于甲基化的年龄预测作为癌症分类的特征
WO2023010242A1 (zh) 估计无创产前基因检测数据中胎儿核酸浓度的方法和系统
HK40010535A (zh) 确定生物样本中预定来源的游离核酸比例的方法及装置
CN120727100A (zh) 一种乳腺癌易感基因有害胚系变异评估模型的构建方法及其系统
HK40047962B (zh) 确定染色体非整倍性、构建分类模型的方法和装置
HK1213601B (zh) 确定生物样本中游离核酸比例的方法、装置及其用途
KR20100131281A (ko) 폐암 감수성 예측 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18743980

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018743980

Country of ref document: EP

Effective date: 20190826

WWE Wipo information: entry into national phase

Ref document number: P-2024/0650

Country of ref document: RS