CN114550818B

CN114550818B - A method for evaluating mixed signals in gene sequencing

Info

Publication number: CN114550818B
Application number: CN202210103664.5A
Authority: CN
Inventors: 黄家蔚; 冯濒啸; 周文雄
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2025-04-15
Anticipated expiration: 2042-01-28
Also published as: CN114550818A

Abstract

The present invention discloses a method for evaluating mixed gene sequencing signals. By acquiring data, constructing a dictionary, sparsely representing the dictionary, and filtering the results, it is possible to determine whether the sequencing signal of a certain site is a signal of a single fragment (monoclonal) or a mixture of multiple fragment signals (polyclonal).

Description

Evaluation method of gene sequencing mixed signal

Technical Field

The invention relates to a method for evaluating a mixed signal of gene sequencing, belonging to the field of data processing of molecular sequencing.

Background

Gene sequencing is the process of decrypting a gene sequence. The application range of gene sequencing is very wide, and common disease prediction and diagnosis, virus identification and the like can be applied to gene sequencing. In high throughput DNA sequencing, in principle, the resulting image data demarcates the signal within a region as a result of amplifying the signal by the same DNA molecule through an amplification reaction (i.e., a monoclonal signal), which is also one of the bases of being able to decode the DNA sequence by the optical signal generated by the reaction. In practical operation, due to process limitations, such as the process of injecting reaction attachment microspheres and injection DNA templates, a plurality of DNA templates inevitably enter the same observation area, the time sequence signal observed in the area is a mixed signal (or simply called a polyclonal signal) generated by superposition of a plurality of reaction templates, the DNA sequence is obtained by subsequent correction according to the signal, and the real composition of the detected DNA molecules is not reflected. Therefore, a method is needed to identify such signals, further separating the original signals.

Ion torrent finds two indicators that can reflect the characteristics of different types of signals (mixed or not mixed), PPF (percent positive flows), and SSQ (sum of squares). The different kinds of signals are identified by screening the values of the two indexes found. PPF refers to the percentage of times that a positive number of sample injection fluids (flows) are generated, the higher the PPF, the greater the likelihood that the signals will be mixed. SSQ refers to the sum of squares of signal values and nearest-neighbor positive numbers in one flow, the greater the SSQ, the greater the likelihood of being a mixed signal. Ion torrent allows the user to customize the round number interval of flow used to estimate thresholds for PPF and SSQ for rational polyclonal screening in the analysis program. But Ion torrent's method is not suitable for degenerate sequencing. The main reason is that PPF indicators do not distinguish well between different types of signals when degenerate sequencing is performed. Thus, there is a need to develop a method that can effectively distinguish between signal types resulting from degenerate sequencing.

Disclosure of Invention

The characteristic combination classification method of the signal used by Ion torrent is an indirect means for predicting whether the signal is a mixed signal generated by various templates. Whether nondegenerate sequencing or degenerate sequencing, the polyclonal signal is linearly composed of more than two monoclonal signals, the coefficients of which are determined by the ratio of individual DNA molecule clones (i.e., sources) to the population, so that the general characteristics of the polyclonal signal in high throughput sequencing can be represented by a linear combination of multiple source signals, but not by a single source signal, and the key to identifying the polyclonal signal is to determine whether or not a representation of the Shan Xinyuan signal is present.

The invention discloses a method for evaluating mixed signals of gene sequencing, which particularly adopts a sparse coding technology to realize the idea, and can judge whether a sequencing signal of a certain site is a single fragment signal (monoclonal) or a mixture of a plurality of fragment signals (polyclonal).

Specifically, the invention discloses a method for evaluating a mixed signal of gene sequencing, which is characterized by comprising the following steps:

A. Obtaining data by sequencing to obtain a sequencing signal intensity result corresponding to a base of a sequence to be tested, and expressing the intensity result as a degenerate polymer sequence or a homopolymer sequence as a sequence to be evaluated, wherein the degenerate polymer sequence is a sequence consisting of the number arrangement of degenerate polymers, and the homopolymer sequence is a sequence consisting of the number arrangement of monomers constituting the homopolymer;

B. Constructing a dictionary, namely determining a reference sequence, expressing the reference sequence corresponding to the sequence to be detected as an ideal degenerate polymer sequence or an ideal homopolymer sequence corresponding to a sequencing method, extracting a sub-signal with the length of k from the ideal degenerate polymer sequence or the ideal homopolymer sequence bit by bit, and constructing the dictionary;

C. dictionary sparsification, namely extracting subsequences from the dictionary, and selecting subsequences with similar distances to the sequence to be evaluated as a final dictionary;

D. The sparse representation is that an optimization algorithm is used for searching a vector which can be multiplied by the final dictionary matrix to ensure that the sparsity of the vector and the distance of the first k bits of the sequence to be evaluated are minimized at the same time, and the found vector is called a sparse vector;

E. And the filtering result is that the sparse vector is analyzed according to the set sparsity threshold value and the set mixedness threshold value, when the sparsity is not higher than the sparsity threshold value and the mixedness is higher than the mixedness threshold value, the sequence to be evaluated corresponding to the sparse vector is judged to be a monoclonal signal, when the sparsity is not higher than the sparsity threshold value and the mixedness is not higher than the mixedness threshold value, the sequence to be evaluated corresponding to the sparse vector is judged to be a polyclonal signal or a mixed signal, and when the sparsity is higher than the sparsity threshold value, the sparse representation fails, and the signal cannot be judged to be a monoclonal signal or a polyclonal signal.

According to a preferred embodiment, the sequencing comprises multiple base sequencing.

According to a preferred embodiment, the first k positions of the degenerate polymer sequence or homopolymer sequence can be selected as the sequence to be evaluated in step A.

According to a preferred embodiment, k has a value of preferably 8 to 20, more preferably 10 to 15.

According to a preferred embodiment, the length k sub-signals constitute each column of the dictionary.

According to a preferred embodiment, the step C of determining the final dictionary comprises the steps of extracting continuous n bits (n < k) from each sequence to be evaluated as subsequences to be evaluated, performing traversal comparison on the subsequences to be evaluated and the first n bits of each subsequence in the dictionary, taking the items which are the same as the first n bits of the actual signal in the dictionary as alternative sets, and calculating the distances between the first k bits of the sequence to be evaluated and dictionary items one by one in the alternative sets, and taking out the elements from the first m items with the distances ranging from small to large as the final dictionary.

According to a preferred embodiment, the value of n may be 3 or 4 or 5, and the value of m is in the range of 20-300, preferably 50-150.

According to a preferred embodiment, the distance includes, but is not limited to, pearson correlation coefficient, spearman correlation coefficient, average mutual information, euclidean distance, hamming distance, chebyshev distance, ma Halan nobis distance, manhattan distance, minkosky distance, maximum or minimum of absolute value of corresponding signal difference, and the optimization algorithm includes, but is not limited to, match tracking, orthogonal match tracking, weak match tracking, thresholding method, basis tracking, IRLS algorithm, lasso algorithm, weighted support vector machine algorithm.

According to a preferred embodiment, the sparsity threshold value is in the range of 2-10, preferably 2-5, more preferably 2-3, and the mixedness threshold value is preferably in the range of 0.6-1, preferably 0.8-1, more preferably 0.9-1.

The invention also discloses a method for evaluating the mixed signal of gene sequencing, which is characterized by comprising the following steps:

1) Sequencing to obtain a result corresponding to the base of the sequence to be detected;

2) Compiling a reference sequence corresponding to the sequence to be tested into a theoretical result corresponding to a sequencing method;

3) Dividing the compiled theoretical result of the reference sequence into possible result clusters of 6-50 bases;

4) Comparing the sequencing result corresponding to the base of the sequence to be detected with the possible result cluster in the step 3).

According to a preferred embodiment, the polyclonal signal or the monoclonal signal is determined using the method of any of the preceding claims.

According to a preferred embodiment, the identified polyclonal or mixed signal is discarded, signals which cannot be determined to be monoclonal or polyclonal are discarded, and the identified monoclonal signals are isolated for subsequent data processing and analysis.

The beneficial effects of the invention are that

Compared with the prior art, the method has the following advantages:

1. The method of the invention can solve the problem of identifying polyclonal signals in degenerate sequencing, and the method of the prior art loses effectiveness in degenerate sequencing.

2. Compared with the method for classifying signals according to the characteristics in the prior art, the method has higher accuracy.

3. The method has wider application range, and can be used for evaluating non-degenerate sequencing signals besides being suitable for evaluating the degenerate sequencing signals.

4. The present method reflects the essential composition of the signal, independent of the step of finding, mining and reconstructing other features of the signal, which typically requires much specific domain knowledge and experience and consumes considerable effort.

Drawings

FIG. 1 is a schematic flow diagram of a final dictionary obtained by the method of the present invention.

FIG. 2 is a schematic flow chart for identifying sequencing signals using sparse representation.

Detailed Description

The following discussion is intended to enable a person skilled in the art to make and use the disclosed methods and is provided for the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the disclosed technology. Thus, the disclosed methods are not limited to the embodiments shown, but are to be accorded the widest scope consistent with the principles and features disclosed herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art.

In gene sequencing, typically, there is a DNA fragment in each well (or referred to as each site) that, after amplification, forms clusters of the fragment. Then at the time of second generation sequencing, the same reaction will occur at that site and extend by the same number of bases. In practice, at one site, there may be more than one DNA fragment at the time of sequencing, which may be referred to as polyclonal. It is difficult to determine whether a signal is a polyclonal signal or a monoclonal signal from the standpoint of how much a signal is sequenced. The presence of the polyclonal signal generally affects the acquisition of the sequencing signal and makes it difficult to interpret the sequencing signal, and thus the signal that is determined to be polyclonal can generally be discarded as data. The invention provides a method for judging polyclonal signals, which aims at improving the accuracy of sequencing by expressing signals after correcting phases during DNA sequencing in a specific space so as to distinguish mixed signals generated by an experimental process. The invention is applicable to sequencing methods for determining DNA sequences by detecting signal multiplying power, the types of signals including but not limited to fluorescent signals, electrical signals, chemical signals. In the case where the above conditions are satisfied, the present invention is not limited to degenerate sequencing, but can also be used for non-degenerate sequencing.

Specifically, the first aspect of the invention discloses a method for evaluating a mixed signal of gene sequencing, which is characterized by comprising the following steps:

Many DNA sequencing methods can be classified into multi-molecular sequencing and single-molecular sequencing according to the number of sample molecules involved in each sequencing reaction, sequencing reaction types can be classified into sequencing by synthesis, sequencing by ligation, sequencing by excision, etc., and single-base sequencing and multi-base sequencing according to the number of nucleotides that can be detected by each reaction, wherein multi-base sequencing refers to sequencing reaction that can detect one or more nucleotides per reaction. Common Illumina sequencing methods belong to single base sequencing, which utilizes a 3-terminal blocking method, extending one base at a time, so its signal is 0 and 1. It should be noted that, here, the signals 0 and 1 are a relative comparison value. It is known that when sequencing is actually performed, a reaction is performed first, and then a signal of the reaction is obtained by using an imaging device such as a CCD. However, since high throughput sequencing is very small in the range of each data point or each cluster of sequences to be sequenced at the time of the reaction, typically between 0.1 and 3 microns, such small chemical reactions are difficult to accurately measure. Thus, for example, an adjusted CCD obtains an image in which the brightness of a bright spot, or referred to as a reaction spot, is 1000, and the brightness of an unreacted spot is 500 (for example, not a practical experiment), then a data point with brightness close to 1000 may be defined as 1, and a data point with brightness closer to 500 as 0.

In the present invention, the sequencing includes multiple base sequencing. Degenerate sequencing is a novel form of sequencing-by-synthesis technique, a method of gene sequencing that does not block the 3' end, and involves simultaneous introduction of degenerate substrates of at least two base substrates in a single sequencing reaction, possibly with multiple bases extending per reaction, e.g., 1,2,3,4,5,6, etc. The processing of subsequent data is significantly different when the multiple base sequencing method is compared to the sequencing method that extends one base at a time. During sequencing, each time extending multiple bases, then the signal intensity may be 1000,2000,3000,4000,5000,6000, etc., corresponding to the example of Illumina above, for illustrative purposes. It is known that the more elaborate the sequencing signals need to be distinguished, the more difficult it is. The sequencing process is affected by various factors, such as incomplete reaction, insufficient reagent entry, unstable light source intensity, uneven light source intensity, and the influence of the reaction itself, etc., which lead to possible advance or retard of the reaction, and these accumulated factors become more serious in the late stage of the reaction. It is more difficult to define its signal value of 1. Furthermore, for excision sequencing, which excises a variable number of bases per sequencing reaction during the sequencing process, the method of the invention can also be used to evaluate the sequencing signal.

In actual sequencing, as the sequencing reaction proceeds, the same cluster of DNA molecules may be extended to different lengths, and this is accumulated as the sequencing length increases. This can create difficulties in interpretation of the sequencing signal, disrupting the aneuploidy of the sequencing signal. For example, when a DNA fragment is sequenced, the sequencing signal obtained from the beginning is relatively close to an integer by comparison, at which time the sequencing reaction proceeds relatively well without much hybridization, and when the sequencing proceeds to 50 bases (which is a simple example and not a practical experiment), the sequencing signal result may become not close to an integer. Signals that are not close to integers can affect the accuracy of the determination of the polyclonal signal.

In the step A of the method, in the sequencing process, a common multi-base sequencing method is firstly utilized to sequence a section of gene sequence, and a sequencing signal corresponding to the gene sequence is obtained by reading data through a sequencer. This sequencing signal is typically converted from the intensity value or current value of the image, etc. The process of sequencing signal processing is not of particular interest to the present invention. The applicant's previous patents CN201510944878.5, or CN202010061629.2, or CN201610899880.X, etc. describe in detail the method of obtaining more accurate sequencing signals or the processing of sequencing signals. The sequencing signal processing process is described simply herein, namely after sequencing to obtain original sequencing data, firstly judging whether the data is compliant, extracting a sequencing signal strength result, subtracting background noise from the compliant data and normalizing the result, wherein the result is generally an integer or floating point vector arranged according to the acquisition time, for example, a single base sequencing signal is X, and when the obtained sequencing signal is 2X, it can be known that the sequencing may be extended by 2 bases. Multiple base sequencing takes 2+2 degenerate sequencing as an example, two nucleotide substrates are introduced into each round of sequencing reaction, two nucleotide substrates are introduced into all odd rounds of reaction, and another two nucleotide substrates are introduced into even rounds of reaction, such as A/C is introduced into the odd rounds of reaction, G/T is introduced into the even rounds of reaction, so that the method can ensure that each round of reaction can generate base extension without empty reaction, the reaction speed is higher, and the sequence obtained by degenerate sequencing reaction is not an accurate base sequence but a degenerate base sequence. Such as mmkmkkkmmmkkkm, where M represents a/C and K represents G/T. For degenerate sequencing, the result of the sequencing signal strength is expressed as a degenerate polymer sequence, which refers to a sequence consisting of a number arrangement of degenerate polymers, for example, for sequence MMKMKKKMMMKKKM, the degenerate polymer sequence is (2,1,1,3,3,3,1). In addition, the multi-base sequencing method also includes, for example, a sequencing method of Ion torrent, in which only one nucleotide substrate is introduced in each sequencing reaction, and the number of nucleotides that are extended may be one or more (without considering that no extension occurs), which is a homopolymer, that is, a polymer polymerized from only one monomer. Homopolymer sequences refer to sequences consisting of an arrangement of the number of monomers comprising the homopolymer. For example, if 4A's are extended, the homopolymer is AAAA and the homopolymer sequence is 4. In particular, when the number of extended nucleotides is 1, it cannot be said to be a homopolymer in the strict sense at this time, but the present invention refers to it as a 1 treatment. The correspondence of the base sequence, degenerate polymer sequence (for MK sequencing as an example) and homopolymer sequence can be seen in the example shown in Table 1, i.e., degenerate polymer sequence (4,5,3,3,5) and homopolymer sequence (3,1,3,2,2,1,3,3,2).

TABLE 1 multiple sequence representations

Base sequence

A

C

T

G

C

A

T

A

C

Degenerate base sequences

M

K

M

K

M

Degenerate multimeric sequences

4

5

3

5

Homopolymer sequences

3

1

3

2

1

3

2

In the present invention, the sequencing signal applied is a normalized sequencing signal (simply referred to as normalized signal), that is, the ratio corresponding to Shan Jianji sequencing signals.

According to a preferred embodiment, when the normalized signal length of the sequence to be measured is relatively long, not all normalized signals need to be used for the calculation of the monoclonal and the polyclonal, but only a part of them need to be selected as input of the method. Empirically, the top k bits of the continuous normalized signal, which is closer to the integer value, can be chosen as input, i.e. the top k bits of the degenerate polymer or homopolymer sequence are chosen as the sequence to be evaluated, the value of k being dependent on the size of the reference sequence and the number of sequencing cycles, the value of k being preferably 8-20, more preferably 10-15. The different selection modes have little influence on the final distinction between monoclonal and polyclonal. The polyclonal signal has a more defined character, and is a mixture of multiple sequencing signals, regardless of the position from which it is selected. The positions selected are different but the results obtained are the same.

For the reference sequence corresponding to the sequence to be detected, it is easier to determine the reference sequence, for example, when the sequence to be detected is a certain tissue RNA of a mouse, the reference sequence is a transcriptome sequence of the mouse, and when the sequence to be detected is a certain tissue DNA of a human, the reference sequence is a human genome sequence. After determining the reference sequence, the reference sequence corresponding to the test sequence is represented as an ideal degenerate polymer sequence or an ideal homopolymer sequence corresponding to the sequencing method, that is, when the test sequence is degenerate, the reference sequence is represented as a degenerate polymer sequence corresponding to the degenerate sequencing method, and when the test sequence is common multi-base sequencing (i.e., only one nucleotide substrate is added per round of reaction), the reference sequence is represented as a corresponding homopolymer sequence.

Sparse dictionary learning

The formal designation of dictionary learning (Dictionary Learning) and sparse representation (Sparse Representation) in the academia should be sparse dictionary learning (Sparse Dictionary Learning). The algorithm theory comprises two phases, a dictionary construction phase (Dictionary Generate) and a representation sample phase (Sparse coding with a precomputed dictionary) with a dictionary (sparse). Sparse representation is essentially a signal representation method that chooses as few basic signals as possible from the original signals and expresses most or all of the original signals by linear combinations of these basic signals. The signal can be obtained in a more concise representation mode through sparse representation, so that information contained in the signal can be obtained more easily, and further processing of the signal is facilitated. Whereas the nature of the mixed signal is identified by calculating whether the signal can be approximated by a certain column vector in the dictionary. The method of the invention specifically adopts the spark coding sparse coding technology to realize the idea, thereby completing the identification of the polyclonal signal.

Dictionary construction

The linear space used to present the sequence to be evaluated is called a "dictionary". Where space is a linear algebraic concept, represented by an integer matrix, and can be understood as a set of all potential signal classes. The construction of a corresponding reasonable dictionary for the sequence to be evaluated is the key of the method. FIG. 1 is a schematic flow chart of constructing a final dictionary, which is a thinned dictionary, and the specific construction process is described in detail below.

In the method, a dictionary set is a set of sub-signals with the length of k, wherein the reference sequence corresponding to a sequence to be detected is expressed as an ideal degenerate polymer sequence or an ideal homopolymer sequence corresponding to a sequencing method, and the sub-signals with the length of k form each column of the dictionary. The value of k is dependent on the size of the reference sequence and the number of sequencing cycles, and is preferably 8-20, more preferably 10-15. To facilitate subsequent queries, these strings of length k, a more general term being k-mers, are stored in a data structure to facilitate subsequent queries. Data structures include, but are not limited to, binary trees, red black trees, inverted index, and the like. In consideration of the increased calculation amount of the subsequent matrix calculation caused by the increase of the genome increasing the set, a certain screening mode is adopted in the construction process of the dictionary to continuously select the subset as a potential space for actually signaling. This screening process, i.e., dictionary thinning process, is described in detail below.

For each sequence to be evaluated, extracting continuous n bits (n < k) from the sequence to be evaluated as subsequences to be evaluated, performing traversal comparison on the subsequences to be evaluated and the first n bits of each subsequence in the dictionary, taking the items which are the same as the first n bits of the actual signal in the dictionary in a rounding way as an alternative set, and directly entering the next step if the first n bits are not identical in the dictionary. n may take the value 3, or 4, or 5.

In the alternative set, distances between the top k bits of the sequence to be evaluated and dictionary items are calculated one by one, wherein the distances comprise, but are not limited to, the maximum value or the minimum value of the absolute value of the pearson correlation coefficient, the spearman correlation coefficient, the average mutual information, the Euclidean distance, the Hamming distance, the Car-specific Schiff distance, the Ma Halan Norbish distance, the Manhattan distance, the Minkoky distance and the corresponding signal difference value, and the distances are taken from the top m items of the set elements which are arranged from small to large as a final dictionary. The m is selected in consideration of the balance between the completeness of the possible signals and the calculated amount, and the value of m ranges from 20 to 300, preferably from 50 to 150, depending on the reference sequence.

Sparse representation (Sparse Representation)

Sparse representation is also known as sparse coding. A vector containing most zero elements may be referred to as a sparse vector. For a linear equation, x=d×α, where D is an underdetermined matrix, D is referred to as a dictionary and x is the signal of interest to us. Sparse representation refers to representing the relationship between x and D with a vector α that is as sparse as possible. For a vector α with a fixed dimension, the more zero elements within the vector α, the more sparse it is. From a mathematical perspective, the sparse representation algorithm solves a dual-objective optimization problem of simultaneously minimizing sparsity and distances from D x to x, and as shown in fig. 2, the present invention needs to apply an appropriate optimization algorithm to find a vector as sparse as possible, so that the result of multiplying the vector by the final dictionary matrix is as close as possible to the first k bits of the sequence to be evaluated, and the found vector is called a sparse vector. The optimization algorithm includes, but is not limited to, matching pursuit, orthogonal matching pursuit, weak matching pursuit, thresholding, basis pursuit, IRLS algorithm, lasso algorithm, weighted support vector machine algorithm, etc.

For example, when the degenerate polymer sequence of one test sequence is (2,1,2,1,1,2,1,1,1,2,2), the first k signals can be selected as the sequence to be evaluated (when k takes a value of 8), which is (2,1,2,1,1,2,1,1). At the time of actual sequencing, the normalized signal may not be entirely integral, and the portion that is closer to the integer may be selected before or the portion that is closer to the integer. Simply, the front sequencing normalization signal is selected as input. The first 5-50 consecutive normalized signals may be selected for input, more preferably the first 6-45 signals, more preferably the first 7-40 signals, or the first 8,9,10,11,12,13,14,15 signals. The foregoing does not require precise definition, and may be selected from the first sequencing signal, or from the 2,3, 4-th sequencing signal.

For example, the normalized signal for a given test gene sequence is 2.3,2.2,2.5,3,1,1,2,1,3,3.1,3.5,3.6,1.3,1.2, and when selected, the first 9 signals, for example, or 8 signals from the third can be selected. In theory, the chosen position does not affect the final result, the monoclonal signal is significantly different from the polyclonal signal, which is a mixture of two or more monoclonal signals. However, since the actually measured signal is affected by phase loss, attenuation and other factors, in general, the signal quality is worse and the whole number is more likely to deviate from the whole number when the acquisition sequence is more posterior, the front sub-signal should be selected as much as possible as the signal to be evaluated as a preferred embodiment. In the present invention, this law is described mathematically and no additional meaning is introduced. This approach is relatively high in selection accuracy for both polyclonal and monoclonal. Of course, when the signal is extremely poor, the sequencing accuracy may be reduced, for example below 90%, but it will be appreciated that this is not due to the accuracy problems of the method of the invention.

Filtering the result

The sparse vector α is a vector of non-zero elements and zero elements, which contains many zero or near zero elements. As shown in fig. 2, for the optimized α obtained by sparse representation, the first step is to check the sparsity (sparsity), i.e. the number of non-zero elements. The method is focused only on alpha vectors with sparsity not higher than a sparsity threshold, the value range of the sparsity threshold is 2-10, preferably 2-5 and more preferably 2-3, other conditions are collectively called anomalies, the anomalies indicate that a sparse representation algorithm cannot generate vectors with lower sparsity according to the design, the difference between an original signal and a basic signal in a dictionary is indicated to be larger, the sequence is possibly a low-quality sequence, or the sequence is not in a reference sequence, and a monoclonal or polyclonal signal cannot be judged by the method. In the case currently observed, it is very rare that the sparsity is above the sparsity threshold.

Preferably, all elements of the sparse vector are normalized, where the largest element in the sparse vector is the part of interest we are interested in, which represents how close x is to a certain column vector in D, called the mix. When the mixing degree of the sparse vector is close to the number 1, the meaning of the expression is that the sequencing signal is close to an ideal normalized signal of the reference sequence, and the sequencing signal is a monoclonal signal (mono-clonal). In contrast, when the degree of mixing of sparse vectors is not close to the number 1, or there are multiple elements that are not very different, then the sequencing signal is judged to be a polyclonal signal (poly-clonal) or a mixed signal. Accordingly, a threshold value, which is referred to as a mixing degree threshold value, is set for the value of the mixing degree in the sparse vector, and is determined to be a monoclonal signal, and a value smaller than the threshold value is determined to be a polyclonal signal or a mixed signal. Empirically, the sequencing signal is considered to be a monoclonal signal when the threshold of degree of mixing is in the range of 0.6-1, preferably the sequencing signal is considered to be a monoclonal signal when the threshold of degree of mixing is in the range of 0.8-1, more preferably the sequencing signal is considered to be a monoclonal signal when the threshold of degree of mixing is in the range of 0.9-1, more preferably the sequencing signal is considered to be a monoclonal signal when the threshold of degree of mixing is in the range of 0.95-1, and vice versa.

Normally, after the polyclonal signal is distinguished by the method of the present invention, the identified polyclonal signal or mixed signal is discarded, the signal that cannot be determined as monoclonal or polyclonal is discarded, and the identified monoclonal signal is isolated for subsequent data processing and analysis, e.g. for base recognition, etc., because the confounding information in the polyclonal signal is usually not available.

Preferably, the mathematical symbols, signs may be used to represent the different chemical reactions in the multiple base sequencing. For example, in 2+2 sequencing, which is performed in KM sample injection order, a degenerate polymer sequence 2,3,1,1,3,3,2,1 is obtained, the K signal can be expressed as positive values and the M signal can be expressed as negative values, i.e., 2, -3,1, -1,3, -3,2, -1.

In the invention, a dictionary and sparse representation judgment mode in mathematical judgment is selected. This way it can represent a vector or a set of data to reference proximity criteria. In the invention, the method is not additionally limited or judged, and further, in the mathematical algorithm, each digital or calculated actual meaning is the calculation of the proximity degree of the actual sequencing signal and the reference sequence, and the sequencing signal can be judged to belong to a monoclonal signal or a polyclonal signal through a given threshold value. The mathematical approach used is thus entirely one of the specific implementations of the claimed solution of the invention.

D is a matrix of ideal normalized signals of possible fragments of the reference sequence. The gene sequence of the reference sequence is known. It means that each column is one possible ideal normalized signal. And, when selected, its signal is continuous and the same dimension as x. The dimension of x is also the number of its sequencing signals. For example, a reference genome of a known sequence, which comprises a base length of 500 bp. When the dimension of x is 10, then D is a matrix of column vectors of every 10 signals selected consecutively from the ideal normalized signal of the reference genome. The meaning is that all possible 10 signal cases are combined into a matrix. Of course, in actual calculation, there is a simpler composition to simplify the matrix, but the principle is the same, since more or less information is known.

In this patent, although a matrix calculation method is used, its practical meaning is to compare the sequencing result with the ideal sequencing result of the reference sequence. The simple mathematical approach does not introduce additional physical meaning or limitations.

Alpha is a sparse vector, and the computation mode of sparse representation is common knowledge in mathematics. Proper distance selection may translate the process of solving α into a problem of convex quadratic programming (convex quadratic optimization). There are many resolvers for this type of problem, such as the interior point method (interior point methods), the fixed-point continuous method (fixed-point continuation), in-grow algorithm, and the like. The calculation mode of sparse representation is not described in detail in the present invention, and this part of the content does not belong to the focus of the present invention.

The second aspect of the invention discloses a method for evaluating a mixed signal of gene sequencing, which is characterized by comprising the following steps:

3) Dividing the compiled theoretical result of the reference sequence into possible result clusters of 6-50 bases, and 4) comparing the result obtained by sequencing and corresponding to the base of the sequence to be tested with the possible result clusters of the step 3).

Each of the features discussed in the description of the first aspect of the invention are equally applicable to the description of the second aspect of the invention. As indicated above, some of the other features are not repeated here and should be considered to be repeated by reference. Those of ordinary skill in the art will understand how features identified in these implementations can be readily combined with basic feature sets identified in other implementations.

Example 1

Lambda phage were subjected to WS 2+2 degenerate sequencing. K=10, m=256 was chosen in an attempt to distinguish between monoclonal and polyclonal signals using a sparse representation method.

One signal determined to be monoclonal is:

2 -2 2 -2 2 -2 2 -2 2 -2

The dictionary of the mono-clonal signal representation (showing the top 10 columns closest to the signal) is

The sparse representation of the monoclonal signal is as follows, which is actually a column vector, and for a clearer representation, the transpose of the representation α is now as follows:

a signal determined to be polyclonal is

2 -1 3 -2 2 -1 2 -3 2 -2

A dictionary of poly-clonal signals (showing the top 10 columns closest to the signal)

Sparse representation of polyclonal signals

Example 2

Identification of polyclonal signals in Tobacco Mosaic Virus (TMV) chain specific reverse transcription library sequencing

When the genome of the tobacco mosaic virus is single-stranded RNA with the length of 6395 and cDNA is subjected to degenerate sequencing by KM sample injection sequence, repeated 0 sequencing signals are combined, and a matrix of 1509 columns and 12 rows can be directly obtained:

As shown, each column is one of the possible sequencing signals in the reference genome, positive for the K signal and negative for the M signal.

Normalizing the matrix by columns to obtain a dictionary D:

For the following four sets of sequencing signals:

1. Noiseless ideal monoclonal signal s _m:

4 -2 1 -1 3 -2 1 -1 6 -1 3 -2

2. noisy observed monoclonal signal s' _m:

3.653 -2.095 1.221 -1.178 3.190 -1.637 1.429 -0.879 5.978 -1.264 2.479 -2.024

3. Noise-free ideal polyclonal signal s _p:

2.5 -2.5 2.5 -1.5 2 -1.5 2 -1.5 3.5 -1 4.5 -1.5

4. noisy observed polyclonal signal s' _p:

2.444 -2.523 2.529 -1.435 1.957 -1.477 1.991 -1.422 3.472 -0.999 4.675 -1.546

The lasso method is used to optimize the following functions:

And the threshold of the mixing degree is set to be 0.6,

For x=s _m,The degree of mixing was 0.889, and was determined to be monoclonal.

For x=s' _m,The degree of mixing was 0.651, and was judged as monoclonal.

For x=s _p,

The degree of mixing was 0.367, and the cells were judged as polyclonal.

For x=s' _p,

The degree of mixing was 0.366, and the polyclonal was judged.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for evaluating a gene sequencing mixed signal, comprising the following steps:

A. Obtaining data: Sequencing to obtain the sequencing signal intensity results corresponding to the bases of the sequence to be tested, and expressing the intensity results as a degenerate polymer sequence or a homopolymer sequence as the sequence to be evaluated; the degenerate polymer sequence is a sequence composed of the number arrangement of degenerate polymers, and the homopolymer sequence is a sequence composed of the number arrangement of monomers constituting the homopolymer;

B. Constructing a dictionary: determining a reference sequence, representing the reference sequence corresponding to the sequence to be tested as an ideal degenerate polymer sequence or an ideal homopolymer sequence corresponding to the sequencing method, and extracting sub-signals of length k from the ideal degenerate polymer sequence or the ideal homopolymer sequence bit by bit to construct a dictionary;

C. Dictionary sparsification: extract subsequences from the dictionary and select subsequences that are close to the sequence to be evaluated as the final dictionary;

D. Sparse representation: Use an optimization algorithm to find a vector that can be left-multiplied by the final dictionary matrix so that the sparsity of the vector and the distance to the first k bits of the sequence to be evaluated are minimized at the same time. The found vector is called a sparse vector;

E. Filtering results: The sparse vector is analyzed according to the set sparsity threshold and mixedness threshold. When the sparsity is not higher than the sparsity threshold and the mixedness is higher than the mixedness threshold, the sequence to be evaluated corresponding to the sparse vector is determined to be a monoclonal signal; when the sparsity is not higher than the sparsity threshold and the mixedness is not higher than the mixedness threshold, the sequence to be evaluated corresponding to the sparse vector is determined to be a polyclonal signal or a mixed signal; when the sparsity is higher than the sparsity threshold, the sparse representation fails and the signal cannot be determined to be monoclonal or polyclonal.

2. The method according to claim 1 is characterized in that the sequencing includes polybase sequencing.

3. The method according to claim 1, characterized in that in step A, the first k positions of the degenerate polymer sequence or homopolymer sequence are selected as the sequence to be evaluated.

4. The method according to claim 1, characterized in that the value of k is 8-20.

5. The method according to claim 1, characterized in that the value of k is 10-15.

The method according to claim 1 , wherein the sub-signals of length k constitute each column of the dictionary.

7. The method according to claim 1 is characterized in that step C of determining the final dictionary comprises: extracting continuous n bits from each sequence to be evaluated as a subsequence to be evaluated, traversing and comparing the subsequence to be evaluated with the first n bits of each sub-signal in the aforementioned dictionary, and taking the items in the dictionary that are the same as the first n bits of the actual signal as an alternative set, wherein n<k; in the alternative set, calculating the distances between the first k bits of the sequence to be evaluated and the dictionary items one by one, and taking out the first m items of the set elements arranged in ascending order of distance as the final dictionary.

8. The method according to claim 7, characterized in that the value of n is 3, 4 or 5; and the value range of m is 20-300.

9. The method according to claim 7, characterized in that the value of n is 3, 4 or 5; and the value range of m is 50-150.

10. The method according to claim 1 is characterized in that the distance includes the maximum or minimum value of the absolute value of the corresponding signal difference, the Pearson correlation coefficient, the Spearman correlation coefficient, the average mutual information, the Euclidean distance, the Hamming distance, the Chebyshev distance, the Mahalanobis distance, the Manhattan distance, the Minkowski distance, and the absolute value of the corresponding signal difference; the optimization algorithm includes matching pursuit, orthogonal matching pursuit, weak matching pursuit, threshold method, basis pursuit, IRLS algorithm, Lasso algorithm, and weighted support vector machine algorithm.

11. The method according to claim 1 is characterized in that the sparsity threshold value ranges from 2 to 10; the mixing threshold value ranges from 0.6 to 1.

12. A method for evaluating a gene sequencing mixed signal, comprising the following steps:

1) Sequencing to obtain the results corresponding to the bases of the sequence to be tested;

2) compiling the reference sequence corresponding to the sequence to be tested into a theoretical result corresponding to the sequencing method;

3) Divide the theoretical results of the compilation of the reference sequence into possible result clusters of 6-50 bases;

4) comparing the sequencing results corresponding to the bases of the sequence to be tested with the possible result clusters in step 3);

The method of claim 1 is used to determine a polyclonal signal or a monoclonal signal.