Disclosure of Invention
In view of the above, it is an object of the present invention to provide a method for identifying an antibody sequence based on the sequencing from the head and the cognate tag theory.
In order to achieve the above purpose, the technical scheme of the invention is as follows:
A method of identifying an antibody sequence based on de novo sequencing and homology tag theory, the method steps comprising:
(1) Performing amino acid horizontal cutting on an antibody to be identified to obtain a peptide fragment solution;
(2) Respectively carrying out liquid phase separation and mass spectrometry on the obtained peptide fragment solution to obtain a mass spectrometry file;
(3) Carrying out pFind search on the obtained mass spectrum file to determine the light chain constant region peptide fragment sequence of the antibody to be identified and the heavy chain constant region peptide fragment sequence of the antibody to be identified, carrying out pNovo de novo sequencing on the obtained mass spectrum file to determine the variable region candidate peptide fragment sequence of the antibody to be identified;
(4) Respectively carrying out homologous search on the light chain constant region peptide fragment sequence of the antibody to be identified and the heavy chain constant region peptide fragment sequence of the antibody to be identified to obtain a complete light chain and a complete heavy chain with high homology with the light chain constant region and the heavy chain constant region;
(5) Selecting homologous labels in a light chain variable region amino acid probability distribution table and a heavy chain variable region amino acid probability distribution table respectively, wherein the homologous labels consist of 5-8 amino acid sequences, and the amino acid sequences comprise more than 2 amino acids with the probability of more than 60% and 3 amino acids with the probability of 5 before;
(6) Dividing the high-reliability variable region peptide fragment sequence into a plurality of small peptide fragment sequences (kmer) with the length K of 5-8, counting the occurrence times and probability of each small peptide fragment sequence, obtaining the assembly score of each small peptide fragment, searching small peptide fragments with K-1 overlapping amino acids along two ends by taking the small peptide fragment sequence which meets the homology tag and has the highest score as an assembly starting point, adding one amino acid at the tail end according to the assembly score, and repeatedly assembling to obtain the light chain variable region peptide fragment sequence and the heavy chain variable region peptide fragment sequence;
(7) And (3) respectively carrying out secondary assembly on the light chain constant region peptide fragment sequence and the heavy chain constant region peptide fragment sequence in the step (3) and the light chain variable region peptide fragment sequence and the heavy chain variable region peptide fragment sequence to obtain the finished antibody light chain and heavy chain.
Further, in the step (1), more than one of protease method, microwave hydrolysis method and microwave-assisted protease method is adopted to carry out amino acid horizontal cutting on the antibody to be identified, and more than 4 overlapped amino acids exist between peptide fragments obtained by different methods.
In the step (1), the antibodies to be identified are subjected to proteolysis by adopting specific protease and non-specific protease respectively, wherein the total number of the specific protease and the non-specific protease is more than or equal to 3.
In the step (2), the peptide solution is subjected to liquid phase separation by mobile phase gradient elution, a chromatographic column is a C18 reversed phase chromatographic column, and during mass spectrometry, the first 20 parent ions in the primary spectrogram are selected for secondary spectrogram analysis, and the parent ions are subjected to secondary fragmentation by adopting a high-energy collision fragmentation mode (HCD).
In the step (3), the database searched by pFind is Swissprot complete library, and during searching, the parent ion deviation and fragment deviation are both +/-20 ppm, and the amino acid sequence result which is close to the length of the antibody light chain constant region sequence and the length of the antibody heavy chain constant region sequence and has the highest coverage is selected from the search results to be used as the light chain constant region peptide sequence of the antibody to be identified and the heavy chain constant region peptide sequence of the antibody to be identified.
Further, in the step (3), pNovo de novo sequencing is performed on the obtained mass spectrum file, and peptide fragments with parent ion mass deviation less than 10ppm are reserved as variable region candidate peptide fragment sequences of the antibodies to be identified.
Further, in step (4), an online antibody library abYsis is used for homology searching.
Further, in step (4), high homology complete light and heavy chains with light chain constant region and heavy chain constant region E values of less than 10 -5 are obtained during homology searching. The E value refers to a sequence in which two amino acid residues are randomly arranged when protein sequences have the same length, and the probability of occurrence of a total score of each pair of amino acid residues is calculated based on a scoring matrix.
Further, in step (6), the assembly score of each small peptide fragment is s=r×10 P, where R is the number of occurrences of the small peptide fragment and P is the amino acid probability of the small peptide fragment.
Further, in the step (6), if the same site is ambiguous during assembly, the amino acid with the highest probability in the probability distribution table is selected to be the amino acid with the probability more than 3 times of that of the suboptimal amino acid.
Further, in the step (6), during assembly, the C-terminal of the variable region is extended by 4 to 7 amino acids more than the front 4 to 7N-terminal of the constant region to form overlapping.
Further, the antibodies to be identified comprise a mixture of polyclonal antibodies with different specificities in serum, a pure monoclonal antibody, a monoclonal antibody secreted by hybridoma cells or a monoclonal antibody secreted by plasma cells.
Advantageous effects
The invention provides an antibody sequence identification method based on de novo sequencing and homologous tag graph theory, which comprises the steps of firstly cutting an antibody to be identified, carrying out liquid phase-mass spectrometry on a peptide solution obtained, then searching and determining the peptide sequences of a light chain constant region and a heavy chain constant region, determining the candidate peptide sequences of a variable region through de novo sequencing, then effectively identifying de novo sequencing errors through combining a species-specific antibody homology library by developing a dynamic window method, distinguishing the leucine and the isoleucine in an isomer form, and improving the assembly accuracy and the assembly stability.
In the analysis process of determining the antibody constant region by the de novo sequencing and the antibody variable region by the de novo sequencing, the antibodies are respectively treated by using 3 protein sequence cutting methods (a protease method, a microwave hydrolysis method and a microwave assisted protease method), so that the high overlapping property between the peptide sequences of the antibodies can be ensured, and the coverage of the antibody protein sequences is improved. And the protease is selected, the parent ion deviation and the fragment deviation are set to be 20ppm, so that the accuracy of the amino acid level of the peptide fragment can be ensured.
Key parameters involved in the sequence assembly process include the size of the amino acid probability threshold in the cognate tag and the size of kmer. The probability threshold in the homologous tag is selected to be greater than 60% to extract highly conserved sequences for preference in de novo sequencing of peptide fragments, thereby improving the accuracy of sequence assembly. kmer was chosen as 5, 6,7 or 8, the effect of which was complementary. When kmer is small, the assembly result is easily affected by repeated peptide fragments, and wrong repeated amino acid fragments appear in the assembly result, which reduces accuracy. Smaller kmers can increase the length of sequence assembly and thus coverage. When kmer is great, the assembly result is difficult to be influenced by repeated peptide segments, and the accuracy of the assembly result is improved. However, larger kmers filter out a portion of the shorter sequences and reduce overlap between different kmers, resulting in shorter assembled sequence lengths and reduced sequence coverage.
The method is based on a homologous distribution probability table obtained by a data mining method, uses the homologous probability table to align peptide fragments, replaces low-probability amino acids with high-probability amino acids, can realize error correction of ambiguous amino acids and isomerism amino acids, and simultaneously reserves the peptide fragments with the best matching score to obtain a preferred de novo sequencing result. The method can improve the accuracy of peptide fragments involved in assembly and also improve the accuracy of antibody sequence identification.
The antibody sample suitable for the method disclosed by the invention is wide in type, and comprises polyclonal antibody mixtures with different specificities in serum, monoclonal antibody pure products, monoclonal antibodies secreted by hybridoma cells, monoclonal antibodies secreted by plasma cells and the like. The antibody proteins from different sources can be cut by adopting a protein sequence cutting method with an amino acid level to form overlapped peptide segments, and the whole antibody sequence is obtained after the assembly.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
An antibody (Ig) is a large molecular weight immunoglobulin (about 150 kDa) of about 10nm in size and similar in structure to a Y-shape. In humans and most mammals, an antibody unit consists of four polypeptide chains, two heavy chains (about 450-550 AA) of longer length and two light chains (about 214 AA) of shorter length, and within the same antibody, the two heavy chain sequences and the two light chain sequences are identical in composition. Antibodies can also be divided into variable and constant regions depending on the variability of the composition of the peptide chain. The antibody can be divided into 4 main regions, namely a light chain constant region (CL), a light chain variable region (VL), a heavy chain constant region (CH) and a heavy chain variable region (VH), by combining the length characteristics of peptide chains. The variable region amino acid sequences of the antibody light and heavy chains vary greatly in composition and near the N-terminus, accounting for 1/4 and 1/2 of the length of the heavy and light chains, respectively. The antibody light and heavy chain constant region amino acid sequences are relatively stable in composition and near the C-terminus, occupying 3/4 and 1/2 of the length of the heavy and light chains, respectively.
The variable regions of the light and heavy chains comprise 3 amino acid regions of highly variable alignment, namely Complementarity Determining Regions (CDRs), and the variable regions comprise a Framework Region (FR) which is a region of relatively stable amino acid composition of 4 additional amino acids, the FR and CDR regions being staggered together to form the variable region. The regions of VH and VL each having a highly variable 3 amino acid composition and arrangement sequence are called complementarity determining regions (Complementarity determining region, CDR) which are CDR1, CDR2 and CDR3 respectively, wherein the CDR3 varies to the highest degree. The 3 CDRs of the VH are located at amino acids 29-31, 49-58 and 95-102, respectively, while the 3 CDRs of the VL are located at amino acids 28-35, 49-56 and 91-98, respectively. Constant region C of heavy and light chains are referred to as CH and CL, respectively. CL lengths of different classes (kappa or lambda) of igs are substantially identical, but CH lengths of different classes of igs are different, e.g. IgG, igA and IgD include CH1, CH2 and CH3, while IgM and IgE include CHl, CH2, CH3 and CH4.
Because the constant region amino acid sequence composition is relatively stable, the partial sequence exists in the Swiss-prot protein sequence database, and in addition, because the database searching method has higher accuracy than the sequencing method from head, the partial region can be identified by using the database searching method by combining the two characteristics. However, the CDR regions in the variable region are highly variable and the protein sequence database typically does not contain this sequence, which is not desirable in database searching methods. De novo sequencing methods, because they do not rely on sequence databases, allow identification of the sequence of the variable region. However, de novo sequencing methods have low accuracy and many ambiguous peptide sequences contained in the sequencing results require reliable information to optimize the selection of a reliable peptide sequence. The constant region sequence is used for carrying out homologous search on an antibody database to obtain a plurality of homologous heavy chain sequences and light chain sequences, the variable region of the homologous sequences is counted to obtain an amino acid frequency distribution table of the variable region by counting the amino acid composition at each position, and the ambiguous peptide fragment can be identified based on the frequency table, so that the most reliable candidate variable region sequence is optimized. The technical flow of the specific scheme is shown in figure 1.
A method of identifying an antibody sequence based on de novo sequencing and homology tag theory, the method steps comprising:
1. sample processing section (Wet experiment)
1. Antibody enzymolysis
(1) 20 Μg herceptin antibody was taken as one portion and 5 portions were taken separately. Each antibody was denatured with 2% mass volume ratio of sodium deoxycholate aqueous solution, 200mM Tris-HCl, 10mM tricarboxyethyl phosphine at pH 8.0 at 95℃for 10min, and then incubated at 35℃for 30min for reduction. Finally, the samples were alkylated to a final concentration of 40mM by the addition of iodoacetic acid and incubated at room temperature at 25℃for 45min in the absence of light and the antibody solution samples were stored at-4 ℃.
(2) The antibody was cleaved at the amino acid level by 3 different methods, protease, microwave hydrolysis and microwave assisted protease, respectively. The three kinds of coordination can ensure that at least 4 amino acids overlap among peptide segments formed after protein cleavage, thereby improving the accuracy of sequencing and the coverage of antibody sequence assembly. The three amino acid level cutting methods are processed in the step (1), and the following specific implementation steps are as follows:
1) Protease hydrolysis, namely taking 5 parts of 3 mug antibody solution samples, and respectively carrying out enzymolysis on the antibody samples by using one protease of aspartic proteinase, trypsin, chymotrypsin, lysine proteinase and glutamic acid proteinase at 37 ℃. The mass ratio of enzyme to antibody is 1:50, the mixture after enzymolysis is dissolved in 100 mu L of 50mM ammonium bicarbonate solution, and after standing for 4 hours, 5 parts of enzymolysis product solution is refrigerated at-4 ℃ for standby. The method combines specific protease (trypsin, aspartic protease, glutamate protease and lysine protease) with non-specific protease (chymotrypsin) to ensure inconsistent enzyme cleavage sites and form overlapped peptide segments, thereby improving the coverage of antibody sequences.
2) Microwave hydrolysis method 1 part of 3 μg antibody solution sample is taken in a glass vial, HCl is added to a final concentration of 3M, the vial is placed on ice in a beaker, and microwave heating is performed for 4 minutes under the power condition of 200W in a microwave oven (stopping every 1 minute and supplementing ice, so as to ensure that peptide fragments in the antibody cannot be denatured due to overheating). The microwave hydrolysate solution was stored at-4 ℃.
3) Microwave-assisted enzymolysis 1 part of a 3. Mu.g sample of antibody solution was taken in a glass vial, HCl was added to a final concentration of 3M, the vial was placed in a beaker with ice and heated by microwaves for 2 minutes (stopping every 1 minute and replenishing with ice) at 200W power in a microwave oven. The microwave hydrolysate was adjusted to pH 8 with 0.5M NaOH aqueous solution and then 0.3. Mu.g of trypsin was added. Microwave heating is carried out for 6 minutes (stopping every 1 minute and supplementing ice) under the power of 100W to carry out the reaction of the microwave-assisted enzymolysis antibody, and after the reaction is finished, 1M HCl is added to adjust the pH to 2 to terminate the enzymolysis reaction. And (3) refrigerating the product solution of microwave-assisted enzymolysis at-4 ℃ for standby.
(3) To the product solution of the cleavage method at the three amino acid levels in step (2), 2. Mu.L of formic acid was added, respectively, and the solution was centrifuged at 14000g for 20min to remove sodium deoxycholate.
(4) After centrifugation, the supernatant was collected and desalted on 30 μm Oasis HLB 96 well plates.
(5) The Oasis HLB adsorbent was activated with 100v% acetonitrile and equilibrated with 10v% aqueous formic acid.
(6) After the enzymatic hydrolysis of the peptide fragment was combined with the adsorbent, the peptide fragment was eluted twice with 10v% formic acid in water and then with 100ml of 50v% acetonitrile to 5v% formic acid.
(7) And (5) vacuum drying and preserving the eluted enzymolysis peptide fragment solution.
2. Liquid phase separation and mass spectrometry
(1) The peptide sample after enzymolysis is dissolved in 0.1v% formic acid, and online analysis is carried out by Agilent 1290UHPLC combined with an Orbitrap Q-Exactive HFX mass spectrometer, 1 mug of sample is loaded each time, and the sample loading speed is 0.8 mug/min.
(2) The peptide fragments were separated by a 15cm C18 reverse phase chromatography column (inner diameter 100 μm,1.9 μm resin) and an elution gradient of 120 min. Mobile phase a was 0.1v% trifluoroacetic acid and 2v% acetonitrile, and mobile phase B was 0.1v% trifluoroacetic acid and 98v% acetonitrile. A120 min gradient (mobile phase B: 3v% at 0min, 5v% at 5min, 22v% at 95min, 30v% at 105min, 90v% at 115min, 90v% at 120 min) was used at a flow rate of 450 nL/min.
(3) The parameters of the Orbitrap Q-Exactive mass spectrum are shown below, the spray voltage is 2100V, and the temperature of the ion transmission tube is 300 ℃. The automatic gain control of the primary spectrogram scanning is 1 x 10-5, the m/z range of scanning is 350-2000, the resolution is 3 x 10-4, the first 20 parent ions in the primary spectrogram are selected for the analysis of the secondary spectrogram, the parent ions adopt a high-energy collision fragmentation mode (HCD) for the secondary fragmentation, and the fragmentation energy is set to be 30%.
2. Database searches determined constant regions (dry test below), specifically:
1. The original mass spectrum files of the 3 sequence cutting methods are respectively subjected to pFind retrieval. The sequence database retrieved was the Swissprot full library.
2. The main parameters of the database search are sequence database Swiss-prot protein sequence database, protease type 1) protease method of aspartic proteinase, trypsin, chymotrypsin, lysine proteinase, glutamate proteinase (the selection corresponds to proteinase in wet experiment), 2) microwave hydrolysis method of no enzyme, 3) microwave auxiliary protease method of no enzyme and trypsin (the selection corresponds to proteinase in wet experiment), maximum miscut number of 3, open search of no, parent ion deviation of 20ppm, fragment deviation of 20ppm, fixed modification of carbamoylamine methylation, variable modification of methionine oxidation, error discovery rate of 1%, peptide fragment mass range of 600-8000, and peptide fragment length of 6-40.
3. And selecting an amino acid sequence result (about 107 amino acid residues in the length of the antibody light chain constant region and about 330 amino acid residues in the length of the antibody heavy chain constant region) which is close to the length of the antibody light chain constant region and the length of the antibody heavy chain constant region from the search results and has the highest coverage, checking the sequence names of the amino acid sequence results, and determining the species, the type and the sequence composition of the light chain constant region and the heavy chain constant region.
3. The variable region candidate sequence is obtained by de novo sequencing, specifically:
1. the file in mgf format corresponding to the 5 enzymes extracted from the original file of mass spectrum using mass spectrum data format conversion software.
2. The input of de novo sequencing was performed using the extracted mgf format mass spectral data as pNovo. De novo sequencing was performed separately for the 3 protein sequence cleavage method, with other parameters remaining consistent except for the cleavage type.
3. PNovo the main parameters retrieved from the de novo sequencing were immobilized modification, carbamoylamine methylation, variable modification, methionine oxidation, parent ion bias ± 20ppm, fragment bias ± 20ppm, protease type 1) protease method, aspartic protease, trypsin, chymotrypsin, lysine protease, glutamate protease (selection corresponds to protease in wet experiments), 2) microwave hydrolysis method, no enzyme, 3) microwave assisted protease method, no enzyme and trypsin (selection corresponds to protease in wet experiments).
4. And combining peptide fragments obtained by the protease method, the microwave hydrolysis method and the microwave-assisted protease method after de novo sequencing respectively, and reserving peptide fragments with mass deviation of parent ions from the de novo sequencing less than 10ppm as candidate variable region peptide fragment sequences.
4. Homologous search
1. Homology searches were performed in the online antibody library abYsis using the determined light chain constant region sequences to obtain a highly homologous complete light chain with a light chain constant region E value of less than 10 -5. The E value refers to the size of the probability of occurrence of the sum of scores of the amino acid residues of each pair based on a scoring matrix in the case where the protein sequences are the same in length and the two amino acid residues are arranged randomly. The smaller the E value, the lower the likelihood of getting the total score in the random case, and the higher the homology of two protein sequences in the case where the score sum is high and the E value is small.
2. And counting the amino acid composition of each site of the variable region in the complete light chain to obtain a light chain variable region amino acid probability distribution table.
3. Repeating the steps 1 and 2 on the heavy chain constant region sequence to obtain an amino acid probability distribution table of a heavy chain variable region.
5. Amino acid frequency distribution assisted assembly
1. The homologous tags at these sites were obtained by reading from the light chain variable region amino acid probability distribution table adjacent amino acids with a probability of greater than 60% (amino acid composition pattern: within a window of 5 amino acids, 2 and more than 60% amino acids 3-5 before probability), as shown in FIG. 2.
2. Homologous tags are written using regular expressions, and candidate peptide fragments satisfying this composition pattern are retrieved as highly reliable variable region sequences from the de novo sequencing retention results. The de novo sequencing results were further filtered using multiple homology tags and a score table was obtained for each peptide fragment. In this table, peptide fragments are arranged in the order in which homologous tags appear, as shown in FIG. 3.
3. The highly reliable variable region sequence results were converted to kmers of default length K of 6 (optional range of K is 5, 6, 7 or 8) and the number of occurrences was counted, the probability distribution for each kmer was obtained from the peptide fragment score table, and converted to an assembly score S using the formula s=r×10 P (R represents the number of occurrences of kmers and P represents the amino acid probability value of kmer). And (3) selecting kmers which meet the homologous tag and have the highest score as starting points of assembly, searching K-1 overlapped kmers along two ends, distinguishing ambiguous amino acids by using the kmer score, adding one amino acid at the tail end, and repeating the process to obtain a contig result of the starting point of the homologous tag. The above assembly process was repeated at different homologous tag starting points, obtaining a series of contig results. If the high-reliability variable region sequence has ambiguity at the same site, the probability distribution list of 20 amino acids at the site is used for selection, wherein the selection basis is that the amino acid with the highest probability at the site is more than 3 times of the probability of suboptimal amino acid. The process is repeated to complete the assembly of the variable region, and the C end of the variable region is extended by 4 to 7 amino acids on the basis of the assembly of the variable region, thereby facilitating the subsequent assembly with the constant region.
4. And performing secondary assembly on the light chain constant region sequence determined by pFind database search and the light chain variable region sequence, wherein 4 to 7 amino acids which are extended more at the C end of the variable region are overlapped with the first 4 to 7 amino acids at the N end of the constant region, so that the complete antibody light chain is obtained.
5. Repeating the steps 1-4 on the heavy chain constant region sequence to obtain the complete heavy chain of the antibody.
The antibody assembly is carried out on the herceptin antibody (light chain 214 amino acids and heavy chain 450 amino acids) by adopting the method, and the accuracy rate on the light chain of the herceptin antibody is 99.5% respectively after the comparison with the correct antibody sequence. The accuracy on the heavy chain of herceptin antibody was 100% and the sequence coverage of both the heavy and light chains was 100%, respectively.
One of the difficulties in de novo sequencing of peptide fragment mass spectrometry is the differentiation of the isomeric forms of leucine and isoleucine residues. Currently, the most commonly used method is to distinguish the two amino acid residues by using w-type ions of different masses generated by matrix assisted laser desorption ionization-tandem time of flight (MALDI-TOF/TOF), which is an additional sequencing experimental method, results in a significant increase in sequencing cost and is easily affected by w-ion signal intensity. The homologous distribution probability table obtained based on the data mining method in the scheme can achieve the same effect of 100% distinguishing accuracy of the isomer leucine and the isoleucine, and no additional complex experimental method and high cost are needed.
The method of horizontal cleavage of 3 antibody sequences is adopted, on herceptin antibody, about 17 ten thousand spectrogram horizontal antibody peptide fragments are identified, the number of occurrences is counted after the antibody peptide fragments are processed into kmers, the average number of occurrences is 20, and the number of occurrences is obviously higher than the number of occurrences of kmers in the mainstream antibody sequencing method by 10. The reliability of the kmer amino acid level, in which the number of occurrences is greater than 20, is increased from 60% to 90% of the mainstream method.
The conventional mass spectrometry-based antibody sequencing method is mostly based on a multi-enzyme digestion method, however, when a plurality of adjacent cleavage sites exist in an antibody sequence, the overlapping property between the sequences cannot be ensured, and thus the sequences cannot be completely covered. According to the invention, by introducing an additional microwave hydrolysis method and a microwave-assisted enzymolysis method without multiple enzyme digestion, the cleavage sites are not fixed, so that various peptide fragments are formed, the overlapping property among the peptide fragments is obviously improved, and the complete coverage of the antibody sequence can be stably realized.
The invention uses the probability distribution table of amino acid obtained after homology alignment, and uses the probability distribution table to extract the homology tag for assisting the peptide fragment result of preferential de novo sequencing. And (3) searching the identified constant region sequence by using a database, performing a limited homologous search (limiting species and sequence subtype) in a abYsis antibody database, obtaining a homologous sequence, obtaining an aligned homologous sequence through multi-sequence alignment, counting the occurrence times of 20 amino acids at each position divided by the total number of sequences, obtaining the amino acid probability distribution of the position, and repeating the process from the N end to the C end of the variable region to obtain an amino acid distribution probability table of the variable region. Currently, de novo sequencing methods do not have high accuracy in identifying peptide fragments, and there are three forms of typical errors. And aligning peptide fragments by using a homologous probability table, replacing amino acids with low probability with amino acids with high probability, realizing the error correction of the amino acids, and simultaneously reserving the peptide fragments with the best matching score to obtain a preferred de novo sequencing result. The method can improve the accuracy of peptide fragments involved in assembly and also improve the accuracy of antibody sequence identification.
The amino acid probability distribution is used to correct ambiguous amino acids and isomeric amino acids. Mass spectrometry is based on the difference in mass of amino acids to distinguish different amino acids, whereas de novo sequencing methods are not able to distinguish leucine from isoleucine isomers directly based on the original mass spectrum (amino acid mass information) due to the lack of reference sequence information. Currently, leucine and isoleucine can be distinguished by w-type ions of different masses generated by matrix-assisted laser desorption ionization-tandem time of flight mass spectrometry, but this method still has the major limitation that 1) when the signal intensity of w-type ions is weak or w-type ions cannot be generated due to incomplete fragmentation, the method cannot distinguish leucine from isoleucine. 2) The use of w-type ions to distinguish leucine from isoleucine is limited by length and sequence composition characteristics, and leucine and isoleucine can generally only be distinguished from peptide fragments of 7 to 15 amino acids in length. In the invention, the probability distribution of the leucine and the isoleucine is obtained by carrying out homologous search based on a large antibody sequence library, and the equivalent distinguishing effect can be realized without additional experiments with high cost.
In sequence assembly, determining the start (N-terminal) and end (C-terminal) of a sequence is one of the most difficult links. Currently, the dominant methods for determining the start of a protein sequence are the Edman chemical degradation method and the chemical isotope labeling method. The Edman chemical degradation method has high accuracy in determining the starting point of the sequence, but the sequence is degraded one by one at the amino acid level, so that the reaction time is long, the cost is high, and the maximum length of the identification is only 30 amino acids. The N-terminal and the C-terminal of the chemically-labeled protein sequence are affected by the signal intensity of the isotopically-labeled peptide, and the internal peptide of the antibody sequence is also easily labeled, so that the sequence starting point and the sequence ending point cannot be determined. In the present invention, the start and end of the antibody variable region sequence are determined using the homology tag collected after homology search, the sequence at the start is encoded by the V gene, and the sequence at the end is 4 to 7 amino acids at the beginning of the constant region determined by database search. According to the method, no additional experiment is needed, the accuracy of the same effect as the Edman chemical method can be achieved based on the homologous information of the antibody sequence data, the analysis time is greatly shortened, and the time is reduced to within 10 minutes from the time of hours. Another difficulty in sequence assembly is determining the number of overlaps between peptide fragments and how to distinguish the best overlapping peptide fragments when ambiguous. In the invention, peptide fragments after de novo sequencing are processed into kmers with uniform length by using a sliding window method, K-1 overlapped candidate assembly fragments are satisfied between every two different kmers, when a plurality of candidate assembly fragments exist, ambiguous amino acids are distinguished by integrating the homology probability score P and the occurrence frequency R of the candidate assembly fragments into a comprehensive scoring value S, and the accuracy of assembly is improved.
In view of the foregoing, it will be appreciated that the invention includes but is not limited to the foregoing embodiments, any equivalent or partial modification made within the spirit and principles of the invention.