CN116779037B

CN116779037B - A method for antibody sequence identification based on de novo sequencing and homology tag graph theory

Info

Publication number: CN116779037B
Application number: CN202310815532.XA
Authority: CN
Inventors: 张永谦; 王旭; 李诺敏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-07-27
Filing date: 2023-07-04
Publication date: 2025-08-05
Anticipated expiration: 2043-07-04
Also published as: CN116779037A

Abstract

The present invention relates to a method for antibody sequence identification based on de novo sequencing and homology tag graph theory, belonging to the field of bioanalysis technology. The method first involves cleaving the antibody to be identified, and subjecting the resulting peptide solution to liquid chromatography-mass spectrometry analysis. The light and heavy chain constant region peptide sequences are then determined through retrieval, and candidate variable region peptide sequences are determined through de novo sequencing. Finally, a dynamic windowing method, combined with a species-specific antibody homology library, is developed to effectively identify de novo sequencing errors and distinguish between isomeric forms of leucine and isoleucine, thereby improving assembly accuracy and stability.

Description

Antibody sequence identification method based on de novo sequencing and homologous tag graph theory

Technical Field

The invention relates to an antibody sequence identification method based on a head sequencing and homologous tag graph theory, and belongs to the technical field of biological analysis.

Background

The amino acid sequence and posttranslational modifications of antibodies are determining factors that influence the drug specificity and effectiveness of antibodies. In the development stage of monoclonal antibody drugs, the primary antibody gene sequences are identified mainly by using DNA sequencing technology, and the corresponding antibody amino acid sequences can be obtained after codon decoding. However, for some antibody drugs used in clinic, they may be derived from immune hosts, commercial antibodies, hybridoma cell products, many of which do not have available their cDNA sequences. In such a scenario, the identification of amino acid sequences is required directly at the protein level to meet the antibody sequencing requirements in clinical trials.

In proteomics, methods that enable protein sequencing based on mass spectrometry are largely divided into two types, database searching and de novo sequencing. The database searching method can only identify the existing protein sequences in the database, and the de novo sequencing method does not depend on the existing sequence information, and unknown protein sequences are directly deduced from the secondary spectrogram ion information. Because of the high variability of the variable regions of antibody sequences, the sequence data of antibodies is typically lacking in existing protein sequence databases, which results in generally low sequence coverage when database search methods identify antibody sequences. To accomplish identification of antibody sequences or protein sequences not present in the database, identification can only be performed using de novo sequencing methods.

In order to obtain the most abundant sequence fragments, the current mainstream method is to combine multiple digestion and multiple mass spectrum fragmentation modes, collect high-precision and high-quality mass spectrum data, and then analyze and design a sequence assembly algorithm by using de novo sequencing software to obtain the complete sequence of the antibody.

In order to solve the insufficient coverage caused by the peptide fragments present during assembly of peptide sequences and ambiguity of the spectrum interpretation, peaks proteomics team proposed an integrated system ALPS in 2016 that first automated the assembly of full length monoclonal antibodies, integrating de novo sequencing of peptide fragments, mass spectrum intensity, position confidence score, and database error correction information from 3 enzymes (Asp N, chymotrypsin, trypsin) treatment into a weighted de Bruijn map for sequence assembly of proteins. The ALPS system achieves 100% coverage and 96.64% -100% assembly accuracy, and the longest assembly length that can be achieved is 441 amino acids. The sequence assembly performance of the ALPS system is susceptible to overlapping peptide deletions and de novo sequencing errors, and sequence homology. Yang Chao et al developed in 2020 a method for determining the complete sequence of a protein based on continuous enzymolysis with a non-specific protease. The method constructs a continuous enzymolysis device and uses various nonspecific proteases to carry out continuous enzymolysis on proteins. The method utilizes the non-specificity of non-specific protease enzymolysis sites, different enzymolysis time and the complementarity of peptide fragments generated by enzymolysis of different types of proteases, improves the variety and the overlapping degree of the peptide fragments of the protein enzymolysis, and develops a protein sequence assembly algorithm to assemble the peptide fragment sequences obtained by liquid chromatography mass spectrometry (LC-MS/MS) and de novo sequencing. The method is applied to the complete sequence determination of bovine serum albumin and monoclonal antibody herceptin, and under the condition of not considering leucine and isoleucine, the sequencing accuracy of the bovine serum albumin and herceptin light chain reaches 100%, and the sequencing accuracy of herceptin heavy chain is 99.7%.

However, when peptide fragments of very similar mass are present in peptide fragments, the de novo sequencing method is often difficult to distinguish. The peptide fragments with similar amino acid quality are commonly 3 of 1) isomers I and L, 2) amino acid combination quality is similar as AG=Q, GG=N, and 3) amino acid combination is the same but different in arrangement mode. Such as AF and FA, KR and RK. Most errors from de novo sequencing result from the 3 cases described above. Since the assembly process is based on overlap between fragments, erroneous peptide fragments reported from the de novo sequencing method can present a significant challenge to the full-length coverage of the assembly task. How to optimize the correct peptide fragment among the ambiguous peptide fragments and correct the individual amino acid positions, and the primary premise of completing the assembly process.

Disclosure of Invention

In view of the above, it is an object of the present invention to provide a method for identifying an antibody sequence based on the sequencing from the head and the cognate tag theory.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

A method of identifying an antibody sequence based on de novo sequencing and homology tag theory, the method steps comprising:

(1) Performing amino acid horizontal cutting on an antibody to be identified to obtain a peptide fragment solution;

(2) Respectively carrying out liquid phase separation and mass spectrometry on the obtained peptide fragment solution to obtain a mass spectrometry file;

(3) Carrying out pFind search on the obtained mass spectrum file to determine the light chain constant region peptide fragment sequence of the antibody to be identified and the heavy chain constant region peptide fragment sequence of the antibody to be identified, carrying out pNovo de novo sequencing on the obtained mass spectrum file to determine the variable region candidate peptide fragment sequence of the antibody to be identified;

(4) Respectively carrying out homologous search on the light chain constant region peptide fragment sequence of the antibody to be identified and the heavy chain constant region peptide fragment sequence of the antibody to be identified to obtain a complete light chain and a complete heavy chain with high homology with the light chain constant region and the heavy chain constant region;

(5) Selecting homologous labels in a light chain variable region amino acid probability distribution table and a heavy chain variable region amino acid probability distribution table respectively, wherein the homologous labels consist of 5-8 amino acid sequences, and the amino acid sequences comprise more than 2 amino acids with the probability of more than 60% and 3 amino acids with the probability of 5 before;

(6) Dividing the high-reliability variable region peptide fragment sequence into a plurality of small peptide fragment sequences (kmer) with the length K of 5-8, counting the occurrence times and probability of each small peptide fragment sequence, obtaining the assembly score of each small peptide fragment, searching small peptide fragments with K-1 overlapping amino acids along two ends by taking the small peptide fragment sequence which meets the homology tag and has the highest score as an assembly starting point, adding one amino acid at the tail end according to the assembly score, and repeatedly assembling to obtain the light chain variable region peptide fragment sequence and the heavy chain variable region peptide fragment sequence;

(7) And (3) respectively carrying out secondary assembly on the light chain constant region peptide fragment sequence and the heavy chain constant region peptide fragment sequence in the step (3) and the light chain variable region peptide fragment sequence and the heavy chain variable region peptide fragment sequence to obtain the finished antibody light chain and heavy chain.

Further, in the step (1), more than one of protease method, microwave hydrolysis method and microwave-assisted protease method is adopted to carry out amino acid horizontal cutting on the antibody to be identified, and more than 4 overlapped amino acids exist between peptide fragments obtained by different methods.

In the step (1), the antibodies to be identified are subjected to proteolysis by adopting specific protease and non-specific protease respectively, wherein the total number of the specific protease and the non-specific protease is more than or equal to 3.

In the step (2), the peptide solution is subjected to liquid phase separation by mobile phase gradient elution, a chromatographic column is a C18 reversed phase chromatographic column, and during mass spectrometry, the first 20 parent ions in the primary spectrogram are selected for secondary spectrogram analysis, and the parent ions are subjected to secondary fragmentation by adopting a high-energy collision fragmentation mode (HCD).

In the step (3), the database searched by pFind is Swissprot complete library, and during searching, the parent ion deviation and fragment deviation are both +/-20 ppm, and the amino acid sequence result which is close to the length of the antibody light chain constant region sequence and the length of the antibody heavy chain constant region sequence and has the highest coverage is selected from the search results to be used as the light chain constant region peptide sequence of the antibody to be identified and the heavy chain constant region peptide sequence of the antibody to be identified.

Further, in the step (3), pNovo de novo sequencing is performed on the obtained mass spectrum file, and peptide fragments with parent ion mass deviation less than 10ppm are reserved as variable region candidate peptide fragment sequences of the antibodies to be identified.

Further, in step (4), an online antibody library abYsis is used for homology searching.

Further, in step (4), high homology complete light and heavy chains with light chain constant region and heavy chain constant region E values of less than 10 ^-5 are obtained during homology searching. The E value refers to a sequence in which two amino acid residues are randomly arranged when protein sequences have the same length, and the probability of occurrence of a total score of each pair of amino acid residues is calculated based on a scoring matrix.

Further, in step (6), the assembly score of each small peptide fragment is s=r×10 ^P, where R is the number of occurrences of the small peptide fragment and P is the amino acid probability of the small peptide fragment.

Further, in the step (6), if the same site is ambiguous during assembly, the amino acid with the highest probability in the probability distribution table is selected to be the amino acid with the probability more than 3 times of that of the suboptimal amino acid.

Further, in the step (6), during assembly, the C-terminal of the variable region is extended by 4 to 7 amino acids more than the front 4 to 7N-terminal of the constant region to form overlapping.

Further, the antibodies to be identified comprise a mixture of polyclonal antibodies with different specificities in serum, a pure monoclonal antibody, a monoclonal antibody secreted by hybridoma cells or a monoclonal antibody secreted by plasma cells.

Advantageous effects

The invention provides an antibody sequence identification method based on de novo sequencing and homologous tag graph theory, which comprises the steps of firstly cutting an antibody to be identified, carrying out liquid phase-mass spectrometry on a peptide solution obtained, then searching and determining the peptide sequences of a light chain constant region and a heavy chain constant region, determining the candidate peptide sequences of a variable region through de novo sequencing, then effectively identifying de novo sequencing errors through combining a species-specific antibody homology library by developing a dynamic window method, distinguishing the leucine and the isoleucine in an isomer form, and improving the assembly accuracy and the assembly stability.

In the analysis process of determining the antibody constant region by the de novo sequencing and the antibody variable region by the de novo sequencing, the antibodies are respectively treated by using 3 protein sequence cutting methods (a protease method, a microwave hydrolysis method and a microwave assisted protease method), so that the high overlapping property between the peptide sequences of the antibodies can be ensured, and the coverage of the antibody protein sequences is improved. And the protease is selected, the parent ion deviation and the fragment deviation are set to be 20ppm, so that the accuracy of the amino acid level of the peptide fragment can be ensured.

Key parameters involved in the sequence assembly process include the size of the amino acid probability threshold in the cognate tag and the size of kmer. The probability threshold in the homologous tag is selected to be greater than 60% to extract highly conserved sequences for preference in de novo sequencing of peptide fragments, thereby improving the accuracy of sequence assembly. kmer was chosen as 5, 6,7 or 8, the effect of which was complementary. When kmer is small, the assembly result is easily affected by repeated peptide fragments, and wrong repeated amino acid fragments appear in the assembly result, which reduces accuracy. Smaller kmers can increase the length of sequence assembly and thus coverage. When kmer is great, the assembly result is difficult to be influenced by repeated peptide segments, and the accuracy of the assembly result is improved. However, larger kmers filter out a portion of the shorter sequences and reduce overlap between different kmers, resulting in shorter assembled sequence lengths and reduced sequence coverage.

The method is based on a homologous distribution probability table obtained by a data mining method, uses the homologous probability table to align peptide fragments, replaces low-probability amino acids with high-probability amino acids, can realize error correction of ambiguous amino acids and isomerism amino acids, and simultaneously reserves the peptide fragments with the best matching score to obtain a preferred de novo sequencing result. The method can improve the accuracy of peptide fragments involved in assembly and also improve the accuracy of antibody sequence identification.

The antibody sample suitable for the method disclosed by the invention is wide in type, and comprises polyclonal antibody mixtures with different specificities in serum, monoclonal antibody pure products, monoclonal antibodies secreted by hybridoma cells, monoclonal antibodies secreted by plasma cells and the like. The antibody proteins from different sources can be cut by adopting a protein sequence cutting method with an amino acid level to form overlapped peptide segments, and the whole antibody sequence is obtained after the assembly.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 shows homologous tags obtained from a probability distribution table according to the method of the present invention.

FIG. 3 is an assembly drawing of peptide fragments of the method of the invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples.

An antibody (Ig) is a large molecular weight immunoglobulin (about 150 kDa) of about 10nm in size and similar in structure to a Y-shape. In humans and most mammals, an antibody unit consists of four polypeptide chains, two heavy chains (about 450-550 AA) of longer length and two light chains (about 214 AA) of shorter length, and within the same antibody, the two heavy chain sequences and the two light chain sequences are identical in composition. Antibodies can also be divided into variable and constant regions depending on the variability of the composition of the peptide chain. The antibody can be divided into 4 main regions, namely a light chain constant region (CL), a light chain variable region (VL), a heavy chain constant region (CH) and a heavy chain variable region (VH), by combining the length characteristics of peptide chains. The variable region amino acid sequences of the antibody light and heavy chains vary greatly in composition and near the N-terminus, accounting for 1/4 and 1/2 of the length of the heavy and light chains, respectively. The antibody light and heavy chain constant region amino acid sequences are relatively stable in composition and near the C-terminus, occupying 3/4 and 1/2 of the length of the heavy and light chains, respectively.

The variable regions of the light and heavy chains comprise 3 amino acid regions of highly variable alignment, namely Complementarity Determining Regions (CDRs), and the variable regions comprise a Framework Region (FR) which is a region of relatively stable amino acid composition of 4 additional amino acids, the FR and CDR regions being staggered together to form the variable region. The regions of VH and VL each having a highly variable 3 amino acid composition and arrangement sequence are called complementarity determining regions (Complementarity determining region, CDR) which are CDR1, CDR2 and CDR3 respectively, wherein the CDR3 varies to the highest degree. The 3 CDRs of the VH are located at amino acids 29-31, 49-58 and 95-102, respectively, while the 3 CDRs of the VL are located at amino acids 28-35, 49-56 and 91-98, respectively. Constant region C of heavy and light chains are referred to as CH and CL, respectively. CL lengths of different classes (kappa or lambda) of igs are substantially identical, but CH lengths of different classes of igs are different, e.g. IgG, igA and IgD include CH1, CH2 and CH3, while IgM and IgE include CHl, CH2, CH3 and CH4.

Because the constant region amino acid sequence composition is relatively stable, the partial sequence exists in the Swiss-prot protein sequence database, and in addition, because the database searching method has higher accuracy than the sequencing method from head, the partial region can be identified by using the database searching method by combining the two characteristics. However, the CDR regions in the variable region are highly variable and the protein sequence database typically does not contain this sequence, which is not desirable in database searching methods. De novo sequencing methods, because they do not rely on sequence databases, allow identification of the sequence of the variable region. However, de novo sequencing methods have low accuracy and many ambiguous peptide sequences contained in the sequencing results require reliable information to optimize the selection of a reliable peptide sequence. The constant region sequence is used for carrying out homologous search on an antibody database to obtain a plurality of homologous heavy chain sequences and light chain sequences, the variable region of the homologous sequences is counted to obtain an amino acid frequency distribution table of the variable region by counting the amino acid composition at each position, and the ambiguous peptide fragment can be identified based on the frequency table, so that the most reliable candidate variable region sequence is optimized. The technical flow of the specific scheme is shown in figure 1.

1. sample processing section (Wet experiment)

1. Antibody enzymolysis

(1) 20 Μg herceptin antibody was taken as one portion and 5 portions were taken separately. Each antibody was denatured with 2% mass volume ratio of sodium deoxycholate aqueous solution, 200mM Tris-HCl, 10mM tricarboxyethyl phosphine at pH 8.0 at 95℃for 10min, and then incubated at 35℃for 30min for reduction. Finally, the samples were alkylated to a final concentration of 40mM by the addition of iodoacetic acid and incubated at room temperature at 25℃for 45min in the absence of light and the antibody solution samples were stored at-4 ℃.

(2) The antibody was cleaved at the amino acid level by 3 different methods, protease, microwave hydrolysis and microwave assisted protease, respectively. The three kinds of coordination can ensure that at least 4 amino acids overlap among peptide segments formed after protein cleavage, thereby improving the accuracy of sequencing and the coverage of antibody sequence assembly. The three amino acid level cutting methods are processed in the step (1), and the following specific implementation steps are as follows:

1) Protease hydrolysis, namely taking 5 parts of 3 mug antibody solution samples, and respectively carrying out enzymolysis on the antibody samples by using one protease of aspartic proteinase, trypsin, chymotrypsin, lysine proteinase and glutamic acid proteinase at 37 ℃. The mass ratio of enzyme to antibody is 1:50, the mixture after enzymolysis is dissolved in 100 mu L of 50mM ammonium bicarbonate solution, and after standing for 4 hours, 5 parts of enzymolysis product solution is refrigerated at-4 ℃ for standby. The method combines specific protease (trypsin, aspartic protease, glutamate protease and lysine protease) with non-specific protease (chymotrypsin) to ensure inconsistent enzyme cleavage sites and form overlapped peptide segments, thereby improving the coverage of antibody sequences.

2) Microwave hydrolysis method 1 part of 3 μg antibody solution sample is taken in a glass vial, HCl is added to a final concentration of 3M, the vial is placed on ice in a beaker, and microwave heating is performed for 4 minutes under the power condition of 200W in a microwave oven (stopping every 1 minute and supplementing ice, so as to ensure that peptide fragments in the antibody cannot be denatured due to overheating). The microwave hydrolysate solution was stored at-4 ℃.

3) Microwave-assisted enzymolysis 1 part of a 3. Mu.g sample of antibody solution was taken in a glass vial, HCl was added to a final concentration of 3M, the vial was placed in a beaker with ice and heated by microwaves for 2 minutes (stopping every 1 minute and replenishing with ice) at 200W power in a microwave oven. The microwave hydrolysate was adjusted to pH 8 with 0.5M NaOH aqueous solution and then 0.3. Mu.g of trypsin was added. Microwave heating is carried out for 6 minutes (stopping every 1 minute and supplementing ice) under the power of 100W to carry out the reaction of the microwave-assisted enzymolysis antibody, and after the reaction is finished, 1M HCl is added to adjust the pH to 2 to terminate the enzymolysis reaction. And (3) refrigerating the product solution of microwave-assisted enzymolysis at-4 ℃ for standby.

(3) To the product solution of the cleavage method at the three amino acid levels in step (2), 2. Mu.L of formic acid was added, respectively, and the solution was centrifuged at 14000g for 20min to remove sodium deoxycholate.

(4) After centrifugation, the supernatant was collected and desalted on 30 μm Oasis HLB 96 well plates.

(5) The Oasis HLB adsorbent was activated with 100v% acetonitrile and equilibrated with 10v% aqueous formic acid.

(6) After the enzymatic hydrolysis of the peptide fragment was combined with the adsorbent, the peptide fragment was eluted twice with 10v% formic acid in water and then with 100ml of 50v% acetonitrile to 5v% formic acid.

(7) And (5) vacuum drying and preserving the eluted enzymolysis peptide fragment solution.

2. Liquid phase separation and mass spectrometry

(1) The peptide sample after enzymolysis is dissolved in 0.1v% formic acid, and online analysis is carried out by Agilent 1290UHPLC combined with an Orbitrap Q-Exactive HFX mass spectrometer, 1 mug of sample is loaded each time, and the sample loading speed is 0.8 mug/min.

(2) The peptide fragments were separated by a 15cm C18 reverse phase chromatography column (inner diameter 100 μm,1.9 μm resin) and an elution gradient of 120 min. Mobile phase a was 0.1v% trifluoroacetic acid and 2v% acetonitrile, and mobile phase B was 0.1v% trifluoroacetic acid and 98v% acetonitrile. A120 min gradient (mobile phase B: 3v% at 0min, 5v% at 5min, 22v% at 95min, 30v% at 105min, 90v% at 115min, 90v% at 120 min) was used at a flow rate of 450 nL/min.

(3) The parameters of the Orbitrap Q-Exactive mass spectrum are shown below, the spray voltage is 2100V, and the temperature of the ion transmission tube is 300 ℃. The automatic gain control of the primary spectrogram scanning is 1 x 10-5, the m/z range of scanning is 350-2000, the resolution is 3 x 10-4, the first 20 parent ions in the primary spectrogram are selected for the analysis of the secondary spectrogram, the parent ions adopt a high-energy collision fragmentation mode (HCD) for the secondary fragmentation, and the fragmentation energy is set to be 30%.

2. Database searches determined constant regions (dry test below), specifically:

1. The original mass spectrum files of the 3 sequence cutting methods are respectively subjected to pFind retrieval. The sequence database retrieved was the Swissprot full library.

2. The main parameters of the database search are sequence database Swiss-prot protein sequence database, protease type 1) protease method of aspartic proteinase, trypsin, chymotrypsin, lysine proteinase, glutamate proteinase (the selection corresponds to proteinase in wet experiment), 2) microwave hydrolysis method of no enzyme, 3) microwave auxiliary protease method of no enzyme and trypsin (the selection corresponds to proteinase in wet experiment), maximum miscut number of 3, open search of no, parent ion deviation of 20ppm, fragment deviation of 20ppm, fixed modification of carbamoylamine methylation, variable modification of methionine oxidation, error discovery rate of 1%, peptide fragment mass range of 600-8000, and peptide fragment length of 6-40.

3. And selecting an amino acid sequence result (about 107 amino acid residues in the length of the antibody light chain constant region and about 330 amino acid residues in the length of the antibody heavy chain constant region) which is close to the length of the antibody light chain constant region and the length of the antibody heavy chain constant region from the search results and has the highest coverage, checking the sequence names of the amino acid sequence results, and determining the species, the type and the sequence composition of the light chain constant region and the heavy chain constant region.

3. The variable region candidate sequence is obtained by de novo sequencing, specifically:

1. the file in mgf format corresponding to the 5 enzymes extracted from the original file of mass spectrum using mass spectrum data format conversion software.

2. The input of de novo sequencing was performed using the extracted mgf format mass spectral data as pNovo. De novo sequencing was performed separately for the 3 protein sequence cleavage method, with other parameters remaining consistent except for the cleavage type.

3. PNovo the main parameters retrieved from the de novo sequencing were immobilized modification, carbamoylamine methylation, variable modification, methionine oxidation, parent ion bias ± 20ppm, fragment bias ± 20ppm, protease type 1) protease method, aspartic protease, trypsin, chymotrypsin, lysine protease, glutamate protease (selection corresponds to protease in wet experiments), 2) microwave hydrolysis method, no enzyme, 3) microwave assisted protease method, no enzyme and trypsin (selection corresponds to protease in wet experiments).

4. And combining peptide fragments obtained by the protease method, the microwave hydrolysis method and the microwave-assisted protease method after de novo sequencing respectively, and reserving peptide fragments with mass deviation of parent ions from the de novo sequencing less than 10ppm as candidate variable region peptide fragment sequences.

4. Homologous search

1. Homology searches were performed in the online antibody library abYsis using the determined light chain constant region sequences to obtain a highly homologous complete light chain with a light chain constant region E value of less than 10 ^-5. The E value refers to the size of the probability of occurrence of the sum of scores of the amino acid residues of each pair based on a scoring matrix in the case where the protein sequences are the same in length and the two amino acid residues are arranged randomly. The smaller the E value, the lower the likelihood of getting the total score in the random case, and the higher the homology of two protein sequences in the case where the score sum is high and the E value is small.

2. And counting the amino acid composition of each site of the variable region in the complete light chain to obtain a light chain variable region amino acid probability distribution table.

3. Repeating the steps 1 and 2 on the heavy chain constant region sequence to obtain an amino acid probability distribution table of a heavy chain variable region.

5. Amino acid frequency distribution assisted assembly

1. The homologous tags at these sites were obtained by reading from the light chain variable region amino acid probability distribution table adjacent amino acids with a probability of greater than 60% (amino acid composition pattern: within a window of 5 amino acids, 2 and more than 60% amino acids 3-5 before probability), as shown in FIG. 2.

2. Homologous tags are written using regular expressions, and candidate peptide fragments satisfying this composition pattern are retrieved as highly reliable variable region sequences from the de novo sequencing retention results. The de novo sequencing results were further filtered using multiple homology tags and a score table was obtained for each peptide fragment. In this table, peptide fragments are arranged in the order in which homologous tags appear, as shown in FIG. 3.

3. The highly reliable variable region sequence results were converted to kmers of default length K of 6 (optional range of K is 5, 6, 7 or 8) and the number of occurrences was counted, the probability distribution for each kmer was obtained from the peptide fragment score table, and converted to an assembly score S using the formula s=r×10 ^P (R represents the number of occurrences of kmers and P represents the amino acid probability value of kmer). And (3) selecting kmers which meet the homologous tag and have the highest score as starting points of assembly, searching K-1 overlapped kmers along two ends, distinguishing ambiguous amino acids by using the kmer score, adding one amino acid at the tail end, and repeating the process to obtain a contig result of the starting point of the homologous tag. The above assembly process was repeated at different homologous tag starting points, obtaining a series of contig results. If the high-reliability variable region sequence has ambiguity at the same site, the probability distribution list of 20 amino acids at the site is used for selection, wherein the selection basis is that the amino acid with the highest probability at the site is more than 3 times of the probability of suboptimal amino acid. The process is repeated to complete the assembly of the variable region, and the C end of the variable region is extended by 4 to 7 amino acids on the basis of the assembly of the variable region, thereby facilitating the subsequent assembly with the constant region.

4. And performing secondary assembly on the light chain constant region sequence determined by pFind database search and the light chain variable region sequence, wherein 4 to 7 amino acids which are extended more at the C end of the variable region are overlapped with the first 4 to 7 amino acids at the N end of the constant region, so that the complete antibody light chain is obtained.

5. Repeating the steps 1-4 on the heavy chain constant region sequence to obtain the complete heavy chain of the antibody.

The antibody assembly is carried out on the herceptin antibody (light chain 214 amino acids and heavy chain 450 amino acids) by adopting the method, and the accuracy rate on the light chain of the herceptin antibody is 99.5% respectively after the comparison with the correct antibody sequence. The accuracy on the heavy chain of herceptin antibody was 100% and the sequence coverage of both the heavy and light chains was 100%, respectively.

One of the difficulties in de novo sequencing of peptide fragment mass spectrometry is the differentiation of the isomeric forms of leucine and isoleucine residues. Currently, the most commonly used method is to distinguish the two amino acid residues by using w-type ions of different masses generated by matrix assisted laser desorption ionization-tandem time of flight (MALDI-TOF/TOF), which is an additional sequencing experimental method, results in a significant increase in sequencing cost and is easily affected by w-ion signal intensity. The homologous distribution probability table obtained based on the data mining method in the scheme can achieve the same effect of 100% distinguishing accuracy of the isomer leucine and the isoleucine, and no additional complex experimental method and high cost are needed.

The method of horizontal cleavage of 3 antibody sequences is adopted, on herceptin antibody, about 17 ten thousand spectrogram horizontal antibody peptide fragments are identified, the number of occurrences is counted after the antibody peptide fragments are processed into kmers, the average number of occurrences is 20, and the number of occurrences is obviously higher than the number of occurrences of kmers in the mainstream antibody sequencing method by 10. The reliability of the kmer amino acid level, in which the number of occurrences is greater than 20, is increased from 60% to 90% of the mainstream method.

The conventional mass spectrometry-based antibody sequencing method is mostly based on a multi-enzyme digestion method, however, when a plurality of adjacent cleavage sites exist in an antibody sequence, the overlapping property between the sequences cannot be ensured, and thus the sequences cannot be completely covered. According to the invention, by introducing an additional microwave hydrolysis method and a microwave-assisted enzymolysis method without multiple enzyme digestion, the cleavage sites are not fixed, so that various peptide fragments are formed, the overlapping property among the peptide fragments is obviously improved, and the complete coverage of the antibody sequence can be stably realized.

The invention uses the probability distribution table of amino acid obtained after homology alignment, and uses the probability distribution table to extract the homology tag for assisting the peptide fragment result of preferential de novo sequencing. And (3) searching the identified constant region sequence by using a database, performing a limited homologous search (limiting species and sequence subtype) in a abYsis antibody database, obtaining a homologous sequence, obtaining an aligned homologous sequence through multi-sequence alignment, counting the occurrence times of 20 amino acids at each position divided by the total number of sequences, obtaining the amino acid probability distribution of the position, and repeating the process from the N end to the C end of the variable region to obtain an amino acid distribution probability table of the variable region. Currently, de novo sequencing methods do not have high accuracy in identifying peptide fragments, and there are three forms of typical errors. And aligning peptide fragments by using a homologous probability table, replacing amino acids with low probability with amino acids with high probability, realizing the error correction of the amino acids, and simultaneously reserving the peptide fragments with the best matching score to obtain a preferred de novo sequencing result. The method can improve the accuracy of peptide fragments involved in assembly and also improve the accuracy of antibody sequence identification.

The amino acid probability distribution is used to correct ambiguous amino acids and isomeric amino acids. Mass spectrometry is based on the difference in mass of amino acids to distinguish different amino acids, whereas de novo sequencing methods are not able to distinguish leucine from isoleucine isomers directly based on the original mass spectrum (amino acid mass information) due to the lack of reference sequence information. Currently, leucine and isoleucine can be distinguished by w-type ions of different masses generated by matrix-assisted laser desorption ionization-tandem time of flight mass spectrometry, but this method still has the major limitation that 1) when the signal intensity of w-type ions is weak or w-type ions cannot be generated due to incomplete fragmentation, the method cannot distinguish leucine from isoleucine. 2) The use of w-type ions to distinguish leucine from isoleucine is limited by length and sequence composition characteristics, and leucine and isoleucine can generally only be distinguished from peptide fragments of 7 to 15 amino acids in length. In the invention, the probability distribution of the leucine and the isoleucine is obtained by carrying out homologous search based on a large antibody sequence library, and the equivalent distinguishing effect can be realized without additional experiments with high cost.

In sequence assembly, determining the start (N-terminal) and end (C-terminal) of a sequence is one of the most difficult links. Currently, the dominant methods for determining the start of a protein sequence are the Edman chemical degradation method and the chemical isotope labeling method. The Edman chemical degradation method has high accuracy in determining the starting point of the sequence, but the sequence is degraded one by one at the amino acid level, so that the reaction time is long, the cost is high, and the maximum length of the identification is only 30 amino acids. The N-terminal and the C-terminal of the chemically-labeled protein sequence are affected by the signal intensity of the isotopically-labeled peptide, and the internal peptide of the antibody sequence is also easily labeled, so that the sequence starting point and the sequence ending point cannot be determined. In the present invention, the start and end of the antibody variable region sequence are determined using the homology tag collected after homology search, the sequence at the start is encoded by the V gene, and the sequence at the end is 4 to 7 amino acids at the beginning of the constant region determined by database search. According to the method, no additional experiment is needed, the accuracy of the same effect as the Edman chemical method can be achieved based on the homologous information of the antibody sequence data, the analysis time is greatly shortened, and the time is reduced to within 10 minutes from the time of hours. Another difficulty in sequence assembly is determining the number of overlaps between peptide fragments and how to distinguish the best overlapping peptide fragments when ambiguous. In the invention, peptide fragments after de novo sequencing are processed into kmers with uniform length by using a sliding window method, K-1 overlapped candidate assembly fragments are satisfied between every two different kmers, when a plurality of candidate assembly fragments exist, ambiguous amino acids are distinguished by integrating the homology probability score P and the occurrence frequency R of the candidate assembly fragments into a comprehensive scoring value S, and the accuracy of assembly is improved.

In view of the foregoing, it will be appreciated that the invention includes but is not limited to the foregoing embodiments, any equivalent or partial modification made within the spirit and principles of the invention.

Claims

1. A method for identifying antibody sequences based on de novo sequencing and homology tag theory is characterized by comprising the following steps:

(3) Carrying out pFind search on the obtained mass spectrum file to determine the light chain constant region peptide fragment sequence of the antibody to be identified and the heavy chain constant region peptide fragment sequence of the antibody to be identified, carrying out pNovo de novo sequencing on the obtained mass spectrum file to determine the light chain variable region candidate peptide fragment sequence and the heavy chain variable region candidate peptide fragment sequence of the antibody to be identified;

(4) Respectively carrying out homologous search on the light chain constant region peptide segment sequences and the heavy chain constant region peptide segment sequences of the antibodies to be identified to obtain a complete light chain and a complete heavy chain with high homology, and respectively carrying out statistics on the amino acid composition of each position of a variable region in the complete light chain and the complete heavy chain to obtain a light chain variable region amino acid probability distribution table and a heavy chain variable region amino acid probability distribution table;

(5) Obtaining a homologous tag from a probability distribution table, and screening peptide fragments containing the homologous tag from the variable region candidate peptide fragment sequences of the antibodies to be identified in the step (3) to obtain a highly-reliable variable region peptide fragment sequence, wherein the homologous tag consists of a sequence consisting of 5-8 amino acids, and the amino acid sequence contains more than 2 amino acids with the probability of more than 60% and 3 amino acids with the probability of 5 at the front;

(6) Dividing the high-reliability variable region peptide sequence into a plurality of small peptide sequences with the length K of 5-8, counting the occurrence times and probability of each small peptide sequence, obtaining the assembly score of each small peptide sequence, searching the small peptide sequences with K-1 overlapped amino acids along two ends by taking the small peptide sequences with the same source tag and the highest score as assembly starting points, adding one amino acid at the tail end according to the assembly score, and repeatedly assembling to obtain the light chain and heavy chain variable region peptide sequences;

2. The method for identifying the antibody sequence based on the de novo sequencing and the homologous tag graph theory, as set forth in claim 1, wherein in the step (1), more than one of a protease method, a microwave hydrolysis method and a microwave assisted protease method is adopted to carry out amino acid level cleavage on the antibody to be identified, and more than 4 overlapping amino acids exist between peptide fragments obtained by different methods.

3. The method for identifying the antibody sequence based on the de novo sequencing and the homologous tag graph as claimed in claim 2, wherein in the step (1), the specific protease and the non-specific protease are adopted to carry out proteolysis on the antibody to be identified, and the total number of the specific protease and the non-specific protease is more than or equal to 3.

4. The method for identifying the antibody sequence based on the de novo sequencing and the homologous label graph theory, as set forth in claim 1, wherein in the step (2), the peptide solution is subjected to liquid phase separation by mobile phase gradient elution, the chromatographic column is a C18 reversed-phase chromatographic column, and during mass spectrometry, the first 20 parent ions in the primary spectrogram are selected for secondary spectrogram analysis, and the parent ions are subjected to secondary fragmentation by adopting a high-energy collision fragmentation mode.

5. The method for identifying the antibody sequence based on the de novo sequencing and the homologous tag graph theory, which is characterized in that in the step (3), a pFind searched database is a Swissprot whole database, the parent ion deviation and the fragment deviation are both +/-20 ppm during searching, the amino acid sequence result which is close to the length of an antibody light chain constant region sequence and an antibody heavy chain constant region sequence and has the highest coverage is selected from the searching results to be used as the light chain constant region peptide fragment sequence of the antibody to be identified and the heavy chain constant region peptide fragment sequence of the antibody to be identified, pNovo de novo sequencing is carried out on the obtained mass spectrum file, and peptide fragments with the parent ion mass deviation of less than 10ppm are reserved to be used as variable region candidate peptide fragment sequences of the antibody to be identified.

6. The method of claim 1, wherein in step (4), an online antibody library abYsis is used for homology search, and a highly homologous complete light chain and heavy chain having an E value of less than 10 ^-5 are obtained from the light chain constant region and the heavy chain constant region.

7. The method of claim 1, wherein in the step (6), the assembly score S=R×10 ^P of each small peptide fragment, wherein R is the number of occurrences of the small peptide fragment and P is the amino acid probability of the small peptide fragment.

8. The method of claim 1, wherein in step (6), if the same site is ambiguous during assembly, the amino acid having the highest probability in the probability distribution table is selected to be the amino acid having a probability of 3 times or more of that of the suboptimal amino acid.

9. The method of claim 1, wherein in the step (6), the variable region C-terminal is extended by 4 to 7 amino acids more than the N-terminal of the constant region to overlap with the N-terminal of the constant region 4 to 7 amino acids more than the C-terminal.

10. The method for identifying the antibody sequences based on the de novo sequencing and the homologous tag graph theory according to any one of the claim 1 to 9, wherein the antibodies to be identified comprise a mixture of polyclonal antibodies, pure monoclonal antibodies, monoclonal antibodies secreted by hybridoma cells or monoclonal antibodies secreted by plasma cells with different specificities in serum.