[go: up one dir, main page]

WO2008008766A2 - Co-expression de prolines hydroxylases afin de faciliter l'hyp-glycosylation de protéines exprimées et secrétées dans des cellules végétales - Google Patents

Co-expression de prolines hydroxylases afin de faciliter l'hyp-glycosylation de protéines exprimées et secrétées dans des cellules végétales Download PDF

Info

Publication number
WO2008008766A2
WO2008008766A2 PCT/US2007/073137 US2007073137W WO2008008766A2 WO 2008008766 A2 WO2008008766 A2 WO 2008008766A2 US 2007073137 W US2007073137 W US 2007073137W WO 2008008766 A2 WO2008008766 A2 WO 2008008766A2
Authority
WO
WIPO (PCT)
Prior art keywords
hyp
protein
proline
glycosylation
pro
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2007/073137
Other languages
English (en)
Other versions
WO2008008766A3 (fr
Inventor
Marcia Kieliszewski
Jianfeng Xu
Iver P. Cooper
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ohio University
Original Assignee
Ohio University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ohio University filed Critical Ohio University
Priority to EP07799434A priority Critical patent/EP2084285A4/fr
Publication of WO2008008766A2 publication Critical patent/WO2008008766A2/fr
Publication of WO2008008766A3 publication Critical patent/WO2008008766A3/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/63Introduction of foreign genetic material using vectors; Vectors; Use of hosts therefor; Regulation of expression
    • C12N15/79Vectors or expression systems specially adapted for eukaryotic hosts
    • C12N15/82Vectors or expression systems specially adapted for eukaryotic hosts for plant cells, e.g. plant artificial chromosomes (PACs)
    • C12N15/8241Phenotypically and genetically modified plants via recombinant DNA technology
    • C12N15/8242Phenotypically and genetically modified plants via recombinant DNA technology with non-agronomic quality (output) traits, e.g. for industrial processing; Value added, non-agronomic traits
    • C12N15/8257Phenotypically and genetically modified plants via recombinant DNA technology with non-agronomic quality (output) traits, e.g. for industrial processing; Value added, non-agronomic traits for the production of primary gene products, e.g. pharmaceutical products, interferon
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K14/00Peptides having more than 20 amino acids; Gastrins; Somatostatins; Melanotropins; Derivatives thereof
    • C07K14/435Peptides having more than 20 amino acids; Gastrins; Somatostatins; Melanotropins; Derivatives thereof from animals; from humans
    • C07K14/78Connective tissue peptides, e.g. collagen, elastin, laminin, fibronectin, vitronectin or cold insoluble globulin [CIG]
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N9/00Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
    • C12N9/0004Oxidoreductases (1.)
    • C12N9/0071Oxidoreductases (1.) acting on paired donors with incorporation of molecular oxygen (1.14)
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12PFERMENTATION OR ENZYME-USING PROCESSES TO SYNTHESISE A DESIRED CHEMICAL COMPOUND OR COMPOSITION OR TO SEPARATE OPTICAL ISOMERS FROM A RACEMIC MIXTURE
    • C12P21/00Preparation of peptides or proteins
    • C12P21/005Glycopeptides, glycoproteins

Definitions

  • This invention relates to the secretion of proteins in plant cells.
  • HRGPs all of the cell surface HRGPs (extensins) form a covalently cross-linked cell wall network.
  • AGPs arabinogalactan- proteins
  • AGPs are initially tethered to the plasma membrane by a lipid anchor whose cleavage results in their movement from the periplasm through the cell wall to the exterior.
  • Hyp hydroxyproline
  • HRGPs Hyp-rich glycoproteins
  • AGPs arabinogalactan proteins
  • extensins extensins
  • PRPs pro line-rich proteins
  • AGPs [>90% (wt/wt) sugar] have repetitive variants of (Xaa-Hyp)n motifs with O-linked arabinogalactan polysaccharides involving an O- galactosyl-IIyp glycosidic bond.
  • Extensins [50% (wt/wt) sugar] have a diagnostic Ser-Hyp4 repeat that contains short oligosaccharides of arabinose (Hyp arabinosides) involving an O-L- arabinosyl-Hyp linkage.
  • lightly arabinosylated PRPs [2-27% (wt/wt) sugar] are the most highly periodic, consisting largely of pentapeptide repeats, typically variants of Pro- Hyp-Val-Tyr-Lys (SEQ ID NO:2). Recombinant production of some Hyp-rich glycoproteins is discussed in Kielizewski et al., USP 6,548,642, 6,570,062, and 6,639,050.
  • Hyp contiguity hypothesis discussed in Shpak et al. (1999) but advanced previously, clustered, noncontiguous Hyp residues (e.g., Hyp's in Xaa-Hyp-Xaa- Hyp) are sites of arabinogalactan polysaccharide attachment, while small arabinooligosaccharides (1-5 Ara residues/Hyp) are attached to contiguous (dipeptidyl or larger) Hyp residues. Di-Hyp blocks are found in PRPs and tetra-Hyp blocks in extensins.
  • Shpak et al. (1999) expressed two synthetic genes, encoding putative AGP glycomodules, in plants.
  • the construct expressing noncontiguous Hyp [32 Ser-Hyp repeats] showed exclusive polysaccharide addition, whereas another construct containing noncontiguous Hyp and additional contiguous Hyp [contained three repeats of a 19 amino acid sequence, SOOOTLSOSOTOTOOOGPH, SEQ ID NO: 3, from gum arabic glycoprotein, GAGP] showed both polysaccharide and arabinooligosaccharide addition consistent with the predictions of the Hyp contiguity hypothesis.”
  • Half of the Hyp residues in the di-Hyp blocks were arabinosylated, and almost 100% of those in the tetra-Hyp blocks. In the case of the tri-Pro blocks, these were incompletely hydroxylated at each of the three Pro's, resulting in a mixture of contiguous and non-contiguous Hyp and thus in partial arabinosylation.
  • the first criterion for classification as as an AGP was that the protein had a PAST (Pro, Ala, Ser, Thr content) over 50%.
  • the second criterion was that the protein had an N- terminal signal sequence identifiable by the program SignalP, see Nielsen et al., Protein Eng 10:1-6 (1997).
  • SignalP see Nielsen et al., Protein Eng 10:1-6 (1997).
  • 62 proteins were identified by the first criterion, of which 49 were predicted to be secreted. Schultz et al. admit that the 50% PAST threshold did not pick up PRP1-PRP4, for which the PAST value is 32-45%.
  • Schultz et al. also identified putative AG peptides by the following criteria: length of 50-75 amino acids; PAST composition of over 35%; and predicted to be secreted.
  • FLAs could not be found by a simple biased amino acid composition search because they are chimeric AGPs, that is, they include fasciclin domains, which are not AGP-like glycomodule domains.
  • the FLA7 protein is 39% PAST, but if the fasciclin domain is ignored, it is 52% PAST.
  • Schultz therefore screened for Arabidopsis proteins which were at least 39% PAST.
  • Schultz et al. then used a hidden markov model for 88 known fasciclin domains to create a position-specific score matrix for identification of fasciclin domains.
  • Shimizu does not propose mutating any non-plant protein so that it can be secreted, or secreted more efficiently, in plant cells.
  • Shimizu does not propose expressing, in secretible form, any plant protein which is not natively secreted, even if that protein natively has the postulated Hyp-glycosylation motif.
  • Shimizu does not propose mutating any plant protein which does not include any sequences fitting the motif so that it possesses the motif.
  • Shimizu does not propose mutating any plant protein to increase the number of prolines which fit the motif.
  • Russell did not deliberately mutate the sFv-encoding sequence in order to facilitate expression and secretion in plant cells, and did not state any opinion as to why the single chain antibody was so efficiently produced therein.
  • the present inventors believe that Russell unsuspectingly chose to produce a single chain antibody which had several prolines which, according to the predictions of the present inventor's algorithm, would be hydroxylated and O-glycosylated, thus resulting in high-level secretion. That algorithm predicts that six of the prolines in Russell SEQ ID NO:6 would be so processed. (The present inventors also believe that the Asn-Pro-Ser site in Russell SEQ ID NO: 8 would be N- glycosylated.)
  • sequence of this viral peptide corresponds to residues 1 to 23 of "virus protein 2", sequence EMBL database # AAV36761.1, with the position 23 Ser (S) being identified as GIp (Pyrrolidone carboxylic acid (pyroglutamate)) in Gil.
  • Plant cells contain particular proline hydroxylases whose specificity determines which prolines, in a protein expressed therein, are hydroxylated.
  • the resulting hydroxyprolines may be glycosylated. It is the presence of glycosylated hydroxypro lines which is the most important determinant of the degree of secretion of the protein. Hence, we have developed methods of predicting which prolines will be hydroxylated and which hydroxyprolines will be glycosylated.
  • glycosylated residues (more specifically, prolines which will be post- translationally modified into arabinosylated or arabinogalactosylated hydroxyproline residues), can be identified in advance, hi that manner, we can determine which proteins are likely to be readily secreted if expressed, in secretable form, in plant cells.
  • Proteins consequently can be divided into two categories, those which, without further modification (other than providing a signal peptide), will, if expressed in plant cells, undergo the processing of one or more prolines into glycosylated-Hyp, and those which will not.
  • the proteins in the latter category may be modified so such processing is likely to occur. That is, modifications can be made which are predicted to increase the number and/or the likelihood of Hyp-glycosylation sites. However, such modifications could have an adverse effect on biological activity.
  • proline hydroxylation capabilities of the plant cell by co-expressing in that cell at least one exogenous proline hydroxylase.
  • This could be the proline hydroxylase of another plant species, or the proline hydroxylase of a non-plant organism. It could also be a non-naturally occurring proline hydroxylase, such as a chimeric proline hydroxylase.
  • the point is to employ a proline hydroxylase which will hydroxylate at least one proline site in the protein which is not hydroxylated by the endogenous proline hydroxylase, but which, if hydroxylated, can then be glycosylated by the plant cell, resulting in increased Hyp-glycosylation.
  • the co-expressed exogenous proline hydroxylase may still be useful in conjunction with modification of the protein itself. That is, since the prolines of the modified protein can be hydroxylated by either the endogenous plant proline hydroxylase or the exogenous proline hydroxylase, the opportunities for hydroxyprolinc formation are increased. The skilled worker also has more flexibility in selecting the sites for modification in such manner as to avoid adverse effects on biological activity.
  • One class of proteins of interest are naturally occurring non-plant proteins which fortuitously possess one or more prolines which, if expressed and secreted by suitable plant cells, and co-expressed with an exogenous proline hydroxylase, will be hydroxylated and glycosylated.
  • Another class of proteins of interest are non-plant proteins which are deficient in favorable prolines, but which can be engineered, based on the design methods set forth in this disclosure, to remedy this deficiency, said modified proteins comprising at least one site which is recognized by an exogenous proline hydroxylase but not by the proline hydroxylases of the plant cells intended to serve as host cells.
  • a third class of proteins of interest are plant proteins which are not naturally secreted, but which, if expressed as fusion proteins including a suitable signal peptide, and co- expressed with an exogenous proline hydroxylase, fortuitously possess the favorable prolines.
  • a fourth class of proteins of interest are plant proteins which are deficient in favorable prolines, but which can be engineered to remedy this deficiency, said modified proteins comprising at least one site which is recognized by an exogenous proline hydroxylase but not by the proline hydroxylases of the plant cells intended to serve as host cells.
  • the first step is to analyze the sequence of the human protein and determine whether it would, without modification, be hydroxylated and glycosylated by the desired plant cells, in such a manner as to achieve the desired level of secretion. If so, then this invention teaches that it is desirable that a mature protein coding sequence, suitable for plant cell expression, and opcrably linked to a signal sequence functional in plant cells, and to a promoter functional in plant cells, be introduced into such cells, and the transformed plant cells cultivated under conditions in which that human protein is expressed and secreted.
  • sequence of the human protein is not such as would achieve a desired level of secretion, then one may instead identify an exogenous proline hydroxylase, such that said protein would achieve the desired level of Hyp-glycosylation and protein secretion if said exogenous proline hydroxylase were co-expressed in said plant cells.
  • mutant protein which does achieve that level, granted co-expression of an exogenous proline hydroxylase, and which either retains substantially all of the desired biological activity of the reference human protein, or which can be processed (e.g., cleaved), in the culture medium or at a later stage of recovery, to yield a final protein which does satisfy this biological activity test.
  • the human protein is mutated by insertion of at least one "Hyp-glycomodule" at the amino and/or carboxy ends of the protein (in which case the reader may prefer to speak of the glycomodule as being “added” to the protein).
  • the term "Hyp-glycomodule” refers generally to a sequence containing one or more prolines so positioned that the plant cell will hydroxylate and glycosylate them (hence the "glyco" of the name). The term will be defined more precisely in a later section of this application.
  • Hyp-glycomodule to the native human protein moiety by a spacer which either 1) acts to distance the native human protein moiety from the Hyp-glycomodule in such manner as to increase the retention of native human protein biological activity by the Hyp-glycomodule-spacer-human protein fusion relative to that retained by a direct Hyp-glycomodule-human protein fusion, or 2) provides a site-specific cleavage site for an enzyme or chemical agent such that, after cleavage at that site, a new product is generated which does have the desired biological activity.
  • Hyp-glycomodule results in reduction of biological activity, that this can be ameliorated by mutations within the human protein moiety proper.
  • mutations may be substitution mutations (not necessarily introducing prolines) or truncation of one or more amino acids from either or both ends of the human protein (e.g., so that the Hyp-glycomodule is in whole or in part replacing an amino or carboxy sequence).
  • the human protein is mutated internally. Most often, this will be by one or more substitution mutations which introduce prolines at sites collectively favored for hydroxylation and subsequent glycosylation.
  • amino acids in the vicinity of a native or introduced proline may be replaced with other amino acids, so that said native or introduced proline becomes one collectively favored for hydroxylation and subsequent glycosylation.
  • any other desired substitutions can be made if they do not substantially adversely affect either plant cell secretion or (with certain caveats) the biological activity of the mutant protein. It is also possible, although more difficult from the standpoint of preserving biological activity, to foster proline hydroxylation and subsequent hydroxyproline glycosylation by deletion and/or internal insertion.
  • the first strategy in effect creates a Hyp-glycomodule within the protein by addition, whereas the second does so by substitution and/or deletion and/or internal insertion.
  • Hyp- glycomodule to one end of a human protein and also introduce glycosylation-increasing substitution mutations into the human protein moiety.
  • proteins comprising at least one native Hyp-glycomodule and/or at least one substitution and/or at least one internal insertion Hyp-glycomodule, whether or not they also comprise an addition Hyp-glycomodule, are of particular interest.
  • proteins comprising only one or more addition Hyp-glycomodules and no substitution Hyp- glycomodules are also within the contemplation of the present invention.
  • the modification may usefully inhibit one of the biological activities of the parental protein, while leaving another biological activity intact.
  • an agonist must bind to and activate a receptor. If the modification inhibits activation, but permits binding, then the agonist is converted into an antagonist.
  • An example of the use of a modification to introduce Hyp-glycosylation while converting an agonist into an antagonist is given in the Examples, in the discussion of Fibroblast Growth Factor 7.
  • the present invention thus relates, in part, to
  • precursor proteins consisting essentially of a plant specific signal peptide and a mature protein as described above, with one or more Hyp-glycosylation sites, not previously expressed in and secreted by plant cells
  • glycoproteins of the present invention are expected to be more efficiently secreted in plant cells; this of course presumes that they are expressed in a precursor form comprising a secretory signal peptide recognized by the host plant cell, which signal peptide is cleaved off, releasing the mature core protein. Glycosylation is post-translational, and occurs after the signal peptide is removed.
  • one or more of the glycosylated residues are hydroxyprolines. Hydroxyprolines arise through hydroxylation of proline residues; it is not presently known whether hydroxylation is co- translational or post-translational, and thus its timing relative to signal peptide cleavage.
  • glycoproteins may exhibit various additional advantages over their wild-type counterparts, including increased solubility, increased resistance to proteolytic enzymes, and/or increased stability. They may have comparable biological activity, or they may have improved pharmacodynamic or pharmacokinetic properties, such as increased biological half-life as compared to wild-type proteins. Finally, glycosylation makes possible the purification of the protein by carbohydrate affinity chromatography.
  • a glycoprotein is a protein containing one or more carbohydrate chains.
  • the core of a glycoprotein is the corresponding unglycosylated protein having the same amino acid sequence.
  • This core protein may include non-genetically encoded, and even non-naturally occurring, amino acids.
  • sequence as determined solely by the genetic code is referred to as the "genetically encoded sequence", the “genetically encodable sequence”, the “translated sequence”, the “nascent sequence”, the “initial sequence”, or the “initial core sequence”.
  • proline skeleton typically refers to this level of sequence analysis.
  • the portion of the intermediate sequence which ultimately becomes part of the mature protein — that is, which excludes the signal peptide — is referred to as the mature portion.
  • the "completely processed sequence”, also known as the “mature sequence”, the “secreted sequence” or the “final sequence”, is the result the hydroxylation of the prolines, the removal of the signal peptide, and the glycosylation.
  • prolines, unglyosylated hydroxyprolines, and glycosylated hydroxyprolines are distinguished.
  • sequences are not distinguished on the basis of the precise nature of the glycosylation at a particular amino acid position. We can however refer to proteins with different "glycosylation patterns.”
  • pro-hydroxylation site means a proline residue which, according to the specified prediction method, is predicted to be hydroxylated if the protein to which it belongs is expressed and secreted in a plant cell.
  • any disclosed method, or art-recognized method may be used. Each disclosed method herein corresponds to a separate series of preferred embodiments, but the most preferred embodiments are those in which the standard quantitative prediction method, with the new matrix, is used.
  • actual Pro-hydroxylation site refers to a proline residue which in fact is hydroxylated if the protein to which it belongs is expressed and secreted in a plant cell.
  • proline residue which, according to the specified prediction method, is predicted to be hydroxylated to form hydroxyproline, and which hydroxyproline is predicted to be glycosylated, at least in part.
  • any disclosed method, or art-recognized method may be used. Each disclosed method herein corresponds to series of preferred embodiments, but the more preferred embodiments arc those in which the new standard prediction method is used.
  • actual Hyp-glycosylation site means a proline residue which, in a protein expressed and secreted in a plant cell, in fact acts as a target site of plant cell hydroxylation (forming a hydroxyproline) and subsequent glycosylation. Such glycosylation need not be complete; a Hyp is considered an actual target site for plant cell glycosylation if at least 25% of the protein molecules are glycosylated at that position in at least one species of plant cell.
  • Predicted hydroxyproline (i.e., Pro-hydroxylation) sites are deemed to be noncontiguous but clustered if they are part of a series (i.e., two or more) of non-contiguous sites, wherein any site is separated from the nearest site, on either side, by one and only amino acid, and that separating amino acid is not a proline or hydroxyproline.
  • the smallest possible cluster, other than at the N- or C-terminal is of the form -X-O-X-O-X-, since the two O are non-contiguous, and separated by each other by one separating amino acid.
  • the third, fourth and fifth hydroxpro lines which are boldfaced, are part of a single cluster of noncontiguous hydroxyprolines, while the first and second hydroxyprolines are a contiguous dipeptide block, and the final hydroxyproline is isolated (a hydroxyproline which is not part of a contiguous series, and not part of a cluster, is considered isolated).
  • 0-0-X-O-X-O-O (SEQ ID NO: 51) does not feature a cluster, but rather two dipeptidyl Hyp with a lone unclustered Hyp in-between.
  • Clustered actual hydroxyproline sites are analogously defined.
  • Predicted Pro-hydroxylation or Hyp-glycosylation sites are deemed to be proximate to each other if there are no intervening prolines (or hydroxyprolines) and if they are separated by not more than four intervening amino acids which are not prolines or hydroxyprolines (e.g., 0-X-X-X-X-O). Proximate actual Pro-hydroxylation or Hyp- glycosylation sites are analogously defined.
  • Sites of a particular kind are said to be grouped if they are a series (ie., two or more) of non-contiguous sites, each site is proximate to the next site in the series, and the sites don't satisfy the definition of clustered sites. Isolated sites maybe grouped or not. If not grouped, they may be termed "highly isolated.”
  • the term “predicted Hyp-glycomodule” is meant to refer to an amino acid sequence consisting of (1) an uninterrupted series of proximate predicted Hyp- glycosylation sites, (2) the amino acids, if any, between any two such Hyp-glycosylation sites of that series which arc not themselves such Hyp-glycosylation sites, (3) the two amino acids, if any, before the first Hyp-glycosylation site of such series, and (4) the two amino acids, if any, after the last Hyp-glycosylation site of such series.
  • Hyp- glycosylation sites are said to be in series if the first site is proximate to the second, the second to third (if any), the third to the fourth (if any), and so on without any gap of more than four intervening amino acids which are not prolines or hydroxyprolines.
  • a Hyp- glycomodule could be, e.g., X-X-O-O-X-O-X-X-O-X-X-X-X-O-X-X-X-X-O-X-X-X-O-X-X (SEQ ID NO: 52), assuming that all of the hydroxyprolines (O) are in fact Hyp-glycosylation sites, as the sequence then includes a series of six sites, each proximate to the next one.
  • the term "actual Hyp-glycomodule" is analogously defined.
  • Hyp-glycomodule may be used not only to refer to the final processed form of the moiety, including one or more glycosylated hydroxyprolines, but also, more loosely, to refer to the amino acid sequence of the Hyp-glycomodule before it undergoes any post-translational modification, or to the sequence which is hydroxylated (and thus includes one or more hydroxyprolines), but those hydroxyprolines are unglycosylated or incompletely glycosylated.
  • the equilibrium glycosylated form may be referred to as the mature or final Hyp-glycomodule
  • the immediately expressed form, prior to hydroxylation or glycosylation may be referred to as the nascent Hyp-glycomodule
  • any intermediate form may be referred as an intermediate Hyp-glycomodule.
  • the amino acid sequence of the nascent Hyp-glycomodule maybe referred to as the initial core sequence thereof and the amino acid sequence of the final Hyp-glycomodule, with hydroxyprolines identified (but ignoring glycosylation), may be referred to as the modified core sequence thereof.
  • Hyp-GIycosylation types include, but are not limited to, arabinosylation and arabinogalactan-polysaccharide addition.
  • Arabinosylation generally involves the addition of short (e.g., generally about 1-5) arabinooligosaccharide (generally L-arabinofuranosyl residues) chains.
  • Arabinogalactan-polysaccharides are larger and generally are formed from a core ⁇ -l,3-D-galactan backbone periodically decorated with 1,6- additions of small side chains of D-galactose and L-arabinose and occasionally with other sugars such as L-rhamnose and sugar acids such as D-glucuronic acid and its 4-o-methyl derivative.
  • Arabinogalactan-polysaccharides can also take the form of a core ⁇ -l,6-D- galactan backbone periodically decorated with 1,6-additions of small side chains of arabinofuranosyl. Note that these adducts are added by a plant's natural enzymatic systems to proteins/peptides/polypeptides that include the target sites for glycosylation, i.e., the glycosylation sites. There may be variation in the actual molecular structure of the glycosylation that occurs.
  • the oligosaccharide chains may include any sugar which can be provided by the host cell, including, without limitation, Gal, GaINAc, GIc, GIcNAc, Fuc, xylose, mannose, and galacturonic acid.
  • Moroever a rule created to explain a single site in a single protein may invoke a feature which is actually irrelevant or only marginally relevant to the susceptibility of that site to hydroxylation and glycosylation, and hence lead, when applied to new proteins, to erroneous predictions. (This is sometimes referred to as "over-training" a rule to match a data set.)
  • any reasonable prediction rule will result in both false positives (saying it is hydroxylated or glycosylated, when in fact it isn't) and false negatives (saying it isn't, when in fact it is). For this reason, we have been careful to define both predicted and actual Hyp- glycosylation sites. Nonetheless, we believe that the current prediction methods are sufficiently accurate to be useful in designing systems for secreting biologically active proteins (or proteins cleavable to release biologically active proteins) in plant cells.
  • the present disclosure sets forth three methods for the prediction of proline hydroxylation in plant cells as a result of the action of the endogenous proline hydroxylases. (These methods may also be applicable to predicting the action of an exogenous proline hydroxylase if it is of plant cell origin.)
  • the qualitative standard method is used.
  • the quantitative standard method which generates a Hyp-score, is used. (This preferably uses the new standard matrix, but may alternatively use the old one.)
  • the qualitative alternative method is used. These three series of embodiments overlap a great deal, but are not identical.
  • the quantitative standard method may further be classified into subseries of embodiments depending on the choice of the three parameters of the method.
  • the present disclosure sets forth three methods for the prediction of hydroxyproline glycosylation: 1) the old standard method, 2) the old alternative method, and 3) the new standard method.
  • the new standard method is used.
  • the old standard method is used.
  • the "extension" is used, and a subset in which it isn't.
  • the alternative method is used. While these methods attempt to predict the type of glycosylation which occurs at a particular residue, this is not as important as knowing whether glycosylation occurs at all.
  • the present program implementation of the methods for predicting hydroxylation and glycosylation doesn't include any subroutines for the prediction of signal peptidase cleavage sites. Consequently, if the sequence of the protein, as input into the program, includes the signal sequence, the program may predict Pro-hydroxylation sites and Hyp-glycosylation sites within the signal peptide. Moreover, residues in the signal sequence may be close enough to a Pro outside the signal sequence to influence the predictions made concerning that proline.
  • the programs don't include any subroutines for the prediction of GPI addition signals. Consequently, there could be prediction of Pro-hydroxylation or Hyp- glycosylation within or near the GPI addition signal, which might not be predicted if that signal were not within the inputted sequence. It is believed that GPI addition is post- translational, which implies that the GPI addition sequence (cleaved off, and the GPI anchor added, in the endoplasmic reticulum) can influence hydroxylation of nearby Pro, but not glycosylation of nearby Hyp.
  • GPI addition signals are primarily a concern in the case of naturally secreted proteins and modifications thereof.
  • Kivirikko and Myllyharju "Proline 4-Hydroxylases and their Protein Disulfide Isomerase Subumt," Matrix Biology , 16: 357-68 (1997/98) discusses prolyl 4-hydroxylases.
  • the vertebrate enzymes e.g., human, mouse, rat chicken
  • the vertebrate enzymes are all tetramers cosisting of two alpha and two beta subunits. At least in humans and mice, there are two isoforms of the alpha subunit, denoted (I) and (II) respectively; they are not found within the same tetramer.
  • the human alpha I and alpha II subunits have a reported sequence identity of 64%.
  • the beta subumt is identical to protein disulfide isomerase.
  • the C. elegans proline 4-hydroxylase is an alpha-beta dimer. Ignoring the C-terminal extension present only in the nematode polypeptide, the nematode alpha subunit has sequence identities of 45 and 42%, respectively, with the human alpha I and II subunits.
  • the vertebrate prolyl 4 hydroxylases will hydroxylate the X-Pro-Gly triplet, especially in repeated form.
  • Vertebrate prolyl 4-hydroxylases have been co-expressed in insect and yeast cells, facilitating the production of recombinant human collagen in those hosts.
  • Poly-L-proline is a strong competitive inhibitor of the vertebrate type I prolyl A- hydroxylase, and of the plant prolyl 4-hydroxylase, but not of the vertebrate type II or the nematode enzyme. .
  • prolyl 4 hydroxylases include viral prolyl 4-hydroxylases and HIF (hypoxia-inducible factor) prolyl 4-hydroxylases. See the thesis at htt ⁇ :/ / hcrkules.oulu.fi/isbn951427203 X/html/c 185. html .
  • the HIF hydroxylases arc further discussed in Berra, "The hypoxia-inducible-factor hydroxylases bring fresh air into hypoxia signalling, EMBO Reports, 7(l):41-45 (2006).
  • These enzymes also known as PHDs, and including PHDl, PHD2 and PHD3, hydroxylate two proline residues in a conserved LxxLAP (SEQ ID NO:118) sequence motif.
  • PHDs hydroxylate two proline residues in a conserved LxxLAP (SEQ ID NO:118) sequence motif.
  • HIF prolyl-hydroxylase 2 is the key oxygen sensor setting low steady-state levels of HIF-Ia- in normoxia. EMBO J 22, 4082-4090.
  • Kivirikko KI & Pihlajaniemi T (1998) Collagen hydroxylases and the protein disulfide isomerase sub unit of prolyl 4-hydroxylases. Adv Enzymol Relat Areas MoI Biol 72: 325-398.
  • Kivirikko KI, Myllyla, R, Pihlajaniemi, T (2000) Hydroxylation of proline and lysine residues in collagens and other animal and plant proteins.
  • Post-Translational Modifications of Proteins CRC Press, Boca Raton, Ann Arbor, London; Eds: JJ. Harding, MJC Crabbe; ppl-49.
  • Semenza GL (2001) HIF-I, 02, and the 3 PHDs: how animal cells signal hypoxia to the nucleus. Cell 107: 43-54.
  • the plant cell of interest is a higher plant
  • the heterologous prolyl-hydroxylase co-expressed in the (higher) plant cells of interest is one native to a lower plant (or mutatis mutandis, a mutant thereof).
  • a lower plant is an alga, moss (bryophyte) liverwort (hepaticophyta), or hornwort ((anthocerotophyla). Algae, and more particularly the green algae and the diatoms, are especially preferred.
  • AAANOOSO SEQ ID NO: 131: Note hydroxylation of Pro following N
  • the major glycoprotein in the extracellular matrix of Volvox is composed of polyHyp. It is 50% carbohydrate, the major sugars being GIc, Ara, Gal and Man. This glycoprotein may exisit as a one, two, or three domain protein including a polyHyp region plus or minus a Hyp-free region.
  • the PolyHyp domain consists of pure polyHyp (750-1000 residues of Hyp) or polyHyp periodically interrupted by other amino acids, most commonly Ser. The only periodicity detected was (Hyp) 10-Ser.
  • ENPSTOOXXOAOAXT (SEQ ID NO: 132) (Where X - unidentified)
  • a proline immediately preceded by Lys, He, GIn, Arg, Leu, Phe, Tyr, Asp, Asn, Cys, Trp or Met is not hydroxylated.
  • a proline immediately preceded by Ala, Ser, VaI, Thr or Pro is likely to be hydroxylated. This is even more likely to occur if the proline is both immediately preceded and immediately followed by one of those five amino acids, e.g., SPS, APS, TPA, APT, APA, APV, SPV, etc.
  • a proline immediately preceded by GIu, GIy or His can be hydroxylated, but this is more sensitive to the nature of other amino acids in the vicinity of that proline.
  • a quantitative prediction method is set forth in the next section.
  • the standard quantitative prediction method draws upon, but goes beyond, the teachings of the qualitative method set forth in the last section. In particular, it considers the effects of residues which are not adjacent to the target proline.
  • Hyp hydroxyproline
  • LCF is the Local Composition Factor Score
  • LCFB is the Local Composition Factor Baseline
  • MV is the Matrix Value, all as defined below.
  • the proline is predicted to be hydroxylated if the HypScore is greater than the Score Threshold.
  • the preferred (default) value of the Score Threshold is 0.5.
  • a proline for which the Hyp Score thus calculated is greater than the Score Threshold is considered to be a predicted Pro- Hydroxylation Site for that Score Threshold. Such a site is a candidate for evaluation for hydroxyproline glycosylation, as described in a later section.
  • the preferred (default) values are assumed.
  • the Matrix value is the sum of the matrix scores, from the table below, for the amino acids in positions n-2, n-1, n+1 and n+2, where the target proline is at position n. If position n is so close to the amino or carboxy terminal that one or more of these positions is null, then the null position(s) can be given a matrix score of zero. However, we would recommend that the proteins of choice be ones for which at least one proline predicted to be hydroxylated and glycosylated is not within three amino acids of the amino or carboxy terminal, as the applicability of our algorithm to these extreme cases is less certain.
  • the "new standard” matrix shown above differs slightly from the “old standard” one set forth in 60/697,337. Specifically, D (Asp) in position +1 was previously scored as -1 (now 0), and G (GIy) in position -1 was formerly scored as -0.75 (now 0). These changes make the scoring system more permissive, which should increase the number of both hits (correct prediction of hydroxylated prolines) and false positives (prolines predicted to be hydroxylated which aren't). In general, false positives are preferred to false negatives.
  • the new standard matrix is used, and references to the matrix, without qualification, assume its use.
  • the old standard matrix is used.
  • the residues favored by rule 2 are assigned matrix values ranging from +1 to +4. Thus, depending on the nature of the residues at positions -2, +1 and +2, the matrix score can be negative or positive.
  • the matrix reveals that the nearby residues most likely to hinder hydroxylation, are, at the -2 position, Cys, Trp and GIn; at the +1 position, Cys and Trp; and at the +2 position, Cys, Asp, Asn and Arg.
  • Pro hydroxylation is common in proteins and regions of proteins that are highly repetitive and rich in Pro/Hyp (therefore less random); Pro hydroxylation is less likely in those that are not repetitive.
  • Shannon entropy is defined as the sum of the - (p, log 2 (p,)) for all signals i for which p, >0, where p, is the probability of occurrence of signal i, where the signal i is either yes or no (i.e., a binary channel).
  • the p are the proportions of amino acids in a sequence which are a particular type i of amino acid (e.g., proline, or leucine, or glycine).
  • proline e.g., or leucine, or glycine
  • up to twenty types may be represented.
  • the absolute entropy score for an amino acid sequence as being the Shannon entropy, with the p, calculated as explained above.
  • post- translational modifications such as Pro to Hyp, or glycosylation.
  • Repetitiveness is a form of order, and the entropy score is a formal mathematical measure of disorder.
  • the repetitiveness of the protein sequence is evaluated in a window around the target proline, so the entropy is a measure of the repetitiveness of the protein in a region localized around the target proline, rather than that of the protein as a whole (unless the window is large enough to include the entire protein).
  • the entropy calculated in this manner is an incomplete measure of repetitiveness in the sense that it only considers the amino acid composition of the sequence, and not the ordering of the amino acids within it, so a sequence in which two amino acids alternate would have the same Shannon entropy as a random sequence which is 50% one and 50% the other.
  • the absolute entropy score would be zero. That is the smallest possible value. If a protein sequence had an equal number of each of the twenty possible amino acids (we will call this an equipolymer), the absolute entropy score would be -log 2 (1/20), or 4.32198, which is the maximum entropy for an amino acid sequence.
  • the Local Composition Factor is the relative order as defined above, and it is normally evaluated over a window centered on and including the target Proline.
  • the window may be an odd or an even number of amino acids. If it is an odd number, and the position of the target proline is denoted n, then the normal window is from position n-a to position n+a, where a is the (width-l)/2, and the width is 2a+l. If the window is even in size, then the window can be defined in two ways, either from position n-a to position n+a-1, or from position n-a+1 to position n+a, where a is the half-width, so the width is 2a.
  • the preferred standard window size is 21 amino acids, so the preferred standard window is from n-10 to n+10.
  • the window When the target proline is close to the amino acid or carboxy terminal of the protein of interest, the window will be truncated on that side of the proline, reducing the effective window size. For example, if we were using a standard window size of 21 amino acids, but the target proline were at the amino terminal, then the "left half of the window would be truncated, reducing the effective window size to 11 , and the Local Composition Factor would be calculated over positions 1-11 of the protein.
  • the Local Composition Factor Baseline is the value of the Local Composition Factor (LCF) for which the effect of the local composition on hydroxylation of prolines, measured as described above, is considered to be neutral.
  • the preferred (default) value is 0.4.
  • Xaal is Ala, VaI, Ser, Thr or GIy
  • Xaa3 is Ala, VaI, Ser, Thr, GIy or Ala [sic]
  • Xaa4 is GIy, Ala, VaI, Pro, Ser, Thr or Cys
  • Xaa5 is Ala, Pro, Ser or acidic (Asp or GIu)
  • Shimizu does not consider the n-2 position, at which the matrix score could be as high as 2.
  • Shimizu ignores the possibility of Pro, which we would score as +3.
  • Shimizu ignores the positive scoring Phe (+0.1), Lys (+1), Hyp (+2), Pro (+3), Arg (+1), and Tyr (+0.5).
  • Xaa4 our n+2
  • Shimizu ignores the positive scoring His (+1), Lys (+1), and Tyr (+0.5).
  • a class of embodiments of interest are those proteins in which at least one proline is predicted to be hydroxylated by our algorithm, even though that proline would not be predicted to be hydroxylated on the basis of Shimizu's consensus sequence.
  • proteins in which at least one proline is predicted to be hydroxylated by our algorithm even though none of the prolines in that protein satisfy Shimizu's consensus sequence.
  • the present computer implementation of the quantitative method doesn't take the species of plant cell into account, i.e.,
  • GP is not hydroxylated in Acacia or tobacco, but is in Arabidopsis
  • HP is not hydroxylated in the solanaceae (e.g., tobacco, tomato, eggplant, nightshade, peppers) but is in maize and probably other graminaceous monocots
  • G has a matrix weight of 0 (neutral), H of -5 (strongly unfavorable), and E of -.5 (slightly unfavorable). That means that the computer program will tend to overlook, e.g., HP which would be hydroxylated in a suitable plant cell.
  • a proline immediately preceded by Lys, He, GIn, Arg, Leu, Phe, Tyr, Asp, Asn, Cys, Trp, Met, or GIu is not hydroxylated.
  • a proline immediately preceded by GIy is hydroxylated in Arabidopsis, but not in Solanaceae or Leguminaceae.
  • a proline immediately preceded by His is usually not hydroxylated, but there is at least one exception (in maize).
  • Pro in the sequence Pro-Val is always hydroxylated unless hydroxylation is forbidden by rule 1.
  • the folding of a protein maybe such as to occlude potential Pro-hydroxylation sites. This is most likely to be a problem with proteins which have significant tertiary or supersecondary structure. Indicators of potential problem proteins are the presence of disulfide bonds (which maybe inferred from the presence of paired cysteines) and low proline (proline tends to interfere with the formation of secondary structures such as alpha helices and beta strands, and hence with formation of higher structures).
  • Pro-hydoxylation sites are preferably predicted, as described above, on the basis of the Hyp-score.
  • the number of predicted Pro-hydroxylation sites is then dependent on the choice of values in the Hyp-Score calculation for the LCFB, taken together with the Score Threshold, which determines whether the target proline is classified as a predicted Pro- hydroxylation site. Only predicted Pro-hydroxylation sites can be predicted Hyp- glycosylation sites. If the LCFB is given its preferred value as set forth above, then the number of predicted Pro-hydroxylation sites will be inversely (but not necessarily linearly) dependent on the Score Threshold.
  • the prediction of Pro-hydroxylation sites is based on the preferred Score Threshold of 0.5. This value was found to yield acceptable results in predicting the hydroxylation of a "problem set" of weakly hydroxylated proteins.
  • mutate a protein so as to improve the Hyp-score of one or more of the predicted Hyp-Glycosylation sites, rather than to create a new Hyp-Glycosylation site.
  • Whether a mutation merely improves the Hyp-Score of a predicted site, or creates a new site, is dependent on the Score Threshold .
  • the Score Threshold For example, if a parental protein has four prolines, with Hyp scores of 0.6, 0.71, 0.83, and 1.2, and mutation increases the lowest score from 0.6 to 0.7, then there is an increase in the number of Pro- hydroxylation sites if the Score Threshold is 0.7, but not if the Score Threshold is 0.5.
  • the improvement of the Hyp-Score of a Pro-hydroxylation site predicted with the default Score Threshold can be characterized as equivalent to the creation of a new predicted Pro- hydroxylation site if a more stringent Score Threshold is employed.
  • PRPs are at best lightly arabinosylated but not arabinogalactosylated despite having some clustered non-contiguous Hyp.
  • An examination of protein sequence and composition provides clues.
  • Both PRPs and AGPs are Hyp-rich. However AGPs are also rich in Ala, Ser, Thr, and sometimes GIy , but notably in Tyr and Lys, at least in the Hyp-rich domains....and AGPs are not highly repetitive. PRPs are the most repetitive of the HRGPs and rich in Hyp, VaI, Tyr, and Lys and seldom contain Ala or GIy.
  • the most common repeat motifs of PRPs are variations of the pentapeptide/hexapeptide: Lys-Pro-Hyp-Val-Tyr/Lys-Pro-Hyp-Hyp-Val-Tyr (SEQ ID NO:60) .
  • Extensins are highly repetitive (Ser-Hyp-Hyp-Hyp-Hyp-Hyp, SEQ ID NO:62, is the extensin identifying sequence), Lys, Tyr, Val-rich, generally Ala and Gly-poor. Extensins are not arabinogalactosylated.
  • Hyp in blocks of three or more contiguous Hyp are about 100% arabinosylated.
  • Hyp in blocks of only two contiguous Hyp are about 50-65% arabinosylated.
  • Non-contiguous Hyp residues can be arabinosylated, arabinogalactosylated, or non- glycosylated, as predicted by the rules below.
  • Hyp residues are Clustered Hyp residues (e.g., (X-Hyp)n, where
  • condition 3.1.1 If condition 3.1.1 is not met, they are arabinosylated or non- glycosylated, and it is prudent to assume that they are non- glycosylated
  • Hyp residue is not immediately followed by Lys, Arg, His, Phe, Tyr, Trp, Leu or He.
  • condition 3.2.2 applies, then the following method may be used to predict whether the Hyp is arabinosylated or not, but it should ne noted that this extension is considered less accurate than the method as described up to this point, hi essence, if condition 3.2.2 applies, the Hyp are non-glycosylated if at least two of the four conditions below are met for the aforementioned 11 amino acid window:
  • the window will be truncated on the terminal side.
  • Dipeptidyl Hyp Our earlier work (Shpak et al 2001, J.Biol.Chem 276, 11272-11278) with repetitive Ser-Hyp-Hyp motifs, which necessarily include dipeptidyl Hyp, indicated the first Hyp in the dipeptide block is always arabinosylated and the second one is incompletely arabinosylated.
  • the old standard method classifies all Hyp residues as large block Hyp, dipeptidyl Hyp, clustered Hyp or isolated Hyp. It may be advantageous to recognize a spectrum of isolation, e.g.,
  • the hydroxyprolines form a series of three (including the target Hyp) proximate Hyp, and are therefore considered "grouped", while in the fourth line, the three hydroxyprolines are not proximate to each other and therefore are considered highly isolated.
  • Hyp we would expect grouped Hyp to be more likely to be glycosylated than would be highly isolated Hyp.
  • Hyp in blocks of three or more contiguous Hyp are about 100% arabinosylated.
  • Hyp in blocks of only two contiguous Hyp are about 50-65% arabinosylated.
  • Hyp which are not contiguous with other Hyp are arabinogalactosylated.
  • Test A If residue 4 is Hyp then do test B, otherwise do Test C.
  • Test B If residue 6 is Hyp OR residue 3 is Hyp then return an answer of Arabinosylated for residue 5. Otherwise return an answer of unaltered Hydroxyproline for residue 5. End all tests for this window.
  • Test C If residue 6 is Hyp return an answer of Arabinosylated for residue 5 and end all tests for this window, otherwise do Test D.
  • Test D If residue 3 is Hyp or Pro AND residue 2 is not Hyp then do test E, otherwise do test G.
  • Test E If residue 4 is one of (S er, Al a, VaI or GIy) AND the total number of (Lys, Tyr, His) is fewer than two then return an answer of Arabinogalactosylated for residue 5, otherwise do test F.
  • Test F If residue 4 is Thr then return an answer of Arabinosylated for residue 5, otherwise return an answer of unaltered Hydroxyproline for residue 5. End all tests for this window.
  • Test G If residue 7 is Hyp or Pro AND residue 8 is not Hyp do test E, otherwise do test H.
  • Test H If residues 4 to 6 inclusive have the one of the sequences (Thr-Hyp-Lys), (Thr-Hyp-His), (Gly-Hyp-Lys) or (Ser-Hyp-Lys) then return an answer of Arabinosylated for residue 5, otherwise do test I.
  • Test I If residue 7 or residue 3 is Pro do test J, otherwise do test K.
  • Test J If residue 4 is one of (S er, Ala, VaI or GIy) AND residue 6 is one of (Leu, He, GIu or Asp) then return an answer of Arabinogalactosylated for residue 5, otherwise do test K.
  • Test K If residue 6 is one of (Lys, Arg, His, Phe, Tyr, Trp, Leu or He) then return an answer of unaltered Hydroxyproline for residue 5, otherwise do test L.
  • Test L If the total number of (Hyp, Pro) is greater than three then return an answer of unaltered Hydroxyproline for residue 5, otherwise do test M.
  • Test M If the total number of (Ser, Thr, Ala) is fewer than four then return an answer of unaltered Hydroxyproline, otherwise do test N.
  • Test N If the total number of different residue types is greater than three then return an answer of Arabinogalactosylated for residue 5, otherwise do test O.
  • Test O If the total number of (Ser, Thr, Ala) is greater than four then return an answer of Arabinogalactosylated for residue 5, otherwise return an answer of unaltered Hydroxyproline for residue 5. End all tests for this window.
  • Tests A-C deal with contiguous Hyp. If the scan encounters 0*0, 00*, or X*0 (where * is the target Hyp, O is other Hyp, and X is another amino acid), these tests predict that * is arabinosylated. Note that X*0 could mean either the beginning of 3+ block of Hyp, or the first Hyp of dipeptidyl Hyp. If it encounters X0*X it predicts that the * (the second Hyp of dipeptidyl Hyp) is left unglycosylated.
  • the subtle difference between new standard tests A-C and rule 2 of the old standard method is that for dipeptidyl Hyp, the old method said that the dipeptide was about 50% arabinosylated, while the new method identifies the first Hyp as arabinosylated and the second as non-glycosylated.
  • test D we have a clustered non-contiguous Hyp/Pro sequences (specifically, X(0/P)X*X), and are directed to tests E and possibly also F.
  • Arabinogalactans are associated with such sequences when they are Ala, Ser, VaI, GIy rich and Lys, Tyr, His poor.
  • Test E looks to whether there is A/S/V/G preceding *, and whether the window in general is K/Y/H poor. If so, then the * (which is the second, or later, Hyp of a cluster) is predicted to be arabinogalactosylated.
  • Thr can also promote arabinogalactan addition in this situation (as we have observed in tobacco cells expressing a repetitive TP synthetic sequence), and is common in AGPs, it was excluded from Test E because it doesn't appear to have the same effect in maize.
  • the person skilled in the art may wish to modify the algorithm to account for differences between, e.g., dicots like tobacco, and graminaceous monocots like maize. That is part of the test in view of, e.g., the lack of arabinogalactosylation of * in certain X(O/P0T*X sequences in, maize THRGP (CAA45514) and maize-expressed human IgAl.
  • test E If test E is failed, the complementary test F predicts arabinosylation of * in X(O/P)T*X.
  • tests E and F predict arabinosylation, but not arabinogalactosylation, of certain T*X sequences, consistent with N. tabaccum extensin (JU 0465), maize THRGP (CAA45514) and maize-expressed human IgAl .
  • test D If test D is failed, we go to test G. If test G is satisfied, we reach test E by a new route.
  • the prior failure of test D means that the * is the first Hyp of a cluster. Satisfaction of test E means that it is arabinogalactosylated.
  • Test G was inspired by LeAGP-I and the sequence HSOLPT (SEQ ID NO: 64) in Jay's gum, wherein the SOLP (Aas 1-4 thereof), while of the form XOXP, behaves much like XOXO.
  • Tests D-G of the new method deal, as did old rule 3.1, with clustered Hyp residues. However, unlike the old rule, they don't accept T*X. That is a problem with certain maize THRGP sequences, so test H, if satisfied, predicts arabinosylation of the * in the sequences T*K, T*H, G*K and 8*K.
  • Tests I through K distinguish among AGP-like sequences having clustered Pro/Hyp, and PRP/extensin sequences having clustered Pro/Hyp.
  • Tests J and K deal with unique modules in 'problem proteins' like Jay's Gum and THRGP from Maize, which was a particular problem.
  • Test J was designed for test case 'Jay's Gum 1 (AKA [Gum-I]n in the paper: MJ Kieliszewski and J Xu, " Synthetic Genes for the Production of Novel Arabinogalactan-proteins and Plant Gums," Foods and Food Ingredients Journal of Japan, 211 (1): 32-36. ( 2006). He, GIu and Asp were added, speculatively as amino acids following Pro that are likely to allow arabinogalactosylation..
  • Test K surveys composition in similar sequences and determines that when the target Hyp is followed by bulky amino acids like Lys, His, Tyr, I, F, L (at residue 6) the Hyp remains non-glycosylated. R, W were thrown in for cases that might arise although these amino acids are rare in HRGPs.
  • Gum Arabic Glycoprotein is one example; it contains the sequence TOOTG*HSOSOA (SEQ ID NO:43), with target Hyp shown as *,. The O in GOH is not arabinoglycosylated.
  • Test L-O deal with the situation of isolated Hyp residues, as did old 3.2. Tests L-M are defined so that if either are positive, the target Hyp is unaltered. On the other hand, tests N and O are defined so that if either is positive, the target Hyp is arabinogalactosylated.
  • test L we know that old 3.3.1(d) is negative, because if old 3.3. l(d) were positive, then test K would have been positive and unaltered target Hyp predicted.
  • Tests L-O are related to old rule 3.2, as follows: if old 3.2.1 (a) is negative, test L is positive; if old 3.2. l(b) is negative, test M is positive; and if old 3.2. l(c) is positive, test N and/or test O are positive.
  • PS polysaccharide (i.e., arabinogalactosylalion)
  • Ara arabinosylation
  • Gly glycosylation (sum of PS and Ara).
  • the number of actual Hyp-glycosylation sites should be sufficient to achieve the desired levels of secretion in plant cells. It does not appear that the level of secretion increases as a smooth function of the number of actual Hyp-glycosylation.
  • the non-plant proteins with addition glycomodules featuring as few as two and as many as over one hundred Hyp-glycosylation sites have demonstrated increased secretion. It is believed that even a single site can provide at least an improved level of secretion.
  • the number of actual Hyp-glycosylation sites maybe one, two, three, four, five, six, seven, eight, nine, ten or more, such as at least fifteen, at least twenty, etc.
  • the main limitation on the number of actual Hyp-glycosylation sites is that the level of Hyp-glycosylation not so great as to substantially interfere with expression, e.g., through excessive demand for sugar for incorporation into the glycoprotein.
  • the number of actual Hyp-glycosylation sites is not more than 1000, more preferably not more than 500, still more preferably not more than 200, even more preferably not more than 150, and most preferably not more than 100. That said, proteins with addition Hyp-glycomodules featuring as many as 160 Hyp-glycosylation sites have been expressed and secreted in plants.
  • all of the predicted Hyp-glycosylation sites are actual Hyp- glycosylation sites. In other embodiments, only some of them are actual Hyp-glycosylation sites, the others being false positives. Whether a predicted site is an actual site may in fact vary depending on the species of plant cell, as there are differences in hydroxylation and perhaps also glycosylation patterns, depending on the species. There may also be one or more false negatives (unpredicted actual Hyp-glycosylation sites). hi general, the goal is to achieve a particular number (or range of numbers) of actual Hyp-glycosylation sites.
  • the desired number of predicted Hyp-glycosylation sites will then depend on the propensity of the Hyp-glycosylation prediction method toward false positives and negatives. For example, if you wanted to achieve at least two actual Hyp-glycosylation sites, and the prediction method was such that there was a 50% chance that the predicted Hyp-glycosylation site was a false positive (and there was a 0% chance of a false negative), then you would want at least four predicted Hyp-glycosylation sites.
  • Predicted Hyp-glycosylation site may vary in terms of the probability that they are actually glycosylated, and the prediction method may be devised so as to state such a probability for each site.
  • a site to be an actual Hyp-glycosylation site it must also be an actual Pro- Hydroxylation site.
  • the protein must have at least that number of actual Pro -Hydroxylation sites.
  • a site for a site to be a predicted Hyp-glycosylation site, it must also be a predicted Pro-hydroxylation site.
  • predicted Pro-hydroxylation sites may vary in terms of the probability that the prolines in question are in fact hydroxylated, and the prediction method may be devised so as to state a probability for each site.
  • the Hyp- S core referred to above is believed to be related to that probability, with a high score indicating a high probability of hydroxylation.
  • you will generally need an equal or greater number of predicted Pro-hydroxylation sites.
  • the existence, or the total number, of the actual Pro-Hydroxylation sites and of the actual Hyp-glycosylation sites may be determined by any suitable method.
  • the glycosyl-Hyp linkage is base- stable.
  • base hydrolysis of a protein O-glycosylated through Hyp residues gives rise to a mixture of amino acids and Hyp-glycosides (the peptide bonds , but not the Hyp-glycosyl linkages, are broken).
  • Hyp assays The free amino acid Hyp and the Hyp occurring in Hyp-glycosides can be colorimetrically assayed and the amount of Hyp in a protein thereby quantified after base or acid hydrolysis of that protein.
  • Kivirikko, K.T. and Liesmaa, M. A colorimetric method for determination of hydroxyproline in tissue hydro lysates," Scand. J. Clin.Lab. Invest. 11 :128-131 (1959).
  • the assay involves opening of the Hyp ring by oxidation with alkaline hypobromitc, subsequent coupling with acidic Ehrlich's reagent and monitoring absorbance at 560nm.
  • Hyp-glycosides Hyp-arabinogalactan polysaccharide, Hyp-Ara 4 , Hyp-Ara 3 , Hyp-Ara 2 , Hyp-Ara, and non-glycosylated Hyp.
  • Hyp residues i.e., actual Pro-hydroxylation sites
  • the number of Hyp residues (i.e., actual Pro-hydroxylation sites) in a protein can be determined by amino acid analysis of the protein, see Bergman, T., M. Carlquist, and H. Jornvall; Amino Acid Analysis by High Performance Liquid Chromatography of Phenylthiocarbamyl Derivatives. Ed. B. Wittmann-Liebold. Berlin: Springer Verlag, 1986. 45-55.
  • the number of each Hyp species in a protein can be calculated. For instance, if a 200 residue protein contains 10 mol% Hyp, the 200-rcsidue protein has 20 Hyp residues in it. If it also has 10% of its Hyp residues occurring as Hyp-arabinogalactan polysaccharide, 20% with Hyp-Ara 3 and 70% non-glycosylated Hyp, the protein contains 2 Hyp-arabinogalactan polysaccharides, 4 Hyp- Ara 3 moieties, and 14 non-glycosylated Hyp residues.
  • the location of the hydroxyprolines maybe determined by fragmenting the proteins into peptides of sequenceable length, optionally deglycosylating the peptides, and then sequencing the peptides.
  • the proteins may be fragmented by treatment with one or more proteolytic non- enzymatic chemicals (e.g., cyanogen bromide) and/or one or more proteolytic enzymes.
  • proteolytic non- enzymatic chemicals e.g., cyanogen bromide
  • proteolytic enzymes e.g., cyanogen bromide
  • Peptides may be deglycosylated, to simplify sequencing, by treatment with anhydrous hydrogen fluoride for 3h at room temperature, according to the method of Moor and Lamport.
  • Peptides may be sequenced by automated Edman degradation. In each cycle, the liberated amino acid is analyzed by reverse phase HPLC, by which it is compared to amino acid standards. Hydroxyproline standards are available.
  • peptides may be sequenced by tandem mass spectrometry.
  • the proteins of interest may be known, naturally occurring proteins which, without further modification, already contain a sufficient number of Hyp-glycosylation sites to be desirably secreted if suitably expressed in plant cells, i.e., when co-expressed with a suitable exogenous proline hydroxylase. They may be referred to as predisposed proteins because they are predisposed, by virtue of their translated amino acid sequence,and its propensity to Pro-hydroxylation and Hyp-glycosylation, to the desired level of Hyp-glycosylation.
  • the predisposed proteins may be non-plant proteins (preferably a vertebrate protein, more preferably a mammalian protein, most preferably a human protein), or they may be plant proteins which are not normally secreted.
  • the protein of interest if produced in tobacco cells, is not a naturally occurring human collagen, and more preferably not a naturally occurring collagen of any species, regardless of the choice of exogenous proline hydroxylase.
  • the protein of interest is not a naturally occuring human collagen, and more preferably not a naturally occurring collagen of any species, regardless of the choice of plant cell and the choice of exogenous proline hydroxylase. It should be understood that for the purpose of the disclaimer, and related preferred embodiments discussed above in the above paragraph, the proteins are compared on the basis of the mature (non-signal) portions of their translated amino acid sequences, i.e., ignoring subsequent hydroxylation and glycosylation.
  • the proteins of interest may also be known proteins which are modified, in accordance with the teachings of the present invention, in such manner as to increase the number of predicted or actual Hyp-glycosylation sites therein, to increase the likelihood of Hyp-glycosylation at an existing site, and/or to alter the nature of the glycosylation at a Hyp- glycosylation site, wherein at least one such site has a greater likelihood of proline hydroxylation by the exogenous proline hydroxylase than by any endogenous proline hydroxylase.
  • the modified (mutant) proteins may but need not feature additional mutations, for other purposes, as well.
  • Hyp-glycosylation-deficient proteins Parental proteins for which such modification is considered desirable may be collectively referred to as Hyp-glycosylation-deficient proteins, and the suitably modified proteins as Hyp-glycosylation-supplemented proteins.
  • the parental protein When such modification is considered desirable, it may be helpful to distinguish the parental protein from the expressed (modified) protein. While the latter is necessarily a mutant protein, the parental protein could be a naturally occurring protein, or a protein mutated for other purposes. In those embodiments in which the protein is not modified to affect Hyp-glycosylation, the expressed protein is also the parental protein.
  • parental protein While we speak formally of modifying a parental protein, it is not necessary to synthesize a parental protein and then modify it chemically. Rather, we mean that the parental protein is used as a guide in the design of a mutant protein which differs from it at one or more amino acid positions, so that the mutant protein can be formally characterized as a modification of the parental protein.
  • the plant cell-expressed and -secreted protein is preferably biologically active. However, if it is not itself biologically active, it preferably is cleavable, by a site-specific cleaving agent such as an enzyme, so as to release a biologically active polypeptide. If it is biologically active, it preferably retains one or more biological activities, and more preferably all biological activities, of the parental protein.
  • the parental protein which is mutated may be a non-plant protein (preferably a vertebrate protein, more preferably a mammalian protein, most preferably a human protein), or it may be a plant protein, as not all plant proteins are in fact predisposed to Hyp- glycosylation. (they may lack prolines, or the prolines may have a low predicted Hyp-score).
  • proteins of interest are proteins which comprise at least one predicted Hyp-glycosylation site, and which, if expressed and secreted in plant cells, exhibit Hyp- glycosylation (thus necessarily comprising at least one actual Hyp-glycosylation site, regardless of whether the location of the site is correctly predicted).
  • at least one predicted Hyp-glycosylation site is also an actual Hyp-glycosylation site.
  • a protein is also of interest if it is a non-plant protein which, in nascent form, comprises at least one proline, and exhibits Hyp-glycosylation, regardless of whether it was predicted to contain a Hyp-glycosylation sites. It is possible to simply express DNA encoding a non-plant protein, said DNA including at least one proline codon, and determine experimentally whether the protein, when expressed and secreted in plant cells, exhibits Hyp- glycosylation, without making any attempt to predict whether such Hyp-glycosylation would occur.
  • the mutant proteins of interest preferably have a greater number of actual Hyp- glycosylation sites and/or a greater number of predicted Hyp-glycosylation sites than does the parental protein.
  • the proteins of interest may each be classified in a number of ways.
  • Hyp-glycosylation-deficient parental proteins there may be zero, one, two, three, four, five, six, seven, eight, nine, ten or even more prolines.
  • these Hyp- glycosylation deficient proteins have relatively few prolines, because each proline, if in a region favorable to hydroxylation and glycosylation, can become a Hyp-glycosylation site.
  • the Hyp-glycosylation-predisposed proteins and Hyp-glycosylation supplemented proteins necessarily include at least one proline. They may have one, two, three, four, five, six, seven, eight, nine, ten or even more prolines, such as at least fifteen, at least twenty, or at least twenty five prolines.
  • Hyp-glycosylation-disposed and Hyp-glycosylation- deficient proteins as follows: less than 2.5% proline, 2.5-10% proline, and more than 10% proline.
  • these proteins of interest may be classified according to the number of predicted Hyp-glycosylation sites. There may be zero (for Hyp-glycosylation-deficient proteins only), one, two, three, four, five, six, seven, eight, nine, ten or even more such sites, such at least fifteen, at least twenty, or at least twenty five such sites.
  • the proteins of interest may also be classified according to their total Hyp score, according to the quantitative standard method, for all of the prolines in the protein, divided by the score threshold. This could be, e.g., less than 2, at least 2 but less than 4, at least 4 but less than 8, at least 8 but less than 16, or at least 16.
  • Another structural feature of interest is the length of the protein.
  • Still another structure feature of interest is the number of disulfide bonds, which can be zero, one, two, three, four or more than four.
  • NCBI/GcnBank maintains a taxonomy database.
  • the proteins of interest maybe classified according to their species of origin, each taxonomic grouping defining a particular class of proteins of interest. (Mutant proteins are classified according to the species of origin of the parental protein.) At the highest level, these are Archaea, Bacteria, Eukaryota, Viroids, Viruses, and Other.
  • Eukaryotic taxons of particular interest include Viridiplantae and Vertebrata; within Vertebrata, Mammalia; and within Mammalia, Homo sapiens.
  • the protein may be a plant protein, in which case the plant may be an algae (which are in some cases also microorganisms), or a vascular plant, especially a gymnosperm (particularly conifers) or an angiosperm.
  • Angiosperms may be monocots or dicots.
  • the plants of greatest interest are rice, wheat, corn, alfalfa, soybeans, potatoes, peanuts, tomatoes, melons, apples, pears, plums, pineapples, fir, spruce, pine, cedar, and oak.
  • the protein may be that of a microorganism, in which case the microorganism may be an alga, bacterium, fungus or virus.
  • the microorganism may be a human or other animal or plant pathogen, or it may be nonpathogenic. It may be a soil or water organism, or one which normally lives inside other living things, or one which lives in some other environment.
  • the protein may be that of an animal, and the animal may be a vertebrate or a nonvertebrate animal.
  • Nonvertebrate animals which are human or economic animal pathogens or parasites are of particular interest.
  • Nonvertebrate animals of interest include worms, mollusks, and arthropods.
  • the vertebrate animal may be a mammal, bird, reptile, fish or amphibian.
  • the animal preferably belongs to the order Primata (humans, apes and monkeys), Artiodactyla (e.g., cows, pigs, sheep, goats, horses), Rodenta (e.g., mice, rats) Lagomorpha (e.g., rabbits, hares), or Carnivora (e.g., cats, dogs).
  • the animals are preferably of the orders Anseriformes (e.g., ducks, geese, swans) or Galliformes (e.g., quails, grouse, pheasants, turkeys and chickens).
  • the animal is preferably of the order Clupeiformes (e.g., sardines, shad, anchovies, whitefish, salmon).
  • a third approach to classification is by gene ontology, and is discussed in a later section.
  • the proteins of interest include, but are not limited to, (1) the specific proteins set forth in sections I-III, classifying proteins on the basis of their native predicted Hyp-glycosylation sites, and (2) whether or not already listed under (1), vertebrate, preferably mammalian, more preferably human, proteins selected from the group consisting of growth hormone, growth hormone mutants which act as growth hormone or prolactin agonists or antagonists (a category discussed in more detail below), growth hormone releasing hormone, somatostatin, ghrelin, leptin, prolactin, prolactin mutants which act as prolactin or growth hormone antagonists, monocyte chcmoattractant protein- 1, interleukin-10, pleiotropin, interleukin-7, interleukin-8, interferon omega, interferon - Alpha 2a and 2b, interferon gamma, interleukin - 1, fibroblast growth factor 6, IFG-I, insulin-like growth factor I, insulin
  • the level of expression of a protein maybe determined by any art-recognized method.
  • the level of expression is directly related to the level of transcription, which can be determined by a northern blot analysis of the corresponding mRNA.
  • the level of expression may also be determined by Western blot analysis. (If the Western blot analysis is of the protein in the culture medium, then the analysis is measuring the level of protein both expressed and secreted. To determine the total expression, the cells may be lysed and the analysis consider the lysate as well as the medium.)
  • the non-plant proteins of the present invention are secreted in plant cells at a level which is increased relative to the level at which they have previously been secreted in non-plant cells.
  • the modified proteins of the present invention are secreted in plant cells at a level which is increased relative to that at which the parental protein can be secreted, using the identical plant cell species, culture conditions, promoter and secretion signal.
  • the level of secretion may be determined by any art-recognized method, including Western blot analysis of the level of the protein in the culture medium.
  • the level of secretion may be characterized by the concentration of the protein in the medium, by the level of the protein in the medium as a percentage of total soluble protein TSP) in the medium, or by the level of the protein in the medium as a percentage of total secreted proteins in the medium.
  • Preferred (high) levels of secretion are at least 1 mg/L protein equivalent in medium, more preferably at least 5 mg/L, still more preferably at least 10 mg/L to 150 mg/L, most preferably at least about 30 mg/L. . It is expected that for the parental proteins lacking Hyp- glycosylation, the level of secretion is typically less than 100 ug/L, or even less than 1 ug/L. That implies preferred, increases in secretion of at least 10 fold, more preferably at least 100 fold, still more preferably at least 1, 000-fold, most preferably at least 10, 000-fold.
  • the protein of the present invention as a result of the native or introduced Hyp-glycomodules, the choice of secretion signal peptide, and, optionally, N-glycosylation, has a level of secretion of at least 1% TSP, more preferably at least 2% TSP.
  • the secreted protein of interest is at least 50%, more preferably at least 75%, still more preferably at least 85%, of the secreted proteins in the medium.
  • non-naturally occurring protein is one which is not known to occur in a cell or virus, except as a result of human manipulation.
  • the present invention contemplates mutation of a parental protein to create a mutant, non-naturally occurring protein with an increased propensity to Pro-hydroxylation and/or Hyp-glycosy]ation. Preferably there is a net increase in the number of Pro-hydroxylation and Hyp-glycosylation site. More preferably, no Pro-hydroxylation and Hyp-glycosylation sites are lost as a result of the mutation.
  • the mutant protein will of course have a particular parental protein in mind.
  • the mutant is designed with reference to a particular protein, i.e., incorporating predetermined insertions, deletions and substitutions relative to a predetermined parental protein.
  • the mutant may come to more closely resemble some other protein, either fortuitously, or because the practitioner was guided by more than one parental protein in designing the mutant protein.
  • a first protein may be considered a mutant of a second protein if the first protein has an amino acid sequence which, when aligned by BlastP, with default parameters, to the sequence of the second protein, generates an alignment score which is statistically significant, i.e., is a higher score then would be expected if the mutant amino acid sequence were aligned with randomly jumbled amino acid sequences of the same length and amino acid composition.
  • the predetermined parental protein used in such design is not known to the practitioner, it may be identifiable by using the sequence of the mutant protein as a query sequence in searching a suitable sequence database containing the parental sequence.
  • a mutant protein is not necessarily non-naturally occurring, as a mutant of protein A may coincidentally be identical to naturally occurring protein B.
  • a protein is considered to be a mutant of a non-plant protein if 1) it has known to have been designed as a mutant of a predetermined non-plant protein and remains more than 50% identical to that non-plant protein, 2) it was made by expression of a gene derived by mutation of a gene encoding a non-plant protein, 3) it has, or comprises a sequence which has, a biological activity which is found in a naturally occurring non-plant protein but which biological activity is not known to occur in any plant protein, or 4) it has, ignoring all Hyp- glycomodules as herein defined, a higher alignment score (aligning with BlastP, default settings) with respect to a non-plant protein than with respect to any known plant protein.
  • Hyp-glycomodules are common in some plant proteins and hence incorporating Hyp-glycomodules into, e.g., a human protein, will cause it to have a higher alignment score with those plant proteins than would otherwise be the case. If need be, each of these four definitional considerations may be used to define a separate class of mutants of non-plant proteins.
  • Mutants of vertebrate, mammalian and human proteins, as well as mutants of non- vertebrate, non-mammalian, and non-human proteins, maybe defined in an analogous manner. Mutations may take the form of insertions, deletions or substitutions. While we recognized that a substitution may be conceptualized as a deletion followed by an insertion, we don't so consider it here.
  • each residue of the mutant protein is 1) aligned with an identical residue of the parental protein (in which case that is considered an unmutated position), 2) aligned with a non-identical residue of the parental protein (in which case that is considered a substitution), or 3) aligned with a null character (usually represented as a space or hyphen), implying that there is no corresponding residue in the parental protein (in which case the residue in question is considered an inserted amino acid).
  • a residue of the parental protein instead of being aligned with a residue of the mutant protein (resulting in the position being considered either unmutated or substituted), may be aligned with a null character, implying that there is no corresponding residue in the mutant protein (in which case the residue in question is considered a deleted amino acid).
  • the protein can retain a high degree of sequence identity to the parental protein. For example, it may be possible to create a new predicted Hyp- glycosylation site by as little a single substitution mutation. In the worst possible case, a Hyp-glycosylation site can be created by five consecutive substitution mutations. Plainly, one can also have the intermediate situation in which the new Hyp-glycosylation site is created by two, three or four mutations within a consecutive five amino acid subsequence of the parental protein.
  • a single Hyp-glycosylation site can be created by just 1-5 substitution mutations, which corresponds to a change in percentage identity (see below) of just 0.5-2.5%.
  • two new Hyp-glycosylation sites can be created by just 1-10 substitution mutations (the "1" is not a typographical error; a single substitution affects the Hyp-scores of prolines up to two amino acids before it and up to two amino acids after it, and therefore could cause the Hyp-scores of two or more nearby prolines to exceed the preferred threshold of the prediction algorithm), corresponding to a change in percentage identity of just 0.5-5%. If no other mutations were made, the resulting modified protein would still be at least 95% identical to the parental protein.
  • mutation is not limited to proteins of two hundred amino acids length, and the number of additional Hyp-glycosylation sites is not limited to one or two.
  • the practitioner must strike a balance between the addition of Hyp-glycosylation sites (with the potential for improved secretion and other advantages) and any adverse effect on biological activity and/or immunogenicity.
  • One method of concisely stating the relationship of two proteins is by stating a percentage identity.
  • This application contemplates two percentage identities, primary and secondary.
  • the primary percentage identity is determined by first aligning the two proteins by BlastP (a local alignment algorithm), with default parameters, and then expressing the number of matching aligned amino acids as a percentage of the length of the overlap region (which includes any gaps introduced during the alignment process).
  • the relationship of the proteins may also be expressed by a secondary ("global") percentage identity calculation, in which the number of matches is expressed as a percentage of the length of the longer sequence (which is likely to be the mutant protein).
  • the mutant protein results from simple addition of one or more Hyp-glycomodules to the amino or carboxy terminal of the parental protein, then the mutant protein remains identical to the parental protein in the overlap region, i.e., the calculated primary percentage identity is 100% even though the mutant protein is longer than the parental protein.
  • the secondary percentage identity would be less than 100%.
  • the addition of (Ser-Hyp)lO to a 200 amino acid protein would result in a secondary percentage identity of 200/220, or about 91%.
  • the mutants of the present invention are at least 50% identical, more preferably at least 60%, at least 70%, at least 80%, at least 85%, or at least 90%, such as at least 91, 92, 93, 94, 95, 96, 97, 98, or 99% identical, to the parental protein when percentage identity is calculated by the primary and/or by the secondary method.
  • a mutant it cannot be identical to the parental protein, but as explained above, it may nonetheless have a primary percentage identity which is 100%.
  • substitutions can be conservative and/or nonconservative.
  • conservative amino acid substitutions the substituted amino acid has similar structural and/or chemical properties with the corresponding amino acid in the reference sequence.
  • conservative substitutions replacements are defined as exchanges within the groups set forth below:
  • Non-conservative substitutions may be further classified as semi-conservative or as strongly non-conservative.
  • Inter-group exchanges of group I-III residues maybe considered semi-conservative, as they are all hydrophilic, neutral (GIy), or only slightly hydrophobic (Ala).
  • Inter-group exchanges of Group IV and IV residues can be considered semi- conservative, as they are all strongly hydrophobic.
  • Exchanges of Ala with amino acids of groups H-V can be considered semi -conservative, as this is the principle underlying Ala scanning mutagenesis. All other non-conservative substitutions are considered strongly non- conservative.
  • all substitutions are at least semi- conservative, more preferably, at least conservative.
  • all substitutions are at least semi- conservative, more preferably, at least conservative, and most preferably, are highly conservative.
  • each mutated position is one which is not a conserved position in the family.
  • the mutant protein may differ from the parental protein by further mutations not related to the control of the level of hydroxylation of proline and/or glycosylation of hydroxyproline, but it is desirable that such further mutations not substantially impair the biological activity of the protein (or, if the protein is to be further processed to yield the final biologically active molecule, of the latter).
  • a protein comprising at least one Hyp-glycosylation site must necessarily comprise at least one Hyp-glycomodule. They may comprise, e.g., two, three, four, five, six or more Hyp- glycomodules.
  • Each Hyp-glycomodule comprises, in accordance with the definition, at least one Hyp-glycosylation site. Again in accordance with the definition, Hyp-glycomodules may be adjacent to each other, or separated.
  • Hyp-glycomodule may be classified according to its relationship, if any, to the underlying mutations which differentiate that mutant protein from a parental protein. Thus, it may be an insertion Hyp-Glycomodule (which optionally may further include substitutions and/or deletions), a substitution Hyp-Glycomodule (which optionally may further include deletions, but cannot include insertions), a deletion Hyp- Glycomodule (wherein only one or more deletions differentiate it from the aligned parental sequence), or a native Hyp-Glycomodule (which is identical to an aligned Hyp-Glycomodule of the parental protein).
  • An insertion Hyp-glycomodule is characterized as the result, at least in part, of insertion of one or more amino acids at the amino terminal, the carboxy terminal, or internally between two pre-existing amino acid positions, of the parental protein. If the insertions are solely of one or more amino acids at the amino or carboxy terminals, it may be further characterized as an addition glycomodule (a subtype of insertion glycomodule).
  • An insertion Hyp-glycomodule may, but need not, further involve one or more substitutions (replacements) and/or one or more deletions (without replacement thereof) of additional amino acids of the parental protein. If it is solely the result of insertion, it may be characterized as a simple insertion (or addition) glycomodule. the corresponding segment of the original protein.
  • the present specification may refer to a Hyp-glycomodule as a substitution Hyp- glycomodule if it can be characterized as being solely the result of one or more substitutions (replacements), and, optionally one or more deletions, of amino acids of the parental protein.
  • the glycomodule is an insertion glycomodule, not a substitution glycomodule.
  • a substitution can be thought of as the result of a deletion followed by an insertion at the same location.
  • the insertions we have in mind are insertions in-between positions of the parental protein.
  • the mutant protein is a Hyp-glycosylation-supplemented protein
  • at least one of the Hyp-glycomodules must be an insertion, substitution, or deletion Hyp-Glycomodule.
  • it may optionally include one or more native Hyp-Glycomodules.
  • the Hyp-Glycomodule is necessarily a native Hyp- Glycomodule.
  • Hyp-glycomodules may be classified according to the nature of their proline skeleton, i.e., the locations of the prolines within the corresponding nascent Hyp- glycomodule.
  • the Hyp-glycomodule has a regularly and uniformly spaced proline residue skeleton.
  • the Hyp-glycomodule may consist essentially of a series of contiguous proline residues.
  • the Hyp-glycomodule may have a proline skeleton in which the proline residues are regularly and uniformly spaced, but noncontiguous, such as the proline skeleton patterns (Pro-X)n, (Pro-X-X)n, (Pro-X-X-X)n or (Pro-X-X-X-X)n, where n is at least two.
  • the Hyp-glycomodule has a proline skeleton in which the prolines are regularly but not uniformly spaced, e.g., there is a repeating pattern of prolines such as (X-P-P-P)n or (X-P-P-X)n, where n is at least two.
  • the Hyp-glycomodule has a proline skeleton in which the prolines are irregularly spaced.
  • the proline skeleton of the Hyp-glycomodule may be a combination of the above skeleton types or patterns, and may also include irregularly distributed prolines. It will be understood that in the formulae set forth above, the X may be different both within a single iteration of the repeating pattern, or from iteration to iteration. However, it is preferable that the X be the same amino acid.
  • Hyp-glycomodules may be classified according to the nature of their glycosylation.
  • a Hyp-glycomodule as now defined may include only arabinogalactosylated Hyp- glycosylation sites (an arabinogalactan Hyp-glycomodule), only arabinosylated Hyp- glycosylation site (an arabinosylation Hyp-glycomodule), or a combination of the two (a mixed Hyp-glycosylation) Hyp-glycomodule.
  • the nature of the proline skeleton has a direct effect on the nature of the glycosylation, as is evident from the glycosylation prediction methods set forth above.
  • Hyp may be glysosylated other than with arabinose or arabinogalactan, in which case the Hyp-glycomodule may be characterized as exotic.
  • n may be at least 3, 4, 5, 6, 7, 8, 9, 10, 50, 100, or 500, and/or less than 999, 998, 997, 996, 995, 994, 993, 992, 991, 990, 900, 800, 700, 600, or 500; or indeed any other subrange of 2-1000
  • Most of the Pro residues in these sequences will be hydroxylated to hydroxyproline and subsequently O-glycosylated with arabinosides ranging in size from one to five arabinose residues.
  • the value of n may be, e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 50, 100, or 500, and/or less than 999, 998, 997, 996, 995, 994, 993, 992, 991, 990, 900, 800, 700, 600, or 500, or indeed any other subrange of 1-1000.
  • Many of the Pro residues in these sequences will be hydroxylated to hydroxyproline (Hyp) and subsequently O-glycosylated with arabinogalactan oligosaccharides or polysaccharides.
  • Tn the light of the standard prediction method, with the quantitative standard method used to predict Pro-hydroxylation, we can see that a repeating sequence of the form X-Pro or Pro-X (where X is Lys, Ser, Thr, VaI, GIy, or Ala) will, if there are sufficient repetitions, establish that most of the target prolines have Ser, Thr, VaI, GIy, or Ala in the -1 and +1 positions, and Pro in the -2 and +2 positions.
  • the matrix scores will vary depending on the choice of X in each repetition.
  • one or "central" prolines will have a local composition factor such that 11/21 amino acids in the preferred 21 amino acid window are proline and 10/21 are the alternative amino acid, yielding an absolute entropy of 0.998364, a relative entropy of 0.231, and a relative order (local composition factor) of 0.769 (which, being greater than the preferred baseline of 0.4, means that the local composition factor is favorable). While use of the same X for all repeats is preferred, it is not required.
  • the X's for each repeat are chosen so that the average local composition factor score for all of the Pro's in the Hyp-glycomodule is at least equal to the baseline, which has a preferred value of 0.4.
  • the proteins of the present invention feature at least one predicted/actual Hyp- glycomodule.
  • This may be an insertion Hyp-glycomodule (preferably an addition Hyp- glycomodule, more preferably a simple addition Hyp-glycomodule) or a substitution Hyp- glycomodule. If there is more than one Hyp-glycomodule, they maybe of the same or different types.
  • Hyp-glycomodule is preferably added at the amino-terminal and/or the carboxy terminal of the biologically active protein.
  • the glycomodule may be joined directly to the terminal amino acid of the parental protein, or indirectly.
  • the Hyp- glycomodule is linked to the native human protein moiety by a spacer which either 1) acts to distance the native human protein moiety from the Hyp-glycomodule in such manner as to increase the retention of native human protein biological activity by the Hyp-glycomodule- spacer-human protein fusion relative to that retained by a direct Hyp-glycomodule-human protein fusion, or 2) provides a site-specific cleavage site for an enzyme or chemical agent such that, after cleavage at that site, a new product is generated which does have the desired biological activity.
  • Site-specific cleavage sites are discussed in, e.g., Walker, "Cleavage Sites in Expression and Purification," http://stcveas.scripps.edu/webpage/htsb/cleavage.htm1 ; Barrett, et al., The Handbook of Proteolytic Enzymes. Please note that site-specific cleavage need not be achieved enzymatically; consider, e.g., the action of cyanogen bromide. In general, it is preferable to use cleavage agents which are specific for a cleavage site which is longer than two amino acids, so as to reduce the possibility that the parental protein will include a site sensitive to the desired agent.
  • the cleavable linker and cleavage agent are chosen so that the biologically active moiety of the fusion protein is not cleaved, only the linker connecting that moiety to the insertion (addition) glycomodule.
  • Hyp-glycomodule may be inserted in the interior of the parental protein. If so, then if the protein is a multi-domain protein, it is preferably inserted at an inter-domain boundary.
  • Other possible preferred insertion sites include turns and loops, or sites known, by comparison with homologous proteins, to be tolerant of insertion.
  • B-factors are indicative of the precision of the atom poitions. If the model is of high quality (e.g., an R factor of 2 or less in a model with a resolution of 2.5 angstroms or better), then a high B-factor is likely to be indicative of freedom of movement of the atoms in that region.
  • the B-factor is at least 20, more preferably, at least 60. Similar considerations apply to NMR structures.
  • Hyp-glycomodule may replace a portion of the amino-tcrminal or carboxy terminal of the biologically active protein, provided that it still extends beyond that original terminal. (If the glycomodule merely replaces a amino or carboxy terminal portion with a sequence of the same or lesser length, it is denoted a substitution glycomodule.)
  • One or more deletions may also be advantageous.
  • it may be advantageous to delete the membrane- spanning or -anchoring domain (avoiding the intrinsic tendency of glycosyltransferases, for example, to associate with ER/Golgi membranes).
  • a Hyp-glycomodule may replace a sequence of the parental protein. If a Hyp- glycomodule replaces a portion of the protein, then the non-proline residues of the Hyp- glycomodule may be chosen to minimize the number of substitutions, or at least the number of non-conservative substitutions, by which the replacement Hyp-glycomodule differs from
  • the substitutions will take the form of 1) replacement of non-proline residues with prolines so as to create new sites, and/or 2) replacement of non-proline residues which are near (especially within two amino acids of) a proline so as to render that proline more likely to experience hydroxylation and glycosylation.
  • Information about the wild-type protein may be useful in identifying where the substitutions might be tolerated. Such information could include any of the following:
  • the binding sites of the protein (this is typically determined either by testing fragments for activity or by some systematic mutagenesis method)
  • a protein comprises one or more prolines with a low Hyp-score
  • introduction of proline is not excluded. The introduction of proline is likely to be more tolerated in a position outside an alpha helix than in an alpha helix. In an alpha helix, it is more likely to be tolerated within the first turn.
  • Deletions may be made at the amino or carboxy terminal (also called truncation), and/or internally. Internal deletions are preferably made in the same protein regions which are the preferred locations for internal insertions. Deletions are most likely to be made to bring together two prolines, or a proline and one of the favored flanking amino acids (Ser, Thr, VaI, Ala), or to eliminate an unfavorable amino acid (especially those with longer range effects, such as Cys, Tyr, Lys and His). However, as a practical matter, deletions arc more likely to adversely affect biological activity than are substitutions or additions, and deletions can only make an existing Pro more favorable to hydroxylation and glycosylation, they don't increase the number of Pro in the protein.
  • Protein domains with disulfide bonds might not exhibit Pro hydroxylation or Hyp glycosylation, even at residues predicted to be favorable sites, as the disulfide bonds hold the protein in a folded conformation which hinders presentation of the polypeptide to the co- and/or post-translational machinery involved in hydroxylation of proline and/or glycosylation of hydroxyproline.
  • the protein to be expressed not comprise any cysteines expected to participate in disulfide bonds.
  • disulfide bond formation can be avoided or reduced by eliminating cysteines not essential to biological activity, e.g., by replacing the cysteines with serine, threonine, alanine or glycine.
  • one or more disulfide bonds must be maintained, then it may be desirable to use a larger number of predicted Hyp-glycosylation sites and/or distribute the predicted Hyp- glycosylation sites throughout the molecule so as to maximize the chance that at least one site is in fact glycosylated despite the folded conformation. It is also possible to use a variety of experimental methods to identify regions which are exposed, despite the folded conformation. For example, one may expose the folded protein to a chemical protein surface labeling agent and then determine which residues have been chemically modified by that agent. An agent of particular interest is tritium, as it is possible to elicit tritium exchange with all exposed hydrogens.
  • the 3D-structure of the protein has been determined by X-ray diffraction or by NMR, this may be used to identify surface sites for modification.
  • Proline scanning mutagenesis (systematic synthesis of a series of single proline substitution mutants, usually corresponding to the non-proline positions in a contiguous region of a protein) is described in Schulman and Kim, "Proline scanning mutagenesis of a molten globule reveals non-cooperative formation of a protein's overall topology," Nat. Struct. Biol., 3:682-7 (1996), Orzaez, et al., "Influence of proline residues in transmembrane helix packing," J.
  • a mutant may be characterized as a growth hormone mutant if, after alignments by BlastP, it has a higher percentage identity with a vertebrate growth hormone than it does with any known vertebrate prolactin or placental lactogen.
  • Prolactin and placental lactogen mutants are analogously defined.
  • This mutant may be an agonist, that is, it possesses at least one biological activity of a vertebrate growth hormone, prolactin, or placental lactogen. It should be noted that a growth hormone may be modified to become a better prolactin or placental lactogen agonist, and vice versa.
  • the mutant may be characterized as a growth hormone mutant if, after alignments by BlastP, it has a higher percentage identity with a vertebrate growth hormone than it does with any known vertebrate prolactin or placental lactogen. Prolactin and placental lactogen mutants are analogously defined.
  • the mutant maybe an antagonist of a vertebrate growth hormone, prolactin, or placental lactogen.
  • the contemplated antagonist is a receptor antagonist, that is, a molecule that binds to the receptor but which substantially fails to activate it, thereby antagonizing receptor activity via the mechanism of competitive inhibition.
  • the mutant polypeptide sequence can be aligned with the sequence of a first reference vertebrate hormone of that superfamily.
  • One method of alignment is by BlastP, using the default setting for scoring matrix and gap penalties.
  • the first reference vertebrate hormone is the one for which such an alignment results in the lowest E value, that is, the lowest probability that an alignment with an alignment score as good or better would occur through chance alone. Alternatively, it is the one for which such alignment results in the highest percentage identity.
  • the mutant polypeptide agonist is considered substantially identical to the reference vertebrate hormone if all of the differences can be justified as being (1) conservative substitutions of amino acids known to be preferentially exchanged in families of homologous proteins, (2) non-conservative substitutions of amino acid positions known or determinable (e.g., by virtue of alanine scanning mutagenesis) to be unlikely to result in the loss of the relevant biological activity, or (3) variations (substitutions, insertions, deletions) observed within the GH-PRL-PL superfamily (or, more particularly, within the relevant family).
  • the mutant polypeptide antagonist will additionally differ from the reference vertebrate hormone by virtue of one or more receptor antagonizing mutations.
  • the alignment algorithm(s) may introduce gaps into one or both sequences. If there is a length one gap in sequence A corresponding to position X in sequence B, then we can say, equivalently, that (1) sequence A differs from sequence B by virtue of the deletion of the amino acid at position X in sequence B, or (2) sequence B differs from sequence A by virtue of the insertion of the amino acid at position X of sequence B, between the amino acids of sequence A which were aligned with positions X-I and X+l of sequence B.
  • mutant sequence can be characterized as differing from the first reference hormone by deletion of the amino acid at that position in the first reference hormone, and such deletion is justified under clause (3) if another reference hormone differs from the first reference hormone in the same way.
  • the mutant sequence can be characterized as differing from the first reference hormone by insertion of the amino acid aligned with that gap, and such insertion is justified under clause (3) if another reference hormone differs from the first reference hormone in the same way.
  • the preferred vertebrate GH-derived GH receptor agonists of the present invention are fusion proteins which comprise a polypeptide sequence P for which the differences, if any, between said amino acid sequence and the amino acid sequence of a first reference vertebrate growth hormone, are independently selected from the group consisting of
  • the binding affinity of a single substitution mutant of the first reference vertebrate growth hormone, wherein said corresponding residue, which is not alanine, is replaced by alanine, is at least 10% of the binding affinity of the first vertebrate growth hormone for the vertebrate growth hormone receptor to which the first vertebrate growth hormone natively binds;
  • polypeptide sequence has at least 10% of the binding affinity of said first reference vertebrate growth hormone for a vertebrate growth hormone receptor, preferably one to which said first reference vertebrate growth hormone natively binds, and
  • fusion protein binds to and thereby activates a vertebrate growth hormone receptor.
  • GH-derived because the polypeptide sequence P qualifies as a vertebrate GH or as a vertebrate GH mutant as defined above.
  • a growth hormone natively binds a growth hormone receptor found in the same species, i.e., human growth hormone natively binds a human growth hormone receptor, bovine growth hormone, a bovine GH receptor, and so forth.
  • binding affinity is determined by the method described in Cunningham and Wells, "High-Resolution Mapping of hGH-Receptor Interactions by Alanine Scanning Mutagenesis", Science 284: 1081 (1989), and thus uses the hGHRbp as the target.
  • binding affinity is determined by the method described in WO92/03478, and thus uses the hPRLbp as the target.
  • binding affinity is determined by use, in order of preference, of the extracellular binding domain of the receptor, the purified whole receptor, and an unpurified source of the receptor (e.g., a membrane preparation).
  • the receptor binding fusion protein preferably has growth promoting activity in a vertebrate.
  • Growth promoting (or inhibitory) activity may be determined by the assays set forth in Kopchick, et al, which involve transgenic expression of the GH agonist or antagonist in mice. Or it may be determined by examining the effect of pharmaceutical administration of the GH agonist or antagonist to humans or nonhuman vertebrates.
  • polypeptide sequence P is at least 50%, more preferably at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90% or most preferably at least 95% identical to said first reference vertebrate growth hormone,
  • any deletion under clause (c) is of a residue which is not located at a conserved residue position of the vertebrate growth hormone family, and, more preferably is not a conserved residue position of the mammalian growth hormone subfamily,
  • the first reference vertebrate growth hormone is a mammalian growth hormone, more preferably, a human or bovine growth hormone,
  • any insertion under clause (d) is of a length such that another reference vertebrate growth hormone exists which differs from said first reference growth hormone by virtue of an equal length insertion at the same location of said first reference vertebrate growth hormone
  • the first reference vertebrate growth hormone is a nonhuman growth hormone, and the intended use is in binding or activating the human growth hormone receptor, the differences increase the overall identity to human growth hormone, (8) one or more of the substitutions are selected from the group consisting of one or more of the mutations characterizing the hGH mutants B2024 and/or B2036 as described below,
  • the polypeptide sequence P is at least 50%, more preferably at least 55%, at least 60%, at least 65%, at least 70% at least 75%, at least 80%, at least 85%, at least 90%, at least 95% or, if an agonist, most preferably 100% similar to said first reference vertebrate growth hormone, or
  • the polypeptide sequence P when aligned to the first reference vertebrate growth hormone by BlastP using the Blosum62 matrix and the gap penalties -11 for gap creation and -1 for each gap extension, results in an alignment for which the E value is less than e-10, more preferably less than e-20, e-30, e-40, e-50, e-60, e-70, e-80, e-90 or most preferably e-100.
  • condition (1) percentage identity is calculated by the BlastP methodology, i.e., identities as a percentage of the aligned overlap region including internal gaps.
  • condition (2) highly conservative amino acid replacements are as follows: Asp/Glu, Arg/His/Lys, Met/Leu/Ile/Val, and Phe/Tyr/Trp.
  • the conserved residue positions are those which, when all vertebrate growth hormones whose sequences are in a publicly available sequence database as of the time of filing are aligned as taught herein, are occupied only by amino acids belonging to the same conservative substitution exchange group (I, II, III, IV or V) as defined above.
  • the unconserved residue positions are those which are occupied by amino acids belonging to different exchange groups, and/or which are unoccupied (i.e., deleted) in one or more of the vertebrate growth hormones.
  • the fully conserved residue positions of the vertebrate growth hormone family arc those residue positions are occupied by the same amino acid in all of said vertebrate growth hormones.
  • Clause (c) does not permit deletion of a residue at one of the fully conserved residue positions.
  • hGH is preferably the form of hGH which corresponds to the mature portion (AAs 27-217) of the sequence set forth in Swiss-Prot SOMA HUMAN, PO 1241, isoform 1 (22 kDa), and bovine growth hormone is preferably the form of bovine growth hormone which corresponds to the mature portion (AA 28-217) of the sequence set forth in Swiss-Prot SOMA JBOVIN, P01246, per Miller W.L., Martial J.A., Baxter J.D.; "Molecular cloning of DNA complementary to bovine growth hormone mRNA.”; J. Biol. Chem. 255:7521-7524(1980). These references are incorporated by reference in their entirety.
  • percentage similarity is calculated by the BlastP methodology, i.e., positives (aligned pairs with a positive score in the Blosum62 matrix) as a percentage of the aligned overlap region including internal gaps.
  • Vertebrate GH-derived GH receptor antagonists of the present invention may be similarly defined, except that the polypeptide sequence must additionally differ from the sequence of the reference vertebrate growth hormone, e.g., at the position corresponding to GIy 119 in bovine growth hormone or GIy 120 in human growth hormone, in such manner as to impart GH receptor antagonist (binds but does not activate) activity to the polypeptide sequence and thereby to the fusion protein.
  • bGH Glyl 19/hGH GIy 120 is presently believed to be a fully conserved residue position in the vertebrate GH family. It has been reported that an independent mutation, R77C, can result in growth inhibition.
  • the GH receptor antagonist has growth inhibitory activity.
  • the compound is considered to be growth-inhibitory if the growth of test animals of at least one vertebrate species which are treated with the compound (or which have been genetically engineered to express it themselves) is significantly (at a 0.95 confidence level) slower than the growth of control animals (the term "significant" being used in its statistical sense). In some embodiments, it is growth-inhibitory in a plurality of species, or at least in humans and/or bovines.
  • the GH antagonists may comprise an alpha helix essentially corresponding to the third major alpha helix of the first reference vertebrate growth hormone, and at least 50% identical (more preferably at least 80% identical) therewith.
  • the mutations need not be limited to the third major alpha helix.
  • the contemplated vertebrate GH antagonists include, in particular, fusions in which the polypeptide P corresponds to the hGH mutants B2024 and B2036 as defined in U.S. Patent No. 5,849,535.
  • B2024 and B2036 are both hGH mutants including, inter alia, a GlOK substitution.
  • vertebrate prolactin agonists and antagonists and vertebrate placental lactogen agonists and antagonists, which agonize or antagonize a vertebrate prolactin receptor.
  • agonists and antagonists that are hybrids, or are mutants of hybrids, of two or more reference hormones of the vertebrate growth hormone - prolactin - placental lactogen hormone superfamily, and which retain at least 10% of at least one receptor binding activity of at least one of the reference hormones.
  • Secondary structure prediction may be made by, e.g., Combet C, Blanchet C, Geourjon C. and Deleage G.”
  • NPS@ Network Protein Sequence Analysis
  • the controlled vocabularies are specified in the form of three structured networks of controlled terms to describe gene product attributes.
  • the three networks are molecular function, biological process, and cellular component.
  • Each network is composed of terms of differing breadth. If term A is a subset of term B, then term A is the child of B and B is the parent of A.
  • a child term can have more than one parent term.
  • the biological process term “hexose biosynthesis” has two parents, “hexose metabolism” and “monosaccharide biosynthesis”. This is because biosynthesis is a subtype of metabolism, and a hexose is a type of monosaccharide. If a child term describes the gene product, then all of its parents, must describe the gene product. And likewise all fo the grandparents, great-grandparents, etc.
  • Molecular function describes the specific tasks performed by the gene product, i.e., its activities, such as catalytic or binding activities, at the molecular level.
  • GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where or when, or in what context, the action takes place.
  • Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity, or binding; examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding.
  • a single gene product might have several molecular functions, and many gene products can share a single molecular function.
  • gene products are often given names which set forth their molecular function, the use of a molecular function ontology term is meant to characterize the function of any gene product with that molecular function, not to refer to a particular gene product even if only one gene product is presently known to have that function.
  • Biological process describes the role of the gene product in achieving broad biological goals, such as mitosis or purine metabolism.
  • a biological process is accomplished by one or more ordered assemblies of molecular functions. Examples of broad biological process terms are cell growth and maintenance or signal transduction. Examples of more specific terms arc pyrimidine metabolism or alpha-glucoside transport. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have two or more distinct steps. Nonetheless, a biological process is not equivalent to a pathway, as the biological process ontologies do not attempt to capture any of the dynamics or dependencies that would be required to describe a pathway.
  • a cellular component is just that, a component of a cell but with the proviso that it is part of some larger object, which maybe an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer).
  • anatomical structure e.g. rough endoplasmic reticulum or nucleus
  • a gene product group e.g. ribosome, proteasome or a protein dimer
  • GO does not contain the following:
  • cytochrome c is not in the ontologies, but attributes of cytochrome c, such as electron transporter, are.
  • oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene.
  • Attributes of sequence such as intron/exon parameters are not attributes of gene products and will be described in a separate sequence ontology (see the OBO web page for more information).
  • the General Ontology data structures defines these ontology terms and their relationships.
  • the data structures may be downloaded from the General Ontology Consortium website.
  • a sample GO entry would be:
  • the annotation may include evidence codes to indicate the basis for assigning particular GOids to that gene or gene product.
  • the collaborating databases do not necessarily exhaustively annotate a gene. For example, if ontology A is child of B, and B is child of C, and C is child of D, and D is child of E, they may list the lower order ontologies A, B and C, but not the higher order ones D and E. It would, of course, be possible for a technician to examine all the terms in tables 3 and 4, determine which higher order ontologies have been omitted by comparing the terms with a complete directory of the gene ontology network, and add the missing higher order terms. We have not done this because, in general, the higher order ontologies, being less specific, are less likely to be of interest, at least taken by themselves.
  • the possible predisposed proteins and Hyp- glycosylation-deficient parental proteins may be classified by gene ontology.
  • Each gene ontology in the controlled vocabulary may be considered a separate embodiment.
  • one embodiment would relate to predisposed proteins with the function ontology of acyltransferase activity, and their expression and secretion in plants, another embodiment would be where the predisposed protein has the process ontology of cholesterol metabolism, a third where the predisposed protein has the component ontology of extracellular space.
  • the universe of predisposed proteins or of Hyp-glycosylation-deficient parental proteins, excluding proteins having one or more specified ontologies may be considered disclosed embodiments.
  • combinations of ontologies in which each ontology is from a different network i.e., molecular function, biological process, biological component
  • combinations of ontologies which include ontologies from more than one network, as well as more than one ontology from the same network, but where no ontology is a child or a parent of any other ontology in the same combination.
  • nucleic acid construct For secretion in plants, a nucleic acid construct is designed which encodes a precursor protein consisting of an N-terminal signal peptide which is functional in the plant cell of interest, followed by the amino acid sequence of the mature protein of interest (which may but need not be a mutant protein). The precursor protein is expressed and, as it is secreted through the membrane, the signal peptide is cleaved off.
  • the secretion signal peptide is one which, in the plant cell in question, can achieve secretion of a non-Hyp-glycosylated protein at a level of at least 0.01% TSP., more preferably at least 0.1% TSP, still more preferably at least 0.5% TSP, most preferably at least 1% TSP.
  • the signal peptide is one native to a plant protein, including but not limited to one of the following:
  • GFP Green fluorescent protein
  • hGM-CSF Human granulocyte-macrophage colony-stimulating factor
  • the signal peptide associated with a secreted plant virus protein is employed.
  • it may be the TMV omega coat protein signal peptide.
  • the non-plant protein's native signal peptide is used to achieve secretion in plants.
  • the protein is a modified protein, then we are referring to the signal peptide of the most closely related naturally occurring protein.
  • Many non-plant eukaryotic signals are functional in plants; examples are given below:
  • Human milk CD14 protein (Tobacco cell culture, CaMV35S promoter, native signal sequence or tomato extensin signal peptide, 5 ug/L medium, Girard et al., Plant Cell, Tissue and Organ Culture 78: 253-260, 2004 )
  • Human interferon beta Tobacco plant, CaMV35S promoter, native signal peptide, 0.01% fresh weight, J. Interferon Res. 12 (6): 449-453, 1992
  • Human Interleukin-2 Human Interleukin-2 (Tobacco cell culture, CaMV35S promoter, native signal peptide, secreted, 0.1 ug/L, Magnuson et al., Protein Expr. Purifi. 13 (1): 45-52, 1998)
  • Heat-labile enterotoxin B subunit (Potato plant, CaMV35S promoter, native signal peptide, 0.01% TSP, Mason et al., vaccine 16(3): 1336-1343, 1996)
  • Norwalk virus capsid protein tobacco leaves and potato tubers, CaMV35S promoter or patatin promoter, native signal peptide, 0.23% TSP, Mason et al., PNAS, 93 (11): 5335-5340, 1996)
  • the native signal could be the one native to either of the parental proteins, but normally the one native to the N-terminal domain would be preferred.
  • the signal peptide is a signal, functional in plants, which is neither the native signal of the foreign protein, nor one native to plants, or plant viruses.
  • Murine immunoglobulin signal peptide was previously used to secrete HIV-I p24 antigen fused to human IgA (Tobacco plant, CaMV35S promoter, 1.4% TSP, Obregon, et al., Plant Biotechnol. J. 4(2): 195-207 (2006).
  • the Obregon murine immunoglobulin signal peptide was also able to direct secretion of un fused HIV-I ⁇ 24 antigen, but secretion was at a level of 0.1% TSP.
  • the carbohydrate component of the glycoprotein accounts for at least 10% of the molecular weight of the protein.
  • O-glycosylation occurs at Ser, Thr, Tyr, and HyI, as well as at Hyp.
  • GIcNAc, GaINAc, Gal, Man, Fuc, Pse, DiAcTridH, GIc, FucNac, XyI and Gal are reported to O-link to Ser, and GIcNAc, GaINAc, Gal, Man, Fuc, Pse, DiAcTridH, GIc and Gal to Thr.
  • GIcNAc, Gal and Ara are found on Hyp, Gal on HyI, and Gal and GIc on Tyr. Spiro Table III provides consensus sequences for some of these glycosylation sites.
  • the proteins of the present invention may optionally include one or more O- glycosylated amino acids other than Hyp.
  • N-glycosylation occurs at Asn or Arg.
  • the principal sugar- peptide bonds identified are of GIcNAc, GaINAc, GIc and Rha to Asn, and of GIc to Arg.
  • the consensus sequence for attachment of GIcNAc to Asn is Asn-Xaa-Ser/Thr (i.e., an "NAS” or "NAT", where Xaa is any amino acid except Pro.
  • the proteins of the present invention may optionally include one or more N- glycosylated amino acids.
  • These N-glycosylation sites may be native to the protein and/or the result of genetic engineering. Genetic engineering of sites may involve the introduction of Asn or Arg by substitution and/or insertion, and/or the modification of nearby amino acids to increase the probability of N-glycosylation of Asn or Arg.
  • an NAS or NAT N-glycosylation motif may be provided at the N- terminal or C-terminal of the engineered protein.
  • N-glycosylated by the covalent linkage of glycans to asparagine (Asn) residues at Asn-X-Ser/Thr concensus sequence (Driouich et al., 1989).
  • the physiological function of N-glycosylation is thought to involve adjusting protein structure for secretion (Okushima et al., 1999). From results obtained in previous studies on protein secretion in plant cells, it appears that N-glycosylation is a prerequisite for transport of proteins from ER to Golgi apparatus, and finally to extracellular space.
  • Enhanced secretion of heterologous proteins was also found in yeast by introduction of an N-glycosylation site (Sagt et al., 2000). As a consequence, a specific N-glycan, or peripheral glycan epitopes, might be involved in protein targeting to the extracellular compartment.
  • glycosylation is desirable to improve secretion or to facilitate purification, but is not required in the protein for clinical use.
  • the glycoproteins may be deglycosylated, e.g., to improve their biological activity.
  • Deglycosylating agents may be enzymatic (e.g., peptide N-glycosidase F, "PNGase F", or endo-beta-N-acetylglucosaminidase H, "endo H") or chemical (e.g., trifluonnethanesulfonic acid; periodate; anhydrous hydrogen fluoride).
  • the recombinant genes are expressed in plant cells, such as cell suspension cultured cells, including but not limited to, BY2 tobacco cells. Expression can also be achieved in a range of intact plant hosts, and other organisms including but not limited to, invertebrates, plants, sponges, bacteria, fungi, algae, archebacteria.
  • the expression construct/plasmid/recombinant DNA comprises a promoter. It is not intended that the present invention be limited to a particular promoter. Any promoter sequence which is capable of directing expression of an operably linked nucleic acid sequence encoding at least a portion of nucleic acids of the present invention, is contemplated to be within the scope of the invention. Promoters include, but are not limited to, promoter sequences of bacterial, viral and plant origins. Promoters of bacterial origin include, but are not limited to, octopine synthase promoter, nopaline synthase promoter, and other promoters derived from native Ti plasmids.
  • Viral promoters include, but are not limited to, 35S and 19S RNA promoters of cauliflower mosaic virus (CaMV), and T- DNA promoters from Agrobacterium.
  • Plant promoters include, but are not limited to, ribulose-l,3-bisphosphate carboxylase small subunit promoter, maize ubiquitin promoters, phaseolin promoter, E8 promoter, and Tob7 promoter.
  • the invention is not limited to the number of promoters used to control expression of a nucleic acid sequence of interest. Any number of promoters may be used so long as expression of the nucleic acid sequence of interest is controlled in a desired manner. Furthermore, the selection of a promoter may be governed by the desirability that expression be over the whole plant, or localized to selected tissues of the plant, e.g., root, leaves, fruit, etc. For example, promoters active in flowers are known (Benfy et al. (1990) Plant Cell 2:849-856).
  • Transformation of plant cells maybe accomplished by a variety of methods, examples of which are known in the art, and include for example, particle mediated gene transfer (see, e.g., U.S. Pat. No. 5,584,807 hereby incorporated by reference); infection with an Agrobacterium strain containing the foreign DNA-for random integration (U.S. Pat. No. 4,940,838 hereby incorporated by reference) or targeted integration (U.S. Pat. No. 5,501,967 hereby incorporated by reference) of the foreign DNA into the plant cell genome; electroinjection (Nan et al. (1995) In “Biotechnology in Agriculture and Forestry,” Ed. Y. P. S.
  • infectious and “infection” with a bacterium refer to co-incubation of a target biological sample, (e.g., cell, tissue, etc.) with the bacterium under conditions such that nucleic acid sequences contained within the bacterium are introduced into one or more cells of the target biological sample.
  • a target biological sample e.g., cell, tissue, etc.
  • Agrobacterium refers to a soil-borne, Gram-negative, rod-shaped phytopathogenic bacterium, which causes crown gall.
  • Agrobacterium includes, but is not limited to, the strains Agrobacterium tumefaciens, (which typically causes crown gall in infected plants), and Agrobacterium rhizogcnes (which causes hairy root disease in infected host plants). Infection of a plant cell with Agrobacterium generally results in the production of opines (e.g., nopalinc, agropine, octopine, etc.) by the infected cell.
  • opines e.g., nopalinc, agropine, octopine, etc.
  • Agrobacterium strains which cause production of nopaline are referred to as "nopaline-type" Agrobacteria
  • Agrobacterium strains which cause production of octopine e.g., strain LBA4404, Ach5, B6
  • octopine-type e.g., strain LBA4404, Ach5, B6
  • agropine-type e.g., strain EHA105, EHAlOl, A281
  • the terms “bombarding,” “bombardment,” and “biolistic bombardment” refer to the process of accelerating particles towards a target biological sample (e.g., cell, tissue, etc.) to effect wounding of the cell membrane of a cell in the target biological sample and/or entry of the particles into the target biological sample.
  • a target biological sample e.g., cell, tissue, etc.
  • Methods for biolistic bombardment are known in the art (e.g., U.S. Pat. No. 5,584,807, the contents of which are herein incorporated by reference), and are commercially available (e.g., the helium gas-driven microprojectile accelerator (PDS-1000/He) (BioRad).
  • micro wounding when made in reference to plant tissue refers to the introduction of microscopic wounds in that tissue. Microwounding may be achieved by, for example, particle, or biolistic bombardment.
  • Plant cells can also be transformed according to the present invention through chloroplast genetic engineering, a process that is described in the art.
  • Methods for chloroplast genetic engineering can be performed as described, for example, in U.S. Patent Nos. 6,680,426, and in published U.S. Application Nos. 2003/0009783, 2003/0204864, 2003/0041353, 2002/0174453, 2002/0162135, the entire contents of each of which is incorporated herein by reference.
  • the present invention be limited by the host cells used for expression of the synthetic genes of the present invention, provided that they are plant cells capable of hydroxy! ating proline and of glycosylating (especially arabinosylating or arabinogalactosylating) hydroxyproline.
  • Plants that can be used as host cells include vascular and non-vascular plants.
  • Non-vascular plants include, but are not limited to, Bryophytes, which further include but are not limited to, mosses (Bryophyta), liverworts (Hepaticophyta), and hornworts (Anthocerotophyta).
  • Other cells contemplated to be within the scope of this invention are green algae types, such as Chlamydomonas and Volvox.
  • Vascular plants include, but are not limited to, lower (e.g., spore-dispersing) vascular plants, such as, Lycophyta (club mosses), including Lycopodiae, Selaginellae, and Isoetae, horsetails or equisetum (Sphenophyta), whisk ferns (Psilotophyta), and ferns (Pterophyta).
  • Lycophyta club mosses
  • Lycopodiae Selaginellae
  • Isoetae horsetails or equisetum (Sphenophyta)
  • whisk ferns Psilotophyta
  • ferns Pterophyta
  • Vascular plants further include, but are not limited to, i) fossil seed ferns (Pteridophyta), ii) gymnosperms (seed not protected by a fruit), such as Cycadophyta (Cycads), Coniferophyta (Conifers, such as pine, spruce, fir, hemlock, yew), Ginkgophyta (e.g., Ginkgo), Gnetophyta (e.g., Gnetum, Ephedra, and Welwitschia), and iii) angiosperms (flowering plants — seed protected by a fruit), which includes Anthophyta, further comprising dicotyledons (dicots) and monocotyledons (monocots).
  • Specific plant host cells that can be used in accordance with the invention include, but are not limited to, legumes (e.g., soybeans) and solanaceous plants (e.g., tobacco, tomato, etc
  • the monocots of interest include Poaceae/Graminaceae (e.g., rice, maize, wheat, barley, rye, oats, millet, sugarcane, sorghum, bamboo), Araceae (e.g., Anthurium, Zantedeschia, taro, elephant ear, Dieffenbachia, Monstera, Philodendron), including those of the old classification Lemnaceae (e.g., duckweed(Lemna)) , Orchidaceae (e.g., various orchids), and Cyperaceae (e.g., various sedges).
  • Poaceae/Graminaceae e.g., rice, maize, wheat, barley, rye, oats, millet, sugarcane, sorghum, bamboo
  • Araceae e.g., Anthurium, Zantedeschia, taro, elephant ear, Dieffenbachia, Monstera, Philo
  • the dicots of interest may be eudicots or paleodicots, and include Solanaceae (e.g., potato, tobacco, tomato, pepper) , Fabaceae (e.g., beans, peas, peanuts, soybeans, lentils, lupins, clover, alfalfa, cassia) , Cucurbitaceae (e.g., squash, pumpkin, melon, cucumber) , Rosaceae (e.g., apple, pear, cherry, apricot, plum, rose, rasberry, strawberry, hawthorn, quince, peach, almond, rowan, hawthorn) , Brassicaceae (e.g., cabbage, broccoli, cauliflower, brussels sprouts, collards, kale, Chinese kale, rutabaga, seakale, turnip, radish, kohlrabi, rapesee, mustard, horseradish, wasabi, watercress, Arabidops
  • the present invention is not limited by the nature of the plant cells. All sources of plant tissue are contemplated.
  • the plant tissue which is selected as a target for transformation with vectors which are capable of expressing the invention's sequences are capable of regenerating a plant.
  • the term "regeneration" as used herein, means growing a whole plant from a plant cell, a group of plant cells, a plant part or a plant piece (e.g., from seed, a protoplast, callus, protocorm-like body, or tissue part).
  • Such tissues include but are not limited to seeds. Seeds of flowering plants consist of an embryo, a seed coat, and stored food.
  • the embryo When fully formed, the embryo generally consists of a hypocotyl-root axis bearing either one or two cotyledons and an apical meristem at the shoot apex and at the root apex.
  • the cotyledons of most dicotyledons are fleshy and contain the stored food of the seed. In other dicotyledons and most monocotyledons, food is stored in the endosperm and the cotyledons function to absorb the simpler compounds resulting from the digestion of the food.
  • Species from the following examples of genera of plants may be regenerated from transformed protoplasts: Fragaria, Lotus, Medicago, Onobrychis, Trifolium, Trigonella, Vigna, Citrus, Linum, Geranium, Manihot, Daucus, Arabidopsis, Brassica, Raphanus, Sinapis,, Atropa, Capsicum, Hyoscyamus, Lycopersicon, Nicotiana, Solanum, Petunia, Digitalis, Majorana, Ciohorium, Helianthus, Lactuca, Bromus, Asparagus, Antirrhinum, Hererocallis, Nemesia, Pelargonium, Panicum, Pennisetum, Ranunculus, Senecio, Salpiglossis, Cucumis, Browaalia, Glycine, Lolium, Zea, Triticum, Sorghum, and Datura.
  • transgenic plants For regeneration of transgenic plants from transgenic protoplasts, a suspension of transformed protoplasts or a petri plate containing transformed explants is first provided. Callus tissue is formed and shoots may be induced from callus and subsequently rooted. Alternatively, somatic embryo formation can be induced in the callus tissue. These somatic embryos germinate as natural embryos to form plants.
  • the culture media will generally contain various amino acids and plant hormones, such as auxin and cytokinins. It is also advantageous to add glutamic acid and proline to the medium, especially for such species as corn and alfalfa. Efficient regeneration will depend on the medium, on the genotype, and on the history of the culture. These three variables may be empirically controlled to result in reproducible regeneration.
  • Plants may also be regenerated from cultured cells or tissues.
  • Dicotyledonous plants which have been shown capable of regeneration from transformed individual cells to obtain transgenic whole plants include, for example, apple (Malus pumila), blackberry (Rubus), Blackberry/raspberry hybrid (Rubus), red raspberry (Rubus), carrot (Daucus carota), cauliflower (Brassica oleracea), celery (Apium graveolens), cucumber.
  • the regenerated plants are transferred to standard soil conditions and cultivated in a conventional manner. After the expression vector is stably incorporated into regenerated transgenic plants, it can be transferred to other plants by vegetative propagation or by sexual crossing.
  • vegetatively propagated crops the mature transgenic plants are propagated by the taking of cuttings or by tissue culture techniques to produce multiple identical plants.
  • the mature transgenic plants are self crossed to produce a homozygous inbred plant which is capable of passing the transgene to its progeny by Mendelian inheritance.
  • the inbred plant produces seed containing the nucleic acid sequence of interest. These seeds can be grown to produce plants that would produce the desired polypeptides.
  • the inbred plants can also be used to develop new hybrids by crossing the inbred plant with another inbred plant to produce a hybrid.
  • Monocotyledons include grasses, lilies, irises, orchids, cattails, palms, Zea mays (such as corn), rice barley, wheat and all grasses.
  • Dicotyledons include almost all the familiar trees and shrubs (other than confers) and many of the herbs (non- woody plants).
  • Tomato cultures are one example of a recipient for repetitive HRGP modules to be hydroxylated and glycosylated.
  • the cultures produce cell surface HRGPs in high yields easily eluted from the cell surface of intact cells and they possess the required posttranslational enzymes unique to plants - HRGP prolyl hydroxylases, hydroxyproline O- glycosyltransferases and other specific glycosyltransferases for building complex polysaccharide side chains.
  • Other recipients for the invention's sequences include, but are not limited to, tobacco cultured cells and plants, e.g., tobacco BY 2 (bright yellow 2).
  • Nodules will be produced from callus after 2 weeks and were used for transformation or transferred to fresh NP medium every 2 weeks for future use.(Nodules are partially organized light green cell masses).
  • peptide As used herein, "peptide,” “polypeptide,” and “protein,” can and will be used interchangeably. "Peptide/polypeptide/protein” will occasionally be used to refer to any of the three, but recitations of any of the three contemplate the other two. That is, there is no intended limit on the size of the amino acid polymer (peptide, polypeptide, or protein), that can be expressed using the present invention. Additionally, the recitation of "protein” is intended to encompass enzymes, hormone, receptors, channels, intracellular signaling molecules, and proteins with other functions. Multimeric proteins can also be made in accordance with the present invention.
  • the signal peptide sequence is italicized. Please note that the prolines in the signal sequence should not be considered targets for hydroxylation and glycosylation. Note that there is sometimes uncertainty as to the exact bounds of the signal sequence. If in doubt, you can search on each of the putative mature sequences.
  • the preliminary predictive methods set forth above are biased toward over-prediction, i.e., they are more likely to produce false positives than false negatives. Consequently, the skilled worker may wish to more closely evaluate each predicted Pro-Hydroxylation/Hyp- Glycosylation site, e.g., comparing it to known plant Hyp-glycomodules, considering the known or predicted secondary, supersecondary or tertiary structure, etc.
  • Atrial Natiuretic Factor (NM006172.1)
  • ANF has only two predicted Hyp-glycosylation sites, it has a very strong motif, AALSPSPEVPP (amino acids 72 to 82 of SEQ ID NO: 7) - rich in clustered Pro and has lots of Ala Ser VaI. Collagen Type I Alpha (NP000079.1)
  • DQEFGFDVGP VCFL SEQ ID NO: 8
  • Colony stimulating factor NP000749.2
  • Immunoglobin Heavy Constant Delta (AAH63384.1) MGLLHKNMKH LWFFLLLVAA ORWVLSQVQL QESGOGLVKP SGTLSLTCAV SGGSISSSNW WSWVRQPOGK GLEWIGEIYH SGSTNYNPSL KSRVTISVDK SKNQFSLKLS SVTAADTAVY YCASLGDIYY YGMDVWGQGT TVTVSSAfTK AODVFPIISG CRHPKDNSOV VLACLITGYH PTSVTVTWYM GTQSQPQRTF
  • prolines are predicted to be Hyp-glycosylation sites or Pro-hydroxylation sites regardless of whether one inputs the entire sequence or just the mature sequence.
  • prolines are predicted to be Hyp-glycosylation sites or Pro-hydroxylation sites regardless of whether one inputs the entire sequence or just the mature sequence.
  • AfDTRPAfGS TAfOAHGVTS AfDTRPAOGS TAfOAHGVTS AfDTRPAfGS
  • AfDTRPAfGS TAfOAHGVTS AfDTRPAOGS TAfOAHGVTS AfDTRPAfGS
  • AfDTRPAfGS TAfOAHGVTS AfDTRPAOGS TAfOAHGVTS AfDTRPAfGS
  • mucins are expected, when expressed and secreted in plant s , to contain Hyp-glycomodules , too .
  • ARGGSRFERS ESRAHSGFYQ DDSLEEYYGQ RSRSREPLTD ADRGWAFSPA
  • Cl-orf32 with five predicted Glyco-Hyp, has its proline-rich region in the middle of the protein and the Pro's are somewhat spread out.
  • CSF has just two predicted Glyco-Hyp, it has a very strong hydroxylation/arabinogalactosylation region right at the N-terminus of the mature sequence, SPSPST... (AAs 22 to 27 of SEQ ID NO: 9) .
  • This sequence resembles those that we deliberately add to the end of hGH, interferon etc to introduce hydroxylation/glycosylation.
  • the program may have a false negative at Pro-268 of Cl-orf32.
  • the region 245-285 has quite a bit of Pro (12 of 40 residues) which means it probably has fairly rigid and extended stretches and that region has an abundance of amino acids common in HRGPs .
  • amino acids immediately surrounding these Pro's favor hydroxylation (A, S, T, V, P) but the overall environment (21 amino acid window) is not particularly not rich in A, S, T, V, or P and the target Pros are quite isolated from one another...or they occur within folded parts of the protein and unlikely to be exposed to the post-translational machinery.
  • the environment is not considered rich if the 21 amino acid window (not counting the target residue on which it is centered) is less than 10% Pro, less than 10% A, less than 10% S, less than 10% T, and less than 10% V.
  • a protein is considered likely to be folded if it contains an even number of Cys residues, since these are likely to be paired off in disulfide bonds, and the disulfide bonds are likely to stabilize a folded conformation.
  • MGFQKFSPFL ALSILVLLQA GSLHAAPFRS ALESS#ADPA TLSEDEARLL LAALVQDYVQ MKASELEQEQ EREGSSLDSP RSKRCGNLST CMLGTYTQDF NKFHTFPQTA IGVGAPGKKR DMSSDLERDH RPHVSMPQNA N_(SEQ ID NO: 21)
  • prolines are predicted to be Hyp-glycosylation sites or Pro-hydroxylation sites regardless of whether one inputs the entire sequence or just the mature sequence.
  • VTKPDLVSKN TGMNMSITLI SEQ ID NO: 27
  • This protein has three predicted AraGal-Hyp sites. The third of these is the most likely to be accessible to the enzymes because it is in a Pro-rich stretch SA#MPEPQAP (amino acids 533-542 of SEQ ID NO:38) .
  • the proteins of this category are likely to require modification in order to exhibit Hyp-glycosylation. It may therefore be advantageous to 1) mutate one or more non-proline amino acids to proline, at positions predicted to then be Hyp- glycosylation sites, 2) mutate one or more amino acids in the vicinity of a proline so as to increase the Hyp-score of that proline or the degree of glycosylation predicted to occur if that proline is hydroxylated, and/or 3) add a Hyp-glycomodule to one or both ends of the protein.
  • Hyp-glycomodule strategy can be used with any of the proteins. However, for some of the proteins in this category, we also suggest below some specific substitutions which will create predicted arabinogalactosylated Hyp-glycosylation sites within those proteins. This could be done, without undue experimentation, for all of the proteins. Likewise, predicted arabinosylated Hyp-glycosylation sites can be created. Of course, finding mutations which will not also adversely affect biological activity is more difficult . See the discussion of mutational strategies, above.
  • coagulation factor has predicted Hyp-glycosylation sites, they aren't in Pro-rich regions, and hence are not likely to have an extended conformation (random coil, extended strand, polyproline helix) .
  • Pro-37 is predicted to become arabinogalactosylated Hyp (#) . However, that fails to take into account the fact that Pro-37 is part of the signal sequence. Another nominally predicted # site is at Pro-39. However, that fails to take into account that signal peptide residues are within the windows used in the predictive methods. If only the sequence of the mature protein is input, neither Pro-37 nor Pro-39 are predicted to be hydroxylated (and hence, there is no Hyp to be glycosylated) .
  • the program still predicts that Pro- 196 is hydroxylated (as shown above) , but it is not thereby predicted to be glycosylated.
  • FGF-7 binds heparin through the interaction of positively charged Lys residues with the negatively charged heparin. See Wong and Burgess, "FGF2 -Heparin Co-crystal Complex-assisted Design of Mutants FGFl and FGF7 with Predictable Heparin Affinities," J. Bio. Chem. , 273(29), 18617-18622 (1998).
  • the difference between enhanced GFP and ordinary GFP is that the former contains two amino acid substitutions in the vicinityof the chromophore (Phe-64 to Leu, Ser-65 to Thr) .
  • Pro-20 and -22 would be predicted to be hydroxylated were they not part of the signal sequence.
  • This protein has predicted Pro-hydroxylation sites, but not predicted Hyp-glycosylation sites.
  • the sequence above is that of Interferon alpha2b. It differs from alpha2a at position 46 (23 of the mature sequence) (boldfaced) , which is Arg in 2b and Lys in 2a.
  • LGLKDRRDFR FPQEMVKGSQ LQKAHVMSVL HEMLQQIFSL FHTERSSAAW
  • Pro-18 is predicted to become arabinogalactosylated-Hyp .
  • Several signal peptide residues are within the entropy window used in predicting whether Pro-Hydroxylation occurs.
  • Several signal peptide residues are also within the 11-aa window used for prediction of Hyp-glycosylation. If only the mature sequence is input, Pro-18 is not predicted to be hydroxylated.
  • cysteines there are also cysteines in this protein.
  • Insulin-like Growth Factor I (AAA52539.1)
  • This protein has predicted Pro-hydroxylation sites, but not predicted Hyp-glycosylation sites.
  • Monocyte Chemotactic Protein-1 (NP002973.1) MKVSAALLCL LLIAATFIPQ GLAQPDAINA PVTCCYNFTN RKISVQRLAS YRRITSSKCP KEAVIFKTIV AKEICADPKQ KWVQDSMDHL DKQTQTPKT (SEQ ID NO : 49)
  • the plant expressed proteins are described in the following format: Protein name (host plant cell species, promoter, signal peptide, yield, references) .
  • the signal peptide in the protein sequence is italicized. Pro residues in protein sequence are bold (this doesn't mean that they are hydroxylated or glycosylated) . N-glycosylation sites are "redlined” .
  • GFP Green Fluorescent Protein
  • GFP tobacco cell suspension culture, CaMV 35S promoter, Arahidopsis basic chitinase signal peptide, 50% secreted, 12 mg/L; Su et al . , High-level secretion of functional green fluorescent protein from transgenic tobacco cell cultures: characterization and sensing. Biotechnol . Bioeng. 85, 610-619, 2004).
  • Human serum albumin (Tobacco cell suspension culture, CaMV 35S promoter, tobacco extensin signal peptide, secreted, 5-10 mg/L detected in this lab; Tobacco leaves Chloroplasts, 11% TSP,
  • Human a- L -antitrypsin (Rice cell suspension culture, RAmy3D promoter, RAmy3D signal peptide, secreted , 85 mg/L in shake flask, 25 mg/L in bioreactor; Terashima, M. et al . Production of functional human a- ⁇ -antitrypsin by plant cell culture. Appl
  • Bryodin 1 (BDl) (Tobacco cell suspension culture, CaMV 35S promoter, tobacco extensin signal peptide, secreted, 30 mg/L; Francisco, J. A. et al . Expression and characterization of bryodin 1 and a bryodin 1-based single chain immunotoxin from tobacco cell culture. Bioconjug. Chem. 8, 708-713, 1997) 1 mikllvlwll iltiflkspt vegdvsfrls gatttsygvf iknlrealpy erkvynipll
  • Hepatitis B surface antigen (HBsAg) (Retained intracellular up to 22 mg/L in soybean and 2 mg/L in tobacco, (ocs)mas promoter, native signal peptide, Smith , M. L. et al . Hepatitis
  • HbsAg B surface antigen expression in plant cell culture: kinetics of antigen accumulation in batch culture and its intracellular form. Biotechnol Bioeng. 80 (7) : 812-822 , 2002;
  • Tobacco BY-2 cells CaMV35S promoter, soybean gene vspA signal peptide, 226 ng/mg TSP, Soj ikul et al . , PNAS, 100(5) :2209-
  • mAb against HBsAg tobacco BY-2 cell suspension culture, CaMV 35S promoter, signal peptide of calreticulin of Nicotiana plumbaginfolia or signal peptide of hordothionin of barley, secreted, 2-7.5 mg/L; Yano, A. et al . Transgenic tobacco cells producing the human monoclonal antibody to Hepatitis B virus surface antigen. J. Med. Virol. 73, 208-215, 2004)
  • Human Interleukin-12 N. tabacum cv Havana suspension culture, Enhanced CaMV 35S promoter, native signal peptide, secreted, 800 ug/L; Kwon, T. H. et al . Expression and secretion of the heterodimeric protein interleukin-12 in plant cell suspension culture. Biotechnol Bioeng 81 (7) : 870-875, 2002)

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Organic Chemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Molecular Biology (AREA)
  • Wood Science & Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biochemistry (AREA)
  • Biomedical Technology (AREA)
  • Microbiology (AREA)
  • Medicinal Chemistry (AREA)
  • Biophysics (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Cell Biology (AREA)
  • Physics & Mathematics (AREA)
  • Pharmacology & Pharmacy (AREA)
  • General Chemical & Material Sciences (AREA)
  • Plant Pathology (AREA)
  • Toxicology (AREA)
  • Gastroenterology & Hepatology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Peptides Or Proteins (AREA)
  • Preparation Of Compounds By Using Micro-Organisms (AREA)
  • Enzymes And Modification Thereof (AREA)

Abstract

La présente invention concerne des cellules végétales qui contiennent des prolines hydroxylases particulières dont la spécificité détermine quelles prolines, dans une protéine exprimée dans celles-ci, sont hydroxylées. Les hydroxyprolines obtenues peuvent, à leur tour, être glycosylées. C'est la présence d'hydroxyprolines glycosylées qui est le déterminant le plus important du degré de sécrétion de la protéine. On peut modifier les capacités d'hydroxylation des prolines de la cellule végétale en co-exprimant dans cette cellule au moins une proline hydroxylase exogène qui hydroxylera au moins un site de proline dans la protéine qui n'est pas hydroxylé par la proline hydroxylase endogène mais qui, s'il est hydroxylé, peut ensuite être glycosylé par la cellule végétale, ce qui donne une plus grande Hyp-glycosylation. Par ailleurs, même si la proline hydroxylase exogène co-exprimée n'atteint le niveau souhaité d'Hyp-glycosylation dans la protéine naturelle, une telle co-expression peut toujours être utilisée en conjonction avec une modification de la protéine elle-même afin d'augmenter le nombre et/ou la probabilité des site d'Hyp-glycosylation.
PCT/US2007/073137 2006-07-10 2007-07-10 Co-expression de prolines hydroxylases afin de faciliter l'hyp-glycosylation de protéines exprimées et secrétées dans des cellules végétales Ceased WO2008008766A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP07799434A EP2084285A4 (fr) 2006-07-10 2007-07-10 Co-expression de prolines hydroxylases afin de faciliter l'hyp-glycosylation de protéines exprimées et secrétées dans des cellules végétales

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US81955706P 2006-07-10 2006-07-10
US60/819,557 2006-07-10

Publications (2)

Publication Number Publication Date
WO2008008766A2 true WO2008008766A2 (fr) 2008-01-17
WO2008008766A3 WO2008008766A3 (fr) 2008-11-20

Family

ID=38924091

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/073137 Ceased WO2008008766A2 (fr) 2006-07-10 2007-07-10 Co-expression de prolines hydroxylases afin de faciliter l'hyp-glycosylation de protéines exprimées et secrétées dans des cellules végétales

Country Status (2)

Country Link
EP (1) EP2084285A4 (fr)
WO (1) WO2008008766A2 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010139048A1 (fr) * 2009-06-01 2010-12-09 Kenneth James Friel Peptides de lait humain
US8623812B2 (en) 2004-04-19 2014-01-07 Ohio University Cross-linkable glycoproteins and methods of making the same
US8815806B2 (en) 2009-10-28 2014-08-26 University Of Manitoba Yellow pea seed protein-derived peptides
US8871468B2 (en) 1997-07-21 2014-10-28 Ohio University Synthetic genes for plant gums and other hydroxyproline-rich glycoproteins
US9006410B2 (en) 2004-01-14 2015-04-14 Ohio University Nucleic acid for plant expression of a fusion protein comprising hydroxyproline O-glycosylation glycomodule
EP2844750B1 (fr) * 2012-05-03 2018-06-20 DSM IP Assets B.V. Variantes de chymosine améliorées
CN109022464A (zh) * 2018-07-02 2018-12-18 西安巨子生物基因技术股份有限公司 重组人源型胶原蛋白的羟基化方法
WO2024126708A1 (fr) * 2022-12-15 2024-06-20 Heinrich-Heine-Universität Düsseldorf Cytokines synthétiques dérivées de la famille il-6 et leur utilisation médicale

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7378506B2 (en) 1997-07-21 2008-05-27 Ohio University Synthetic genes for plant gums and other hydroxyproline-rich glycoproteins

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2133427B1 (fr) * 2004-01-14 2013-12-11 Ohio University Procédés de production de peptides/protéines dans des plantes et peptides/protéines produites grâce à ces procédés
EP3312282B1 (fr) * 2004-09-29 2019-07-03 Collplant Ltd. Plantes produisant du collagene et procedes de generation et d'utilisation de celles-ci

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of EP2084285A4 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8871468B2 (en) 1997-07-21 2014-10-28 Ohio University Synthetic genes for plant gums and other hydroxyproline-rich glycoproteins
US9006410B2 (en) 2004-01-14 2015-04-14 Ohio University Nucleic acid for plant expression of a fusion protein comprising hydroxyproline O-glycosylation glycomodule
US8623812B2 (en) 2004-04-19 2014-01-07 Ohio University Cross-linkable glycoproteins and methods of making the same
WO2010139048A1 (fr) * 2009-06-01 2010-12-09 Kenneth James Friel Peptides de lait humain
US8518894B2 (en) 2009-06-01 2013-08-27 Kenneth James Friel Human milk peptides
US8815806B2 (en) 2009-10-28 2014-08-26 University Of Manitoba Yellow pea seed protein-derived peptides
EP2844750B1 (fr) * 2012-05-03 2018-06-20 DSM IP Assets B.V. Variantes de chymosine améliorées
EP2844751B1 (fr) * 2012-05-03 2018-10-17 DSM IP Assets B.V. Variants améliorés de l'enzyme chymosine
CN109022464A (zh) * 2018-07-02 2018-12-18 西安巨子生物基因技术股份有限公司 重组人源型胶原蛋白的羟基化方法
WO2024126708A1 (fr) * 2022-12-15 2024-06-20 Heinrich-Heine-Universität Düsseldorf Cytokines synthétiques dérivées de la famille il-6 et leur utilisation médicale

Also Published As

Publication number Publication date
EP2084285A2 (fr) 2009-08-05
WO2008008766A3 (fr) 2008-11-20
EP2084285A4 (fr) 2010-01-13

Similar Documents

Publication Publication Date Title
WO2008008766A2 (fr) Co-expression de prolines hydroxylases afin de faciliter l'hyp-glycosylation de protéines exprimées et secrétées dans des cellules végétales
EP2133427B1 (fr) Procédés de production de peptides/protéines dans des plantes et peptides/protéines produites grâce à ces procédés
JP5517309B2 (ja) コラーゲン生産植物及びその作成方法及びその使用
CN107810271B (zh) 用于在植物细胞中生产具有改变的糖基化模式的多肽的组合物和方法
EP1387884A2 (fr) Genes synthetiques pour gommes vegetales et autres glycoproteines riches en hydroxyprolines
US20080242834A1 (en) Methods of Predicting Hyp-Glycosylation Sites For Proteins Expressed and Secreted in Plant Cells, and Related Methods and Products
JP5568702B2 (ja) 組換えポリペプチドの発現および翻訳後修飾の制御を標的化するための配列を有するポリペプチド
WO2012052170A1 (fr) Sécrétion de polypeptides recombinés dans le milieu extracellulaire de diatomées
KR101906463B1 (ko) 폼페병 치료를 위한 고 만노스 당사슬을 가지는 재조합 인간 산성 알파 글루코시다제의 대량 생산용 형질전환 벼 캘러스의 제조방법 및 상기 방법에 의해 제조된 인간 산성 알파 글루코시다제 대량 생산용 형질전환 벼 캘러스
US10227604B2 (en) Plant synthesizing hypoallergenic paucimannose type N-glycan and uses thereof
EP2994479B1 (fr) L'expression modifiée de prolyl-4-hydroxylase sdans physcoitrella patens
JP3940793B2 (ja) 任意のペプチドを植物のタンパク顆粒で蓄積させる方法
Held Synthetic genes for the elucidation of the molecular requirements of P3 extensin intermolecular crosslinking
HK1139180B (en) Methods of producing peptides/proteins in plants and peptides/proteins produced thereby

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07799434

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

WWE Wipo information: entry into national phase

Ref document number: 2007799434

Country of ref document: EP