WO2011088044A2

WO2011088044A2 - Prime/proxy model enhancement

Info

Publication number: WO2011088044A2
Application number: PCT/US2011/020835
Authority: WO
Inventors: Karl Wassmann; Jay Magidson
Original assignee: Source Mdx
Priority date: 2010-01-12
Filing date: 2011-01-11
Publication date: 2011-07-21
Also published as: WO2011088044A3

Abstract

The present invention provides methods for identifying models useful for predicting a biological state of a subject.

Description

PRIME/PROXY MODEL ENHANCEMENT

RELATED APPLICATIONS

[0001] This application claims the benefit of USSN 61/294386 filed January 12, 2010, the content of which is incorporated herein by reference in its entirety.

FIELD OF THE INVENTION

[0002] The present invention relates generally to the identification of models useful for predicting a biological state of a subject.

BACKGROUND

[0003] T he advent of molecular medicine has resulted in explosive growth in the discovery of new biomarkers, and the scope of biomarker knowledge as an indicator of a particular disease state, some other biological state of an organism, or drug discovery. Much of this is related to new technologies such as genomics, proteomics, and imaging. Yet in order to be useful, biomarkers must be rigorously and widely tested before they are deemed valid for specific medical uses. As a result many biomarkers have been available for decades, but their utility in predicting a particular disease state or in drug development and the clinic is still unclear. Thus, a need exists for methods of rapidly identifying biomarkers that have high predicative power and would more likely validate for specific medical uses.

SUMMARY OF THE INVENTION

[0004] In one aspect the invention providing methods of evaluating a biological state of a subject, by providing (e.g. measuring) a test value for a prime indicator and a test value for a proxy indicator, adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime gene relative to the surrogate normative value and as a result the prime indicator combined with the proxy indicator improves the predicative power of the prime indicator. The biological state of the subject is evaluated based upon the adjusted test value of the prime indicator.

[0005] In another aspect, the invention provides methods of determining the change in a value of a prime indicator attributed to a biological state of a subject by providing a test value for a prime indicator and a proxy indicator and adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime gene relative to the surrogate normative value [0006] In a further aspect, the invention provides methods of increasing the predictive value of a prime indicator of a biological state by providing a test value for a prime indicator and a proxy indicator and adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime gene relative to the surrogate normative value and as a result the prime indicator combined with the proxy indicator improves the predicative power of the prime indicator.

[0007] In yet another aspect, the invention provides A method of evaluating a biological state of a subject over a period of time, comprising providing a test value for a prime indicator and a proxy indicator from a subject at a first period of time; adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime indicator relative to the surrogate normative value at the first period of time; providing a test value for a prime indicator and a proxy indicator from the subject a second period of time, adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime indicator relative to the surrogate normative value at the second period of time and comparing the change in value of the prime indicator at the first period of time to the second period of time.

[0008] The indicator is a gene, or a gene expression product such as for example RNA or protein. Statistical significance is determined by the Wald test or the likelihood ratio test.

[0009] The value of the prime indicator alone makes a statistically significant contribution to the evaluation of the biological state. By a statistically significant contribution it is meant having a p-value of < 0.05. The value of the proxy indicator alone does not make a statistically significant contribution to the evaluation of the biological state, and the proxy indicator is correlated with the prime indicator, the test value of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the test value of the proxy indicator provides a surrogate normative value for the prime indicator. The proxy indicator and the prime indicator is correlated in both a normal biological state and an altered biological state. By a non-statistically significant contribution is meant having a p-value of > 0.05. By correlation is meant as having a coefficient of correlation of > 0.5.

[00010] In some aspects he biological state is evaluated based upon (i) an index that is indicative of the state of the subject comprised of the adjusted test values of the prime indicators; (ii) the surrogate normative value for the prime indicator as determined by the proxy indicator; (3) an index that is indicative of the state of the subject comprised of surrogate normative values for prime indicators as determined by the proxy indicators; or comparing the adjusted test value of the prime indicator to a control value ( i.e., a reference value).

[00011] Also provided by the invention is a method of identifying a prime/proxy indicator pair for a function to predict a value indicative of the risk or probability of a biological state from a dataset of measurements of a plurality of indicators from a plurality of subjects each with the biological state known, by

a) providing a test value for each indicator where the test value represents a change in the value of each indicator in a subject with a first biological state compared to a subject with a second biological state;

b) determining the statistical significance of each test value in (a);

c) using the test values for each indicator to enumerate two indicator models indicative of a risk or probability of the biological state, where the two indicator models distinguishes between a subject with the first biological state and a subject with the second biological state;

d) selecting the two indicator models enumerated in (c) capable of distinguishing

between a subject with the first biological state and a subject with the second biological state with at least 75% accuracy;

e) determine a coefficient for each indicator in each two indicator model selected in (d) and identifying the two indicator models identified in step (d) having one indicator having a positive coefficient and one indicator having a negative coefficient; f) selecting the two indicator models identified in (e) in which the unique contribution of each individual indicate is statistically significant;

g) identifying the two indicator models selected in (f) than comprise one indicator whose test value alone makes a statistically significant contribution to the predication of the biological state and one indicator whose test value alone does not make a statistically significant contribution to the predication of the biological state as determined in (b); h) determine whether the two indicators in the models identified in (g) are correlated to each other and selecting two indicator models wherein the two indicators have a correlation coefficient > 0.5 where the indicator that makes a statistically significant contribution as determined in step (b) is the prime indicator and the indicator that does not make a statistically significant contribution as determined in step (b) is the proxy indicator, thereby identifying a prime/proxy pair. [00012] Optionally the method includes using thetest values for each prime and proxy indicator identified in (h) to enumerate (i) two indicator models indicative of a risk or probability of the biological state, to identify a model comprising at least one prime indicator and at least one proxy indicator where the model distinguishes between a subject with the first biological state and a subject with the second biological state with at least 85% accuracy; (ii) models comprised of multiple indicators indicative of a risk or probability of the biological state, to identify a model comprising multiple prime indicators and at least one proxy indicator that is correlated to each of the prime indicators in a two indicator model where the model distinguishes between a subject with the first biological state and a subject with the second biological state with at least 85% accuracy; or (3) models comprised of two indicator models and one or more Clinical Parameter indicative of a risk or probability of the biological state, to identify a model comprising at least one prime indicator, at least one proxy indicator and one or more Clinical Parameter where the model distinguishes between a subject with the first biological state and a subject with the second biological state with at least 85% accuracy and improves the predictive power of the Clinical Parameters alone.

[00013] In some aspects the plurality of indicators is a greater the plurality of subjects.

Also included in the invention is thee use of the prime indicator identified by the methods of the invention to screen for therapeutic agents to treat the biological condition.

[00014] A method of evaluating a biological state of a subject over a period of time, comprising

providing a test value for a prime indicator from said subject at a first period of time, wherein the value of the prime indicator alone makes a statistically significant contribution to the evaluation of the biological state;

providing a test value for a proxy indicator from said subject at a first period of time, wherein the value of the proxy indicator alone does not make a statistically significant contribution to the evaluation of the biological state, and wherein

the proxy indicator is correlated with the prime indicator;

the test value of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the test value of the proxy indicator provides a surrogate normative value for the prime indicator, adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime indicator relative to the surrogate normative value at the first period of time ;

providing a test value for a prime indicator from said subject a second period of time, providing a test value for a proxy indicator from said subject a second period of time, adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime indicator relative to the surrogate normative value at the second period of time ;

comparing the change in value of the prime indicator at the first period of time to the second period of time.

Also provide by the invention is a computer-readable medium having computer executable instructions thereon for performing the method of receiving a dataset of measurements for the patient, where the measurements include the following:

(i) at least one measurement for a prime indicator, where the value of the measurement where the value of the prime indicator alone makes a statistically significant contribution to the evaluation of a biological state; and

(ii) at least one measurement for a proxy indicator, wherein the value of the measurement of the proxy indicator alone does not make a statistically significant contribution to the evaluation of the biological state, and wherein

the proxy indicator is correlated with the prime indicator;

the value of the measurement of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the value of measurement of the proxy indicator provides a surrogate normative value for the prime indicator,

and evaluating the dataset of measurements for the patient with a function to predict a value indicative of the risk or probability, wherein the function adjusts the value of the

measurement based upon the test of the proxy indicator to arrive at a change in value of the prime gene relative to the surrogate normative value.

[00015] The invention further provides an apparatus for identifying prime and proxy indicators for a function to predict a value indicative of the risk or probability of a biological state from a dataset of measurements of a plurality of clinical indicators from a plurality of subjects each with the biological state known, the apparatus comprising: a processor configured to evaluate the dataset of measurements from the subjects to thereby identify one or more prime indicators for the function and identify one or more proxy indicators for the function, wherein the measurements include the following:

wherein the value of the measurement of the prime indicator alone makes a statistically significant contribution to the evaluation of the biological state; and

wherein the value of the measurement of the proxy indicator alone does not make a statistically significant contribution to the evaluation of the biological state, and wherein

the proxy indicator is correlated with the prime indicator;

the value of the measurement of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the value of measurement of the proxy indicator provides a surrogate normative value for the prime indicator.

[00016] Optionally, the apparatus contains an output device for outputting the prime and the proxy indicators and or an input device configured to receive the dataset of measurements.

[00017] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice of the present invention, suitable methods and materials are described below. All publications, patent applications, patents, and other references mentioned herein are expressly incorporated by reference in their entirety. In cases of conflict, the present

specification, including definitions, will control. In addition, the materials, methods, and examples described herein are illustrative only and are not intended to be limiting.

[00018] Other features and advantages of the invention will be apparent from and encompassed by the following detailed description and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[00019] This application is being filed with drawings that include color.

[00020] Figure 1 A is a table showing the sample sizes of untreated localized prostate cancer subjects, healthy, normal subjects (without BPH) and BPH subjects by age and test group (i.e., Training Dataset and Test Dataset); Figure IB is a table showing the mean PSA values of untreated localized prostate cancer subjects, healthy, normal subjects (without BPH) and BPH subjects by age and test group (i.e., Training Dataset and Test Dataset); Figure 1C is a table showing the percent of untreated localized prostate cancer subjects, healthy, normal subjects (without BPH) and BPH subjects amongst different test groups (i.e., Training and Test Datasets) meeting specified age-adjusted PSA criteria.

[00021] Figure 2 is a ROC curve based on PSA screening showing that PSA provides discrimination of prostate cancer patients (CaP) from age-matched normal, healthy subjects (without BPH) with a specificity of 94.7% (healthy normal subjects correctly classified) and a sensitivity of 71.1% (prostate cancer subjects correctly classified).

[00022] Figure 3 is a ROC curve for a 6-gene logit model (RP51077B9.4, CD97,

CDK 2A, SPl , S100A6 and IQGAPl) compared to a model based on age-adjusted PSA criteria alone; the area under the curve (AUC) is 0.842 for PSA alone whereas the AUC is 0.946 for the 6-gene model.

[00023] Figure 4 is a ROC curve comparing the 6-gene logit model (RP51077B9.4, CD97,

CDK 2A, SPl , S100A6 and IQGAPl) combined with PSA to a model based on PSA alone; the area under the curve (AUC) is 0.842 for PSA alone whereas the AUC is 0.994 for the 6- gene+PSA model.

[00024] Figure 5 is a scatterplot showing that a 6-gene logit model (RP51077B9.4, CD97,

CDK 2A, SPl , S100A6 and IQGAPl) combined with PSA discriminates prostate cancer patients (CaP) from age-matched normal, healthy subjects (without BPH). Only 2 of the 76 CaP and 3 of the 76 normal subjects are misclassified by the 6-gene+PSA model, based on a cut-off of 0.5.

[00025] Figure 6 is a discrimination plot showing that a 6-gene logit model

(RP51077B9.4, CD97, CDK 2A, SPl, S100A6 and IQGAPl) combined with PSA

discriminates prostate cancer patients (CaP) from age-matched normal, healthy subjects (without BPH) with 97.4% sensitivity (CaP subjects correctly classified; 2/76 subjects

misclassified=97.4%) and 96.1% specificity (normal subject correctly classified; 3/76 misclassifled=96.1%)).

[00026] Figure 7 is a discrimination plot of individual subject predicted probability scores based on a 6-gene logit model (RP51077B9.4, CD97, CDK 2A, SPl, S100A6 and IQGAPl) combined with PSA, showing that the 6-gene+PSA model provides good separation of prostate cancer (CaP) subjects from age-matched normal subject. [00027] Figure 8 is a ROC curve for a logit model based on PSA and age only, showing that PSA and age alone discriminates between prostate cancer (CaP) subjects and BPH subjects with 86.7% specificity (BPH subjects correctly classified) and 88.2% sensitivity (CaP subjects correctly classified).

[00028] Figure 9 is a ROC curve for a 5-gene logit model (S 100A6, MYC, MAP2K1 ,

C1QA and RP51077B9.4) combined with PSA and age showing that the 5-gene+PSA+age model discriminates between prostate cancer patients (CaP) and BPH subjects with 96.1% sensitivity (CaP correctly classified) and 93.3% specificity (BPH subjects correctly classified).

[00029] Figure 10 is a ROC curve comparing a 5-gene logit model (S100A6, MYC,

MAP2K1 , C1QA and RP51077B9.4) combined with PSA and age to a logit model based on PSA and age alone; the area under the curve (AUC) = 0.871 for the model based on PSA and age alone, whereas AUC=0.989 for the 5-gene+PSA+age model.

[00030] Figure 11 is a discrimination plot based on the 5-gene logit model (S100A6,

MYC, MAP2K1, C1QA and RP51077B9.4) combined with PSA and age showing that the 5- gene+PSA+age model discriminates between prostate cancer patients (CaP) and BPH subjects with a sensitivity of 96.1% (CaP correctly classified; 3/76 misclassified=96.1%) and specificity of 93.3% (BPH correctly classified; 2/30 misclassified = 93.3%).

[00031] Figure 12 is a discrimination plot of individual subject predicted probability scores based on the 5-gene logit model (S100A6, MYC, MAP2K1, C1QA and RP51077B9.4) combined with PSA and age showing that the cut-off can be modulated to alter sensitivity and specificity of the model. A cut-off (probability of CaP) of 0.5 results in misclassification of 3 CaP subjects and 2 BPH subjects; a cut-off of 0.43 results in misclassification of 2 CaP subjects and 2 BPH subjects; and a cut-off of 0.17 results in misclassification of zero CaP subjects and 4 BPH subjects.

[00032] Figure 13 is a bivariate discrimination plot based on a 6-gene logit model

(RP51077B9.4, CD97, CDK 2A, SP1, S100A6 and IQGAP1) +PSA (Y-axis) and a 5-gene logit model (S100A6, MYC, MAP2K1, C1QA and RP51077B9.4)+ PSA (X-axis) demonstrating that iterative classification based on the two models can yield almost perfect discrimination between untreated, localized prostate cancer subjects can be perfectly distinguished from normal healthy subjects (with and without BPH). [00033] Figure 14 is a graph showing a comparison of differences in mean delta CT (cycle threshold) values for prostate cancer patients (CaP) versus normal subjects in two different test groups (Training Dataset versus Test Dataset).

[00034] Figure 15 depicts two scatterplots comparing the results obtained by using a 6- gene logit model (RP51077B9.4, CD97, CDK 2A, SP1, S100A6 and IQGAP1) alone (i.e., not used in combination with any other predictors) to discriminate between prostate cancer subjects (CaP) and normal, healthy subjects (without BPH) in two different test groups (Training Dataset versus Test Dataset).

[00035] Figure 16 depicts two ROC curves comparing the results obtained by using a 6- gene logit model (RP51077B9.4, CD97, CDK 2A, SP1, S100A6 and IQGAP1) alone (i.e., not used in combination with any other predictor) to discriminate between prostate cancer subjects (CaP) and normal, healthy subjects (without BPH) in two different test groups (Training Dataset versus Test Dataset).

[00036] Figure 17 depicts two scatterplots comparing the results obtained by using a 6- gene logit model (RP51077B9.4, CD97, CDK 2A, SP1, S100A6 and IQGAP1)+ PSA to discriminate between prostate cancer subjects (CaP) and normal, healthy subjects (without BPH) in two different test groups (Training Dataset versus Test Dataset).

[00037] Figure 18 depicts two ROC curves comparing the results obtained by using a 6- gene logit model (RP51077B9.4, CD97, CDK 2A, SP1, S100A6 and IQGAP1)+PSA to discriminate between prostate cancer subjects (CaP) and normal, healthy subjects (without BPH) in two different test groups (Training Dataset versus Test Dataset).

[00038] Figure 19 is a ROC curve comparing the results obtained by using a 6-gene logit model (RP51077B9.4, CD97, CDK 2A, SP1, S100A6 and IQGAP1)+PSA to discriminate between prostate cancer subjects (CaP) and normal, healthy subjects (with and without BPH).

[00039] Figures 20A and 20B are tables of re-estimated model parameters for the 6-gene logit model (RP51077B9.4, CD97, CDK 2A, SP1 , S100A6 and IQGAP1) (with PSA-Figure 19B; without PSA Figure 19A) based on the combined results of two different test groups (Training and Test Datasets).

[00040] Figure 21 depicts two scatterplots comparing the combined results from two different test groups (Training Dataset and Test Dataset) of a 6-gene logit model (RP51077B9.4, CD97, CDK 2A, SP1, S100A6 and IQGAP1) used with and without PSA to discriminate between prostate cancer subjects (CaP) and normal, healthy subjects (without BPH), using the re-estimated parameters shown in Figures 19A and 19B.

[00041] Figure 22 is a ROC curve comparing the combined results from two different test groups (Training Dataset and Test Dataset) of a 6-gene logit model (RP51077B9.4, CD97, CDK 2 A, SP 1 , S 100A6 and IQGAP 1 ) used with and without PSA to discriminate between prostate cancer subjects (CaP) and normal, healthy subjects (without BPH), using the re- estimated parameters shown in Figures 19A and 19B.

[00042] Figures 23A and 23B are tables for the Training and Test Datasets, respectively, of the 22 genes identified in Tables 3 and 4, with their mean gene expression values in prostate cancer subjects and normal, healthy subjects (without BPH) and their statistical significance in 1 - gene models to discriminate those subjects. The three Prime genes (RP51077B9.4, CD97, and IQGAPl) in the 6-Gene logit model are highlighted blue, while the Proxy gene (SPl) from the 6- Gene model is highlighted yellow.

[00043] Figure 24 depicts two ROC curves comparing the results obtained by using a 1- gene model (SPl) developed in the Training Dataset and validated in the Test Dataset to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH).

[00044] Figures 25A and 25B depicts ROC curves developed in the Training Dataset

(Figure 25 A) and validated in the Test Dataset (Figure 25B) comparing a 1 -gene model

(RP51077B9.4) to a 2-gene logit model (RP51077B9.4 and SPl) to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH).

[00045] Figures 26A and 26B depicts ROC curves developed in the Training Dataset

(Figure 26A) and validated in the Test Dataset (Figure 26B) comparing a 1 -gene model (CD97) to a 2-gene logit model (CD97 and SPl) to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH).

[00046] Figures 27A and 27B depicts ROC curves developed in the Training Dataset

(Figure 27A) and validated in the Test Dataset (Figure 27B) comparing a 1 -gene model

(RP51077B9.4) to a 2-gene logit model (RP51077B9.4 and CD97) to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH).

[00047] Figures 28A and 28B depicts ROC curves developed in the Training Dataset

(Figure 28A) and validated in the Test Dataset (Figure 28B) comparing a 2-gene logit model (RP51077B9.4 and CD97) to a 3-gene logit model (RP51077B9.4, CD97 and SPl) to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH). [00048] Figure 29 depicts scatterplots of CD97 expression versus SP1 expression in the

Training and Test Datasets demonstrating the results obtained by using a threshold of 12.85 for CD97 expression to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH).

[00049] Figure 30 depicts the scatterplots of Figure 29 comparing a 1-gene model (CD97) to a 2-gene logit model (CD97 and SP1) to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH).

[00050] Figure 31 depicts scatterplots of CD97 expression versus SP1 expression in the

Training and Test Datasets. Figure 31A depicts the scatterplots of Figures 29 and 30 and includes concentration ellipses for the prostate cancer subjects and normal, healthy subjects (without BPH), respectively. Figure 3 IB depicts the scatterplots of Figure 31A where the CD97 expression data for the prostate cancer subjects have been transformed by subtracting a value of 0.527.

[00051] Figure 32 depicts scatterplots of CD97 expression (Figure 32A) or change in

CD97 expression (Figure 32 A) versus SP1 expression in the Training Dataset and includes concentration ellipses for the prostate cancer subjects and normal, healthy subjects (without BPH), respectively, as well as the regression line for the normal subjects. Change in CD97 expression is estimated by subtracting the predicted CD97 expression for a normal subject with the measured SP1 expression from the measured CD97 expression.

[00052] Figure 33 depicts scatterplots of CD97 expression (Figure 32A) versus SP1 expression (Figure 33A) or versus RP51077B9.4 expression (Figure 33B) in the Training Dataset and includes concentration ellipses for the prostate cancer subjects and normal, healthy subjects (without BPH), respectively, as well as discrimination lines for discriminating the normal subjects from the prostate cancer subjects.

[00053] Figure 34 is a table of the model parameters and statistics for the top ten of the eighty validated 2-gene logit models of Example 8. The models are ranked according to the validation entropy based R . Prime genes are shaded blue, whereas Proxy genes are shaded yellow.

[00054] Figures 35A and 35B are tables with the results for the seven of the eighty validated 2-gene logit models of Example 8, where the model includes the SP1 Proxy gene. Figure 35A includes the model parameters and statistics based on the Training Dataset, as well as the respective p-values for the corresponding 1-gene models (the highlighted rows include genes that are included in the 6-Gene models of the previous Examples). Figure 35B depicts the correlation values for the respective Prime gene and SP1 in the Training Dataset, Test Dataset and the combined (pooled within prostate cancer subjects and normal subjects) Training and Test Datasets, respectively. Figure 35B is ranked based upon the pooled correlation values and further indicates whether the Prime/Proxy Gene pair is included in the "best" 6-Gene and 8-Gene models.

[00055] Figure 36 is a table the top fourteen of the eighty validated 2-gene logit models of

Example 8 comparing the measured down-regulation in the gene expression of the respective Prime gene between prostate cancer subjects and normal, healthy subjects (without BPH) to the predicted change in gene expression of the Prime gene based on the gene expression of its Proxy gene in the model.

[00056] Figures 37A and 37B are tables from Figures 20A and 20B of model parameters for the 6-Gene logit model (RP51077B9.4, CD97, CDK 2A, SP1 , S100A6 and IQGAP1) (with PSA Figure 37B; without PSA Figure 37A) based on the combined Training and Test Datasets compared to the results of model parameters for models where CD97, RP51077B9.4 and IQGAP1 gene expression is replaced respectively with their corresponding predicted change in gene expression based on SP1 gene expression.

[00057] Figure 38 is a table of the percentages of the Normal, CaPl and CaP4 cohorts correctly classified by applying the top 20 validated 2-gene logit models developed to discriminate between CaPl subjects and normal healthy subjects (without BPH). The results for the CD97/SP1 model are highlighted.

[00058] Figure 39 depicts a scatterplot of CD97 expression versus SP1 expression to discriminate CaP4 subjects and normal, healthy subjects (without BPH) using the 2-gene (CD97 and SP1) logit model for CaPl and normal subjects. 88.7% of the CaP4 subjects are correctly classified.

[00059] Figure 40 is a ROC curve comparing a 1-gene model (CD97) to a 2-gene logit model (CD97 and SP1) to discriminate between late stage prostate cancer subjects (CaP4) and normal, healthy subjects; the area under the curve (AUC) = 0.74 for the model based on CD97 alone, whereas AUC = 0.91 for the 2-gene model.

[00060] Figure 41 is a table of the nine validated gene models of Table 3, as well as a model based on age-adjusted PSA criteria alone for comparison, for the discrimination of prostate cancer subjects and normal, healthy subjects, which shows each model's sensitivity and specificity. Prime genes are shaded in blue, while Proxy genes are shaded in green.

[00061] Figure 42 is a table of the top twenty 8-gene models of Example 8 for the discrimination of prostate cancer subjects and normal, healthy subjects, which shows each model's sensitivity and specificity. Prime genes are shaded in blue, while Proxy genes are shaded in green.

DETAILED DESCRIPTION OF THE INVENTION

[00062] The present invention related to the field of molecular medicine and in particular molecular diagnostics. In particular the invention provides a novel method for selecting biomarkers useful in the prediction of disease. The present invention is based in part upon the surprising observation of predicative two gene models developed using linear regression based enumeration methodology. Specifically, it was observed that in two genes models that were highly predictive in discriminating between two groups that one gene had a positive correlation coefficient and the other had a negative correlation coefficient. Furthermore, it was observed that when one of the pair of genes was used separately to discriminate between groups, one of the genes maintained a highly significant discrimination whereas the other gene's discriminatory power was reduced or eliminated.

[00063] This observation is not only surprising but counter intuitive to the common approach taken by those skilled in the art in developing biomarker based models for prediction of disease. In contrast to the observation disclosed herein, it is common practice to exclude biomarkers that are not correlated or predictive as a single marker with the biological state.

[00064] Predictor variables that are predictive when used alone as the only predictor of an outcome are referred herein as a "prime" indicator. A predictor variable that is not predictive of an outcome on its own but functions in a predictive model to increase the predicative power of the model is referred to herein as a "proxy" indicator. It was further discovered that the prime indicator changes as a result of the change in the biological state but the proxy indicator does not change. Accordingly, the proxy indicator can be used to predict the value of the prime indicator prior to the change in the biological state. In other words, the value of the proxy indicator can serve as surrogate normative value for the prime indicator. This discovery solves a significant problem in the clinical utility of biomarkers— the lack of subject specific longitudinal data. In molecular medicine, often the changes of expression of a particular biomarker due to a biological state is difficult to detect because it is impossible to determine the normal level of expression of an individual. Thus if it is possible to estimate a pre- disease state level of expression of a biomarker for an individual, the variation in the measure can be reduced and the predictive power of the model can be enhanced.

[00065] Mathematically, this discovery can be described as follows:

Y = expression value of a indicator predictive of a biological state ("prime indicator" )

Yo = Baseline expression from a pre -biological state baseline time

ΔΥ = Change in Y from baseline = Y - Yo

Y is correlated with ΔΥ

[00066] Assuming that Y is a good predictor of the biological because of its correlation with ΔΥ, which is a better predictor of the biological state Thus, if Y is replaced in the predictive model by ΔΥ, prediction of the biological state can be improved In the absence of longitudinal data the ΔΥ can not be computed directly since the baseline measurement Yo is not available However, if another variable X can be found to estimate the baseline value Yo, then we can estimate ΔΥ using the following equation:

Estimated ΔΥ = Y - (a + λ X) , where a and λ are parameters to be estimated

[00067] In order to obtain a good estimate for ΔΥ, the variable X ("proxy

indicator") must meet the following conditions: (1) It must be correlated with Y in both the disease group and normal group and (2) X must be non-predictive of the biological state.

[00068] In other words, assuming that (Y,X) follows a bivariate normal distribution with common variances and common correlation p across the 2 populations - Disease (g=l) and Normals (g=0).

[00069] The population means, covariance and variances can be denoted as follows:

Y-means: Vi for the disease subjects, and Vo for normals

X-means: ui for the disease subjects, and uo for normals

Effects for Y and X in 1-gene models:

Variance(Y), Variance(X):

Covariance(Y,X):

Let d = determinant of the Variance Covariance matrix:

[00070] Then, under the above assumptions, the 2-gene logit model:

Logit(Disease:Normals) = α + βγ.χΥ+ βχ,γΧ has the same population parameters as the standard linear discriminant function. (Parameter estimates from logit and discriminant analyses are asymptotically equivalent). Expressed in terms of the population means, variances and covariance:

[00071]

άβ_{Ύ Χ} ------ σ_χ ² (v* - >¾ ) - σ_{γ χ} (uf - u£ ) άβ_ΧΎ ------ σ; { - ιξ ) -^■ σ_{Υ Χ} (ν* - ν* )

[00072] Definitions

[00073] The following terms shall have the meanings indicated unless the context otherwise requires:

[00074] "Accuracy" refers to the degree of conformity of a measured or calculated quantity (a test reported value) to its actual (or true) value. Clinical accuracy relates to the proportion of true outcomes (true positives (TP) or true negatives (TN)) versus misclassified outcomes (false positives (FP) or false negatives (FN)), and may be stated as a sensitivity, specificity, positive predictive values (PPV) or negative predictive values (NPV), or as a likelihood, odds ratio, among other measures.

[00075] "Algorithm" is a set of rules for describing a biological condition. The rule set may be defined exclusively algebraically but may also include alternative or multiple decision points requiring domain-specific knowledge, expert interpretation or other clinical indicators.

[00076] An "agent" is a "composition" or a "stimulus", as those terms are defined herein, or a combination of a composition and a stimulus.

[00077] "Amplification" in the context of a quantitative RT-PCR assay is a function of the number of DNA replications that are required to provide a quantitative determination of its concentration.

[00078] A "baseline data set" is a set of values associated with an indicator resulting from evaluation of a biological sample (or population or set of samples) under a desired biological condition that is used for mathematically normative purposes. The desired biological condition may be, for example, the condition of a subject (or population or set of subjects) before exposure to an agent or in the presence of an untreated disease or in the absence of a disease.

Alternatively, or in addition, the desired biological condition may be health of a subject or a population or set of subjects. Alternatively, or in addition, the desired biological condition may be that associated with a population or set of subjects selected on the basis of at least one of age group, gender, ethnicity, geographic location, nutritional history, medical condition, clinical indicator, medication, physical activity, body mass, and environmental exposure.

[00079] A "biological state " of a subject is the condition of the subject, as wii.1i respect io circumstances or attributes of the biological condition.

[00080] A "biological condition" of a subject is the condition of the subject in a pertinent realm that is under observation, and such realm may include any aspect of the subject capable of being monitored for change in condition, such as health; disease including cancer; trauma; aging; infection; tissue degeneration; developmental steps; physical fitness; obesity; and mental state. As can be seen, a condition in this context may be chronic or acute or simply transient.

Moreover, a targeted biological condition may be manifest throughout the organism or population of cells or may be restricted to a specific organ (such as skin, heart, eye or blood) but in either case, the condition may be monitored directly by a sample of the affected population of cells or indirectly by a sample derived elsewhere from the subject. The term "biological condition" includes a "physiological condition". For example, the biological condition is cancer such as prostate cancer, ovarian cancer, lung cancer, breast cancer, skin cancer, colon cancer, or cervical cancer.

[00081] "Biomarker(s) " can be classified based on different parameters. They can be classified based on their characteristics such as imaging biomarkers (CT, PET, MRI) or molecular biomarkers. Molecular biomarkers can be used to refer to nonimaging biomarkers that have biophysical properties, which allow their measurements in biological samples (eg, plasma, serum, cerebrospinal fluid, bronchoalveolar lavage, biopsy) and include nucleic acids-based biomarkers such as gene mutations or polymorphisms and quantitative gene expression analysis, peptides, proteins, lipids metabolites, and other small molecules. Biomarkers can also be classified based on their application such as diagnostic biomarkers, staging of disease biomarkers, disease prognosis biomarkers, and biomarkers for monitoring the clinical response to an intervention. Another category of biomarkers includes those used in decision making in early drug development. For instance, pharmacodynamic (PD) biomarkers are markers of a certain pharmacological response, which are of special interest in dose optimization studies.

[00082]

[00083] "Body fluid" of a subject includes blood, urine, spinal fluid, lymph, mucosal secretions, prostatic fluid, semen, haemolymph or any other body fluid known in the art for a subject. [00084] "Calibrated data set" is a function of a member of a first data set and a corresponding member of a baseline data set for a given constituent in a panel.

[00085] A "circulating endothelial cell" ("CEC") is an endothelial cell from the inner wall of blood vessels which sheds into the bloodstream under certain circumstances, including inflammation, and contributes to the formation of new vasculature associated with cancer pathogenesis. CECs may be useful as a marker of tumor progression and/or response to antiangiogenic therapy.

[00086] A "circulating tumor cell" ("CTC") is a tumor cell of epithelial origin which is shed from the primary tumor upon metastasis, and enters the circulation. The number of circulating tumor cells in peripheral blood is associated with prognosis in patients with metastatic cancer. These cells can be separated and quantified using immunologic methods that detect epithelial cells.

[00087] A "clinical indicator" is any physiological datum used alone or in conjunction with other data in evaluating the physiological condition of a collection of cells or of an organism. This term includes pre-clinical indicators.

[00088] "Clinical parameters" encompasses of a subject's health status or other characteristics, such as, without limitation, age (AGE), ethnicity (RACE), gender (SEX), and family history of disease, such as cancer. A clinical parameter is also referred to as a covariate.

[00089] A "Composition" includes a chemical compound, a nutraceutical, a

pharmaceutical, a homeopathic formulation, an allopathic formulation, a naturopathic formulation, a combination of compounds, a toxin, a food, a food supplement, a mineral, and a complex mixture of substances, in any physical state or in a combination of physical states.

[00090] A "Control Value" is a value obtained from a reference sample(s) in which the biological state is known. The control value may be an index.

[00091] "Correlation Coefficient" is a measure of the interdependence of two random variables that ranges in value from -1 to +1, indicating perfect negative correlation at -1 , absence of correlation at zero, and perfect positive correlation at +1. Also called coefficient of correlation. There are several correlation coefficients, often denoted p or r, measuring the degree of correlation. The most common of these is the Pearson correlation coefficient, which is mainly sensitive to a linear relationship between two variables. Other correlation coefficients have been developed to be more robust than the Pearson correlation, or more sensitive to nonlinear relationships The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or "Pearson's correlation." It is obtained by dividing the covariance of the two variables by the product of their standard deviations.

[00092] "Correlated" is meant that that correlation coefficient is greater than 0.1; 0.2; 0.3;

0.4; 0.5; 0.6; 0.7; 0.8; or 0.9. Preferably, the correlation coefficient is great at least 0.5 or greater.

[00093] To "derive" a data set from a sample includes determining a set of values associated with the indicator either (i) by direct measurement of such indicator in a biological sample or (ii) by indirect measurement of such indicator in a biological sample.

[00094] A "Digital computer system" includes a programmable calculator or other programmable device.

[00095] "Distinct RNA or protein constituent" is a distinct expressed product of a gene, whether RNA or protein. An "expression" product of a gene includes the gene product whether RNA or protein resulting from translation of the messenger RNA.

[00096] "Enumerated or Enumeration" is meant to to ascertain the number of possible models predicative of a biological state. See, for example the enumeration methodology decribed in Example 2.

[00097] "FN" is false negative, which for a disease state test means classifying a disease subject incorrectly as non-disease or normal.

[00098] "Fp" i_s false positive, which for a disease state test means classifying a normal subject incorrectly as having disease.

[00099] A "formula," "algorithm " or "model" is any mathematical equation, algorithmic, analytical or programmed process, statistical technique, or comparison, that takes one or more continuous or categorical inputs and calculates an output value, sometimes referred to as an "index" or "index value." Non-limiting examples of "formulas" include comparisons to reference values or profiles, sums, ratios, and regression operators, such as coefficients or exponents, value transformations and normalizations (including, without limitation, those normalization schemes based on clinical parameters, such as gender, age, or ethnicity), rules and guidelines, statistical classification models, and neural networks trained on historical populations. Of particular use in combining indicators are linear and non-linear equations and statistical significance and classification analyses to determine the relationship between levels of a indicator detected in a subject sample and the survivability of the subject. Techniques which may be used in survival and time to event hazard analysis, include but are not limited to Cox, Zero-Inflation Poisson, Markov, Weibull, Kaplan-Meier and Greenwood models, well known to those of skill in the art. In panel and combination construction, of particular interest are structural and synactic statistical classification algorithms, and methods of risk index

construction, utilizing pattern recognition features, including, without limitation, such established techniques such as cross-correlation, Principal Components Analysis (PCA), factor rotation, Logistic Regression Analysis (LogReg), Kolmogorov Smirnoff tests (KS), Linear Discriminant Analysis (LDA), Eigengene Linear Discriminant Analysis (ELD A), Support Vector Machines (SVM), Random Forest (RF), Recursive Partitioning Tree (RPART), as well as other related decision tree classification techniques (CART, LART, LARTree, FlexTree, amongst others), Shrunken Centroids (SC), StepAIC, K-means, Kth-Nearest Neighbor, Boosting, Decision Trees, Neural Networks, Bayesian Networks, Support Vector Machines, and Hidden Markov Models, among others. Many of these techniques are useful either combined with a an indicator selection technique, such as forward selection, backwards selection, or stepwise selection, complete enumeration of all potential panels of a given size, genetic algorithms, voting and committee methods, or they may themselves include biomarker selection methodologies in their own technique. These may be coupled with information criteria, such as Akaike's

Information Criterion (AIC) or Bayes Information Criterion (BIC), in order to quantify the tradeoff between additional biomarkers and model improvement, and to aid in minimizing overfit. The resulting predictive models may be validated in other clinical studies, or cross- validated within the study they were originally trained in, using such techniques as Bootstrap, Leave-One-Out (LOO) and 10-Fold cross-validation (10-Fold CV). At various steps, false discovery rates (FDR) may be estimated by value permutation according to techniques known in the art.

[000100] A "Gene Expression Panel" (Precision Profile^™) is an experimentally verified set of constituents, each constituent being a distinct expressed product of a gene, whether RNA or protein, wherein constituents of the set are selected so that their measurement provides a measurement of a targeted biological condition.

[000101] A "Gene Expression Profile" is a set of values associated with constituents of a Gene Expression Panel (Precision Profile^™) resulting from evaluation of a biological sample (or population or set of samples). [000102] A "Gene Expression Profile Inflammation Index" is the value of an index function that provides a mapping from an instance of a Gene Expression Profile into a single -valued measure of inflammatory condition.

[000103] A Gene Expression Profile Cancer Index " is the value of an index function that provides a mapping from an instance of a Gene Expression Profile into a single -valued measure of a cancerous condition.

[000104] The "health" of a subject includes mental, emotional, physical, spiritual, allopathic, naturopathic and homeopathic condition of the subject.

[000105] "Index" is an arithmetically or mathematically derived numerical characteristic developed for aid in simplifying or disclosing or informing the analysis of more complex quantitative information. A survivability and/or survival time index may be determined by the application of a specific algorithm to a plurality of subjects or samples with a common biological condition.

[000106] "Indicator" in the context of the present invention encompasses, without limitation, proteins, nucleic acids, and metabolites, together with their polymorphisms, mutations, variants, modifications, subunits, fragments, protein-ligand complexes, and degradation products, protein-ligand complexes, elements, related metabolites, and other analytes or sample-derived measures. Indicator can also include mutated proteins or mutated nucleic acids. Indicator also encompass non-blood borne factors or non-analyte physiological markers of health status, such as "clinical parameters" defined herein, as well as "traditional laboratory risk factors", also defined herein. Indicators also include any calculated indices created mathematically or combinations of any one or more of the foregoing measurements, including temporal trends and differences. Where available, and unless otherwise described herein, biomarkers which are gene products are identified based on the official letter abbreviation or gene symbol assigned by the international Human Genome Organization Naming Committee (HGNC) and listed at the date of this filing at the US National Center for Biotechnology

Information (NCBI) web site (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene ), also known as Entrez Gene. An indicator is for example a biomarker.

[000107] "Inflammation" is used herein in the general medical sense of the word and may be an acute or chronic; simple or suppurative; localized or disseminated; cellular and tissue response initiated or sustained by any number of chemical, physical or biological agents or combination of agents. [000108] "Inflammatory state" is used to indicate the relative biological condition of a subject resulting from inflammation, or characterizing the degree of inflammation.

[000109] A "large number" of data sets based on a common panel of genes is a number of data sets sufficiently large to permit a statistically significant conclusion to be drawn with respect to an instance of a data set based on the same panel.

[000110] "Measuring" or "measurement," means assessing the presence, absence, quantity or amount of either a given substance within a clinical or subject-derived sample, including the derivation of qualitative, semi-quantitative or quantitative concentration levels of such substances, or otherwise evaluating the values or categorization of a subject's non-analyte clinical parameters.

[000111] "Molecular risk assessment" means a procedure in which biomarkers (i.e., indicators) are used to estimate a person's risk for developing a biological condiction

[000112] "Negative predictive value" or "NPV" is calculated by TN/(TN + FN) or the true negative fraction of all negative test results. It also is inherently impacted by the prevalence of the disease and pre-test probability of the population intended to be tested.

[000113] See, e.g. , O'Marcaigh AS, Jacobson RM, "Estimating the Predictive Value of a Diagnostic Test, How to Prevent Misleading or Confusing Results," Clin. Ped. 1993, 32(8): 485- 491, which discusses specificity, sensitivity, and positive and negative predictive values of a test, e.g., a clinical diagnostic test. Often, for binary disease state classification approaches using a continuous diagnostic test measurement, the sensitivity and specificity is summarized by

Receiver Operating Characteristics (ROC) curves according to Pepe et ah, "Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic, Prognostic, or Screening Marker," Am. J. Epidemiol 2004, 159 (9): 882-890, and summarized by the Area Under the Curve (AUC) or c- statistic, an indicator that allows representation of the sensitivity and specificity of a test, assay, or method over the entire range of test (or assay) cut points with just a single value. See also, e.g., Shultz, "Clinical Interpretation of Laboratory Procedures," chapter 14 in Teitz,

Fundamentals of Clinical Chemistry, Burtis and Ashwood (eds.), 4^th edition 1996, W.B.

Saunders Company, pages 192-199; and Zweig et ah, "ROC Curve Analysis: An Example Showing the Relationships Among Serum Lipid and Apolipoprotein Concentrations in

Identifying Subjects with Coronory Artery Disease," Clin. Chem., 1992, 38(8): 1425-1428. An alternative approach using likelihood functions, BIC, odds ratios, information theory, predictive values, calibration (including goodness-of-fit), and reclassification measurements is summarized according to Cook, "Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction," Circulation 2007, 115: 928-935.

[000114] A "norma/" subject is a subject who is generally in good health, has not been diagnosed with a biological condition, e.g., is asymptomatic for prostate cancer, and lacks the traditional laboratory risk factors for the biological condition.

[000115] A "normative value" is the value of the indicator in a normal subject.

[000116] An "Outcome category", synonymous with "outcome" refers to a particular category of a "categorical outcome variable"

[000117] An "Outcome score", synonymous with "outcome value", refers to a quantitative value associated with a given category or level of an Outcome variable'.

[000118] An "Outcome variable" is a variable containing at least one set of scores that are believed to be correlated with an underlying biological condition of the cases, and may be categorical ("categorical outcome variable") which may be nominal or ordinal, continuous or may denote an event history.

[000119] A "Panel" is an experimentally verified set of indicators. A "panel" includes a set of at least two indicators.

[000120] A "Profile" is a set of values associated with constituents of an indicator resulting from evaluation of a biological sample (or population or set of samples).

[000121] A "population of cells" refers to any group of cells wherein there is an underlying commonality or relationship between the members in the population of cells, including a group of cells taken from an organism or from a culture of cells or from a biopsy, for example.

[000122] "Positive predictive value" or "PPV" is calculated by TP/(TP+FP) or the true positive fraction of all positive test results. It is inherently impacted by the prevalence of the disease and pre -test probability of the population intended to be tested.

[000123] "Prime indicator" is an indicator that alone makes a statistically significant contribution to the evaluation of the biological state. Optimally, the change in the value of the prime indicator in a normal subject compared to a subject with an altered biological is greater than the standard of error of the test that is used to measure the value.

[000124] "Proxy indicator" is an indicator that alone does not make a statistically significant contribution to the evaluation of the biological state, is correlated with the prime indicator and whose value is similar in both a normal biological state and an altered biological state. [000125] "Risk" in the context of the present invention, relates to the probability that an event will occur over a specific time period, and can mean a subject's "absolute" risk or

"relative" risk. Absolute risk can be measured with reference to either actual observation post- measurement for the relevant time cohort, or with reference to index values developed from statistically valid historical cohorts that have been followed for the relevant time period.

Relative risk refers to the ratio of absolute risks of a subject compared either to the absolute risks of lower risk cohorts, across population divisions (such as tertiles, quartiles, quintiles, or deciles, etc.) or an average population risk, which can vary by how clinical risk factors are assessed. Odds ratios, the proportion of positive events to negative events for a given test result, are also commonly used (odds are according to the formula p/(l-p) where p is the probability of event and (1- p) is the probability of no event) to no-conversion.

[000126] "Risk evaluation or "evaluation of risk" in the context of the present invention encompasses making a prediction of the probability, odds, or likelihood that an event (e.g., death) or disease state may occur, and/or the rate of occurrence of the event (e.g., death) or conversion from one disease state to another, i.e., from a normal condition to cancer or from cancer remission to cancer, or from primary cancer occurrence to occurrence of a cancer metastasis. Risk evaluation can also comprise prediction of future clinical parameters, traditional laboratory risk factor values, or other indices of cancer results, either in absolute or relative terms in reference to a previously measured population. Such differing use may require different combinations and individualized panels, mathematical algorithms, and/or cut-off points, but be subject to the same aforementioned measurements of accuracy and performance for the respective intended use.

[000127] A "sample" from a subject may include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, taken from the subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision or intervention or other means known in the art. The sample is blood, urine, spinal fluid, lymph, mucosal secretions, prostatic fluid, semen, haemolymph or any other body fluid known in the art for a subject. The sample is also a tissue sample. The sample is or contains a circulating endothelial cell or a circulating tumor cell.

[000128] "Sensitivity" is calculated by TP/(TP+FN) or the true positive fraction of disease subjects. [000129] "Specificity" is calculated by TN/(TN+FP) or the true negative fraction of non- disease or normal subjects.

[000130] By "statistically significant", it is meant that the alteration is greater than what might be expected to happen by chance alone (which could be a "false positive"). Statistical significance can be determined by any method known in the art. Commonly used measures of significance include the p-value, which presents the probability of obtaining a result at least as extreme as a given data point, assuming the data point was the result of chance alone. A result is often considered highly significant at a p-value of 0.05 or less and statistically significant at a p- value of 0.10 or less. Such / values depend significantly on the power of the study performed. By non-statistically significant it is mean a p-value greater than 0.05.

[000131] A "set " or "population" of samples or subjects refers to a defined or selected group of samples or subjects wherein there is an underlying commonality or relationship between the members included in the set or population of samples or subjects.

[000132] A "subject" is a cell, tissue, or organism, human or non-human, whether in vivo, ex vivo or in vitro, under observation. As used herein, reference to predicting the survivability and/or survival time of a subject based on a sample from the subject, includes using blood or other tissue sample from a human subject to evaluate the human subject's predicted survivability and/or survival time; it also includes, for example, using a blood sample itself as the subject to evaluate, for example, the effect of therapy or an agent upon the sample.

[000133] A "stimulus" includes (i) a monitored physical interaction with a subject, for example ultraviolet A or B, or light therapy for seasonal affective disorder, or treatment of psoriasis with psoralen or treatment of cancer with embedded radioactive seeds, other radiation exposure, and (ii) any monitored physical, mental, emotional, or spiritual activity or inactivity of a subject.

[000134] "Survivability" refers to the ability to remain alive or continue to exist (i.e., alive or dead).

[000135] "Survival time" refers to the length or period of time a subject is able to remain alive or continue to exist as measured from an initial date (e.g., date of birth, date of diagnosis of a particular disease or stage of disease, date of initiating a therapeutic regimen, etc.) to a later date in time (e.g., date of death, date of termination of a particular therapeutic regimen, or an arbitrary date). [000136] "Therapy" or "therapeutic regimen" includes all interventions whether biological, chemical, physical, metaphysical, or combination of the foregoing, intended to sustain or alter the monitored biological condition of a subject.

[000137] "77V" is true negative, which for a disease state test means classifying a non- disease or normal subject correctly.

[000138] "TP" is true positive, which for a disease state test means correctly classifying a disease subject.

[000139] A "value" is a numerical quantity measured, assigned or computed for the indicator.

[000140] Diagnostic and Prognostic Indications of the Invention

[000141] The methods disclosed herein are used for evaluating a biological state of a subject. By evaluating a biological condition it is meant that the methods are used with subjects at risk for developing a biological condition, subjects who may or may not have already been diagnosed with the biological condition and subjects undergoing treatment and/or therapies for the biological condition. The methods of the present invention can also be used to monitor or select a treatment regimen for a subject having a biological condition, and to screen subjects who have not been previously diagnosed as having a biological condition

[000142] Preferably, the methods of the present invention are used to identify and/or diagnose subjects who are asymptomatic for a biological condition. "Asymptomatic" means not exhibiting the traditional symptoms.

[000143] The methods of the present invention may also used to identify and/or diagnose subjects already at higher risk of developing reoccurrence of a biological condition or based on solely on the traditional risk factors.

[000144] A biological state of a subject is evaluated by providing a test value for a prime indicator, and a proxy indicator. The test value for the prime indicator is adjusted based upon the test value of the proxy indicator to arrive at a change in value of the prime gene relative to the surrogate normative value and as a result the prime indicator combined with the proxy indicator increases improves the predicative power of the prime indicator. The prime indicator alone makes a statistically significant contribution to the evaluation of the biological state. The value of the proxy indicator alone does not make a statistically significant contribution to the evaluation of the biological state, and (i) is correlated with the prime indicator; (ii) the test value of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the test value of the proxy indicator provides a surrogate normative value for the prime indicator.

[000145] Optimally, the proxy indicator and the prime indicator are correlated in both a normal biological state and an altered biological state. In some embodiments by providing a value it is meant measuring the value of the indicator. In some embodiments by evaluating is meant comparing the adjusted test value of the prime indicator to a control or reference value.

[000146] Preferably the indicators are selected as to predict the occurrence, presence or reoccurrence of a biological state with least 75% accuracy, more preferably 80%, 85%, 90%, 95%, 97%, 98%, 99% or greater accuracy.

[000147] A reference value can be relative to a number or value derived from population studies, including without limitation, such subjects having the same biological condition, subjects having the same or similar age range, subjects in the same or similar ethnic group, subjects having family histories of the biological condition, or relative to the starting sample of a subject undergoing treatment. Such reference values can be derived from statistical analyses and/or risk prediction data of populations obtained from mathematical algorithms and computed indices. Reference indices can also be constructed and used using algorithms and other methods of statistical and structural classification.

[000148] In one embodiment of the present invention, the reference value is a control sample derived from one or more subjects who are not at risk or at low risk for developing a biological condition. In another embodiment of the present invention, the reference value a control sample derived from one or more subjects who are asymptomatic and/or lack traditional risk factors for the biological condition. In a further embodiment, such subjects are monitored and/or periodically retested for a diagnostically relevant period of time following such test to verify continued absence the biological condition (disease or event free survival). Such period of time may be one year, two years, two to five years, five years, five to ten years, ten years, or ten or more years from the initial testing date for determination of the reference value. Furthermore, retrospective measurement of the indicators in properly banked historical subject samples may be used in establishing these reference values, thus shortening the study time required.

[000149] A reference value can also comprise indicator values derived from subjects who show an improvement in risk factors as a result of treatments and/or therapies for the biological condition. A reference value can also comprise indicator values derived from subjects who have confirmed disease by known invasive or non-invasive techniques, or are at high risk for developing the biological condition, or who have suffered from the biological condition.

[000150] In another embodiment, the reference value is an index value or a baseline value. An index value or baseline value is a composite sample of indicator values from one or more subjects who do not have the biological conditions or subjects who are asymptomatic for the biological condition. A baseline value can also comprise the indicator values in a sample derived from a subject who has shown an improvement in risk factors for the biological condition as a result of cancer treatments or therapies. Optionally, subjects identified as having a biological condition or being at increased risk of developing a biological condition are chosen to receive a therapeutic regimen to slow the progression the biological condition, or decrease or prevent the risk of developing the biological condition

[000151] The progression of a biological condition, or effectiveness of a treatment of a biological condition can be monitored by providing indicator values from a subject over time. For example, a first sample can be obtained prior to the subject receiving treatment and one or more subsequent samples are taken after or during treatment of the subject. The biological condition is considered to be progressive (or, alternatively, the treatment does not prevent progression) if the adjusted prime indicator value changes over time relative to the reference value, whereas the biological condition is not progressive if the adjusted indicator values remain constant over time (relative to the reference population, or "constant" as used herein). The term "constant" as used in the context of the present invention is construed to include changes over time with respect to the reference value.

[000152] The present invention further provides a method for screening for changes in a prime indicator attributed to the biological states, by providing a test value for a prime indicator, and a proxy indicator. The test value for the prime indicator is adjusted based upon the test value of the proxy indicator to arrive at a change in value of the prime indicator relative to the surrogate normative value and as a result the prime indicator combined with the proxy indicator increases improves the predicative power of the prime indicator. The prime indicator alone makes a statistically significant contribution to the evaluation of the biological state. The value of the proxy indicator alone does not make a statistically significant contribution to the evaluation of the biological state, and (i) is correlated with the prime indicator; (ii) the test value of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the test value of the proxy indicator provides a surrogate normative value for the prime indicator.

[000153] The present invention can also be used to screen patient or subject populations in any number of settings. For example, a health maintenance organization, public health entity or school health program can screen a group of subjects to identify those requiring interventions, as described above, or for the collection of epidemiological data. Insurance companies (e.g., health, life or disability) may screen applicants in the process of determining coverage or pricing, or existing clients for possible intervention. Data collected in such population screens, particularly when tied to any clinical progression to conditions like cancer or cancer reoccurrence, will be of value in the operations of, for example, health maintenance organizations, public health programs and insurance companies. Such data arrays or collections can be stored in machine- readable media and used in any number of health-related data management systems to provide improved healthcare services, cost effective healthcare, improved insurance operation, etc. See, for example, U.S. Patent Application No. 2002/0038227; U.S. Patent Application No. US 2004/0122296; U.S. Patent Application No. US 2004/ 0122297; and U.S. Patent No. 5,018,067. Such systems can access the data directly from internal data storage or remotely from one or more data storage sites as further detailed herein.

[000154] A machine-readable storage medium can comprise a data storage material encoded with machine readable data or data arrays which, when using a machine programmed with instructions for using said data, is capable of use for a variety of purposes, such as, without limitation, subject information relating to the biological condition over time or in response drug therapies. Measurements of effective amounts of the indicators of the invention and/or the resulting evaluation of risk from those indicators can implemented in computer programs executing on programmable computers, comprising, inter alia, a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Program code can be applied to input data to perform the functions described above and generate output information. The output information can be applied to one or more output devices, according to methods known in the art. The computer may be, for example, a personal computer, microcomputer, or workstation of conventional design.

[000155] Each program can be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. The language can be a compiled or interpreted language. Each such computer program can be stored on a storage media or device (e.g., ROM or magnetic diskette or others as defined elsewhere in this disclosure) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. The health-related data management system of the invention may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform various functions described herein.

[000156] METHOD OF IDENTIFYING PRIME/PROXY PAIRS

[000157] The invention also provides a method of identifying a prime/proxy indicator pair from measurements of a plurality of indicators from a plurality of subjects. The prime/proxy indicator pair predicts a value indicative of the risk or probability of a biological state.

Prime/proxy indicator pairs are identified by:

1- providing a test value for each indicator. The test value represents a change in the value of each indicator in a subject with a first biological state compared to a subject with a second biological state

2- determining the statistical significance of each test value in step (1).

3- using said test values for each indicator to enumerate two indicator models indicative of a risk or probability of the biological state. The two indicator models distinguishes between a subject with the first biological state and a subject with the second biological state

4- selecting the two indicator models enumerated in step (3) capable of distinguishing between a subject with the first biological state and a subject with the second biological state with at least 75% accuracy

5- determine a coefficient for each indicator in each two indicator model selected in and identifying the two indicator models identified in step (4) that have one indicator having a positive coefficient and one indicator having a negative coefficient

6- selecting the two indicator models identified in step (5) in which the unique contribution of each indicator that is statistically significant and identifying the two indicator models than comprise one indicator whose test value alone make a statistically significant contribution to the prediction of the biological state and one indicator whose test value alone does not make a statistically significant contribution to the prediction of the biological state as determined in step

(2);

7-determine whether the two indicators in the models identified in step (6) are correlated to each other and selecting two indicator models where the two indicators have a correlation coefficient >0.5. The indicator that makes a statistically significant contribution in step (2) is the prime indicator and the indicator that is does not make statistically significant contribution is the proxy indicator.

[000158] Optionally the method further includes the step of using the test values for each prime and proxy indicator identified to enumerate models having at least two indicators that are indicative of a risk or probability of the biological state, to identify a model comprising at least one prime indicator and at least one proxy indicator wherein said model distinguishes between a subject with the first biological state and a subject with the second biological state with at least 85% accuracy.

[000159] Optionally the method further includes the step of using the test values for each prime and proxy indicator identified to enumerate models having multiple indicators that are indicative of a risk or probability of the biological state, to identify a model comprising multiple prime indicator and at least one proxy indicator that is correlated to each of the prime indictaors in a two indicator model wherein said model distinguishes between a subject with the first biological state and a subject with the second biological state with at least 85% accuracy.

[000160] By virtue of the prime indicator being functionally active, by elucidating its function, subjects altered prime indicator values, for example, can be managed with agents/drugs that preferentially target such pathways. Thus also included in the invention are the prime indicator identified by the disclosed methods as therapeutic targets to treat a biological condition.

[000161] MEASUREMENT OF THE PRIME AND PROXY INDICATORS

[000162] The actual measurement of the test values provided for the prime and proxy indicators can be determined, for example, at the protein or nucleic acid level, using any method known in the art. For example, at the nucleic acid level, Northern and Southern hybridization analysis, as well as ribonuclease protection assays using probes which specifically recognize one or more of these sequences can be used to determine gene expression. Alternatively, the indicators can be measured using reverse-transcription-based PCR assays (RT-PCR), e.g., using primers specific for the differentially expressed sequence of genes or by branch-chain RNA amplification and detection methods by Panomics, Inc. Test values can also be determined at the protein level, e.g., by measuring the levels of peptides encoded by their respective gene products, or subcellular localization or activities thereof using technological platform such as for example AQUA. Such methods are well known in the art and include, e.g., immunoassays based on antibodies to proteins encoded by the genes, aptamers or molecular imprints. Alternatively, a suitable method can be selected to determine the activity of proteins encoded by marker genes according to the activity of each protein analyzed. More specific examples for measuring various types of indicators are described below for purposes of illustration, and are not intended to be limiting.

[000163] Detection of Nucleic Acids

[000164] Methods of detecting nucleic acids, including but not limited to, determining genotype, evaluating the copy number of a particular gene or chromosomal region, and detecting the presence or absence of sequence additions or deletions, are well known to those of skill in the art and include Hybridization-based Assays and Amplification-based Assays.

[000165] Hybridization-based Assays

[000166] Hybridization-based assays include, but are not limited to, traditional "direct probe" methods such as Southern Blots or In Situ Hybridization (e.g., FISH), and "comparative probe" methods such as Comparative Genomic Hybridization (CGH). The methods can be used in a wide variety of formats including, but not limited to substrate~(e.g. membrane or glass) bound methods or array-based approaches as described below.

[000167] In situ hybridization assays are well known (e.g., Angerer (1987) Meth. Enzymol 152: 649). Generally, in situ hybridization comprises the following major steps: (1) fixation of tissue or biological structure to be analyzed; (2) prehybridization treatment of the biological structure to increase accessibility of target DNA, and to reduce nonspecific binding; (3) hybridization of a mixture of nucleic acids to the nucleic acid in the biological structure or tissue; (4) post-hybridization washes to remove nucleic acid fragments not bound in the hybridization and (5) detection of the hybridized nucleic acid fragments. The reagent used in each of these steps and the conditions for use varies depending on the particular application.

[000168] In a typical in situ hybridization assay, cells are fixed to a solid support, typically a glass slide. If a nucleic acid is to be probed, the cells are typically denatured with heat or alkali. The cells are then contacted with a hybridization solution at a moderate temperature to permit annealing of labeled probes specific to the nucleic acid sequence encoding the protein. The targets (e.g., cells) are then typically washed at a predetermined stringency or at an increasing stringency until an appropriate signal to noise ratio is obtained.

[000169] The probes are typically labeled, e.g., with radioisotopes or fluorescent reporters.

The preferred size range is from about 200 bp to about 1000 bases, more preferably between about 400 to about 800 bp for double stranded, nick translated nucleic acids.

[000170] In some applications it is necessary to block the hybridization capacity of repetitive sequences. Thus, human genomic DNA or Cot-1 DNA is used to block non- specific hybridization.

[000171] In Comparative Genomic Hybridization methods a first collection of (sample) nucleic acids (e.g., from a possible tumor) is labeled with a first label, while a second collection of (control) nucleic acids (e.g., from a healthy cell tissue) is labeled with a second label. The ratio of hybridization of the nucleic acids is determined by the ratio of the two (first and second) labels binding to each fiber in the array. Where there are chromosomal deletions or

multiplications, differences in the ratio of the signals from the two labels will be detected and the ratio will provide a measure of the copy number.

[000172] Other Hybridization protocols suitable for use with the methods of the invention are described, e.g., in Albertson (1984) EMBO J. 3: 1227-1234; Pinkel (1988) Proc. Natl. Acad. Sci. USA 85: 9138-9142; EPO Pub. No. 430,402; Methods in Molecular Biology, Vol. 33: In Situ Hybridization Protocols, Choo, ed., Humana Press, Totowa, N.J. (1994), etc.

[000173] The methods of this invention are particularly well suited to array-based hybridization formats. Arrays are a multiplicity of different "probe" or "target" nucleic acids (or other compounds) attached to one or more surfaces (e.g., solid, membrane, or gel). The multiplicity of nucleic acids (or other moieties) is attached to a single contiguous surface or to a multiplicity of surfaces juxtaposed to each other.

[000174] In an array format a large number of different hybridization reactions can be run essentially "in parallel." This provides rapid, essentially simultaneous, evaluation of a number of hybridizations in a single "experiment." Methods of performing hybridization reactions in array based formats are well known to those of skill in the art (see, e.g., Pastinen (1997) Genome Res. 7: 606-614; Jackson (1996) Nature Biotechnology 14: 1685; Chee (1995) Science 274: 610; WO 96/17958).

[000175] Arrays, particularly nucleic acid arrays, can be produced according to a wide variety of methods well known to those of skill in the art. For example, in a simple embodiment, "low density" arrays can simply be produced by spotting (e.g. by hand using a pipette) different nucleic acids at different locations on a solid support (e.g. a glass surface, a membrane, etc.).

[000176] This simple spotting, approach has been automated to produce high density spotted arrays (see, e.g., U.S. Pat. No.: 5,807,522). This patent describes the use of automated systems that tap a microcapillary against a surface to deposit a small volume of a biological sample. The process is repeated to generate high density arrays. Arrays can also be produced using oligonucleotide synthesis technology. Thus, for example, U.S. Pat. No. 5,143,854 and PCT patent publication Nos. WO 90/15070 and 92/10092 teach the use of light-directed combinatorial synthesis of high density oligonucleotide arrays.

[000177] A spotted array can include genomic DNA, e.g. overlapping clones that provide a high resolution scan of the amplicon corresponding to the region of interest. Amplicon nucleic acid can be obtained from, e.g., MACs, YACs, BACs, PACs, Pis, cosmids, plasmids, inter-Alu PCR products of genomic clones, restriction digests of genomic clone, cDNA clones, amplification (e.g., PCR) products, and the like.

[000178] The array nucleic acids are derived from previously mapped libraries of clones spanning or including target sequences, as well as clones from other areas of the genome. The arrays can be hybridized with a single population of sample nucleic acid or can be used with two differentially labeled collections (as with a test sample and a reference sample).

[000179] Many methods for immobilizing nucleic acids on a variety of solid surfaces are known in the art. A wide variety of organic and inorganic polymers, as well as other materials, both natural and synthetic, can be employed as the material for the solid surface. Illustrative solid surfaces include, e.g., nitrocellulose, nylon, glass, quartz, diazotized membranes (paper or nylon), silicones, polyformaldehyde, cellulose, and cellulose acetate. In addition, plastics such as polyethylene, polypropylene, polystyrene, and the like can be used. Other materials which may be employed include paper, ceramics, metals, metalloids, semiconductive materials, cermets or the like. In addition, substances that form gels can be used. Such materials include, e.g., proteins (e.g., gelatins), lipopolysaccharides, silicates, agarose and polyacrylamides. Where the solid surface is porous, various pore sizes may be employed depending upon the nature of the system.

[000180] In preparing the surface, a plurality of different materials may be employed, particularly as laminates, to obtain various properties. For example, proteins (e.g., bovine serum albumin) or mixtures of macromolecules (e.g. Denhardt's solution) can be employed to avoid non-specific binding, simplify covalent conjugation, enhance signal detection or the like. If covalent bonding between a compound and the surface is desired, the surface will usually be polyfunctional or be capable of being polyfunctionalized. Functional groups which may be present on the surface and used for linking can include carboxylic acids, aldehydes, amino groups, cyano groups, ethylenic groups, hydroxyl groups, mercapto groups and the like. The manner of linking a wide variety of compounds to various surfaces is well known and is amply illustrated in the literature.

[000181] For example, methods for immobilizing nucleic acids by introduction of various functional groups to the molecules are known (see, e.g., Bischoff (1987) Anal. Biochem., 164: 336-344; Kremsky (1987) Nucl. Acids Res. 15: 2891-2910). Modified nucleotides can be placed on the target using PCR primers containing the modified nucleotide, or by enzymatic end labeling with modified nucleotides. Use of glass or membrane supports (e.g., nitrocellulose, nylon, polypropylene) for the nucleic acid arrays of the invention is advantageous because of well developed technology employing manual and robotic methods of arraying targets at relatively high element densities. Such membranes are generally available and protocols and equipment for hybridization to membranes is well known.

[000182] Target elements of various sizes, ranging from 1 mm diameter down to 1 micron can be used. Smaller target elements containing low amounts of concentrated, fixed probe DNA are used for high complexity comparative hybridizations since the total amount of sample available for binding to each target element will be limited. Thus it is advantageous to have small array target elements that contain a small amount of concentrated probe DNA so that the signal that is obtained is highly localized and bright. Such small array target elements are typically used in arrays with densities greater than 10 /cm . Relatively simple approaches capable of quantitative fluorescent imaging of 1 cm² areas have been described that permit acquisition of data from a large number of target elements in a single image (see, e.g., Wittrup (1994) Cytometry 16:206-213).

[000183] Arrays on solid surface substrates with much lower fluorescence than membranes, such as glass, quartz, or small beads, can achieve much better sensitivity. Substrates such as glass or fused silica are advantageous in that they provide a very low fluorescence substrate, and a highly efficient hybridization environment. Covalent attachment of the target nucleic acids to glass or synthetic fused silica can be accomplished according to a number of known techniques (described above). Nucleic acids can be conveniently coupled to glass using commercially available reagents. For instance, materials for preparation of silanized glass with a number of functional groups are commercially available or can be prepared using standard techniques (see, e.g., Gait (1984) Oligonucleotide Synthesis: A Practical Approach, IRL Press, Wash., D.C.). Quartz cover slips, which have at least 10-fold lower auto fluorescence than glass, can also be silanized.

[000184] Alternatively, probes can also be immobilized on commercially available coated beads or other surfaces. For instance, biotin end-labeled nucleic acids can be bound to commercially available avidin-coated beads. Streptavidin or anti-digoxigenin antibody can also be attached to silanized glass slides by protein-mediated coupling using e.g., protein A following standard protocols (see, e.g., Smith (1992) Science 258: 1122-1 126). Biotin or digoxigenin end- labeled nucleic acids can be prepared according to standard techniques. Hybridization to nucleic acids attached to beads is accomplished by suspending them in the hybridization mix, and then depositing them on the glass substrate for analysis after washing. Alternatively, paramagnetic particles, such as ferric oxide particles, with or without avidin coating, can be used.

[000185] For example, a probe nucleic acid is spotted onto a surface (e.g., a glass or quartz surface). The nucleic acid is dissolved in a mixture of dimethylsulfoxide (DMSO) and nitrocellulose and spotted onto amino-silane coated glass slides. Small capillary tubes can be used to "spot" the probe mixture.

[000186] A variety of other nucleic acid hybridization formats are known to those skilled in the art. For example, common formats include sandwich assays and competition or displacement assays. Hybridization techniques are generally described in Hames and Higgins (1985) Nucleic Acid Hybridization, A Practical Approach, IRL Press; Gall and Pardue (1969) Proc. Natl. Acad. Sci. USA 63: 378-383; and John et al. (1969) Nature 223: 582-587.

[000187] Sandwich assays are commercially useful hybridization assays for detecting or isolating nucleic acid sequences. Such assays utilize a "capture" nucleic acid covalently immobilized to a solid support and a labeled "signal" nucleic acid in solution. The sample will provide the target nucleic acid. The "capture" nucleic acid and "signal" nucleic acid probe hybridize with the target nucleic acid to form a "sandwich" hybridization complex. To be most effective, the signal nucleic acid should not hybridize with the capture nucleic acid.

[000188] Detection of a hybridization complex may require the binding of a signal generating complex to a duplex of target and probe polynucleotides or nucleic acids. Typically, such binding occurs through ligand and anti-ligand interactions, such as between a ligand- conjugated probe and an anti-ligand conjugated with a signal.

[000189] The sensitivity of the hybridization assays may be enhanced through use of a nucleic acid amplification system that multiplies the target nucleic acid being detected.

Examples of such systems include the polymerase chain reaction (PCR) system and the ligase chain reaction (LCR) system. Other methods recently described in the art are the nucleic acid sequence based amplification ( ASBAO, Cangene, Mississauga, Ontario) and Q Beta Replicase systems.

[000190] Nucleic acid hybridization simply involves providing a denatured probe and target nucleic acid under conditions where the probe and its complementary target can form stable hybrid duplexes through complementary base pairing. The nucleic acids that do not form hybrid duplexes are then washed away leaving the hybridized nucleic acids to be detected, typically through detection of an attached detectable label. It is generally recognized that nucleic acids are denatured by increasing the temperature or decreasing the salt concentration of the buffer containing the nucleic acids, or in the addition of chemical agents, or the raising of the pH.

Under low stringency conditions (e.g., low temperature and/or high salt and/or high target concentration) hybrid duplexes (e.g., DNA:DNA, RNA:RNA, or RNA:DNA) will form even where the annealed sequences are not perfectly complementary. Thus specificity of

hybridization is reduced at lower stringency. Conversely, at higher stringency (e.g., higher temperature or lower salt) successful hybridization requires fewer mismatches.

[000191] One of skill in the art will appreciate that hybridization conditions may be selected to provide any degree of stringency. In a preferred embodiment, hybridization is performed at low stringency to ensure hybridization and then subsequent washes are performed at higher stringency to eliminate mismatched hybrid duplexes. Successive washes may be performed at increasingly higher stringency (e.g., down to as low as 0.25XSSPE-T at 37°C to 70°C) until a desired level of hybridization specificity is obtained. Stringency can also be increased by addition of agents such as formamide. Hybridization specificity may be evaluated by comparison of hybridization to the test probes with hybridization to the various controls that can be present.

[000192] In general, there is a tradeoff between hybridization specificity (stringency) and signal intensity. Thus, the wash is performed at the highest stringency that produces consistent results and that provides a signal intensity greater than approximately 10% of the background intensity. The hybridized array may be washed at successively higher stringency solutions and read between each wash. Analysis of the data sets thus produced will reveal a wash stringency above which the hybridization pattern is not appreciably altered and which provides adequate signal for the particular probes of interest.

[000193] Background signal is reduced by the use of a detergent (e.g., C-TAB) or a blocking reagent (e.g., sperm DNA, cot-1 DNA, etc.) during the hybridization to reduce nonspecific binding. In a particularly preferred embodiment, the hybridization is performed in the presence of about 0.1 to about 0.5 mg/mL DNA (e.g., cot-1 DNA). The use of blocking agents in hybridization is well known to those of skill in the art (see, e.g., Chapter 8 in P. Tijssen, infra.)

[000194] Methods of optimizing hybridization conditions are well known to those of skill in the art (see, e.g., Tijssen (1993) Laboratory Techniques in Biochemistry and Molecular Biology, Vol. 24: Hybridization With Nucleic Acid Probes, Elsevier, N.Y.).

[000195] Optimal conditions are also a function of the sensitivity of label (e.g.,

fluorescence) detection for different combinations of substrate type, fluorochrome, excitation and emission bands, spot size and the like. Low fluorescence background membranes can be used (see, e.g., Chu (1992) Electrophoresis 13: 105-1 14). The sensitivity for detection of spots ("target elements") of various diameters on the candidate membranes can be readily determined by, e.g., spotting a dilution series of fluorescently end labeled DNA fragments. These spots are then imaged using conventional fluorescence microscopy. The sensitivity, linearity, and dynamic range achievable from the various combinations of fluorochrome and solid surfaces (e.g., membranes, glass, fused silica) can thus be determined. Serial dilutions of pairs of fluorochrome in known relative proportions can also be analyzed. This determines the accuracy with which fluorescence ratio measurements reflect actual fluorochrome ratios over the dynamic range permitted by the detectors and fluorescence of the substrate upon which the probe has been fixed.

[000196] The hybridized nucleic acids are detected by detecting one or more labels attached to the sample or probe nucleic acids. The labels may be incorporated by any of a number of means well known to those of skill in the art. Means of attaching labels to nucleic acids include, for example nick translation or endlabeling (e.g. with a labeled RNA) by kinasing of the nucleic acid and subsequent attachment (ligation) of a nucleic acid linker joining the sample nucleic acid to a label (e.g., a fluorophore). A wide variety of linkers for the attachment of labels to nucleic acids are also known. In addition, intercalating dyes and fluorescent nucleotides can also be used. [000197] Detectable labels suitable for use in the present invention include any composition detectable by spectroscopic, photochemical, biochemical, immunochemical, electrical, radiological, optical or chemical means. Useful labels in the present invention include biotin for staining with labeled streptavidin conjugate, magnetic beads (e.g., DY ABEADS™), fluorescent dyes (e.g., fluorescein, texas red, rhodamine, green fluorescent protein, and the like, see, e.g., Molecular Probes, Eugene, OR., USA), radiolabels (e.g.,³H, ¹²⁵ 1, ³⁵ S, ¹⁴ C, or ³² P), enzymes (e.g., horse radish peroxidase, alkaline phosphatase and others commonly used in an ELISA), and calorimetric labels such as colloidal gold (e.g., gold particles in the 40-80 nm diameter size range scatter green light with high efficiency) or colored glass or plastic (e.g., polystyrene, polypropylene, latex, etc.) beads. Patents teaching the use of such labels include U.S. Pat. Nos. 3,817,837; 3,850,752;3,939,350; 3,996,345; 4,277,437; 4,275,149; and 4,366,241.

[000198] A fluorescent label is preferred because it provides a very strong signal with low background. It is also optically detectable at high resolution and sensitivity through a quick scanning procedure. The nucleic acid samples can all be labeled with a single label, e.g., a single fluorescent label. Alternatively, in another embodiment, different nucleic acid samples can be simultaneously hybridized where each nucleic acid sample has a different label. For instance, one target could have a green fluorescent label and a second target could have a red fluorescent label. The scanning step will distinguish cites of binding of the red label from those binding the green fluorescent label. Each nucleic acid sample (target nucleic acid) can be analyzed independently from one another.

[000199] Suitable chromogens which can be employed include those molecules and compounds which absorb light in a distinctive range of wavelengths so that a color can be observed or, alternatively, which emit light when irradiated with radiation of a particular wave length or wave length range, e.g., fluorescers.

[000200] Desirably, fluorescers should absorb light above about 300 nm, preferably about 350 nm, and more preferably above about 400 nm, usually emitting at wavelengths greater than about 10 nm higher than the wavelength of the light absorbed. It should be noted that the absorption and emission characteristics of the bound dye can differ from the unbound dye.

Therefore, when referring to the various wavelength ranges and characteristics of the dyes, it is intended to indicate the dyes as employed and not the dye which is unconjugated and

characterized in an arbitrary solvent. [000201] Fluorescers are generally preferred because by irradiating a fiuorescer with light, one can obtain a plurality of emissions. Thus, a single label can provide for a plurality of measurable events.

[000202] Detectable signal can also be provided by chemiluminescent and bioluminescent sources. Chemiluminescent sources include a compound which becomes electronically excited by a chemical reaction and can then emit light which serves as the detectable signal or donates energy to a fluorescent acceptor. Alternatively, luciferins can be used in conjunction with luciferase or lucigenins to provide bioluminescence. Spin labels are provided by reporter molecules with an unpaired electron spin which can be detected by electron spin resonance (ESR) spectroscopy. Exemplary spin labels include organic free radicals, transitional metal complexes, particularly vanadium, copper, iron, and manganese, and the like. Exemplary spin labels include nitroxide free radicals.

[000203] The label may be added to the target (sample) nucleic acid(s) prior to, or after the hybridization. So called "direct labels" are detectable labels that are directly attached to or incorporated into the target (sample) nucleic acid prior to hybridization. In contrast, so called "indirect labels" are joined to the hybrid duplex after hybridization. Often, the indirect label is attached to a binding moiety that has been attached to the target nucleic acid prior to the hybridization. Thus, for example, the target nucleic acid may be biotinylated before the hybridization. After hybridization, an avidin-conjugated fluorophore will bind the biotin bearing hybrid duplexes providing a label that is easily detected. For a detailed review of methods of labeling nucleic acids and detecting labeled hybridized nucleic acids see Tijssen (1993)

Laboratory Techniques in Biochemistry and Molecular Biology, Vol. 24: Hybridization With Nucleic Acid Probes, Elsevier, N.Y.

[000204] Fluorescent labels are easily added during an in vitro transcription reaction. Thus, for example, fluorescein labeled UTP and CTP can be incorporated into the RNA produced in an in vitro transcription.

[000205] The labels can be attached directly or through a linker moiety. In general, the site of label or linker-label attachment is not limited to any specific position. For example, a label may be attached to a nucleoside, nucleotide, or analogue thereof at any position that does not interfere with detection or hybridization as desired. For example, certain Label-ON Reagents from Clontech (Palo Alto, Calif.) provide for labeling interspersed throughout the phosphate backbone of an oligonucleotide and for terminal labeling at the 3' and 5' ends. For example, labels can be attached at positions on the ribose ring or the ribose can be modified and even eliminated as desired. The base moieties of useful labeling reagents can include those that are naturally occurring or modified in a manner that does not interfere with the purpose to which they are put. Modified bases include but are not limited to 7-deaza A and G, 7-deaza-8-aza A and G, and other heterocyclic moieties.

[000206] It will be recognized that fluorescent labels are not to be limited to single species organic molecules, but include inorganic molecules, multi-molecular mixtures of organic and/or inorganic molecules, crystals, heteropolymers, and the like. Thus, for example, CdSe-CdS core- shell nanocrystals enclosed in a silica shell can be easily derivatized for coupling to a biological molecule (Bruchez et al. (1998) Science, 281 : 2013-2016). Similarly, highly fluorescent quantum dots (zinc sulfide-capped cadmium selenide) have been covalently coupled to biomolecules for use in ultrasensitive biological detection (Warren and ie (1998) Science, 281 : 2016-2018).

[000207] Amplification-based Assays

[000208] In another embodiment, amplification-based assays can be used. In such amplification-based assays, the nucleic acid sequences act as a template in an amplification reaction (e.g. Polymerase Chain Reaction (PCR). In a quantitative amplification, the amount of amplification product will be proportional to the amount of template in the original sample. Methods of "quantitative" amplification are well known to those of skill in the art. For example, quantitative PCR involves simultaneously co-amplifying a known quantity of a control sequence using the same primers. This provides an internal standard that may be used to calibrate the PCR reaction. Detailed protocols for quantitative PCR are provided in Innis et al. (1990) PCR Protocols, A Guide to Methods and Applications, Academic Press, Inc. N.Y.).

[000209] Other suitable amplification methods include, but are not limited to ligase chain reaction (LCR) (see Wu and Wallace (1989) Genomics 4: 560, Landegren et al. (1988) Science 241 : 1077, and Barringer et al. (1990) Gene 89: 117); transcription amplification (Kwoh et al. (1989) Proc. Natl. Acad. Sci. USA 86: 1173); and self-sustained sequence replication (Guatelli et al. (1990) Proc. Nat. Acad. Sci. USA 87: 1874).

[000210] To determine genotype, assays designed to identify the allele or alleles at a particular polymorphic locus or loci in a DNA sample, a gene, and/or chromosome can be used. Such assays may employ single base extension reactions, DNA amplification reactions that amplify across one or more polymorphic loci, or may be as simple as sequencing across one or more polymorphic loci.

[000211] Detection of Gene Expression

[000212] Methods of detecting and/or quantifying gene transcripts using nucleic acid hybridization techniques are known to those of skill in the art. For example, a Northern transfer may be used for the detection of the desired mRNA directly. In brief, the mRNA is isolated from a given cell sample using, for example, an acid guanidinium-phenol-chloroform extraction method. The mRNA is then electrophoresed to separate the mRNA species and the mRNA is transferred from the gel to a nitrocellulose membrane. As with the Southern blots, labeled probes are used to identify and/or quantify the target mRNA.

[000213] Alternatively, the gene transcript can be measured using amplification (e.g. PCR) based methods as described above. For example, expression can be measured using reverse transcription- based PCR assays (RT-PCR), e.g., using primers specific for the differentially expressed sequences. RNA can also be quantified using, for example, other target amplification methods (e.g., TMA, SDA, NASBA), or signal amplification methods (e.g., bDNA), and the like.

[000214] Detection of Expressed Protein

[000215] A protein or polypeptide can be detected and quantified by any of a number of means well known to those of skill in the art. These may include analytic biochemical methods such as electrophoresis, capillary electrophoresis, high performance liquid chromatography (HPLC), thin layer chromatography (TLC), hyperdiffusion chromatography, and the like, or various immunological methods such as fluid or gel precipitin reactions, immunodiffusion (single or double), Immunoelectrophoresis, radioimmunoassay (RIA), enzyme-linked

immunosorbent assays (ELISAs), immuno fluorescent assays, western blotting, and the like.

[000216] The protein or polypeptide, can be detected using immunological methods in any suitable manner, but is typically detected by contacting a sample from the subject with an antibody which binds the protein or polypeptide and then detecting the presence or absence of a reaction product. The antibody may be monoclonal, polyclonal, chimeric, or a fragment of the foregoing, and the step of detecting the reaction product may be carried out with any suitable immunoassay. The sample from the subject is typically a biological fluid as described above, and may be the same sample of biological fluid used to conduct the method described above.

[000217] Immunoassays carried out in accordance with the present invention may be homogeneous assays or heterogeneous assays. In a homogeneous assay the immunological reaction usually involves the specific antibody, a labeled analyte, and the sample of interest. The signal arising from the label is modified, directly or indirectly, upon the binding of the antibody to the labeled analyte. Both the immunological reaction and detection of the extent thereof can be carried out in a homogeneous solution. Immunochemical labels which may be employed include free radicals, radioisotopes, fluorescent dyes, enzymes, bacteriophages, or coenzymes, or other label as described above.

[000218] In a heterogeneous assay approach, the reagents are usually the sample, the antibody, and means for producing a detectable signal. Samples as described above may be used. The antibody can be immobilized on a support, such as a bead (such as protein A and protein G agarose beads), plate or slide, and contacted with the specimen suspected of containing the antigen in a liquid phase. The support is then separated from the liquid phase and either the support phase or the liquid phase is examined for a detectable signal employing means for producing such signal. The signal is related to the presence of the analyte in the sample. Means for producing a detectable signal include the use of radioactive labels, fluorescent labels, or enzyme labels. For example, if the antigen to be detected contains a second binding site, an antibody which binds to that site can be conjugated to a detectable group and added to the liquid phase reaction solution before the separation step. The presence of the detectable group on the solid support indicates the presence of the antigen in the test sample. Examples of suitable immunoassays are oligonucleotides, immunoblotting, immunofluorescence methods, immunoprecipitation, quantum dots, multiplex fluorochromes, chemiluminescence methods, electrochemiluminescence (ECL) or enzyme-linked immunoassays.

[000219] Those skilled in the art will be familiar with numerous specific immunoassay formats and variations thereof which may be useful for carrying out the method disclosed herein. See generally E. Maggio, Enzyme-Immunoassay, (1980) (CRC Press, Inc., Boca Raton, Fla.); see also U.S. Pat. No. 4,727,022 to Skold et al. titled "Methods for Modulating Ligand-Receptor Interactions and their Application," U.S. Pat. No. 4,659,678 to Forrest et al. titled "Immunoassay of Antigens," U.S. Pat. No. 4,376,1 10 to David et al., titled "Immunometric Assays Using Monoclonal Antibodies," U.S. Pat. No. 4,275,149 to Litman et al, titled "Macromolecular Environment Control in Specific Receptor Assays," U.S. Pat. No. 4,233,402 to Maggio et al., titled "Reagents and Method Employing Channeling," and U.S. Pat. No. 4,230,767 to Boguslaski et al., titled "Heterogenous Specific Binding Assay Employing a Coenzyme as Label." [000220] Antibodies can be conjugated to a solid support suitable for a diagnostic assay (e.g., beads such as protein A or protein G agarose, microspheres, plates, slides or wells formed from materials such as latex or polystyrene) in accordance with known techniques, such as passive binding. Antibodies as described herein may likewise be conjugated to detectable labels or groups such as radiolabels (e.g., 35S, 1251, 1311), enzyme labels (e.g., horseradish peroxidase, alkaline phosphatase), and fluorescent labels (e.g., fluorescein, Alexa, green fluorescent protein, rhodamine) in accordance with known techniques. Highly sensitivity antibody detection strategies may be used that allow for evaluation of the antigen-antibody binding in a non- amplified configuration. In addition, antibodies may be conjugated to oligonucleotides, and followed by Polymerase Chain Reaction and a variety of oligonucleotide detection methods.

[000221] Antibodies can also be useful for detecting post-translational modifications of a protein or polypeptide, such as tyrosine phosphorylation, threonine phosphorylation, serine phosphorylation, glycosylation (e.g., O-GlcNAc). Such antibodies specifically detect the modified, e.g., phosphorylated, amino acids in a protein or proteins of interest, and can be used in immunoblotting, immunofluorescence, and ELISA assays described herein. These antibodies are well-known to those skilled in the art, and commercially available. Post-translational modifications can also be determined using metastable ions in reflector matrix-assisted laser desorption ionizationtime of flight mass spectrometry (MALDI-TOF) (Wirth, U. et al. (2002) Proteomics 2(10): 1445-51). In addition to post-translation modifications, these processes may be coupled to localization of the protein, such that a re-localization process is monitored, and the indicator is evaluated in a relative fashion exhibited by the constancy or change to the ratio of the protein in different compartments.

[000222] Suitable sources for antibodies for the detection of proteins or polypeptides include commercially available sources such as, for example, Abazyme, Abnova, Affinity Biologicals, AntibodyShop, Biogenesis, Biosense Laboratories, Calbiochem, Cell Sciences, Chemicon International, Chemokine, Clontech, Cytolab, DAKO, Diagnostic BioSystems, eBioscience, Endocrine Technologies, Enzo Biochem, Eurogentec, Fusion Antibodies, Genesis Biotech, GloboZymes, Haematologic Technologies, Immunodetect, Immunodiagnostik,

Immunometrics, Immunostar, Immunovision, Biogenex, Invitrogen, Jackson ImmunoResearch Laboratory, KMI Diagnostics, Koma Biotech, LabFrontier Life Science Institute, Lee

Laboratories, Lifescreen, Maine Biotechnology Services, Mediclone, Micro Pharm Ltd.,

ModiQuest, Molecular Innovations, Molecular Probes, Neoclone, Neuromics, New England Biolabs, Novocastra, Novus Biologicals, Oncogene Research Products, Orbigen, Oxford Biotechnology, Panvera, PerkinElmer Life Sciences, Pharmingen, Phoenix Pharmaceuticals, Pierce Chemical Company, Polymun Scientific, Polysiences, Inc., Promega Corporation, Proteogenix, Protos Immunoresearch, QED Biosciences, Inc., R&D Systems, Repligen, Research Diagnostics, Roboscreen, Santa Cruz Biotechnology, Seikagaku America, Serological Corporation, Ab Serotec, SigmaAldrich, StemCell Technologies, Synaptic Systems GmbH, Technopharm, Terra Nova Biotechnology, TiterMax, Trillium Diagnostics, Upstate

Biotechnology, US Biological, Vector Laboratories, Wako Pure Chemical Industries, and Zeptometrix.

[000223] For a protein or polypeptide known to have enzymatic activity, the activities can be determined in vitro using enzyme assays known in the art. Such assays include, without limitation, kinase assays, phosphatase assays, reductase assays, among many others. Modulation of the kinetics of enzyme activities can be determined by measuring the rate constant KM using known algorithms, such as the Hill plot, Michaelis-Menten equation, linear regression plots such as Lineweaver-Burk analysis, and Scatchard plot.

[000224] Exempary Measurement of Gene Expression

[000225] For measuring the amount of a particular RNA in a sample, methods known to one of ordinary skill in the art were used to extract and quantify transcribed RNA from a sample. (See detailed protocols below. Also see PCT application publication number WO 98/24935 herein incorporated by reference for RNA analysis protocols). Briefly, RNA is extracted from a sample such as any tissue, body fluid, cell (e.g., circulating tumor cell) or culture medium in which a population of cells of a subject might be growing. For example, cells may be lysed and RNA eluted in a suitable solution in which to conduct a DNAse reaction. Subsequent to RNA extraction, first strand synthesis may be performed using a reverse transcriptase. Gene amplification, more specifically quantitative PCR assays, can then be conducted and the gene of interest calibrated against an internal marker such as 18S rRNA (Hirayama et ah, Blood 92, 1998: 46-52). Any other endogenous marker can be used, such as 28S-25S rRNA and 5S rRNA. Samples are measured in multiple replicates, for example, 3 replicates. In an embodiment of the invention, quantitative PCR is performed using amplification, reporting agents and instruments such as those supplied commercially by Applied Biosystems (Foster City, CA). Given a defined efficiency of amplification of target transcripts, the point (e.g., cycle number) that signal from amplified target template is detectable may be directly related to the amount of specific message transcript in the measured sample. Similarly, other quantifiable signals such as fluorescence, enzyme activity, disintegrations per minute, absorbance, etc., when correlated to a known concentration of target templates (e.g., a reference standard curve) or normalized to a standard with limited variability can be used to quantify the number of target templates in an unknown sample.

[000226] Although not limited to amplification methods, quantitative gene expression techniques may utilize amplification of the target transcript. Alternatively or in combination with amplification of the target transcript, quantitation of the reporter signal for an internal marker generated by the exponential increase of amplified product may also be used.

Amplification of the target template may be accomplished by isothermic gene amplification strategies or by gene amplification by thermal cycling such as PCR.

[000227] It is desirable to obtain a definable and reproducible correlation between the amplified target or reporter signal, i.e., internal marker, and the concentration of starting templates. It has been discovered that this objective can be achieved by careful attention to, for example, consistent primer-template ratios and a strict adherence to a narrow permissible level of experimental amplification efficiencies (for example 80.0 to 100% +/- 5% relative efficiency, typically 90.0 to 100% +/- 5% relative efficiency, more typically 95.0 to 100% +/- 2 %, and most typically 98 to 100% +/- 1 % relative efficiency). In determining gene expression levels with regard to a single Gene Expression Profile, it is necessary that all constituents of the panels, including endogenous controls, maintain similar amplification efficiencies, as defined herein, to permit accurate and precise relative measurements for each constituent. Amplification efficiencies are regarded as being "substantially similar", for the purposes of this description and the following claims, if they differ by no more than approximately 10%, preferably by less than approximately 5%, more preferably by less than approximately 3%, and more preferably by less than approximately 1%. Measurement conditions are regarded as being "substantially repeatable, for the purposes of this description and the following claims, if they differ by no more than approximately +/- 10% coefficient of variation (CV), preferably by less than approximately +/- 5% CV, more preferably +/- 2% CV. These constraints should be observed over the entire range of concentration levels to be measured associated with the relevant biological condition. While it is thus necessary for various embodiments herein to satisfy criteria that measurements are achieved under measurement conditions that are substantially repeatable and wherein specificity and efficiencies of amplification for all constituents are substantially similar, nevertheless, it is within the scope of the present invention as claimed herein to achieve such measurement conditions by adjusting assay results that do not satisfy these criteria directly, in such a manner as to compensate for errors, so that the criteria are satisfied after suitable adjustment of assay results.

[000228] In practice, tests are run to assure that these conditions are satisfied. For example, the design of all primer-probe sets are done in house, experimentation is performed to determine which set gives the best performance. Even though primer-probe design can be enhanced using computer techniques known in the art, and notwithstanding common practice, it has been found that experimental validation is still useful. Moreover, in the course of experimental validation, the selected primer-probe combination is associated with a set of features:

[000229] The reverse primer should be complementary to the coding DNA strand. In one embodiment, the primer should be located across an intron-exon junction, with not more than four bases of the three-prime end of the reverse primer complementary to the proximal exon. (If more than four bases are complementary, then it would tend to competitively amplify genomic DNA.)

[000230] In an embodiment of the invention, the primer probe set should amplify cDNA of less than 110 bases in length and should not amplify, or generate fluorescent signal from, genomic DNA or transcripts or cDNA from related but biologically irrelevant loci.

[000231] A suitable target of the selected primer probe is first strand cDNA, which in one embodiment may be prepared from whole blood as follows:

[000232] (a) Use of whole blood for ex vivo assessment of a biological condition

[000233] Human blood is obtained by venipuncture and prepared for assay. The aliquots of heparinized, whole blood are mixed with additional test therapeutic compounds and held at 37°C in an atmosphere of 5% C0₂ for 30 minutes. Cells are lysed and nucleic acids, e.g., RNA, are extracted by various standard means.

[000234] Nucleic acids, RNA and or DNA, are purified from cells, tissues or fluids of the test population of cells. RNA is preferentially obtained from the nucleic acid mix using a variety of standard procedures (or RNA Isolation Strategies, pp. 55-104, in RNA Methodologies, A laboratory guide for isolation and characterization, 2nd edition, 1998, Robert E. Farrell, Jr., Ed., Academic Press), in the present using a filter-based RNA isolation system from Ambion

(RNAqueous™, Phenol-free Total RNA Isolation Kit, Catalog #1912, version 9908; Austin, Texas). [000235] (b) Amplification strategies.

[000236] Specific RNAs are amplified using message specific primers or random primers. The specific primers are synthesized from data obtained from public databases (e.g., Unigene, National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD), including information from genomic and cDNA libraries obtained from humans and other animals. Primers are chosen to preferentially amplify from specific RNAs obtained from the test or indicator samples (see, for example, RT PCR, Chapter 15 in RNA Methodologies. A

Laboratory Guide for Isolation and Characterization, 2nd edition, 1998, Robert E. Farrell, Jr., Ed., Academic Press; or Chapter 22 pp.143-151, RNA Isolation and Characterization Protocols, Methods in Molecular Biology, Volume 86, 1998, R. Rapley and D. L. Manning Eds., Human Press, or Chapter 14 Statistical refinement of primer design parameters; or Chapter 5, pp.55-72, PCR Applications: protocols for functional genomics, M.A.Innis, D.H. Gelfand and J.J. Sninsky, Eds., 1999, Academic Press). Amplifications are carried out in either isothermic conditions or using a thermal cycler (for example, a ABI 9600 or 9700 or 7900 obtained from Applied Biosystems, Foster City, CA; see Nucleic acid detection methods, pp. 1 -24, in Molecular Methods for Virus Detection, D.L.Wiedbrauk and D.H., Farkas, Eds., 1995, Academic Press). Amplified nucleic acids are detected using fluorescent-tagged detection oligonucleotide probes (see, for example, TaqmanTM PCR Reagent Kit, Protocol, part number 402823, Revision A, 1996, Applied Biosystems, Foster City CA) that are identified and synthesized from publicly known databases as described for the amplification primers.

[000237] For example, without limitation, amplified cDNA is detected and quantified using detection systems such as the ABI Prism^® 7900 Sequence Detection System (Applied

Biosystems (Foster City, CA)), the Cepheid SmartCycler^® and Cepheid GeneXpert^® Systems, the Fluidigm BioMark™ System, and the Roche LightCycler^® 480 Real-Time PCR System.

Amounts of specific RNAs contained in the test sample can be related to the relative quantity of fluorescence observed (see for example, Advances in Quantitative PCR Technology: 5 ' Nuclease Assays, Y.S. Lie and C.J. Petropolus, Current Opinion in Biotechnology, 1998, 9:43-48, or Rapid Thermal Cycling and PCR Kinetics, pp. 21 1-229, chapter 14 in PCR applications:

protocols for functional genomics, M.A. Innis, D.H. Gelfand and J.J. Sninsky, Eds., 1999, Academic Press). Examples of the procedure used with several of the above-mentioned detection systems are described below. In some embodiments, these procedures can be used for both whole blood RNA and RNA extracted from cultured cells (e.g., without limitation, CTCs, and CECs). In some embodiments, any tissue, body fluid, or cell(s) (e.g. , circulating tumor cells (CTCs) or circulating endothelial cells (CECs)) may be used for ex vivo assessment of a biological condition affected by an agent. Methods herein may also be applied using proteins where sensitive quantitative techniques, such as an Enzyme Linked Immunosorbent Assay (ELISA) or mass spectroscopy, are available and well-known in the art for measuring the amount of a protein constituent (see WO 98/24935 herein incorporated by reference).

[000238] An example of a procedure for the synthesis of first strand cDNA for use in PCR amplification is as follows:

[000239] Materials

1. Applied Biosystems TAQMAN Reverse Transcription Reagents Kit (P/N 808-0234). Kit Components: 10X TaqMan RT Buffer, 25 mM Magnesium chloride, deoxyNTPs mixture, Random Hexamers, RNase Inhibitor, MultiScribe Reverse Transcriptase (50 U/mL) (2) RNase / DNase free water (DEPC Treated Water from Ambion (P/N 9915G), or equivalent).

Methods

1. Place RNase Inhibitor and MultiScribe Reverse Transcriptase on ice immediately. All other reagents can be thawed at room temperature and then placed on ice.

2. Remove RNA samples from -80oC freezer and thaw at room temperature and then place immediately on ice.

3. Prepare the following cocktail of Reverse Transcriptase Reagents for each 100 mL RT reaction (for multiple samples, prepare extra cocktail to allow for pipetting error):

1 reaction (mL) I IX, e.g. lO sampl es (μί)

10X RT Buffer 10.0 1 10.0

25 mM MgCl₂ 22.0 242.0

dNTPs 20.0 220.0

Random Hexamers 5.0 55.0

RNAse Inhibitor 2.0 22.0

Reverse Transcriptase 2.5 27.5

Water 18.5 203.5

Total: 80.0 880.0 (80 per sample)

Bring each RNA sample to a total volume of 20 μΐ_^ in a 1.5 mL microcentrifuge tube (for e, remove 10 uL RNA and dilute to 20 uL with RNase / DNase free water, for whole blood RNA use 20 μΐ_^ total R A) and add 80 μΐ_^ RT reaction mix from step 5,2,3. Mix by pipetting up and down.

5. Incubate sample at room temperature for 10 minutes.

6. Incubate sample at 37°C for 1 hour.

7. Incubate sample at 90°C for 10 minutes.

8. Quick spin samples in microcentrifuge.

9. Place sample on ice if doing PCR immediately, otherwise store sample at -20°C for future use.

10. PCR QC should be run on all RT samples using 18S and β-actin.

Following the synthesis of first strand cDNA, one particular embodiment of the approach for amplification of first strand cDNA by PCR, followed by detection and quantification of constituents of a Gene Expression Panel (Precision Profile^™) is performed using the ABI Prisrn 7900 Sequence Detection System as follows:

Materials

1. 20X Primer/Probe Mix for each gene of interest.

2. 20X Primer/Probe Mix for 18S endogenous control.

3. 2X Taqman Universal PCR Master Mix.

4. cDNA transcribed from RNA extracted from cells.

5. Applied Biosystems 96-Well Optical Reaction Plates.

6. Applied Biosystems Optical Caps, or optical-clear film.

7. Applied Biosystem Prism^® 7700 or 7900 Sequence Detector.

Methods

1. Make stocks of each Primer/Probe mix containing the Primer/Probe for the gene of interest, Primer/Probe for 18S endogenous control, and 2X PCR Master Mix as follows. Make sufficient excess to allow for pipetting error e.g., approximately 10% excess. The following example illustrates a typical set up for one gene with quadruplicate samples testing two conditions (2 plates).

IX (1 well) (μν>

2X Master Mix 7.5

20X 18S Primer/Probe Mix 0.75

20X Gene of interest Primer/Probe Mix 0.75

Total 9.0 2. Make stocks of cDNA targets by diluting 95μΙ. of cDNA into 2000μί of water. The amount of cDNA is adjusted to give Ct values between 10 and 18, typically between 12 and 16.

3. Pipette 9 μL· of Primer/Probe mix into the appropriate wells of an Applied Biosystems 384- Well Optical Reaction Plate.

4. Pipette ΙΟμΙ. of cDNA stock solution into each well of the Applied Biosystems 384-Well Optical Reaction Plate.

5. Seal the plate with Applied Biosystems Optical Caps, or optical-clear film.

6. Analyze the plate on the ABI Prism^® 7900 Sequence Detector.

In another embodiment of the invention, the use of the primer probe with the first strand cDNA as described above to permit measurement of constituents of a Gene Expression Panel (Precision Profile™) is performed using a QPCR assay on Cepheid SmartCycler^® and GeneXpert^®

Instruments as follows:

I. To run a QPCR assay in duplicate on the Cepheid SmartCycler^® instrument containing three target genes and one reference gene, the following procedure should be followed.

A. With 20X Primer/Probe Stocks.

Materials

1. SmartMix™-HM lyophilized Master Mix.

2. Molecular grade water.

3. 20X Primer/Probe Mix for the 18S endogenous control gene. The endogenous control gene will be dual labeled with VIC-MGB or equivalent.

4. 20X Primer/Probe Mix for each for target gene one, dual labeled with FAM-BHQ1 or equivalent.

5. 20X Primer/Probe Mix for each for target gene two, dual labeled with Texas Red-BHQ2 or equivalent.

6. 20X Primer/Probe Mix for each for target gene three, dual labeled with Alexa 647-BHQ3 or equivalent.

7. Tris buffer, pH 9.0

8. cDNA transcribed from RNA extracted from sample.

9. SmartCycler^® 25 μΐ, tube.

10. Cepheid SmartCycler® instrument.

Methods

1. For each cDNA sample to be investigated, add the following to a sterile 650 μL tube. SmartMix™-HM lyophilized Master Mix 1 bead

20X 18S Primer/Probe Mix 2.5 μΐ,

20X Target Gene 1 Primer/Probe Mix 2.5 μΐ,

20X Target Gene 2 Primer/Probe Mix 2.5 μΐ,

20X Target Gene 3 Primer/Probe Mix 2.5 μΐ,

Tris Buffer, pH 9.0 2.5 μL·

Sterile Water 34.5 μΐ,

Total 47 μL·

Vortex the mixture for 1 second three times to completely mix the reagents. Briefly centrifuge the tube after vortexing.

2. Dilute the cDNA sample so that a 3 μL· addition to the reagent mixture above will give an 18S reference gene CT value between 12 and 16.

3. Add 3 μΐ, of the prepared cDNA sample to the reagent mixture bringing the total volume to 50 μL·. Vortex the mixture for 1 second three times to completely mix the reagents. Briefly centrifuge the tube after vortexing.

4. Add 25 μL· of the mixture to each of two SmartCycler® tubes, cap the tube and spin for 5 seconds in a microcentrifuge having an adapter for SmartCycler® tubes.

5. Remove the two SmartCycler^® tubes from the microcentrifuge and inspect for air bubbles. If bubbles are present, re-spin, otherwise, load the tubes into the SmartCycler^® instrument.

6. Run the appropriate QPCR protocol on the SmartCycler^®, export the data and analyze the results.

B. With Lyophilized SmartBeads™.

Materials

1. SmartMix™-HM lyophilized Master Mix.

2. Molecular grade water.

3. SmartBeads™ containing the 18S endogenous control gene dual labeled with VIC-MGB or equivalent, and the three target genes, one dual labeled with FAM-BHQ1 or equivalent, one dual labeled with Texas Red-BHQ2 or equivalent and one dual labeled with Alexa 647-BHQ3 or equivalent. 4. Tris buffer, pH 9.0

5. cDNA transcribed from RNA extracted from sample.

6. SmartCycler^® 25 μΐ, tube.

7. Cepheid SmartCycler^® instrument.

Methods

1. For each cDNA sample to be investigated, add the following to a sterile 650 μL tube.

SmartMix -HM lyophilized Master Mix 1 bead

SmartBead™ containing four primer/probe sets 1 bead

Tris Buffer, pH 9.0 2.5 μL·

Sterile Water 44.5 μΐ,

Total 47 μL·

3. Add 3 μL· of the prepared cDNA sample to the reagent mixture bringing the total volume to 50 μL·. Vortex the mixture for 1 second three times to completely mix the reagents. Briefly centrifuge the tube after vortexing.

4. Add 25 μL· of the mixture to each of two SmartCycler^® tubes, cap the tube and spin for 5 seconds in a microcentrifuge having an adapter for SmartCycler^® tubes.

5. Remove the two SmartCycler^®tubes from the microcentrifuge and inspect for air bubbles. If bubbles are present, re-spin, otherwise, load the tubes into the SmartCycler® instrument.

II. To run a QPCR assay on the Cepheid GeneXpert^® instrument containing three target genes and one reference gene, the following procedure should be followed. Note that to do duplicates, two self contained cartridges need to be loaded and run on the GeneXpert^® instrument.

Materials 1. Cepheid GeneXpert self contained cartridge preloaded with a lyophilized SmartMix - HM master mix bead and a lyophilized SmartBead™ containing four primer/probe sets.

2. Molecular grade water, containing Tris buffer, pH 9.0.

3. Extraction and purification reagents.

4. Clinical sample (whole blood, RNA, etc.)

5. Cepheid GeneXpert^® instrument.

Methods

1. Remove appropriate GeneXpert^® self contained cartridge from packaging.

2. Fill appropriate chamber of self contained cartridge with molecular grade water with Tris buffer, pH 9.0.

3. Fill appropriate chambers of self contained cartridge with extraction and purification reagents.

4. Load aliquot of clinical sample into appropriate chamber of self contained cartridge.

5. Seal cartridge and load into GeneXpert^® instrument.

6. Run the appropriate extraction and amplification protocol on the GeneXpert^® and analyze the resultant data.

In yet another embodiment of the invention, the use of the primer probe with the first strand cDNA as described above to permit measurement of constituents of a Gene Expression Panel (Precision Profile^™) is performed using a QPCR assay on the Roche LightCycler^® 480 Real- Time PCR System as follows:

Materials

1. 20X Primer/Probe stock for the 18S endogenous control gene. The endogenous control gene may be dual labeled with either VIC-MGB or VIC-TAMRA.

2. 20X Primer/Probe stock for each target gene, dual labeled with either FAM-TAMRA or FAM-BHQ1.

3. 2X LightCycler^® 490 Probes Master (master mix).

4. IX cDNA sample stocks transcribed from RNA extracted from samples

5. IX TE buffer, pH 8.0.

6. LightCycler^® 480 384-well plates.

7. Source MDx 24 gene Precision Profile^™ 96-well intermediate plates.

8. RNase/DNase free 96-well plate.

9. 1.5 mL microcentrifuge tubes. 10. Beckman/Coulter Biomek 3000 Laboratory Automation Workstation.

11. Velocityl 1 Bravo™ Liquid Handling Platform.

12. LightCycler^® 480 Real-Time PCR System.

Methods

1. Remove a Source MDx 24 gene Precision Profile™ 96-well intermediate plate from the freezer, thaw and spin in a plate centrifuge.

2. Dilute four (4) IX cDNA sample stocks in separate 1.5 mL microcentrifuge tubes with the total final volume for each of 540 μL·.

3. Transfer the 4 diluted cDNA samples to an empty RNase/DNase free 96-well plate using the Biomek^® 3000 Laboratory Automation Workstation.

4. Transfer the cDNA samples from the cDNA plate created in step 3 to the thawed and centrifuged Source MDx 24 gene Precision Profile™ 96-well intermediate plate using Biomek^® 3000 Laboratory Automation Workstation. Seal the plate with a foil seal and spin in a plate centrifuge.

5. Transfer the contents of the cDNA-loaded Source MDx 24 gene Precision Profile™ 96- well intermediate plate to a new LightCycler® 480 384-well plate using the Bravo™ Liquid Handling Platform. Seal the 384-well plate with a LightCycler^® 480 optical sealing foil and spin in a plate centrifuge for 1 minute at 2000 rpm.

6. Place the sealed in a dark 4°C refrigerator for a minimum of 4 minutes.

7. Load the plate into the LightCycler^® 480 Real-Time PCR System and start the

LightCycler® 480 software. Chose the appropriate run parameters and start the run.

8. At the conclusion of the run, analyze the data and export the resulting CP values to the database.

[000240] In some instances, target gene FAM measurements may be beyond the detection limit of the particular platform instrument used to detect and quantify constituents of a Gene Expression Panel (Precision Profile^™). To address the issue of "undetermined" gene expression measures as lack of expression for a particular gene, the detection limit may be reset and the "undetermined" constituents may be "flagged". For example without limitation, the ABI Prism^® 7900HT Sequence Detection System reports target gene FAM measurements that are beyond the detection limit of the instrument (>40 cycles) as "undetermined". Detection Limit Reset is performed when at least 1 of 3 target gene FAM C_T replicates are not detected after 40 cycles and are designated as "undetermined". "Undetermined" target gene FAM C_T replicates are re-set to 40 and flagged. C_T normalization (Δ C_T) and relative expression calculations that have used re-set FAM C_T values are also flagged.

[000241] Baseline profile data sets

[000242] The analyses of samples from single individuals and from large groups of individuals provide a library of profile data sets relating to a particular panel or series of panels. These profile data sets may be stored as records in a library for use as baseline profile data sets. As the term "baseline" suggests, the stored baseline profile data sets serve as comparators for providing a calibrated profile data set that is informative about a biological condition or agent. Baseline profile data sets may be stored in libraries and classified in a number of cross- referential ways. One form of classification may rely on the characteristics of the panels from which the data sets are derived. Another form of classification may be by particular biological condition, e.g., prostate cancer. The concept of a biological condition encompasses any state in which a cell or population of cells may be found at any one time. This state may reflect geography of samples, sex of subjects or any other discriminator. Some of the discriminators may overlap. The libraries may also be accessed for records associated with a single subject or particular clinical trial. The classification of baseline profile data sets may further be annotated with medical information about a particular subject, a medical condition, and/or a particular agent.

[000243] The choice of a baseline profile data set for creating a calibrated profile data set is related to the biological condition to be evaluated, monitored, or predicted, as well as, the intended use of the calibrated panel, e.g., as to monitor drug development, quality control or other uses. It may be desirable to access baseline profile data sets from the same subject for whom a first profile data set is obtained or from different subject at varying times, exposures to stimuli, drugs or complex compounds; or may be derived from like or dissimilar populations or sets of subjects. The baseline profile data set may be normal, healthy baseline.

[000244] The profile data set may arise from the same subject for which the first data set is obtained, where the sample is taken at a separate or similar time, a different or similar site or in a different or similar biological condition. For example, a sample may be taken before stimulation or after stimulation with an exogenous compound or substance, such as before or after therapeutic treatment. Alternatively the sample is taken before or include before or after a surgical procedure for the biological condition, e.g., prostate cancer. The profile data set obtained from the unstimulated sample may serve as a baseline profile data set for the sample taken after stimulation. The baseline data set may also be derived from a library containing profile data sets of a population or set of subjects having some defining characteristic or biological condition. The baseline profile data set may also correspond to some ex vivo or in vitro properties associated with an in vitro cell culture. The resultant calibrated profile data sets may then be stored as a record in a database or library along with or separate from the baseline profile data base and optionally the first profile data set although the first profile data set would normally become incorporated into a baseline profile data set under suitable classification criteria. The remarkable consistency of Gene Expression Profiles associated with a given biological condition makes it valuable to store profile data, which can be used, among other things for normative reference purposes. The normative reference can serve to indicate the degree to which a subject conforms to a given biological condition (healthy or diseased) and, alternatively or in addition, to provide a target for clinical intervention.

[000245] Calibrated data

[000246] Given the repeatability achieved in measurement of gene expression, described above in connection with "Gene Expression Panels" (Precision Profiles™) and "gene amplification", it was concluded that where differences occur in measurement under such conditions, the differences are attributable to differences in biological condition. Thus, it has been found that calibrated profile data sets are highly reproducible in samples taken from the same individual under the same conditions. Similarly, it has been found that calibrated profile data sets are reproducible in samples that are repeatedly tested. Also found have been repeated instances wherein calibrated profile data sets obtained when samples from a subject are exposed ex vivo to a compound are comparable to calibrated profile data from a sample that has been exposed to a sample in vivo.

[000247] Calculation of calibrated profile data sets and computational aids

[000248] The calibrated profile data set may be expressed in a spreadsheet or represented graphically for example, in a bar chart or tabular form but may also be expressed in a three dimensional representation. The function relating the baseline and profile data may be a ratio expressed as a logarithm. The constituent may be itemized on the x-axis and the logarithmic scale may be on the y-axis. Members of a calibrated data set may be expressed as a positive value representing a relative enhancement of gene expression or as a negative value representing a relative reduction in gene expression with respect to the baseline. [000249] Each member of the calibrated profile data set should be reproducible within a range with respect to similar samples taken from the subject under similar conditions. For example, the calibrated profile data sets may be reproducible within 20%, and typically within 10%. In accordance with embodiments of the invention, a pattern of increasing, decreasing and no change in relative gene expression from each of a plurality of gene loci examined in the Gene Expression Panel (Precision Profile^™) may be used to prepare a calibrated profile set that is informative with regards to a biological condition, biological efficacy of an agent treatment conditions or for comparison to populations or sets of subjects or samples, or for comparison to populations of cells. Patterns of this nature may be used to identify likely candidates for a drug trial, used alone or in combination with other clinical indicators to be diagnostic or prognostic with respect to a biological condition or may be used to guide the development of a

pharmaceutical or nutraceutical through manufacture, testing and marketing.

[000250] The numerical data obtained from quantitative gene expression and numerical data from calibrated gene expression relative to a baseline profile data set may be stored in databases or digital storage mediums and may be retrieved for purposes including managing patient health care or for conducting clinical trials or for characterizing a drug. The data may be transferred in physical or wireless networks via the World Wide Web, email, or internet access site for example or by hard copy so as to be collected and pooled from distant geographic sites.

[000251] The method also includes producing a calibrated profile data set for the panel, wherein each member of the calibrated profile data set is a function of a corresponding member of the first profile data set and a corresponding member of a baseline profile data set for the panel, and wherein the baseline profile data set is related to the biological condition, e.g., prostate cancer or condition related to prostate cancer, to be evaluated, with the calibrated profile data set being a comparison between the first profile data set and the baseline profile data set, thereby providing evaluation of the biological condition, for example, prostate cancer or a condition related to prostate cancer, of the subject.

[000252] In yet other embodiments, the function is a mathematical function and is other than a simple difference, including a second function of the ratio of the corresponding member of first profile data set to the corresponding member of the baseline profile data set, or a logarithmic function. In such embodiments, the first sample is obtained and the first profile data set quantified at a first location, and the calibrated profile data set is produced using a network to access a database stored on a digital storage medium in a second location, wherein the database may be updated to reflect the first profile data set quantified from the sample. Additionally, using a network may include accessing a global computer network.

[000253] In an embodiment of the present invention, a descriptive record is stored in a single database or multiple databases where the stored data includes the raw gene expression data (first profile data set) prior to transformation by use of a baseline profile data set, as well as a record of the baseline profile data set used to generate the calibrated profile data set including for example, annotations regarding whether the baseline profile data set is derived from a particular Signature Panel and any other annotation that facilitates interpretation and use of the data.

[000254] Because the data is in a universal format, data handling may readily be done with a computer. The data is organized so as to provide an output optionally corresponding to a graphical representation of a calibrated data set.

[000255] The above described data storage on a computer may provide the information in a form that can be accessed by a user. Accordingly, the user may load the information onto a second access site including downloading the information. However, access may be restricted to users having a password or other security device so as to protect the medical records contained within. A feature of this embodiment of the invention is the ability of a user to add new or annotated records to the data set so the records become part of the biological information.

[000256] The graphical representation of calibrated profile data sets pertaining to a product such as a drug provides an opportunity for standardizing a product by means of the calibrated profile, more particularly a signature profile. The profile may be used as a feature with which to demonstrate relative efficacy, differences in mechanisms of actions, etc. compared to other drugs approved for similar or different uses.

[000257] The various embodiments of the invention may be also implemented as a computer program product for use with a computer system. The product may include program code for deriving a first profile data set and for producing calibrated profiles. Such

implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (for example, a diskette, CD-ROM, ROM, or fixed disk), or transmittable to a computer system via a modem or other interface device, such as a

communications adapter coupled to a network. The network coupling may be for example, over optical or wired communications lines or via wireless techniques (for example, microwave, infrared or other transmission techniques) or some combination of these. The series of computer instructions preferably embodies all or part of the functionality previously described herein with respect to the system. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (for example, shrink wrapped software), preloaded with a computer system (for example, on system ROM or fixed disk), or distributed from a server or electronic bulletin board over a network (for example, the Internet or World Wide Web). In addition, a computer system is further provided including derivative modules for deriving a first data set and a calibration profile data set.

[000258] The calibration profile data sets in graphical or tabular form, the associated databases, and the calculated index or derived algorithm, together with information extracted from the panels, the databases, the data sets or the indices or algorithms are commodities that can be sold together or separately for a variety of purposes as described in WO 01/25473.

[000259] In other embodiments, a clinical indicator may be used to assess the biological condition, e.g.e, prostate cancer or condition related to prostate cancer, of the relevant set of subjects by interpreting the calibrated profile data set in the context of at least one other clinical indicator, wherein the at least one other clinical indicator is selected from the group consisting of blood chemistry, (e.g., PSA levels) X-ray or other radiological or metabolic imaging technique, molecular markers in the blood, other chemical assays, and physical findings.

[000260] Index construction

[000261] In combination, (i) the remarkable consistency of Gene Expression Profiles with respect to a biological condition across a population or set of subject or samples, or across a population of cells and (ii) the use of procedures that provide substantially reproducible measurement of constituents in a Gene Expression Panel (Precision Profile™) giving rise to a Gene Expression Profile, under measurement conditions wherein specificity and efficiencies of amplification for all constituents of the panel are substantially similar, make possible the use of an index that characterizes a Gene Expression Profile, and which therefore provides a measurement of a biological condition.

[000262] An index may be constructed using an index function that maps values in a Gene Expression Profile into a single value that is pertinent to the biological condition at hand. The values in a Gene Expression Profile are the amounts of each constituent of the Gene Expression

Panel (Precision Profile ). These constituent amounts form a profile data set, and the index function generates a single value— the index— from the members of the profile data set.

[000263] The index function may conveniently be constructed as a linear sum of terms, each term being what is referred to herein as a "contribution function" of a member of the profile data set. For example, the contribution function may be a constant times a power of a member of the profile data set. So the index function would have the form

[000264] / =∑GMi^P(i) ,

[000265] where I is the index, Mi is the value of the member i of the profile data set, Ci is a constant, and P(i) is a power to which Mi is raised, the sum being formed for all integral values of i up to the number of members in the data set. We thus have a linear polynomial expression. The role of the coefficient Ci for a particular gene expression specifies whether a higher ACt value for this gene either increases (a positive Ci) or decreases (a lower value) the likelihood of prostate cancer, the ACt values of all other genes in the expression being held constant.

[000266] The values Ci and P(i) may be determined in a number of ways, so that the index / is informative of the pertinent biological condition. One way is to apply statistical techniques, such as latent class modeling, to the profile data sets to correlate clinical data or experimentally derived data, or other data pertinent to the biological condition. In this connection, for example, may be employed the software from Statistical Innovations, Belmont, Massachusetts, called Latent Gold^®. Alternatively, other simpler modeling techniques may be employed in a manner known in the art. The index function for prostate cancer may be constructed, for example, in a manner that a greater degree of prostate cancer (as determined by the profile data set for the

Precision Profile listed in Table 1 described herein correlates with a large value of the index function.

[000267] Just as a baseline profile data set, discussed above, can be used to provide an appropriate normative reference, and can even be used to create a Calibrated profile data set, as discussed above, based on the normative reference, an index that characterizes a Gene

Expression Profile can also be provided with a normative value of the index function used to create the index. This normative value can be determined with respect to a relevant population or set of subjects or samples or to a relevant population of cells, so that the index may be interpreted in relation to the normative value. The relevant population or set of subjects or samples, or relevant population of cells may have in common a property that is at least one of age range, gender, ethnicity, geographic location, nutritional history, medical condition, clinical indicator, medication, physical activity, body mass, and environmental exposure.

[000268] As an example, the index can be constructed, in relation to a normative Gene Expression Profile for a population or set of healthy subjects, in such a way that a reading of approximately 1 characterizes normative Gene Expression Profiles of healthy subjects. Let us further assume that the biological condition that is the subject of the index is prostate cancer; a reading of 1 in this example thus corresponds to a Gene Expression Profile that matches the norm for healthy subjects (i.e. , normal, healthy subjects or otherwise healthy subjects with BPH). A substantially higher reading then may identify a subject experiencing prostate cancer, or a condition related to prostate cancer. The use of 1 as identifying a normative value, however, is only one possible choice; another logical choice is to use 0 as identifying the normative value. With this choice, deviations in the index from zero can be indicated in standard deviation units (so that values lying between -1 and +1 encompass 90% of a normally distributed reference population or set of subjects. Since it was determined that Gene Expression Profile values (and accordingly constructed indices based on them) tend to be normally distributed, the 0-centered index constructed in this manner is highly informative. It therefore facilitates use of the index in diagnosis of disease and setting objectives for treatment.

[000269] Still another embodiment is a method of providing an index pertinent to a biological condition, for example, prostate cancer or a condition related to prostate cancer, of a subject based on a first sample from the subject, the method comprising deriving from the first sample a profile data set, the profile data set including a plurality of members, each member being a quantitative measure of the amount of a distinct constituent (e.g. , RNA) in a panel of constituents selected so that measurement of the constituents is indicative of the presumptive signs of the biological condition, (e.g., prostate cancer). In deriving the profile data set, such measure for each constituent is achieved under measurement conditions that are substantially repeatable, at least one measure from the profile data set is applied to an index function that provides a mapping from at least one measure of the profile data set into one measure of the presumptive signs of the biological condition, so as to produce an index pertinent to the biological condition of the subject.

[000270] As another embodiment of the invention, an index function / of the form

[000271] I = C₀ +∑ CMh^P1(i) M_2l ^P2(i), ^{1 1} can be employed, where Mi and M₂ are values of the member i of the profile data set, Ci is a constant determined without reference to the profile data set, and PI and P2 are powers to which Mi and M₂ are raised. The role of Pl(i) and P2(i) is to specify the specific functional form of the quadratic expression, whether in fact the equation is linear, quadratic, contains cross-product terms, or is constant. For example, when PI = P2 = 0, the index function is simply the sum of constants; when PI = 1 and P2 = 0, the index function is a linear expression; when PI = P2 =1 , the index function is a quadratic expression.

[000273] The constant Co serves to calibrate this expression to the biological population of interest that is characterized by having the biological condition. In this embodiment, when the index value equals 0, the odds are 50:50 of the subject having the biological condition versus a normal subject. More generally, the predicted odds of the subject having the biological condition is [exp(Ii)], and therefore the predicted probability of having the biological condition is

[exp(Ii)]/[l+exp((Ii)]. Thus, when the index exceeds 0, the predicted probability that a subject has the biological condition is higher than 0.5, and when it falls below 0, the predicted probability is less than 0.5.

[000274] The value of Co may be adjusted to reflect the prior probability of being in this population based on known exogenous risk factors for the subject. In an embodiment where Co is adjusted as a function of the subject's risk factors, where the subject has prior probability pi of having the biological condition based on such risk factors, the adjustment is made by increasing (decreasing) the unadjusted Co value by adding to Co the natural logarithm of the following ratio: the prior odds of having the biological condition taking into account the risk factors/ the overall prior odds of having the biological condition without taking into account the risk factors.

[000275] PERFORMANCE AND ACCURACY MEASURES OF THE INVENTION

[000276] The performance and thus absolute and relative clinical usefulness of the invention may be assessed in multiple ways as noted above. Amongst the various assessments of performance, the invention is intended to provide accuracy in clinical diagnosis and prognosis. The accuracy of a diagnostic or prognostic test, assay, or method concerns the ability of the test, assay, or method to distinguish between subjects having a biological condition is based on whether the subjects have an "effective amount" or a "significant alteration" in the levels of an indicator. By "effective amount" or "significant alteration", it is meant that the measurement of an appropriate number of indicators (which may be one or more) is different than the predetermined cut-off point (or threshold value) for that indicator and therefore indicates that the subject has the biological condition for which the the indicator(s) is a determinant.

[000277] The difference in the level of indicator(s) between normal and abnormal is preferably statistically significant. As noted below, and without any limitation of the invention, achieving statistical significance, and thus the preferred analytical and clinical accuracy, generally but not always requires that combinations of s indicator(s) be used together in panels and combined with mathematical algorithms in order to achieve a statistically significant index.

[000278] In the categorical diagnosis of a disease state, changing the cut point or threshold value of a test (or assay) usually changes the sensitivity and specificity, but in a qualitatively inverse relationship. Therefore, in assessing the accuracy and usefulness of a proposed medical test, assay, or method for assessing a subject's condition, one should always take both sensitivity and specificity into account and be mindful of what the cut point is at which the sensitivity and specificity are being reported because sensitivity and specificity may vary significantly over the range of cut points. Use of statistics such as AUC, encompassing all potential cut point values, is preferred for most categorical risk measures using the invention, while for continuous risk measures, statistics of goodness-of-fit and calibration to observed results or other gold standards, are preferred.

[000279] Using such statistics, an "acceptable degree of diagnostic accuracy", is herein defined as a test or assay (such as the test of the invention for determining an effective amount or a significant alteration of indicator(s), which thereby indicates the presence of a biological condition in which the AUC (area under the ROC curve for the test or assay) is at least 0.60, desirably at least 0.65, more desirably at least 0.70, preferably at least 0.75, more preferably at least 0.80, and most preferably at least 0.85.

[000280] By a "very high degree of diagnostic accuracy", it is meant a test or assay in which the AUC (area under the ROC curve for the test or assay) is at least 0.75, desirably at least 0.775, more desirably at least 0.800, preferably at least 0.825, more preferably at least 0.850, and most preferably at least 0.875.

[000281] The predictive value of any test depends on the sensitivity and specificity of the test, and on the prevalence of the condition in the population being tested. This notion, based on Bayes' theorem, provides that the greater the likelihood that the condition being screened for is present in an individual or in the population (pre -test probability), the greater the validity of a positive test and the greater the likelihood that the result is a true positive. Thus, the problem with using a test in any population where there is a low likelihood of the condition being present is that a positive result has limited value (i.e., more likely to be a false positive). Similarly, in populations at very high risk, a negative test result is more likely to be a false negative.

[000282] As a result, ROC and AUC can be misleading as to the clinical utility of a test in low disease prevalence tested populations (defined as those with less than 1% rate of occurrences (incidence) per annum, or less than 10% cumulative prevalence over a specified time horizon). Alternatively, absolute risk and relative risk ratios as defined elsewhere in this disclosure can be employed to determine the degree of clinical utility. Populations of subjects to be tested can also be categorized into quartiles by the test's measurement values, where the top quartile (25% of the population) comprises the group of subjects with the highest relative risk for developing prostate cancer, and the bottom quartile comprising the group of subjects having the lowest relative risk for developing prostate cancer. Generally, values derived from tests or assays having over 2.5 times the relative risk from top to bottom quartile in a low prevalence population are considered to have a "high degree of diagnostic accuracy," and those with five to seven times the relative risk for each quartile are considered to have a "very high degree of diagnostic accuracy." Nonetheless, values derived from tests or assays having only 1.2 to 2.5 times the relative risk for each quartile remain clinically useful are widely used as risk factors for a disease. Often such lower diagnostic accuracy tests must be combined with additional parameters in order to derive meaningful clinical thresholds for therapeutic intervention, as is done with the aforementioned global risk assessment indices.

[000283] A health economic utility function is yet another means of measuring the performance and clinical value of a given test, consisting of weighting the potential categorical test outcomes based on actual measures of clinical and economic value for each. Health economic performance is closely related to accuracy, as a health economic utility function specifically assigns an economic value for the benefits of correct classification and the costs of misclassification of tested subjects. As a performance measure, it is not unusual to require a test to achieve a level of performance which results in an increase in health economic value per test (prior to testing costs) in excess of the target price of the test.

[000284] In general, alternative methods of determining diagnostic accuracy are commonly used for continuous measures, when a disease category or risk category (such as those at risk for having a bone fracture) has not yet been clearly defined by the relevant medical societies and practice of medicine, where thresholds for therapeutic use are not yet established, or where there is no existing gold standard for diagnosis of the pre-disease. For continuous measures of risk, measures of diagnostic accuracy for a calculated index are typically based on curve fit and calibration between the predicted continuous value and the actual observed values (or a historical index calculated value) and utilize measures such as R squared, Hosmer-Lemeshow P-value statistics and confidence intervals. It is not unusual for predicted values using such algorithms to be reported including a confidence interval (usually 90% or 95% CI) based on a historical observed cohort's predictions, as in the test for risk of future breast cancer recurrence commercialized by Genomic Health, Inc. (Redwood City, California).

[000285] In general, by defining the degree of diagnostic accuracy, i.e., cut points on a ROC curve, defining an acceptable AUC value, and determining the acceptable ranges in relative concentration of what constitutes an effective amount of the indicator(s) of the invention allows for one of skill in the art to use the indicator(s) to identify, diagnose, or prognose subjects with a pre-determined level of predictability and performance.

[000286] Results from the indicator(s) indices thus derived can then be validated through their calibration with actual results, that is, by comparing the predicted versus observed rate of disease in a given population, and the best predictive indicator(s) selected for and optimized through mathematical models of increased complexity. Many such formula may be used;

beyond the simple non-linear transformations, such as logistic regression, of particular interest in this use of the present invention are structural and synactic classification algorithms, and methods of risk index construction, utilizing pattern recognition features, including established techniques such as the Kth-Nearest Neighbor, Boosting, Decision Trees, Neural Networks, Bayesian Networks, Support Vector Machines, and Hidden Markov Models, as well as other formula described herein.

[000287] Furthermore, the application of such techniques to panels of multiple indicator(s) is provided, as is the use of such combination to create single numerical "risk indices" or "risk scores" encompassing information from multiple indicator(s) inputs. Individual indicator(s) may also be included or excluded in the panel of indicator(s) used in the calculation of the indices so derived above, based on various measures of relative performance and calibration in validation, and employing through repetitive training methods such as forward, reverse, and stepwise selection, as well as with genetic algorithm approaches, with or without the use of constraints on the complexity of the resulting indices. [000288] The above measurements of diagnostic accuracy for indicator(s) are only a few of the possible measurements of the clinical performance of the invention. It should be noted that the appropriateness of one measurement of clinical accuracy or another will vary based upon the clinical application, the population tested, and the clinical consequences of any potential misclassification of subjects. Other important aspects of the clinical and overall performance of the invention include the selection of indicator(s) so as to reduce overall indicator(s) variability (whether due to method (analytical) or biological (pre-analytical variability, for example, as in diurnal variation), or to the integration and analysis of results (post-analytical variability) into indices and cut-off ranges), to assess analyte stability or sample integrity, or to allow the use of differing sample matrices amongst blood, cells, serum, plasma, urine, etc.

[000289] KITS

[000290] The invention also includes prime indicator and proxy indicator detection reagents, e.g. , nucleic acids and or proteins for any of the prime or proxy genes listed herein by having homologous nucleic acid sequences, such as oligonucleotide sequences, complementary to a portion of the prime or proxy genes or antibodies to proteins encoded by the prime or proxy genes, packaged together in the form of a kit. The oligonucleotides can be fragments of the prime or proxy genes. For example the oligonucleotides can be 200, 150, 100, 50, 25, 10 or less nucleotides in length. In another embodiment, the detection reagent is one or more antibodies that specifically identify one or more protein encoded by a prime or proxy gene.

[000291] The kit may contain in separate containers a nucleic acid or antibody (either already bound to a solid matrix or packaged separately with reagents for binding them to the matrix), control formulations (positive and/or negative), and/or a detectable label. Instructions (i.e. , written, tape, VCR, CD-ROM, etc.) for carrying out the assay may be included in the kit. The assay may for example be in the form of PCR, a Northern hybridization or a sandwich ELISA, as known in the art.

[000292] For example, the kit may comprise one or more antibodies or antibody fragments which specifically bind to a protein equivalent of a prime or proxy gene described herein. The antibodies may be conjugated conjugated to a solid support suitable for a diagnostic assay (e.g. , beads, plates, slides or wells formed from materials such as latex or polystyrene) in accordance with known techniques, such as precipitation. Antibodies as described herein may likewise be conjugated to detectable groups such as radiolabels (e.g., 35 S, 125 I, 131 I), enzyme labels (e.g. , horseradish peroxidase, alkaline phosphatase), and fluorescent labels (e.g. , fluorescein) in accordance with known techniques. Alternatively the kit comprises (a) an antibody conjugated to a solid support and (b) a second antibody of the invention conjugated to a detectable group, or (a) an antibody, and (b) a specific binding partner for the antibody conjugated to a detectable group.

[000293] In another embodiment, prime or proxy gene detection reagents can be immobilized on a solid matrix such as a porous strip to form at least one prostate cancer associated gene detection site. The measurement or detection region of the porous strip may include a plurality of sites containing a nucleic acid. A test strip may also contain sites for negative and/or positive controls. Alternatively, control sites can be located on a separate strip from the test strip. Optionally, the different detection sites may contain different amounts of immobilized nucleic acids, i.e., a higher amount in the first detection site and lesser amounts in subsequent sites. Upon the addition of test sample, the number of sites displaying a detectable signal provides a quantitative indication of the amount of prostate cancer associated genes present in the sample. The detection sites may be configured in any suitably detectable shape and are typically in the shape of a bar or dot spanning the width of a test strip.

[000294] Alternatively, prime or proxy gene genes can be labeled (e.g., with one or more fluorescent dyes) and immobilized on lyophilized beads to form at least one prime or proxy gene detection site. The beads may also contain sites for negative and/or positive controls. Upon addition of the test sample, the number of sites displaying a detectable signal provides a quantitative indication of the amount of prime/proxy genes present in the sample.

[000295] Alternatively, the kit contains a nucleic acid substrate array comprising one or more nucleic acid sequences. The nucleic acids on the array specifically identify one or more nucleic acid sequences represented by prime or proxy genes. In various embodiments, the expression of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 40 or 50 or more of the sequences represented by the genes can be identified by virtue of binding to the array. The substrate array can be on, i.e., a solid substrate, i.e., a "chip" as described in U.S. Patent No. 5,744,305.

Alternatively, the substrate array can be a solution array, i.e., Luminex, Cyvera, Vitra and Quantum Dots' Mosaic.

[000296] The skilled artisan can routinely make antibodies, nucleic acid probes, i.e., oligonucleotides, aptamers, siRNAs, antisense oligonucleotides, against any of prime or proxy genes listed herein. EXAMPLES

Example 1: Patient Population

Screening for prostate cancer with PSA testing is limited by a high number of false positives, particularly in the setting of benign prostatic hypertrophy (BPH). The goal of the studies described herein was to develop whole blood RNA transcript-based diagnostic tests that improve the diagnosis of untreated, localized prostate cancer over the use of the PSA test alone.

Several multi-gene models (i.e., Precision Profiles^™) having improved discrimination between prostate cancer subjects and normal, healthy, or otherwise healthy subjects with BPH, over the use of PSA alone are described herein. These multi-gene models were identified using RNA samples isolated from a "Training Set" of subjects, and validated using RNA samples isolated from a "Test Set" of subjects.

RNA was isolated from whole blood that was collected in PaxGene™ Blood RNA Tubes from a total of 204 prostate cancer subjects and 2 control groups consisting of age-matched, medically defined normal subjects (N=170) and otherwise healthy subjects with BPH (N=l 10), for a total of 484 subject samples. Blood RNA tubes were manually processed to total RNA. RNA quality and quantity was assessed on the Agilent Bioanalyzer 2100. RNA was converted to cDNA in a random hexamer primed reaction with reverse transcriptase. cDNA was quality checked and used as the template in a quantitative PCR assay optimized for precision and calibration.

The subject samples were divided into a training set and test set as follows:

Training Set:

A total of 76 untreated, localized prostate cancer subjects, 76 age-matched, medically defined normal, healthy subjects, and 30 age-matched BPH subjects (Ν_ωαι=182) were selected to identify a preliminary biomarker panel. The 174 inflammation and cancer-related genes listed in the Precision Profile^™ for Prostate Cancer Detection (Table 1) were assayed against RNA samples isolated from the training set. The resulting gene models identified using the gene expression analysis from these subject samples are described in the Examples below.

Test Set:

A total of 128 untreated, localized prostate cancer subject, 94 medically defined age- matched normal subjects and 80 age-matched BPH subjects (N_totai=302) were selected for validating the biomarker panel identified using the Training set. Twenty-one genes (selected from the training set) were assayed against RNA sample isolated from the test set. The resulting gene models identified using gene expression analysis based on these subject samples are described in the Examples below.

Age-Mate hing/Age-adjusted PSA cut-offs:

The prostate cancer subjects and normal subjects (with and without BPH) were age matched (i.e., selected to be similar in age to each other) within 5 years in both the Training and Test datasets, as reflected in column 1 of Figure 1 A below. In some examples, PSA levels of the subjects were also age-adjusted (represented by dummy (dichotomous) variable coded 1 for all subjects (normal, BPH or CaP) if their PSA level fell above a given cut-off dependent on their age, as shown in Figure 1. The PSA cut-off levels applied to each given age range are shown in Column 2 of Figure 1A. The mean PSA value by age and group (CaP, normal, BPH) is shown in Figure IB and the percent meeting the age-adjusted PSA criteria is shown in Figure 1C.

The Examples below describe multi-gene logistic regregression models capable of distinguishing between prostate cancer subjects; normal, healthy subjects; or otherwise healthy subjects with BPH.

Example 2: Enumeration and Classification Methodology based on Logistic Regression Models Introduction

The following methods were used to generate gene models capable of distinguishing between subjects diagnosed with prostate cancer and normal subjects, with at least 75% classification accuracy, as described in the Examples below.

Given measurements on G genes from samples of i subjects belonging to group 1 and N₂ members of group 2, the purpose was to identify models containing g < G genes which discriminate between the 2 groups. The groups might be such that one consists of reference subjects (e.g., healthy, normal subjects) while the other group might have a specific disease, or subjects in group 1 may have disease A while those in group 2 may have disease B.

Specifically, parameters from a linear logistic regression model were estimated to predict a subject's probability of belonging to group 1 given his (her) measurements on the g genes in the model. After all the models were estimated (all G 1-gene models were estimated, as well as

G*(G-l)/2 2-gene models, and all (G 3) =G*(G-l)*(G-2)/6 3-gene models based on G

genes (number of combinations taken 3 at a time from G)), they were evaluated using a 2- dimensional screening process. The first dimension employed a statistical screen (significance of incremental p-values) that eliminated models that were likely to overfit the data and thus may not validate when applied to new subjects. The second dimension employed a clinical screen to eliminate models for which the expected misclassification rate was higher than an acceptable level. As a threshold analysis, the gene models showing less than 75% discrimination between i subjects belonging to group 1 and N₂ members of group 2 (i.e. , misclassification of 25% or more of subjects in either of the 2 sample groups), and genes with incremental p-values that were not statistically significant, were eliminated.

Methodological, Statistical and Computing Tools Used

The Latent GOLD program (Vermunt and Magidson, 2005) was used to estimate the logistic regression models. For efficiency in processing the models, the LG-Syntax™ Module available with version 4.5 of the program (Vermunt and Magidson, 2007) was used in batch mode, and all g-gene models associated with a particular dataset were submitted in a single run to be estimated. That is, all 1-gene models were submitted in a single run, all 2-gene models were submitted in a second run, etc.

The Data

The data consists of ACT values for each sample subject in each of the 2 groups (e.g. , prostate cancer subject vs. reference (e.g., healthy, normal subjects or otherwise healthy subjects with BPH) on each of G(k) genes obtained from a particular class k of genes (e.g., the 174 inflammation and prostate cancer specific genes shown in Table 1).

Analysis Steps

The steps in a given analysis of the G(k) genes measured on i subjects in group 1 and N₂ subjects in group 2 are as follows:

1) Eliminate low expressing genes: In some instances, target gene FAM measurements were beyond the detection limit (i.e. , very high ACj values which indicate low expression) of the particular platform instrument used to detect and quantify constituents of a Gene Expression Panel (Precision Profile™). To address the issue of "undetermined" gene expression measures as lack of expression for a particular gene, the detection limit was reset and the "undetermined" constituents were "flagged", as previously described. CT normalization (Δ CT) and relative expression calculations that have used re-set FAM CT values were also flagged. In some instances, these low expressing genes (i.e., re-set FAM CT values) were eliminated from the analysis in step 1 if 50% or more ACT values from either of the 2 groups were flagged. Although such genes were eliminated from the statistical analyses described herein, one skilled in the art would recognize that such genes may be relevant in a disease state.

Estimate logistic regression (logit) models predicting P(i) = the probability of being in group 1 for each subject i = 1 ,2,... , Ni+N₂. Since there are only 2 groups, the probability of being in group 2 equals 1 -P(i). The maximum likelihood (ML) algorithm implemented in Latent GOLD 4.0 (Vermunt and Magidson, 2005) was used to estimate the model parameters. All 1-gene models were estimated first, followed by all 2-gene models and in cases where the sample sizes i and N₂ were sufficiently large, all 3-gene models were estimated.

Screen out models that fail to meet the statistical or clinical criteria: Regarding the statistical criteria, models were retained if the incremental p- values for the parameter estimates for each gene (i.e. , for each predictor in the model) fell below the cut-off point alpha = 0.05.

Regarding the clinical criteria, models were retained if the percentage of cases within each group (e.g., disease group, and reference group (e.g., healthy, normal subjects) that was correctly predicted to be in that group was at least 75%. For technical details, see the section "Application of the Statistical and Clinical Criteria to Screen Models".

Each model yielded an index that could be used to rank the sample subjects. Such an index value could also be computed for new cases not included in the sample. See the section "Computing Model-based Indices for each Subject" for details on how this index was calculated.

A cut-off value somewhere between the lowest and highest index value was selected and based on this cut-off, subjects with indices above the cut-off were classified (predicted to be) in the disease group, those below the cut-off were classified into the reference group (i.e. , normal, healthy subjects). Based on such classifications, the percent of each group that is correctly classified was determined. See the section labeled "Classifying Subjects into Groups" for details on how the cut-off was chosen.

Among all models that survived the screening criteria (Step 3), an entropy-based R² statistic was used to rank the models from high to low, i.e., the models with the highest percent classification rate to the lowest percent classification rate. The top 5 such models are then evaluated with respect to the percent correctly classified and the one having the highest percentages was selected as the single "best" model. While there are several possible R statistics that might be used for this purpose, it was determined that the one based on entropy was most sensitive to the extent to which a model yields clear separation between the 2 groups. Such sensitivity provides a model which can be used as a tool by a practitioner (e.g. , primary care physician, oncologist, etc.) to ascertain the necessity of future screening or treatment options. For more detail on this issue, see the section labeled "Using R² Statistics to Rank Models" below.

Computing Model-based Indices for each Subject

The model parameter estimates were used to compute a numeric value (logit, odds or probability) for each diseased and reference subject (e.g., healthy, normal subject) in the sample. For illustrative purposes only, in an example of a 2-gene logit model for prostate cancer containing the genes ALOX5 and S100A6, the following parameter estimates listed in Table A were obtained:

Table A:

For a given subject with particular ACT values observed for these genes, the predicted logit associated with prostate cancer vs. reference (i.e., normals) was computed as:

LOGIT (ALOX5, S 100A6) = [alpha(l) - alpha(2)] + beta(l)* ALOX5 + beta(2)* S100A6.

The predicted odds of having prostate cancer would be:

ODDS (ALOX5, S100A6) = exp[LOGIT (ALOX5, S 100A6)]

and the predicted probability of belonging to the prostate cancer group is:

P (ALOX5, S 100A6) = ODDS (ALOX5, S 100A6) / [1 + ODDS (ALOX5, S 100A6)]

Note that the ML estimates for the alpha parameters were based on the relative proportion of the group sample sizes. Prior to computing the predicted probabilities, the alpha estimates may be adjusted to take into account the relative proportion in the population to which the model will be applied (e.g. , the incidence of prostate cancer in the population of adult men in the U.S.) Classifying Subjects into Groups

The "modal classification rule" was used to predict into which group a given case belongs. This rule classifies a case into the group for which the model yields the highest predicted probability. Using the same prostate cancer example previously described (for illustrative purposes only), use of the modal classification rule would classify any subject having P > 0.5 into the prostate cancer group, the others into the reference group (e.g., healthy, normal subjects). The percentage of all i prostate cancer subjects that were correctly classified were computed as the number of such subjects having P > 0.5 divided by Ni . Similarly, the percentage of all N₂ reference (e.g. , normal healthy) subjects that were correctly classified were computed as the number of such subjects having P < 0.5 divided by N₂. Alternatively, a cut-off point Po could be used instead of the modal classification rule so that any subject i having P(i) > Po is assigned to the prostate cancer group, and otherwise to the Reference group (e.g. , normal, healthy group).

Application of the Statistical and Clinical Criteria to Screen Models

Clinical screening criteria

In order to determine whether a model met the clinical 75% correct classification criteria, the following approach was used:

A. All sample subjects were ranked from high to low by their predicted probability P (e.g. , see Table B).

B. Taking Po(i) = P(i) for each subject, one at a time, the percentage of group 1 and group 2 that would be correctly classified, Pi(i) and P₂(i) was computed.

C. The information in the resulting table was scanned and any models for which none of the potential cut-off probabilities met the clinical criteria (i.e., no cut-offs Po(i) exist such that both Pi(i) > 0.75 and P₂(i) > 0.75) were eliminated. Hence, models that did not meet the clinical criteria were eliminated.

The example shown in Table B has many cut-offs that meet this criteria. For example, the cut-off Po = 0.4 yields correct classification rates of 92% for the reference group (i.e. , normal, healthy subjects), and 93% for Prostate Cancer subjects.

Statistical screening criteria

In order to determine whether a model met the statistical criteria, the following approach was used to compute the incremental p-value for each gene g =1 ,2,... , G as follows: i. Let LSQ(O) denote the overall model L-squared output by Latent GOLD for an unrestricted model.

ii. Let LSQ(g) denote the overall model L-squared output by Latent GOLD for the

restricted version of the model where the effect of gene g is restricted to 0. iii. With 1 degree of freedom, use a 'components of chi-square' table to determine the p- value associated with the LR difference statistic LSQ(g) - LSQ(O).

Note that this approach required estimating g restricted models as well as 1 unrestricted model.

Discrimination Plots

For a 2-gene model, a discrimination plot consisted of plotting the ACT values for each subject in a scatterplot where the values associated with one of the genes served as the vertical axis, the other serving as the horizontal axis. Two different symbols were used for the points to denote whether the subject belongs to group 1 or 2.

A line was appended to a discrimination graph to illustrate how well the 2-gene model discriminated between the 2 groups. The slope of the line was determined by computing the ratio of the ML parameter estimate associated with the gene plotted along the horizontal axis divided by the corresponding estimate associated with the gene plotted along the vertical axis. The intercept of the line was determined as a function of the cut-off point.

For a 3-gene model, a 2-dimensional slice defined as a linear combination of 2 of the genes was plotted along one of the axes, the remaining gene being plotted along the other axis. The particular linear combination was determined based on the parameter estimates. For example, if a 3^rd gene were added to the 2-gene model consisting of ALOX5 and S 100A6 and the parameter estimates for ALOX5 and S 100A6 were beta(l) and beta(2) respectively, the linear combination beta(l)* ALOX5+ beta(2)* S 100A6 could be used. This approach can be readily extended to the situation with 4 or more genes in the model by taking additional linear combinations. For example, with 4 genes one might use beta(l)* ALOX5+ beta(2)* S 100A6 along one axis and beta(3)*gene3 + beta(4)*gene4 along the other, or beta(l)* ALOX5+ beta(2)* S 100A6+ beta(3)*gene3 along one axis and gene4 along the other axis. When producing such plots with 3 or more genes, genes with parameter estimates having the same sign were chosen for combination.

Using R² Statistics to Rank Models 2

The R in traditional OLS (ordinary least squares) linear regression of a continuous dependent variable can be interpreted in several different ways, such as 1) proportion of variance accounted for, 2) the squared correlation between the observed and predicted values, and 3) a transformation of the F-statistic. When the dependent variable is not continuous but categorical (in our models the dependent variable is dichotomous - membership in the diseased group or reference group), this standard R² defined in terms of variance (see definition 1 above) is only one of several possible measures. The term 'pseudo R ' has been coined for the generalization of the standard variance-based R² for use with categorical dependent variables, as well as other settings where the usual assumptions that justify OLS do not apply.

The general definition of the (pseudo) R² for an estimated model is the reduction of errors compared to the errors of a baseline model. For the purpose of the present invention, the estimated model is a logistic regression model for predicting group membership based on 1 or more continuous predictors (AC_T measurements of different genes). The baseline model is the regression model that contains no predictors; that is, a model where the regression coefficients are restricted to 0. More precisely, the pseudo R² is defined as:

R² = [Error(baseline)- Error(model)]/Error(baseline)

Regardless how error is defined, if prediction is perfect, Error(model) = 0 which yields

R² = 1. Similarly, if all of the regression coefficients do in fact turn out to equal 0, the model is

2 2

equivalent to the baseline, and thus R = 0. In general, this pseudo R falls somewhere between O and 1.

2 2

When Error is defined in terms of variance, the pseudo R becomes the standard R . When the dependent variable is dichotomous group membership, scores of 1 and 0, -1 and +1, or any other 2 numbers for the 2 categories yields the same value for R . For example, if the dichotomous dependent variable takes on the scores of 1 and 0, the variance is defined as P*(l-P) where P is the probability of being in 1 group and 1-P the probability of being in the other.

A common alternative in the case of a dichotomous dependent variable, is to define error in terms of entropy. In this situation, entropy can be defined as P*ln(P)*(l-P)*ln(l-P) (for further discussion of the variance and the entropy based R , see Magidson, Jay, "Qualitative Variance, Entropy and Correlation Ratios for Nominal Dependent Variables," Social Science Research 10 (June), pp. 177-194).

The R² statistic was used in the enumeration methods described herein to identify the "best" gene-model. R² can be calculated in different ways depending upon how the error variation and total observed variation are defined. For example, four different R 2 measures output by Latent GOLD are based on:

a) Standard variance and mean squared error (MSE)

b) Entropy and minus mean log-likelihood (-MLL)

c) Absolute variation and mean absolute error (MAE)

d) Prediction errors and the proportion of errors under modal assignment (PPE)

Each of these 4 measures equals 0 when the predictors provide zero discrimination between the groups, and equals 1 if the model is able to classify each subject into their actual group with 0 error. For each measure, Latent GOLD defines the total variation as the error of the baseline (intercept-only) model which restricts the effects of all predictors to 0. Then for each, R² is defined as the proportional reduction of errors in the estimated model compared to the baseline model. For the 2-gene prostate cancer example used to illustrate the enumeration methodology described herein, the baseline model classifies all cases as being in the diseased group since this group has a larger sample size, resulting in 50 misclassifications (all 50 normal subjects are misclassified) for a prediction error of 50/107 = 0.467. In contrast, there are only 10 prediction errors (= 10/107 = 0.093) based on the 2-gene model using the modal assignment rule, thus yielding a prediction error R² of 1 - 0.093/.467 = 0.8. As shown in Exhibit 1, 4 normal and 6 cancer subjects would be misclassified using the modal assignment rule. Note that the modal rule utilizes Po = 0.5 as the cut-off. If Po = 0.4 were used instead, there would be only 8 misclassified subjects.

To reduce the likelihood of obtaining models that capitalize on chance variations in the observed samples the models may be limited to contain only M genes as predictors in the model. (Although a model may meet the significance criteria, it may overfit data and thus would not be expected to validate when applied to a new sample of subjects.) For example, for M = 2, all models would be estimated which contain:

A. 1 -gene— G such models

B. 2-gene models ~ G*(G-l)/2 such models

C. 3-gene models - (G 3) =G*(G-l)*(G-2)/6 such models Table B: ACT Values and Model Predicted Probability of Prostate Cancer for Each Subject

ALOX5 S108A6 p Group SiQOAS

13.92 16.13 1,0000 Cancer 16.52: 15.38 0.5343 i Cancer

13.90 15.77 1.0000 Cancer 15.54 13,67 0.5255 Norma!

13.75 15.17 1.0000 Cancer 15.28 13.11 0.4537 Cancer

13.62 14.51 t.0000 Cancer 15,96 1 ,23 0.4207 Cancer

15.33 17.16 1.0000 Cancer 15,96 14.20 0.3928 Norma! 13.86 14.61 1.0000 Cancer 16,25 14.69 03887 Can er

14.14 15.09 1.0000 Cancer 5604 14.32 03874 Can er

13.49 13.60 0.9999 Cancer 56.28 14.71 0.3883 Normal

15 24 1661: 0.9999 Cancer 55,97 14 18 0.371:0 Cancer

14.03 14,45 0.9999 Cancer 5,93 14.06 0.3407 Norma!

14.98 16.05 0.9999 Cancer 16,23 14.41 0.2378 Cancer

13.95 14.25 0.9999 Cancer 16.02 13 91 0.1743 Norma! 14.09 14.13 o. m Cancer 15.99 13 78 0.1501: Norma!

15.01 15.89 0.9997 Cancer 18.74 15,05 0.1389 Norma!

14.13 14.15 0.9997 Cancer 16.86 1 ,90 0.1349 Norma!

14.37 16.91 15,20 0.0994 Norma!

14.43 0.9996 Cancer

16,47

er 14.31 0.0721 Normal

14.14 13.88 0.9994 Canc

1663 14.57 00672 Normal

14.33 14.17 0.9993 Cancer

56,25 13,90 00663 Normal

14.97 15.06 0.9988 Cancer

56,82 14.84 00596 Normal

14.59 14.30 0.9984 Cancer

56,75 14.73 0.0587 Norma!

14.45 13.93 0.9978 Cancer

56,69 14.54 0.0474 Normal

14.40 13.77 0.9972 Cancer

17.13 15,25 0.0416 Norma!

14.72 14.31: 0.9971: Cancer

16,87 14.72 0.0329 Normal

14 81: 14 38 0.9963 Cancer 13.35 13 76 0.0285 Norma!

14.54 13.91: 0.9963 Cancer 18.41 13 83 0.0255 Norma!

14.88 14,48 0.9962 Cancer 18.88 1 ,20 0.0205 Norma!

14.85 14,42 0.9959 Cancer 16.58 3,97 0.01:69 Norma! 15,40 15.30 0.9951 Cancer 16.86 1 ,09 0.01:67 Norma!

15.58 15.30 0.9951 Cancer 16,92 1 ,49 0.0140 Normal

14.82 14.28 0.9950 Cancer 5693 14.51 00139 Normal

14.78 14.06 0.9924 Cancer 5727 15.04 00123 Normal

14.68 13.88 0.9922 Cancer 56,45 13,80 0.011:6 Normal

14.54 13.64 0.9922 Cancer 57.52 15.44 0.011:0 Norma!

15.86 15.91: 0.9920 Cancer 57.12 14.46 0.0051: Normal

15.71: 15.60 0.9908 Cancer 17.13 14.46 0.0048 Norma!

16.24 16.36 0.98:58 Cancer 16,78 13.86 0.0047 Normal

16.09 15.94 0.9774 Cancer 17.10 14 36 0.0041: Norma!

15.26 14.41: 0.9705 Cancer 18.75 13 69 0.0034 Norma!

14 93 13 81: 0.9693 Cancer 17.27 1 ,49 0.0027 Norma!

15,44 14.67 0.9670 Cancer 17.07 14.08 0.0022 Norma!

15.69 15.08 0.9663 Cancer 17.16 1 ,08 0.0014 Norma!

15,40 14.54 0.9615 Cancer 17.50 1 ,41 0.0007 Normal 15.80 15.21 0.9586 Cancer 57 50 14.18 00004 Norma!

15.98 15.43 0.9485 Cancer 57 45 14.02 00003 Normal

15.20 14.08 0.9461 Norma} 5753 13,90 00001 Normal

15.03 13,62 0.9196 Cancer 58,21 15.06 0.0001 Norma!

15.20 13.91 0.9184 Cancer 57.99 14.63 0.0001 Norma!

15.04 13.54 0.8972 Cancer .73 14.05 0.0001: Norma!

15.30 13.92 0.8774 Cancer 17.S7 14.40 0.0001 Norma! 15.80 14.68 0.8404 Cancer 17.S8 14 35 0.0001: Norma!

15.61 14.23 0.7939 Normal 18.47 15 16 0.0001 Norma!

15.89 14.64 0.7577 Normal 18.28 14.59 0.0000 Norma!

15.44 13.66 0.6445 Cancer 18.37 14.71 0.0000 Norma! These data support that Gene Expression Profiles with sufficient precision and calibration as described herein (1) can determine subsets of individuals with a known biological condition, particularly individuals with prostate cancer or individuals with conditions related to prostate cancer; (2) may be used to monitor the response of patients to therapy; (3) may be used to assess the efficacy and safety of therapy; and (4) may be used to guide the medical management of a patient by adjusting therapy to bring one or more relevant Gene Expression Profiles closer to a target set of values, which may be normative values or other desired or achievable values.

Gene Expression Profiles are used for characterization and monitoring of treatment efficacy of individuals with prostate cancer, or individuals with conditions related to prostate cancer. Use of the algorithmic and statistical approaches discussed above to achieve such identification and to discriminate in such fashion is within the scope of various embodiments herein.

Example 3: Discrimination of Prostate Cancer Subjects from Healthy, Normal Subjects Using RNA Transcript-Based Gene Expression: Training Dataset

The cDNA derived from patient blood samples, as described in Example 1, was quality checked and used as the template in a quantitative PCR assay optimized for precision and calibration. Custom primers and probes were prepared for the targeted 174 genes shown in the Precision Profile^™ for Prostate Cancer Detection (shown in Table 1), selected to be informative relative to biological state of inflammation and prostate cancer. Individual target genes were multiplexed with 18s rRNA endogenous control. Assays were configured in a 384-well plate formatted for triplicate measures and run on the ABI Prism® 7900HT Sequence Detection System. Gene expression profiles for the 174 prostate cancer specific genes were analyzed using the RNA samples obtained from the Training Dataset (i.e., 76 prostate cancer, 76 medically defined age-matched normals, and 30 age-matched BPH), described in Example 1.

Logistic regression models yielding the best discrimination between subjects diagnosed with prostate cancer and normal subjects (excluding subjects with BPH) were generated using the enumeration and classification methodology described in Example 2. Data files were "flltered-by-rule" to ensure all replicate values met predefined metrics. Normalized gene expression values (delta CT values) for each amplified target gene were calculated (target gene CT - endogenous control CT). Logistic regression methodology was used to obtain all possible 1-, 2- and 3-gene models. Top qualifying 3-gene models were used to develop higher order models (4-6 gene) through stepwise regression technique. Several thousand logistic regression models were identified as capable of distinguishing between subjects diagnosed with prostate cancer and normal subjects (excluding subjects BPH) with at least 75% accuracy. For example, a total of 11,105 3-gene models capable of distinguishing between subjects diagnosed with prostate cancer and normal subjects (excluding BPH) were identified. No additional predictors which discriminate between prostate cancer and normal subjects (e.g., PSA or age) were used in conjunction with these gene-models. As used in this Example, sensitivity refers to the percentage of prostate cancer subjects correctly classified by the gene models described herein, whereas specificity refers to the percentage of normal subjects (without BPH) correctly classified.

For example, the "best" 3-gene logistic regression model capable of distinguishing between prostate cancer subjects and normal, healthy subjects (defined as the model with the highest entropy R² value, as described in Example 2) based on the 174 genes included in the Precision Profile^™ for Prostate Cancer Detection is CD97, CDK2 and SP1, capable of classifying normal subjects with 81.6% accuracy (81.6% specificity), and prostate cancer subjects with 81.6% accuracy (81.6% sensitivity). Each of the 76 normal RNA samples and the 76 prostate cancer RNA samples were analyzed for this 3-gene model, no values were excluded. This 3- gene model correctly classifies 62 of the normal subjects as being in the normal patient population, and misclassifies 14 of the normal subjects as being in the prostate cancer patient population. This 3-gene model correctly classifies 62 of the prostate cancer subjects as being in the prostate cancer patient population and misclassifies 14 of the prostate cancer subjects as being in the normal patient population.

Example 4: Discrimination of Prostate Cancer Subiects from Healthy, Normal Subiects (without BPH) Using RNA Transcript-Based Gene Expression and PSA Values: Training Dataset

The PSA test is currently used as a predictor for identifying subjects with prostate cancer. However, such test is unreliable and results in a high incidence of false positives, especially in the setting of BPH, resulting in additional costly and unnecessary testing.

PSA values were available for the 76 untreated, localized prostate cancer subjects and 76 age-matched normal subjects from the Training Dataset described in Example 1. The prostate cancer subjects and age-matched normal subjects had a median age of 60 years. These PSA values were used as the sole predictor to discriminate the prostate cancer subjects from the age-matched normal subjects. As shown in the ROC curve in Figure 2, PSA alone had a specificity of 94.7%, but sensitivity of only 71.1% for diagnosis of prostate cancer, using a cut-off of 4 ng/mL. When age-adjusted PSA was used as the sole predictor, age-adjusted PSA alone had a specificity of 90.8% but a sensitivity of only 77.6% for diagnosis of prostate cancer.

Stepwise methodology was used to determine whether transcript based gene expression combined with PSA levels could improve the sensitivity (i.e., percentage of prostate cancer subjects correctly classified) and specificity (i.e., percentage of normal, healthy subjects (without BPH) correctly classified) over the use of PSA testing alone. Both gene expression data and PSA were available for the 76 untreated, localized prostate cancer subjects and the 76 age- matched normal subjects from the Training Dataset described in Example 1. All possible 2- and 3-gene logit models were estimated based on the 174 target genes assayed (Table 1) and PSA using the methodology described in Example 2. A total 790,244 3-gene models were enumerated. Of these 790,244 3-gene models, a total of 3,533 models displayed a specificity and sensitivity for diagnosis of prostate cancer of over 88%.

Note that the variable plnPSA used in these logit models was a logarithmic

transformation of PSA in which PSA values less than 1 were recorded to 1 prior to taking the natural logarithm.

The "best" 3-gene logistic regression model capable of distinguishing between prostate cancer subject and normal healthy subjects when combined with PSA values (defined as the model with the highest entropy R value, as described in Example 2) based on the 174 genes included in the Precision Profile for Prostate Cancer Detection is CD97, RP51077B9.4 and SPl , capable of classifying normal subjects with 89.5% accuracy, and prostate cancer subjects with 89.5% accuracy. Each of the 76 normal RNA samples and the 76 prostate cancer RNA samples were analyzed for this 3-gene model, no values were excluded. This 3-gene model correctly classifies 68 of the normal subjects as being in the normal patient population, and misclassifies 8 of the normal subjects as being in the prostate cancer patient population. This 3- gene model correctly classifies 68 of the prostate cancer subjects as being in the prostate cancer patient population and misclassifies 8 of the prostate cancer subjects as being in the normal patient population. This 3-gene logit model (CD97, RP51077B9.4 and SP1) was used to develop a 6-gene model, RP51077B9.4, CD97, CDK 2A, SP1, S100A6 and IQGAP1, based on the Stepwise regression technique. This 6-gene model significantly improved prediction of prostate cancer compared with age-adjusted PSA. This 6-gene model was capable of distinguishing between prostate cancer subjects and normal, healthy subjects (without BPH) with 97.4% sensitivity and 96.1% specificity. A ROC curve for this 6-gene model compared to age-adjusted PSA criteria is shown in Figure 3. As shown in Figure 3, there is improved area under the ROC curve for the 6- gene model (AUC=0.946) as compared to age-adjusted PSA criteria alone (AUC=0.842). The AUC difference 0.104 is statistically significant (p-value=0.005).

Transcript based gene expression levels of the 6-gene model, combined with PSA values of the 76 prostate cancer subjects and 76 age-matched normal subjects from the Training Dataset, gave even higher specificity (96.1%) and a much improved sensitivity (97.4%) for prostate cancer diagnosis (criterion: Prob (CaP) > .5) over the use of PSA alone (94.7% specificity, 71.1% sensitivity). A ROC curve for the 6-gene model + PSA model compared to age-adjusted PSA alone is shown in Figure 4. Improved area under the ROC curve further supports the improved discrimination of prostate cancer versus age-matched normal subjects when combining PSA with gene expression as compared to PSA alone. As shown in Figure 4, the area under the ROC curve is 0.842 for age-adjusted PSA alone compared to 0.994 for PSA+6 genes. This improvement is statistically significant (p-value = 2.0E-06).

The 6-Gene Model + PSA retains its superiority over age-adjusted PSA alone when BPH Subjects were included with the normal subjects without BPH. The 6-gene+PSA model yielded a sensitivity of 97.4% and specificity of 91.5% for discriminating between prostate cancer subjects and normal subjects (with and without BPH; CaP (N=76) vs. Normals (N=76), BPH (N=30)). In contrast, age-adjusted PSA alone yielded a sensitivity of only 77.6% and a specificity of only 87.7% when BPH subjects were included with normal subjects without BPH.

The 6-gene model, RP51077B9.4, CD97, CDKN2A, SP1, S100A6 and IQGAP1, did not over-fit based on K-fold cross-validation. The following analysis was done to test for over- fitting: a) data were randomly split into K=10 equal sized sub samples; b) target model was re- estimated 10 times, each time omitting 1 sub sample; c) re-estimated model was applied to omitted sub sample; d) results were accumulated across all sub samples; e) validation log likelihood (Validation LL) was calculated (standard LL always increases when an additional gene is included in the model, and should decrease if the additional gene is extraneous). The subjects in the Training Dataset with PSA values between 2 ng niL and 4 ng niL included a large number of both prostate cancer subject and normal subjects. As shown in Figure 5, when using PSA alone to discriminate between prostate cancer subjects and normal subjects with PSA values between 2 ng/mL and 4 ng/mL, 22 prostate cancer subjects are misclassified based on a cut-off of 4.0 ng/mL. However, 17 prostate cancer subjects and 17 age- matched normal subjects have PSA between 2 ng/mL and 4 ng/mL. Thus, reducing the cut-off below 4 ng/mL results in many false positive diagnoses. In contrast, when using the transcript based gene expression levels of the 6-gene model, RP51077B9.4, CD97, CDK 2A, SPl , S100A6 and IQGAPl , combined with PSA values, only 2 of the 76 prostate cancer subjects and 3 of the 76 normal subjects are misclassified based on a cut-off of 0.5 (cut-off shown by arrow on Y-axis, Figure 5). Additionally, the 65 subjects with the highest model scores are all prostate cancer subjects, while the 65 subjects with the lowest model scores are all normal subjects.

Thus, the use of the 6-gene model combined with PSA provides excellent discrimination between prostate cancer and age-matched normal subjects.

A discrimination plot of the 6-gene model, RP51077B9.4, CD97, CDK 2A, SPl, S100A6 and IQGAPl combined with PSA is shown in Figure 6. As shown in Figure 6, the normal subjects are represented by circles, whereas prostate cancer subjects are represented by X's. The line appended to the discrimination graph in Figure 6 illustrates how well the 6-gene model plus PSA discriminates between the 2 groups. Values above the line represent subjects predicted by the 6-gene plus PSA model to be in the prostate cancer population. Values below the line represent subjects predicted to be normal subject population. As shown in Figure 6, only 3 normal subject (circles) and 2 prostate cancer subjects (X's) are classified in the wrong patient population.

Individual subject predicted probability scores based on the 6-gene model, RP51077B9.4, CD97, CDK 2A, SPl, S100A6 and IQGAPl combined with PSA also provides good separation of prostate cancer subjects from age-matched normal subjects. As shown in Figure 7, many prostate cancer subjects have predicted probability of 0.8 or higher of having prostate cancer (above arrow shown on Y-axis, Figure 7). Using a cut-off probability of 0.5 (probability (CaP) misclassifies only 2 prostate cancer subjects and only 3 normal subjects.

This 6-gene model was validated using RNA samples from the Test Dataset, as described in Example 6 below. In addition to this 6-gene model, five 2-gene models, two 4-gene models, and an additional 6-gene model, as shown in Table 3, all capable of distinguishing between prostate cancer subjects and normal subjects (without BPH) with over 75% sensitivity and specificity, will be validated using the Test data set population, as described in Example 6 below.

Example 5: Discrimination of Prostate Cancer Subjects from Healthy, Subjects With Benign Prostatic Hyperplasia (BPH) Using RNA Transcript-Based Gene Expression and Age- Adjusted PSA Values: Training Dataset

Example 4 describes several thousand 2-, 3- and 4-gene models and a 6-gene model which improve the specificity and sensitivity of prostate cancer screening when combined with PSA values, over the use of PSA testing alone. The data presented in this Example demonstrates that age-adjusted PSA values, when combined with transcript based gene expression, can also improve sensitivity (i.e., percentage of prostate cancer subjects correctly classified) and specificity (i.e., percentage of BPH subjects correctly classified) of prostate cancer screening over the use of age-adjusted PSA values alone.

The 76 prostate cancer subjects, 76 normal subjects, and 30 BPH subjects in the Training Dataset were age-matched as shown in Figure 1 , and PSA values were age adjusted. Age adjusted PSA criteria was represented by a dummy (dichotomous) variable coded 1 for all subjects (normal, BPH or CaP) if their PSA level fell above the given cut-off dependent on their age, as shown in Figure 1. The prostate cancer cohort had a median age of 60 years, while the BPH cohort had a median age of 70 years.

Using age-adjusted PSA criteria as the sole predictor to screen for prostate cancer among the 76 untreated, localized prostate cancer subjects, 106 normal subjects (combined normal and BPH subjects) resulted in a specificity of 88.1% and sensitivity of only 77.6% for diagnosis of prostate cancer.

Using age-adjusted PSA criteria as the sole predictor to screen for prostate cancer among the 76 untreated, localized prostate cancer subjects and the 30 BPH subjects resulted in a specificity of 86.7% and sensitivity of 88.2%. A ROC curve demonstrating the ability of age and PSA to discriminate the prostate cancer patients from the BPH subjects of the Training Dataset is shown in Figure 8.

Stepwise methodology was used to identify multi-gene models which combined with age-adjusted PSA levels could improve the sensitivity and specificity over the use of age- adjusted PSA values alone to discriminate between the prostate cancer subjects and BPH subjects. Both gene expression data and PSA values were available for the 76 untreated, localized prostate cancer subjects and 30 BPH subjects from the Training Dataset described in Example 1. All possible 3-gene logit models were estimated based on the 174 target genes assayed (Table 1) and age-adjusted PSA values using the methodology described in Example 2, resulting in an enumeration of 5,597 3-gene models which discriminate between prostate cancer and BPH subjects with at least 75% correct classification.

The "best" logistic regression model capable of distinguishing between prostate cancer subjects and BPH subjects when combined with age and PSA (defined as the model with the highest entropy R² value, as described in Example 2) based on the 174 genes included in the Precision Profile^™ for Prostate Cancer Detection is MAP2K1, MYC and S100A6, capable of classifying BPH subjects with 90% accuracy, and prostate cancer subjects with 89.5% accuracy. Each of the 30 BPH RNA samples and the 76 prostate cancer RNA samples were analyzed for this 3-gene model, no values were excluded. This 3-gene model correctly classifies 27 of the BPH subjects as being in the BPH patient population, and misclassifies 3 of the BPH subjects as being in the prostate cancer patient population. This 3-gene model correctly classifies 68 of the prostate cancer subjects as being in the prostate cancer patient population and misclassifies 8 of the prostate cancer subjects as being in the BPH patient population.

This 3-gene logit model (MAP2K1 , MYC and S100A6) was used to develop a 5-gene model, S100A6, MYC, MAP2K1, C1QA and RP51077B9.4, based on the Stepwise regression technique. Transcript based gene expression levels of the 5-gene model integrated with PSA and age gave higher specificity (93.3% of BPH subjects correctly classified) a much improved sensitivity (96.1% of prostate cancer subjects correctly classified) for prostate cancer diagnosis over the use of PSA and age alone (86.7% specificity, 88.2% sensitivity, as shown in Figure 8).

A ROC curve for the 5-gene+PSA+Age model is shown in Figure 9. Improved area under the ROC curve further supports the improved discrimination of prostate cancer versus BPH subjects when combining PSA and age with gene expression as compared to age-adjusted PSA alone. As shown in Figure 10, the area under the ROC curve is 0.871 for the model based on PSA and age alone, as compared to 0.989 when expression values for the 5-gene model are included with PSA and age. This improvement is statistically significant (p-value = 0.0001).

A discrimination plot of the 5-gene model, S100A6, MYC, MAP2K1, C1QA and RP51077B9.4, combined with PSA + age is shown in Figure 1 1. As shown in Figure 11 , the BPH subjects are represented by circles, whereas prostate cancer subjects are represented by X's. The line appended to the discrimination graph in Figure 11 illustrates how well the 5-gene model combined with PSA and age discriminates between the 2 groups. Values above the line represent subjects predicted by the 5-gene+PSA+age model to be in the BPH subject population. Values below the line represent subjects predicted to be prostate cancer population. As shown in Figure 11, only 2 of the 30 BPH subject (circles) and 3 of the 76 prostate cancer subjects (X's) are classified in the wrong patient population. However, all 5 misclassifications are close to the discrimination line.

Individual subject predicted probability scores based on the 5-gene model, S100A6, MYC, MAP2K1, C1QA and RP51077B9.4 combined with PSA+ age also provides good separation of prostate cancer subjects from BPH subjects, as shown in Figure 12. The cut-off probability (probability (CaP)) can be modulated to alter sensitivity and specificity of the model. For example, a cut-off probability of 0.17 yields a sensitivity of 100% (all prostate cancer subjects above the cut-off line) which reduces the specificity to 87% (26 of 30 BPH subjects below the line).

Two or more of the gene-models described herein can be used incrementally or iteratively to provide almost perfect discrimination of prostate cancer patients from non-prostate cancer patients (normals and BPH). For example, combining the 6-gene (RP51077B9.4, CD97, CDK 2A, SP1 , S100A6 and IQGAP1)+PSA model which discriminates between prostate cancer and normal, healthy subjects (described in Example 4 and validated in Example 6) with the 5-gene (S100A6, MYC, MAP2K1 , C1QA and RP51077B9.4)+PSA+age model which discriminates between prostate cancer and BPH subjects provides almost perfect discrimination. As shown in Figure 13, prostate cancer subjects are almost exclusively in the upper right quadrant-above the cut-off on the prostate cancer versus normals model (cut off shown as the horizontal line intersecting the Y-axis) and above the cut-off on the prostate cancer versus BPH model (cut off shown as the vertical line intersecting the Y-axis).

This 5-gene model, in addition to the three 1-gene models, five 2-gene models, three 3- gene models, and three 5-gene models, as shown in Table 4, all capable of distinguishing between prostate cancer subjects and normal subjects with BPH with over 75% sensitivity and specificity, will be validated using the RNA samples from the Test Dataset. Example 6: Discrimination of Prostate Cancer Subiects from Healthy, Normal Subiects (without BPH) Using RNA Transcript-Based Gene Expression: Test Dataset

RNA samples from the Test Dataset were used to validate the 6-gene (RP51077B9.4, CD97, CDKN2A, SPl, S100A6 and IQGAPl) model's ability to discriminate between prostate cancer subjects and normal subjects (without BPH), identified using samples from the Training Dataset, as described in Example 4.

A comparison of differences in mean delta CT values for prostate cancer patients versus normal subjects demonstrated high consistency between training and test sample measurements for the 6-gene (RP51077B9.4, CD97, CDKN2A, SPl, S100A6 and IQGAPl) model (see Figure 14, mean delta CT differences (CaP-normals) with associated 95% confidence intervals).

Validation of 6-gene logit model alone (i.e., not combined with PSA)

Validation of the 6-gene logit model followed a pre-specified validation plan as follows:

Test A:

a) Using pre-specified model coefficients (beta) established in TRAINING dataset compute a model logit score.

b) Apply pre-specified cut-point established in the TRAINING dataset to yield 2 groups.

Subjects with logit scores above 0 (predicted probability of CaP = 0.5) are predicted to be CaP and those with scores below 0 are predicted to be healthy normal.

c) Form 2 x 2 table of frequency counts (actual by predicted classification). Compute likelihood ratio chi-squared (L²) and derive p-value with 1 degree of freedom. A validation p-value<.05 constitutes a successful validation (meaning test results deviate from independence with 95% confidence).

Test B:

a) Repeat Test A using model coefficients (beta parameters) re-estimated on the test dataset. b) Validation is successful if the re-estimated beta parameters are in the same direction as the original model and the predictions based on the logit cut-point of 0 results in a p-value <0.05. Test C:

a) Using pre-specified model coefficients established in the TRAINING dataset compute a model logit score.

b) Construct comparative ROC curves using the 6-gene model logit score vs. the age-adjusted PSA criterion. The model validates if the improvement in the area under the curve (AUC) associated with the 6-gene logit model vs. age-adjusted PSA is significant (p<0.05).

Test A results:

Using pre-specified model coefficients (5.19 RP51077B9.4, 3.67 CD97, 1.89 CDKN2A, -9.47 SP1 , -1.43 S100A6, 3.44 IQGAP1) and a pre-specified cut point probability of 0.5, the 6- gene logit model demonstrated a sensitivity (% CaP subjects correctly classified) and specificity (% normal subjects (without BPH) correctly classified) of 85.9% and 83.0%, respectively with a validation p-value = 1.3E-26.

Test B results:

6-gene Model -- Parameter Estimates and p-values

Training Test

All coefficients estimated based on the test data have the same sign as the original model estimated on the training data.

Test C results:

Comparative test dataset ROC curves using the 6-gene model logit score vs. the age- adjusted PSA criterion were constructed. The area under the ROC curve (AUC) for the 6-gene logit model vs. the age-adjusted PSA was 0.898 vs. 0.816, respectively. This represents a statistically significant improvement with a validation p-value = 0.014.

The Test Dataset confirms that the 6-gene logit model alone (i.e., not used in combination with PSA) is capable of discriminating prostate cancer patients from normal subjects (without BPH) with high statistical significance. A comparison of the Training Dataset results and the Test Dataset results is shown in Figure 15. The results for the 6-gene model from the training sample yielded a sensitivity of 88.2% (CaP) and specificity of 85.5% (normals) while the test set results yielded a sensitivity of 85.9% (CaP) and specificity of 83% (normals). The Test Dataset exhibited a slight fall-off in sensitivity (88.2% to 85.9%) and specificity (85.5% to 83%) from the Training Dataset. In comparison, the age-adjusted PSA criteria only yielded a sensitivity of 69.5% and specificity of 93.6%.

The Test Dataset further confirms that the area under the ROC curve (AUC) is significantly improved for the 6-gene model over the age-adjusted PSA criterion. As shown in Figure 16, the AUC for the ROC curve for the 6-gene model in the results from the training set is 0.946 whereas the AUC for the curve for age-adjusted PSA criteria alone is 0.842 (p- value=0.005). The AUC for the ROC curve for the 6-gene model in the test set results is 0.898, whereas the AUC for the age-adjusted PSA alone is 0.816. Note that the AUC for the ROC curve is somewhat smaller on the Test Dataset for both the 6-gene model as well as the age- adjusted PSA criterion.

Validation of 6-gene+PSA model:

Validation of the 6-gene logit model (RP51077B9.4, CD97, CDKN2A, SP1, S100A6 and IQGAP1)+PSA that discriminates prostate cancer patients from normal subjects (without BPH) followed a pre-specified plan as follows:

Test A:

a) Using pre-specified model coefficients (beta) established in TRAINING dataset (red box in Table) compute a model logit score.

Predictors beta p-value

Intercept -50.66

plnPSA 4.50 4.4E-05

SP1 -15.11 2.8E-04

CD97 6.31 9.3E-04

RP51077B9.4 7.65 1.9E-03

CDKN2A 2.94 4.1E-03

S100A6 -2.63 0.014

IQGAP1 4.03 0.024

c) Form 2 x 2 table of frequency counts (actual by predicted classification). Compute likelihood ratio chi-squared (L²) and derive p-value with 1 degree of freedom. A validation p-value<0.05 constitutes a successful validation (meaning test results deviate from independence with 95% confidence).

Test B:

b) Construct comparative ROC curves using the 6-gene + PSA model logit score vs. the age- adjusted PSA criterion. The model validates if the improvement in the area under the curve (AUC) associated with the 6-gene logit model + PSA vs. age-adjusted PSA is significant (p<0.05).

Test A results:

Using pre-specified model coefficients (4.50 plnPSA, 7.65 RP51077B9.4, 6.31 CD97, 2.94 CDKN2A, -15.11 SP1, -2.63 S100A6, 4.03 IQGAP1) and a pre-specified cut point probability of 0.5, the 6-gene logit model + PSA demonstrated a sensitivity and specificity of 87.5% and 92.6%, respectively with a validation p-value = 9.6E-37.

Test B results:

'6-gene + PSA' Model ~ Parameter Estimates and p-values

Training Test

Predictors beta p-value beta p-value

Intercept -50.66 -14.90

pin PSA 4.50 4.4E-05 3.01 6.8E-10

SP1 -15.11 2.8E-04 -5.64 2.3E-05

CD97 6.31 9.3E-04 4.45 2.4E-04

RP51077B9.4 7.65 1.9E-03 1.69 0.10

CDKN2A 2.94 4.1E-03 0.79 0.10

S100A6 -2.63 0.014 -0.19 0.71

IQGAP1 4.03 0.02 0.24 0.84

Again, all coefficients estimated based on the test data have the same sign as the original model estimated on the training data.

Test C results:

Comparative test dataset ROC curves using the 6-gene + PSA model logit score vs. the age- adjusted PSA criterion were constructed. The area under the ROC curve (AUC) for the 6- gene + PSA logit model vs. the age-adjusted PSA was 0.962 and 0.816, respectively. This represents a statistically significant improvement with a validation p-value = 1.5E-7.

The Test Dataset confirms that the 6-gene logit model+PSA is capable of discriminating prostate cancer patients from normal subjects (without BPH) with high statistical significance. A comparison of the Training Set results and the Test Set results is shown in Figure 17. The results for the 6-gene model+PSA from the training sample yielded a sensitivity of 97.4% (CaP) and specificity of 96.1% (normals) while the test set results yielded a sensitivity of 87.5% (CaP) and specificity of 92.6% (normals) (validation p-value=9.6E-37). The test dataset exhibited a slight fall-off in sensitivity (97.4% to 87.5%) and specificity (96.1% to 92.6%) from the Training Dataset. Inclusion of BPH subjects in the Training and independent validation sets reduced the sensitivity and specificity to 91.5% and 91.4% respectively. In comparison, the age-adjusted PSA criteria yielded a sensitivity of 69.5% and specificity of 93.6%.

The Test Dataset further confirms that the area under the ROC curve (AUC) is significantly improved for the 6-gene+PSA model over the age-adjusted PSA criterion. As shown in Figure 18, the AUC for the ROC curve for the 6-gene+PSA model in the results from the Training Set is 0.994 whereas the AUC for the curve for age-adjusted PSA criteria alone is 0.842 (p-value=0.005). The AUC for the ROC curve for the 6-gene+PSA model in the Test Set results is 0.962, whereas the AUC for the age-adjusted PSA alone is 0.816 (validation p- value=l .5E-7). Given a specificity range of 91-93%, the 6-gene + PSA model has higher sensitivity (97% training / 87.5% test) compared to the age-adjusted PSA alone (<78% training / <70% test).

The 6-Gene Model +PSA retains its superiority over age-adjusted PSA alone when BPH Subjects are included with the normal subjects without BPH. As stated in Example 4, the Training set results of the 6-gene+PSA model yielded a sensitivity of 97.4% and specificity of 91.5% for discriminating between prostate cancer subjects and normal subjects (with and without BPH; CaP (N=76) vs. Normals (N=76), BPH (N=30)). These results were validated using the Test Dataset, a sensitivity of 87.5% and specificity of 91.4% was yielded for the 6-gene+PSA model, as compared to a sensitivity of 69.5% and specificity of 93.1% for the age-adjusted PSA model alone. A ROC curve is shown in Figure 19. The AUC for the 6-gene+PSA model is 0.953, the AUC for the for the age-adjusted PSA alone is 0.813. The AUC difference is statistically significant (p-value = 9.0E-8). Development of a 6-gene model in a Training set of samples that is further validated in an independent dataset strongly suggest that specific whole blood RNA transcript levels can assess abnormal gene expression levels associated with untreated, localized CaP. Validation of such a model with and without the inclusion of PSA supports its potential value as a diagnostic tool in the management of early stage prostate cancer with economic benefits to the healthcare system.

Re-estimation of model parameters based on the combined Training and Test Datasets will be used to refine the 6-gene model (with and without PSA) for use in future multi-site validation studies (see Figure 19; all coefficients estimated based on the test data have the same sign as the original model estimated on the training data). Using the combined training and test datasets with re-estimated parameters, improved sensitivity and specificity is observed when comparing the 6-gene model with PSA to the 6-gene model without PSA (see Figure 20). The 6- gene + PSA model exhibited improvement in sensitivity (85.8% to 93.6%) and specificity (87.1% to 94.7%) when compared to the 6-gene model alone.

Using the combined Training and Test Datasets with re-estimated parameters, improved AUC for ROC curve is also observed when comparing the 6-Gene+PSA model (AUC=0.977) to the 6-Gene model without PSA (AUC=0.920), as shown in Figure 21 (p-value=3.1E-7).

However, when using the combined Training and Test datasets, the AUC for the ROC curve for the age-adjusted PSA criterion (AUC=0.825) does not provide statistically significant improvement over the global PSA>4 criterion (AUC=0.82) (p-value >0.05, see Figure 22).

Example 7: Discrimination of Prostate Cancer Subiects from Healthy, Normal Subiects (without BPH) Using RNA Transcript-Based Gene Expression: SP1 gene expression is a proxy for CD97. RP51077B9.4 and IQGAP1 gene expression

In the 6-Gene model (RP51077B9.4, CD97, CDKN2A, SP1, S100A6 and IQGAP1) of the previous Examples, certain patterns were observed, that Applicants have also observed in other predictive models for cancer (both in models developed to discriminate normal subjects from cancer subjects and to predict survival in latter stages of cancer), between certain pairs of genes. These observations, discussed more fully below with respect to SP1 gene expression and CD97, RP51077B9.4, or IQGAP1 gene expression, include: (i) one gene of the pair (referred to herein as a "Prime" gene) is significant when used separately in a 1-gene model; (ii) the other gene of the pair (referred to herein as a "Proxy" gene) is NOT significant when used separately in a 1-gene model; (iii) however, when the Proxy gene is included in a 2-gene model with the Prime gene, the Proxy gene significantly improves the predictive area under the ROC curve of the Prime gene alone; (iv) in the 2-gene model, one gene has a significant positive coefficient, while the other gene has a significant negative coefficient; and (v) the two genes have moderate to high positive correlation (>0.6). As will be discussed more fully below, in the 6-Gene model, SPl is a Proxy gene for the Prime genes CD97, RP51077B9.4 and IQGAPl .

In the 6-Gene model, SPl, the Proxy gene, is the gene that makes the largest contribution to the model. As shown for example in Figure 20, SPl is the gene with the largest coefficient in both 6-Gene+PSA model and the 6-Gene model without PSA. Meanwhile, CD97, RP51077B9.4 and IQGAPl make the next largest contributions, respectively, of the genes to the model with the next largest coefficients (and largest coefficients opposite in sign to the coefficient of SPl) in both the 6-Gene+PSA model and the 6-Gene model without PSA (see Figure 21). However, as shown in Figure 23, SPl is not significant in a 1-gene model and has no direct predictive power of its own. By contrast, RP51077B9.4 and CD97 validate in the Test Dataset as the most significant 1-gene models (see Figure 23B). As shown in Figure 24, the AUC for the ROC curve for the 1-gene SPl model developed in the Training Dataset is 0.50 (p-value=0.93), while the AUC for the ROC curve for the 1-gene SPl model validated in the Test Dataset results is 0.52 (p-value=0.57). This demonstrates that SPl gene expression has no direct predictive power of its own to discriminate normal subjects from prostate cancer subjects.

SPl improves prediction by transforming the effects of the Prime genes CD97,

RP51077B9.4 and IQGAPl to the more predictive effects of changes in the expression of the Prime genes, as is discussed below. Figure 25 depicts comparative ROC curves comparing a 1- gene model (RP51077B9.4) to a 2-gene logit model (RP51077B9.4 and SPl) developed in the Training Dataset (Figure 25A) and validated in the Test Dataset (Figure 25B)) to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH). As shown in Figure 25A in the Training Dataset, there is improved area under the ROC curve for the 2-gene model (AUC=0.82) as compared to the 1-gene model alone (AUC=0.76). The contribution of SPl to the 2-gene logit model in the Training Dataset is statistically significant (p- value=0.0001). The Test Dataset further confirms that the area under the ROC curve (AUC) is significantly improved for the 2-gene model over the 1-gene model. As shown in Figure 25B, the AUC for the ROC curve for the 2-gene model in the Test Dataset is 0.81 whereas the AUC for the ROC curve for the 1-gene model alone is 0.74 (AUC difference p-value=0.095). The contribution of SP1 in the Test Dataset to the 2-gene logit model is statistically significant (p- value=4.6E-8).

Similarly, Figure 26 depicts comparative ROC curves comparing a 1-gene model (CD97) to a 2-gene logit model (CD97 and SP1) developed in the Training Dataset (Figure 26 A) and validated in the Test Dataset (Figure 26B)) to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH). As shown in Figure 26A, in the Training Dataset there is improved area under the ROC curve for the 2-gene model (AUC=0.84) as compared to the 1- gene model alone (AUC=0.73), which is a statistically significant AUC difference (p- value=0.047). The contribution of SP1 to the 2-gene logit model in the Training Dataset is statistically significant (p-value=2.2E-6). The Test Dataset further confirms that the area under the ROC curve (AUC) is significantly improved for the 2-gene model over the 1-gene model. As shown in Figure 26B, the AUC for the ROC curve for the 2-gene model in the Test Dataset is 0.87, whereas the AUC for the ROC curve for the 1-gene model alone is 0.70 (AUC difference p-value=7.6E-5). The contribution of SP1 to the 2-gene logit model in the Test Dataset is statistically significant (p-value= 1.4E- 1 1).

By contrast, addition of CD97 in a 2-gene logit model (RP51077B9.4 and CD97) does not significantly improve the ROC curves over the 1-gene model for RP51077B9.4 alone.

Figure 27 depicts comparative ROC curves comparing the 1-gene model to the 2-gene logit model developed in the Training Dataset (Figure 27A) and validated in the Test Dataset (Figure 27B)) to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH). As shown in Figure 27A in the Training Dataset, there is no significant AUC difference (p-value=0.96) for the area under the ROC curve for the 2-gene model (AUC=0.76) as compared to that for the 1-gene model alone (AUC=0.76). The Test Dataset further confirms that the area under the ROC curve (AUC) is virtually unchanged for the 2-gene model over the 1-gene model. As shown in Figure 26B, the AUC for the ROC curve for the 2-gene model in the Test Dataset is 0.74 whereas the AUC for the ROC curve for the 1-gene model alone is 0.73 (AUC difference p- value=1.0).

Furthermore, addition of SP1 to a 2-gene logit model (RP51077B9.4 and CD97) significantly improve the ROC curves over the 2-gene model alone. Figure 28 depicts comparative ROC curves comparing the 2-gene logit model (RP51077B9.4 and CD97) to the 3- gene logit model (RP51077B9.4, CD97 and SP1) developed in the Training Dataset (Figure 28A) and validated in the Test Dataset (Figure 28B)) to discriminate between prostate cancer subjects and normal, healthy subjects (without BPH). As shown in Figure 28A, in the Training Dataset there is improved area under the ROC curve for the 3-gene model (AUC=0.89) as compared to the 2-gene model alone (AUC=0.76), which is a statistically significant AUC difference (p-value=0.006). The Test Dataset further confirms that the area under the ROC curve (AUC) is significantly improved for the 3-gene model over the 2-gene model. As shown in Figure 28B, the AUC for the ROC curve for the 3-gene model in the Test Dataset is 0.90 whereas the AUC for the ROC curve for the 2-gene model is 0.74 (AUC difference p- value=2.5E-5).

Accordingly, while RP51077B9.4 is slightly more predictive as a 1-gene model than CD97, the 2-gene CD97 and SPl model is more predictive than the 2-gene RP51077B9.4 and SPl model. This is because there is a higher positive correlation of SPl with CD97 than with RP51077B9.4. The combined (pooled within-group) correlation for SPl with CD97 is 0.83, while for SPl with RP51077B9.4 it is 0.69. The third Prime gene, IQGAP1, from the 6-Gene model has an even higher correlation of 0.91.

Since common industry practice is to exclude genes that are not significant in 1-gene models, Proxy genes will be excluded. Thus, predictive models developed using common industry practice will tend to underperform models that include Proxy genes.

Comparisons of the Training Set results and the Test Set results are shown in Figures 29 and 30, which are scatterplots of CD97 expression versus SPl expression. As shown in Figure 29, using CD97 in a 1-gene model with a horizontal discrimination line of CD97 = 12.85 results in a high rate of false positives. Because CD97 and SPl are highly correlated, normal subjects with high SPl values will tend to have high CD97 values, thereby yielding a large number of false positives. This model yielded a sensitivity of 68.4% and a specificity of 67.8% in the Training Dataset and a sensitivity of 78.9% and a specificity of only 53.2% in the Test Dataset.

A 2-gene logistic regression model estimated on the combined Training and Test Datasets yielded: Logit(CaP:Normal) = 6.74 + 4.46 CD97 - 4.01 SPl . Using a Logit cut-point of 0 and solving for CD97 yields the equi-probability line (shown in Figure 30 as dashed lines): CD97 = -1.51 + 0.90 SPl . As SPl is a Proxy gene, the slope of the discrimination line (0.90) is identical to the slope of the linear regression lines for predicting CD97 from SPl, for both Normals (green, lower solid lines in Figure 30) and CaP (red, upper solid lines in Figure 30). Assuming that the Normals are representative of the gene expression for the CaP subjects when they were in a normal state, the change in CD97 expected as a subject goes from a normal to a CaP state can be estimated by the vertical distance between the red and green solid lines in Figure 30, which is 0.53 (down regulation of 0.53 ct).

Figure 31 A depicts the scatterplots of Figures 29 and 30 and includes concentration ellipses for the prostate cancer subjects and normal, healthy subjects (without BPH), respectively. The concentration ellipses shown in Figure 31 (red ellipse for CaP subjects, blue ellipse for normals) are analogous to confidence intervals of +/- 1 standard deviation. Most normal subjects are contained within the blue ellipse, while most prostate cancer subjects are contained within the red ellipse. Figure 31B depicts the scatterplots of Figure 31 A where the CD97 expression data for the prostate cancer subjects have been transformed by subtracting a value of 0.527. By shifting the values for the prostate cancer subjects by the value of 0.527 for CD97, both the normal subjects and prostate cancer subjects are contained by the single common, Normal ellipse as shown in Figure 3 IB.

Figure 32 depicts scatterplots of CD97 expression (Figure 32A) or change in CD97 expression (Figure 32A) versus SPl expression in the Training Dataset and includes

concentration ellipses for the prostate cancer subjects and normal, healthy subjects (without BPH), respectively, as well as the regression line for the normal subjects. Change in CD97 expression is estimated by subtracting the predicted CD97 expression for a normal subject with the observed SPl expression from the observed CD97 expression. As shown in Figure 32B, most of the prostate cancer subjects have a change in CD97 expression that is greater than the regression line for the normal subjects.

Figure 33 depicts scatterplots of CD97 expression (Figure 32A) versus SPl expression (Figure 33 A) or versus RP51077B9.4 expression (Figure 33B) in the Training Dataset and includes concentration ellipses for the prostate cancer subjects and normal, healthy subjects (without BPH), respectively, as well as discrimination lines for discriminating the normal subjects from the prostate cancer subjects.

The combined (pooled within-group) correlation for SPl with CD97 is 0.83. However, because SPl is not predictive for discriminating the normal subjects from the prostate cancer subjects, the discrimination line corresponds to the predicted CD97 expression for the given SPl expression as determined on the combined Training and Test Datasets. The 2-gene Prime/Proxy model (CD97/SP1) yields good separation of normal subjects from prostate cancer subjects, with a sensitivity of 79.4% and specificity of 80.0% (See Figure 33A). By contrast, the 2-gene Prime/Prime model (CD97/RP51077B9.4) does not perform as well, with a sensitivity of 68.1% and specificity of 67.7% (See Figure 33B). RP51077B9.4 and CD97 are positively correlated (0.68) based upon the combined (pooled within-group) correlation. However, as both RP51077B9.4 and CD97 are predictive for discriminating the normal subjects from the prostate cancer subjects, they are largely redundant. Therefore, the discrimination of this model is comparable to that for the 1-gene models alone.

Example 8: Discrimination of Prostate Cancer Subiects from Healthy, Normal Subiects (without BPH) Using RNA Transcript-Based Gene Expression: Additional Prime/Proxy Gene Models

In this Example, Applicants determined whether any other gene pairs from the genes listed Tables 3 and 4 were candidate Prime/Proxy 2-gene models, as well as to generate 8-Gene models, for discriminating prostate cancer subjects and normal, healthy subjects. The models of Tables 3 and 4 represent 22 unique genes, which are shown in Figure 23, with their mean gene expression values in prostate cancer subjects and normal, healthy subjects (without BPH) and their statistical significance in 1-gene models to discriminate those subjects.

The top thirteen 1-gene models in Figure 23A were significant (p-value < 0.05) in the Training Dataset. Of these thirteen 1-gene models, ten were also significant in the Test Dataset, and hence were Prime candidates. The resulting ten Prime gene candidates included three genes, CD97; RP51077B9.4; and IQGAP1, found in the 6-Gene model. CD97 and RP51077B9.4 validated as the two most significant discriminators, while IQGAP1 was one of the least significant of the Prime gene candidates.

Furthermore, logistic regression methodology was used to estimate all possible 2-gene models for the 22 genes (total number of models 231) in the Training Dataset. Of the resulting models, 1 13 resulted in significant p-values for both genes ("significant models"). Of the significant models, 94 (83%) were models where one gene coefficient was positive and the other was negative. All the models were re-estimated in the Test Dataset. Of the significant models, 80 were also found to be significant in the Test Dataset and all of which were models where one gene coefficient was positive and the other was negative.

Shown in Figure 34 are the model parameters and statistics for the top ten (ranked according to the validation entropy based R²) of the eighty validated 2-gene logit models.

Values for the Prime genes are shaded blue, whereas values for the Proxy genes are shaded yellow. The "best" of these models was the CD97/SP1 model. Furthermore, fourteen of the eighty validated 2-gene models, contain one Proxy gene that significantly enhances the effect of the paired Prime gene (See Figure 36), and are summarized in the tables below.

Proxy

14

MAP2K1 was a Prime gene in four of these fourteen models. IQGAP1, CDK2, ABL1, CD97 were each Prime genes in two of these fourteen models, while TNF and RP51077B9.4 was each a Prime gene in one of these fourteen models. SP1 was a Proxy gene in seven of these fourteen models. MAPK1 and MYC were each Proxy genes in three of these fourteen models While GSK3B was a Proxy gene in one of these fourteen models. Figure 35 provides more detail for the seven models that include the SP1 Proxy gene, three of which are included in the 6-Gene model (highlighted). Furthermore, a retrospective analysis of the nine validated gene models of Table 3, surprisingly demonstrated that six of them included Prime and corresponding Proxy genes (See Figure 41). The CD97/SP1 , Prime/Proxy, gene pair is a core component of the best models.

As shown in Figures 37, the model parameters for the 6-gene logit model (RP51077B9.4, CD97, CDK 2A, SP1, S100A6 and IQGAP1) (with PSA - Figure 37B; without PSA - Figure 37A) based on the combined Training and Test Datasets were compared to the results of model parameters for models where CD97, RP51077B9.4 and IQGAP1 gene expression is replaced respectively with the corresponding predicted change in gene expression based on SP1 gene expression. When the three Prime genes are replaced by their predicted change in expression, the direct effect for SP1 is not significant (model including PSA - p-value=0.14; model without PSA - p-value=0.53). These results indicate that SP1 has no direct effect in the models, but serves to enhance the effects of the three Prime genes.

Stepwise logistic regression was then used to further identify all possible 8-gene models for the 22 genes listed in Tables 3 and 4 that are capable of discriminating prostate cancer subjects from normal, healthy subjects (without BPH) without coincidental measurement of PSA values. Enumeration of possible 8-gene models yielded approximately 9,000 8-gene models with over 75% correct classification and about 1,000 8-gene models with over 85% correct classification were identified. The top twenty of these 8-gene models is shown in Figure 42, and SPl is a Proxy gene in each.

The "best" 8-gene model ABLl, BRCAl, CD97, IL18, IQGAPl, RP51077B9.4, SPl and TNF, shown in Figure 42, is capable of classifying normal subjects with 90% accuracy (90% specificity), and prostate cancer subjects with 89.2% accuracy (89.2% sensitivity). This model includes five Prime genes, ABLl , CD97, IQGAPl, RP51077B9.4 and TNF, and the Proxy gene, SPl . The model parameters for this 8-gene logit model is shown in the table below:

As a further example, the "best" 8-gene logistic regression model, combined with PSA values, capable of distinguishing between prostate cancer subjects and normal, healthy subjects based on the 22 genes analyzed is BRCAl , CD97, CDK2, IQGAPl, PTPRC, RP51077B9.4, SPl and TNF. This model is capable of classifying normal subjects with 96.5% accuracy (96.5% specificity), and prostate cancer subjects with 93.1% accuracy (93.1% sensitivity). This model includes five Prime genes, CDK2, CD97, IQGAPl, RP51077B9.4 and TNF, and the Proxy gene, SPl . The model parameters for this 8-gene logit model is shown in the table below:

Example 9: Discrimination of Late Stage Prostate Cancer Subjects from Healthy, Normal Subjects (without BPH) Using RNA Transcript-Based Gene Expression: 2-Gene Prime/Proxy Models

In addition to the 204 untreated, localized prostate cancer subjects (referred to in this Example as "CaPl") and the 170 age-matched, medically defined normal, healthy subjects of the Training and Test Sets of the previous Examples, a cohort of 62 subjects with castrate resistant (late stage) prostate cancer with or without bone metastases (referred to in this Example as "CaP4") was selected. RNA samples were isolated from the CaP4 cohort and transcript based gene expression was measured for this cohort as for the other cohorts of the previous examples.

The 80 validated 2-gene models from Example 8 for the discrimination of the CaPl cohort from the 170 Normals were applied to the CaP4 cohort to determine their predictive power to discriminate late stage prostate cancer from the same 170 Normals. Re-estimation of model parameters based on the combined training and test datasets for CaPl and Normals was used to refine the 2-gene models.

Shown in Figure 38 are the percentages of the Normal, CaPl and CaP4 cohorts correctly classified by applying the top 20 validated 2-gene logit models from Example 8, developed to discriminate between CaPl subjects and normal healthy subjects (without BPH). All but four of the 80 validated 2-gene models were also able to discriminate well between Normals and the late stage cancer subjects. Surprisingly, of the resulting 76 models, 68 had greater predictive power for the CaP4 group than for the original CaPl group, including the CD97/SP1 model. Generally, when a model where the coefficients are estimated on some set of data is applied to another set of data there is some falloff in prediction, as occurs for example when a model developed on training sets is applied to validation sets. It is rare that a model actually performs better on some other population. This result is consistent with the model's prediction representing a measure of the underlying biological condition, the condition apparently worsening for late stage cancer subjects in this case.

Figure 39 depicts a scatterplot of CD97 expression versus SP1 expression to discriminate CaP4 subjects and normal, healthy subjects (without BPH) using the 2-gene logit model (CD97 and SP1) for CaPl and normal subjects. 88.7% of the CaP4 subjects are correctly classified, while 80.9% of CaPl subjects and 74.1% of Normal subjects are correctly classified.

A ROC curve comparing a 1-gene model (CD97) to a 2-gene logit model (CD97 and SP1) to discriminate between late stage prostate cancer subjects (CaP4) and normal, healthy subjects is shown in Figure 40. Improved area under the ROC curve further supports the improved discrimination of late stage prostate cancer versus normal subjects when combining prime and proxy gene expression as compared to prime gene expression alone. As shown in Figure 40, the area under the ROC curve is 0.74 for the model based on the expression values of the Prime CD97 gene alone, as compared to 0.91 when expression values for the Proxy SPl gene is included with CD97 gene expression. This improvement is statistically significant (p-value = 3.3E-5).

The references listed below are hereby incorporated herein by reference.

References

Magidson, J. GOLDMineR User's Guide (1998). Belmont, MA: Statistical Innovations

Inc.

Vermunt and Magidson (2005). Latent GOLD 4.0 Technical Guide, Belmont MA: Statistical Innovations.

Vermunt and Magidson (2007). LG-Syntax™ User's Guide: Manual for Latent GOLD^® 4.5 Syntax Module, Belmont MA: Statistical Innovations.

Vermunt J.K. and J. Magidson. Latent Class Cluster Analysis in (2002) J . A.

Hagenaars and A. L. McCutcheon (eds.). Applied Latent Class Analvsis. 89-106. Cambridge: Cambridge University Press.

Magidson, J. "Maximum Likelihood Assessment of Clinical Trials Based on an Ordered Categorical Response." (1996) Drug Information Journal, Maple Glen, PA: Drug Information Association, Vol. 30, No. l , pp 143-170.

TABLE 1: Precision Profile for Prostate Cancer Detection

TABLE 2: Precision Protein Panel for Prostate Cancer Detection

CDKN2A cyclin-dependent kinase inhibitor 2A NP_000068

(melanoma, p16, inhibits CDK4)

E2F1 E2F transcription factor 1 NP_005216

GSK3B glycogen synthase kinase 3 beta NP_002084

ICAM1 Intercellular adhesion molecule 1 NP_000192

IQGAP1 IQ motif containing GTPase activating NP_003861

protein 1

MAP2K1 mitogen-activated protein kinase kinase 1 NP_002746

MTA1 metastasis associated 1 NP_004680

MYC v-myc myelocytomatosis viral oncogene NP_002458

homolog (avian)

PTPRC protein tyrosine phosphatase, receptor NP_002829

type, C

RB1 retinoblastoma 1 (including osteosarcoma) NP_000312

RP5- invasion inhibitory protein 45 NP_068752

1077B9.4

S100A6 S100 calcium binding protein A6 NP_055439

SIAH2 seven in absentia homolog 2 (Drosophila) NP_005058

SOCS1 suppressor of cytokine signaling 1 NP_003736

SP1 Sp1 transcription factor NP_612482

TGFB1 transforming growth factor, beta-induced, NP_000651

68kDa

THBS1 thrombospondin 1 NP_003237

TNF tumor necrosis factor (TNF superfamily, NP_000585

member 2)

Table 3: Prostate Cancer vs. Normals: select gene models to be validated using Test Dataset Five 2-gene models Two 4-gene models Two 6-gene models

ABL1, BRCA1 CD97, CDK2, RP51077B9.4, RP51077B9.4, CD97,

SPl CDKN2A, SP1. S100A6,

IQGAP1, CD97, GSK3B,

MAP2K1 , MAPK1 BRCA1, GSK3B, RBI, TNF CD97, GSK3B, PTPRC,

RP51077B9.4, SPl, TNF

BRCA1, MAP2K1

PTPRC, RP51077B9.4

CD97, SPl

Table 4: Prostate cancer vs. BPH Normals: select gene models to be validated using Test Dataset Three 1-gene models Five 2-gene models Three 3-gene models Three 5-gene

models IL18 CD97, S100A6 MAP2K1,MYC, MAP2K1,MYC,

S100A6 S100A6,C1QA,

RP51077B9.4

RP51077B9.4 IL18, RP51077B9.4 MAP2K1, S100A6, MAP2K1, SMAD3,

TP53 S100A6.CCNE1,

TP53

S100A6 MAP2K1, S100A6 MAP2K1, S100A6, MAP2K1,TP53,

SMAD3 S100A6.CCNE1,

ST14

RP51077B9.4,

S100A6

RP51077B9.4, SP1

Claims

A method of evaluating a biological state of a subject, comprising a. providing a test value for a prime indicator, wherein the value of the prime

indicator alone makes a statistically significant contribution to the evaluation of the biological state;

b. providing a test value for a proxy indicator, wherein the value of the proxy

indicator alone does not make a statistically significant contribution to the evaluation of the biological state, and wherein

i. the proxy indicator is correlated with the prime indicator;

ii. the test value of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the test value of the proxy indicator provides a surrogate normative value for the prime indicator,

c. adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime gene relative to the surrogate normative value and as a result the prime indicator combined with the proxy indicator improves the predicative power of the prime indicator;

d. evaluating the biological state of the subject based upon the adjusted test value of the prime indicator.

The method of claim 1 , wherein the proxy indicator and the prime indicator is correlated in both a normal biological state and an altered biological state.

The method of claim 1 , wherein the biological state of the subject is evaluated based upon an index that is indicative of the state of the subject comprised of the adjusted test values of the prime indicators.

The method of claim 1 , wherein the biological state of the subject is evaluated based upon the surrogate normative value for the prime indicator as determined by the proxy indicator.

The method of claim 1, wherein the biological state of the subject is evaluated based upon an index that is indicative of the state of the subject comprised of surrogate normative values for prime indicators as determined by the proxy indicators.

6. The method of claim 1 , wherein a statistically significant contribution is defined as having a p-value of < 0.05.

7. The method of claim 1 , wherein a non-statistically significant contribution is defined as having a p-value of > 0.05.

8. The method of claim 1 , wherein correlation is defined as having a coefficient of

correlation of > 0.5.

9. The method of claim 1 , wherein an indicator is a gene, or a gene expression product.

10. The method of claim 6, wherein a gene expression product is an RNA or a protein.

11. The method of claim 1 , wherein said providing step(s) in step (a) and/or step (b)

comprises measuring the value of the indicator.

12. The method of claim 1 , wherein said evaluating step in step (d) comprises comparing the adjusted test value of the prime indicator to a control value.

13. A method of determining the change in a value of a prime indicator attributed to a

biological state of a subject comprising

a. providing a test value for a prime indicator, wherein the value of the prime

b. providing a test value for a proxy indicator, wherein the value of the proxy

i. the proxy indicator is correlated with the prime indicator; ii. the test value of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the test value of the proxy indicator provides a surrogate normative value for the prime indicator,

c. adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime gene relative to the surrogate normative value.

14. A method of increasing the predictive value of a prime indicator of a biological state, comprising the steps of: a. providing a test value for a prime indicator, wherein the value of the prime indicator alone makes a statistically significant contribution to the evaluation of the biological state;

b. providing a test value for a proxy indicator, wherein the value of the proxy

c. adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime gene relative to the surrogate normative value and as a result the prime indicator combined with the proxy indicator improves the predicative power of the prime indicator.

15. A method of identifying a prime/proxy indicator pair for a function to predict a value indicative of the risk or probability of a biological state from a dataset of measurements of a plurality of indicators from a plurality of subjects each with the biological state known, the method comprising:

a. providing a test value for each indicator wherein said test value represents a

change in the value of each indicator in a subject with a first biological state compared to a subject with a second biological state;

b. determining the statistical significance of each test value in step (a);

c. using said test values for each indicator to enumerate two indicator models

indicative of a risk or probability of the biological state, wherein said two indicator models distinguishes between a subject with the first biological state and a subject with the second biological state;

d. selecting the two indicator models enumerated in step (c) capable of

distinguishing between a subject with the first biological state and a subject with the second biological state with at least 75% accuracy;

e. determine a coefficient for each indicator in each two indicator model selected in step (d) and identifying the two indicator models identified in step (d) having one indicator having a positive coefficient and one indicator having a negative coefficient;

f. selecting the two indicator models identified in step (e) in which the unique contribution of each individual indicate is statistically significant;

g. identifying the two indicator models selected in step (f) than comprise one

indicator whose test value alone makes a statistically significant contribution to the predication of the biological state and one indicator whose test value alone does not make a statistically significant contribution to the predication of the biological state as determined in step (b);

h. determine whether the two indicators in the models identified in step (g) are correlated to each other and selecting two indicator models wherein the two indicators have a correlation coefficient > 0.5 wherein the indicator that makes a statistically significant contribution as determined in step (b) is the prime indicator and the indicator that does not make a statistically significant contribution as determined in step (b) is the proxy indicator, thereby identifying a prime/proxy pair.

16. The method of claim 15, further comprising using said test values for each prime and proxy indicator identified in step (h) to enumerate two indicator models indicative of a risk or probability of the biological state, to identify a model comprising at least one prime indicator and at least one proxy indicator wherein said model distinguishes between a subject with the first biological state and a subject with the second biological state with at least 85% accuracy.

17. The method of claim 15, further comprising using said test values for each prime and proxy indicator identified in step (h) to enumerate models comprised of multiple indicators indicative of a risk or probability of the biological state, to identify a model comprising multiple prime indicators and at least one proxy indicator that is correlated to each of the prime indicators in a two indicator model wherein said model distinguishes between a subject with the first biological state and a subject with the second biological state with at least 85% accuracy.

18. The method of claim 15, further comprising using said test values for each prime and proxy indicator identified in step (h) to enumerate models comprised of two indicator models and one or more Clinical Parameter indicative of a risk or probability of the biological state, to identify a model comprising at least one prime indicator, at least one proxy indicator and one or more Clinical Parameter wherein said model distinguishes between a subject with the first biological state and a subject with the second biological state with at least 85% accuracy and improves the predictive power of the Clinical Parameters alone.

19. The method of claim 15, wherein statistical significance in step (b) or step (f) is

determined by the Wald test or the likelihood ratio test.

20. The method of claim 15, wherein a statistically significant contribution is defined as having a p-value of < 0.05.

21. The method of claim 15, wherein a non-statistically significant contribution is defined as having a p-value of > 0.05.

22. The method of claim 15, wherein the plurality of indicators is a greater the plurality of subjects.

23. The use of the prime indicator identified by claim 15 to screen for therapeutic agents to treat the biological condition.

24. A method of evaluating a biological state of a subject over a period of time, comprising a. providing a test value for a prime indicator from said subject at a first period of time, wherein the value of the prime indicator alone makes a statistically significant contribution to the evaluation of the biological state;

b. providing a test value for a proxy indicator from said subject at a first period of time, wherein the value of the proxy indicator alone does not make a statistically significant contribution to the evaluation of the biological state, and wherein i. the proxy indicator is correlated with the prime indicator; ii. the test value of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the test value of the proxy indicator provides a surrogate normative value for the prime indicator,

c. adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime indicator relative to the surrogate normative value at the first period of time ;

d. providing a test value for a prime indicator from said subject a second period of time, e. providing a test value for a proxy indicator from said subject a second period of time,

f. adjusting the test value for the prime indicator based upon the test value of the proxy indicator to arrive at a change in value of the prime indicator relative to the surrogate normative value at the second period of time ;

g. comparing the change in value of the prime indicator at the first period of time to the second period of time.

25. A computer-readable medium comprising computer executable instructions thereon for performing the method comprising:

receiving a dataset of measurements for the patient, wherein the measurements

include the following:

a. at least one measurement for a prime indicator, wherein the value of the measurement wherein the value of the prime indicator alone makes a statistically significant contribution to the evaluation of a biological state;; and

b. at least one measurement for a proxy indicator, wherein the value of the

measurement of the proxy indicator alone does not make a statistically significant contribution to the evaluation of the biological state, and wherein

i. the proxy indicator is correlated with the prime indicator; ii. the value of the measurement of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the value of measurement of the proxy indicator provides a surrogate normative value for the prime indicator,

c. evaluating the dataset of measurements for the patient with a function to predict a value indicative of the risk or probability, wherein the function adjusts the value of the measurement based upon the test of the proxy indicator to arrive at a change in value of the prime gene relative to the surrogate normative value.

26. An apparatus for identifying prime and proxy indicators for a function to predict a value indicative of the risk or probability of a biological state from a dataset of measurements of a plurality of clinical indicators from a plurality of subjects each with the biological state known, the apparatus comprising:

a processor configured to evaluate the dataset of measurements from the subjects to thereby identify one or more prime indicators for the function and identify one or more proxy indicators for the function, wherein the measurements include the following:

wherein the value of the measurement of the prime indicator alone makes a

statistically significant contribution to the evaluation of the biological state; and

i. the proxy indicator is correlated with the prime indicator; ii. the value of the measurement of the proxy indicator is similar in both a normal biological state and an altered biological state and as a result the value of measurement of the proxy indicator provides a surrogate normative value for the prime indicator.

27. The apparatus of claim 26, wherein the apparatus comprises an output device for

outputting the prime and the proxy indicators.

28. The apparatus of claim 26, wherein the apparatus comprises an input device configured to receive the dataset of measurements.