Roadmap for Developing and ValidatingTherapeutically Relevant Genomic ClassifiersRichard Simon
Oncologists need improved tools for selecting treatments for individual patients. The devel-
opment of therapeutically relevant prognostic markers has traditionally been slowed by poor
study design, inconsistent findings, and lack of proper validation studies. Microarray expres-
sion profiling provides an exciting new technology for relating tumor gene expression to pa-
tient outcome, but it also provides increased challenges for translating initial research findings
into robust diagnostics that benefit patients and physicians in therapeutic decision making.
This article attempts to clarify some of the misconceptions about the development and val-
idation of multigene expression signature classifiers and highlights the steps needed to move
genomic signatures into clinical application as therapeutically relevant and robust diagnostics. INTRODUCTION
and excessive skepticism. In this article, I
Oncologists need improved tools for select-
will attempt to clarify some of the miscon-
ing treatments for individual patients. Most
cancer treatments benefit only a minority
of the patients to whom they are adminis-
classifiers and highlight the steps needed
tered. Being able to predict which patients
are most likely to benefit would not only
application as therapeutically relevant and
save patients from unnecessary toxicity and
inconvenience, but might facilitate their re-ceiving drugs that are more likely to helpthem. In addition, the current overtreatment
WHY ARE SO FEW PROGNOSTIC FACTORS
of patients results in major expense for indi-
USED IN ONCOLOGY?
viduals and society, an expense that may not
Although there is a large literature on prog-
nostic factors for cancer patients, very few
such factors are used in clinical practice.
provided an exciting new technology for at-
Prognostic factors are unlikely to be used
tempting to identify classifiers for tailoring
unless they are therapeutically relevant,
treatments to patients. To date, however,
been widely adopted into oncology practice
studies are conducted using a convenience
and very few are close to achieving such sta-
sample of patients for whom tissue is avail-
tus. Development of biomarker classifiers
able, but the cohort is often far too hetero-
geneous with regard to stage and treatment
and sufficiently validated for broad clinical
to support therapeutically relevant conclu-
application is difficult, and more difficult
sions. Additional problems in the prognos-
for expression signature classifiers. The
tic marker literature derive from the fact
field of microarray expression profiling is
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
Therapeutically Relevant Genomic Classifiers
markers and prognostic models, but do not test prespeci-
sion contexts where even accurate, reproducible, and
fied models using independent data. Clinical drug trials
well-validated classifiers are unlikely to be used widely.
are generally prospective, with patient selection criteria,
For example, consider the treatment of patients with ad-
primary end point, hypotheses, and analysis plan specified
vanced disease treated with a potentially curative treat-
in advance in a written protocol. The consumers of clinical
ment. A classifier for predicting the patients unlikely to
trial reports have been educated to be skeptical of data
respond to that therapy may not be widely used if there
dredging to find something ‘‘statistically significant’’ to re-
is no good alternative treatment. The classifier would
port in clinical trials. They are skeptical of analyses with
have to have a very high negative predictive value in order
multiple end points or multiple subsets, knowing that
to justify withholding a potentially curative therapy. It is
the chances of erroneous conclusions increase rapidly
important to evaluate carefully the context of therapeutic
once one leaves the context of a focused, single-hypothesis
decision making if one wants to develop a classifier that
clinical trial. Prognostic marker studies are generally per-
has a sufficiently great chance of having clinical impact
formed with no written protocol, no eligibility criteria, no
to warrant the large expense and time commitment re-
primary end point or hypotheses and no defined analysis
quired to achieve the other parts of Table 1.
plan. The analysis often includes numerous analyses of dif-ferent end points and patient subsets. The problem is not
WHAT IS A MULTIGENE CLASSIFIER?
just that the studies are for developing prognostic markers
A multigene expression signature classifier is a function
rather than validating previously specified markers, but
that provides a classification of a tumor based on the ex-
that even as developmental studies the planning and anal-
pression levels of the component genes. The classes are of-
ten good-risk or poor-risk, but classifiers can be defined to
Another feature that has hindered the use of prog-
distinguish any set of classes for which a training set of
nostic markers in medical practice is the lack of studies
cases exist for each class. The term ‘‘classifier’’ is somewhat
demonstrating the reproducibility of results for assaying
over-restrictive because a multigene biomarker can be a
markers either between laboratories, between samples of
function that provides a continuous risk score rather
the same tissue specimen, or between times and readers
than a class identifier. Here we will use the term ‘‘classi-
fier’’ however, because for validation purposes it is usually
Many of these problems apply to studies of prog-
important that cutoff thresholds of a risk score be defined
nostic classifiers on gene expression profiles. Some of
the problems are even more formidable. Because of the
Some people prefer the phrase ‘‘multigene bio-
number of genes available for analysis, microarray data
marker’’ to ‘‘multigene classifier.’’ This can lead to serious
can be a veritable fountain of false findings unless a
misunderstandings, however. A completely defined classi-
structured approach to model development and valida-
fier can be used to select patients and stratify patients for
therapy, and the clinical effectiveness of the classifier can
Some of the key steps in obtaining a classifier that is
potentially be validated. Specifying only the genes involved
ready for ‘‘prime time’’ are listed in Table 1. These steps
does not enable one to structure prospective clinical
are discussed in the following sections. We have already
validation experiments in which patients are assigned or
discussed the importance of developing the classifier for
stratified in prospectively well-defined ways. Hence, one is
a specific therapeutic decision problem and using cases rel-
forever correlating expression of individual genes against
evant to that decision context. That is of key importance.
outcomes, but never evaluating the use of a defined diag-
There are, however, some well-defined therapeutic deci-
nostic classifier that can be applied to patients. The genesets identified as associated with outcome tend to be un-stable because gene groups are correlated by co-regulation
Table 1. Key Steps in Development and Validation of Therapeutically
and the stringent criteria used for identifying differentially
expressed genes results in reduced statistical power for
Develop classifier for addressing a specific important therapeutic decision
gene selection. It is often much easier to develop a classifier
Patients are sufficiently homogeneous and receiving uniformtreatment so that results are therapeutically relevant
that performs accurately than it is to identify exactly the
Treatment options and costs of mis-classification are such that a
Perform internal validation of classifier to assess whether it appears
The components of expression signature classifiers
sufficiently accurate relative to standard prognostic factors that it is
need not be valid biomarkers in the sense of the US
Translate classifier to platform that would be used for broad clinical
Food and Drug Administration.3 Those criteria require
that the role of the biomarker be mechanistically under-
Demonstrate that the classifier is reproducibleIndependent validation of the completely specified classifier on a
stood and accepted as markers of disease activity. Such
criteria are relevant for biomarkers used as surrogateend points but not for the components of expression
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
Richard Simon
signatures used for tailoring treatments. It is, of course,
uct kernel,8 perceptrons,9 and the naı¨ve Bayes classifier for
desirable to understand the mechanistic relationship of
the components of an expression signature, but the classi-
When the number of genes (p) is greater than the
fier can be validated without such understanding and clear
number of cases (n), perfect separation of a training set
biologic interpretation may be more difficult to achieve
is always possible with a linear classifier. In fact, there
are an infinite number of linear classifiers that achieve
The concept of ‘‘validation’’ has been problematic for
perfect separation. That suggests that there may not be
the development of traditional disease biomarkers. Much
sufficient information in most datasets to effectively utilize
of the confusion derives from attempting to define valida-
nonlinear classifiers. Although complex nonlinear classi-
tion in an absolute sense. A much more pragmatic and
fiers are popular, there is very little evidence that they
productive approach is to focus on validation for a speci-
perform any better than simpler methods.
fied purpose. For example, an expression signature should
In the study of Dudoit et al,5 the simplest methods,
be developed for the purpose of predicting outcome for
diagonal linear discriminant analysis and nearest-neighbor
a well-defined set of patients who receive a well-defined
classification, performed as well or better than the more
therapy. The signature classifier would be developed using
complex methods. Nearest-neighbor classification is based
data from such patients and would be validated for an in-
on a distance function d(_x,_y), which measures the distance
dependent set of such patients. The developmental study
between the expression profiles _x and _y of two samples.
would identify the genes to be included in the classifier,
The distance function utilizes only the genes in the selected
usually by screening a much larger set of genes to find
set of genes F. To classify a sample with expression profile
those whose expression is most correlated with outcome.
_y, compute d(_x,_y) for each sample _x in the training set.
The developmental study would also combine the genes
The predicted class of _y is the class of the sample in the
into a completely specified classifier that can be used
training set that is closest to _y with regard to the dis-
and potentially validated in a subsequent study. The vali-
dation does not consist of seeing whether the same genes
Paik et al11 used linear classifiers for predicting recur-
are prognostic in the subsequent study. The validation
rence risk of patients with primary breast cancer. Paik et al
should be focused on addressing whether the application
identified 19 genes for inclusion in the classifier. These
of the previously defined classifier to a new set of patients
included five proliferation genes, four genes related to es-
results in clinical benefit. This is discussed further in a
trogen metabolism, two Her2 genes, two genes related to
tissue invasion, and three other genes. These genes wereselected on the basis of their correlation with recurrence
DEVELOPING A GENOMIC CLASSIFIER
in a training set of data. The classifier was based on com-puting the average expression level for each gene group
What Kinds of Classifiers Are Most Useful?
and then a weighted average of the gene group–specific
Many algorithms have been used effectively with DNA
averages. The genes not in the proliferation, estrogen,
microarray data for class prediction. A linear discriminant
Her2 or invasion groups were taken as members of single-
ton groups. The weights were determined to optimize pre-diction on the training set. The final component of the
classifier determined based on the training set were two
cutpoints for the weighted sum of gene expression in order
where xi denotes the expression measurement for the
to define groups with a low risk, intermediate risk, and
ith gene, wi is the weight given to that gene, and the summa-
tion is over the set F of features (genes) selected for inclusionin the classifier. For a two-class problem, there is a threshold
How Many Genes Should Be Included
value c that must be defined; a sample with expression pro-
in the Classifier?
file defined by a vector _x of values is predicted to be in class 1
Most classifiers do not use all of the genes whose ex-
or class 2 depending on whether l(_x) as computed from the
pression is measured. Consequently, one step in develop-
equation is less than or greater than c.
ing a classifier is determining which genes to include; this
Many kinds of classifiers used in the literature have
is called feature selection. Using all of the genes means that
the form shown in the preceding equation. They differ
all of the genes would have to be measured in the future for
with regard to how the weights are determined. These clas-
classification of new patients. That is particularly problem-
sifiers include Fisher’s linear discriminant analysis and di-
atic if the classifier is going to be converted to a real-time
agonal discriminant analysis,5 the compound covariate
reverse transcriptase polymerase chain reaction (RT-PCR)
predictor of Radmacher et al,6 the weighted voting method
platform. Also, the number of genes that are actually dif-
of Golub et al,7 support vector machines with inner prod-
ferentially expressed between the classes (ie, ‘‘informative
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
Therapeutically Relevant Genomic Classifiers
genes’’) is usually small compared to the number of genes
method of partitioning the set of samples into a training
that are not differentially expressed (‘‘noise genes’’). In-
set and a test set. Rosenwald et al12 used this approach suc-
cluding too many noise genes can dilute the influence
cessfully in their international study of prognostic predic-
of the informative genes and reduce the accuracy of pre-
tion for large B cell lymphoma. They used two thirds of
diction. It also makes interpretation and future use of the
their samples as a training set. Multiple kinds of predictors
were studied on the training set. When the collaborators of
It is sometimes possible to distinguish very different
that study agreed on a single fully specified prediction
cell types based on expression levels of a small number
model, they accessed the test set for the first time. On
of genes. Even if such genes are not known a priori,
the test set there was no adjustment of the model or fitting
they can be identified if they are very differentially ex-
of parameters. They merely used the samples in the test set
pressed in the two cell types. This is often not the case
to evaluate the predictions of the model that was com-
for more difficult classification problems however. For
pletely specified using only the training data. In addition
these problems there may be a dozen or more differentially
to estimating the overall error rate on the test set, one can
expressed genes, but the fold differences in expression may
also estimate other important operating characteristics of
not be large and it may be difficult to identify these genes
the test such as sensitivity, specificity, positive and negative
from among the thousands of noise genes. Omitting infor-
mative genes from a classifier has a greater deleterious ef-
The split-sample method is often used with so few
fect on classification accuracy than does inclusion of noise
samples in the test set, however, that the validation is
genes, so long as the number of noise genes included is not
almost meaningless. One can evaluate the adequacy of
too great. Consequently, in many cases accurate classifiers
the size of the test set by computing the statistical sig-
can be developed, but it is more difficult to develop such
nificance of the classification error rate on the test set
classifiers based on a very small number of genes.
or by computing a confidence interval for the test-seterror rate. Since the test set is separate from the training
INTERNAL VALIDATION OF A CLASSIFIER
set, the number of errors on the test set has a bino-
IN DEVELOPMENTAL STUDIES
It is useful to divide genomic classifier studies into devel-
Michiels et al13 suggested that multiple training-test
opmental studies and validation studies. Developmental
partitions be used, rather than just one. The split sample
studies are the ones that first develop the classifiers and
approach is mostly useful, however, when one does not
are analogous to phase II clinical trials. They should in-
have a well-defined algorithm for developing the classifier.
clude an indication of whether the genomic classifier is
When there is a single training set-test set partition, one
promising and worthy of phase III evaluation. There are
can perform numerous unplanned analyses on the training
special problems in evaluating whether a genomic classifier
set to develop a classifier and then test that classifier on the
is promising based on a developmental study, however.
test set. With multiple training-test partitions however,
The difficulty derives from the fact that the number of can-
that type of flexible approach to model development
didate genes available for use in the classifier is much
cannot be used. If one has an algorithm for classifier de-
larger than the number of cases available for analysis. In
velopment, it is generally better to use one of the cross
such situations, it is always possible to find classifiers
validation or bootstrap resampling approaches to estimat-
that accurately classify the data on which they were devel-
ing error rate because the split sample approach does not
oped even if there is no relationship between expression of
provide as efficient a use of the available data.14 Some of
any of the genes and outcome.6 Consequently, even in de-
the conclusions of Michiels et al about the inaccuracy of
velopmental studies, some kind of validation on data not
published expression profiles may be artifacts of their
used for developing the model is necessary. This internal
validation is usually accomplished either by splitting thedata into two portions, one used for training the model
Cross Validation
and the other for testing the model, or some form of cross
Cross validation is an alternative to the split sample
validation based on repeated model development and test-
method of estimating prediction accuracy.6 Molinaro et al14
ing on random data partitions. This internal validation
describe and evaluate many variants of cross-validation
should not, however, be confused with the kind of external
and bootstrap resampling for classification problems
validation of the classifier in a setting simulating broad
where the number of candidate predictors vastly exceeds
the number of cases. For illustration we will describeleave-one-out cross validation (LOOCV). LOOCV starts
Split-Sample Validation
like split-sample cross validation in forming a training
The most straightforward method of estimating the
set of samples and a test set. With LOOCV, however,
accuracy of future prediction is the split-sample validation
the test set consists of only a single sample; the rest of
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
Richard Simon
the samples are placed in the training set. The sample in
Simon et al15 performed a simulation to examine the
the test set is placed aside and not utilized at all in the de-
bias in estimated error rates for class prediction. Two types
velopment of the class prediction model. Using only the
of LOOCV were studied: one with removal of the left-out
training set, the informative genes are selected and the pa-
specimen before selection of differentially expressed genes
rameters of the model are fit to the data. Let us call M1 the
and one with removal of the left-out specimen before com-
model developed with sample 1 in the test set. When this
putation of gene weights and the prediction rule but after
model is fully developed, it is used to predict the class of
gene selection. They also computed the re-substitution
sample 1. This prediction is made using the expression
estimate of the error rate. In a simulated dataset, 20 gene
profile of sample 1, but obviously without using knowl-
expression profiles of length 6,000 were randomly generated
edge of the true class of sample 1. This predicted class is
from the same distribution. Ten profiles were arbitrarily as-
compared to the true class label of sample 1. If they dis-
signed to class 1 and the other 10 to class 2, creating an
agree, then the prediction is in error. Then a new training
artificial separation of the profiles into two classes. Since
set–test set partition is created. This time sample 2 is
no true underlying difference exists between the two classes
placed in the test set and all of the other samples, including
class prediction will perform no better than a random guess
sample 1, are placed in the training set. A new model is
for future biologically independent samples. Hence, the
constructed from scratch using the samples in the new
estimated error rates for simulated data sets should be
training set. Call this model M2 . Although the same algo-
centered around 0.5 (ie, 10 misclassifications of 20).
rithm for gene selection and parameter estimation is used,
Figure 1 shows the observed number of misclassifica-
since model M2 is constructed from scratch on the new
tions resulting from each level of cross validation for 2,000
training set, it will in general not contain exactly the same
simulated data sets. It is well known that the re-substitution
gene set as M1. After creating M2, it is applied to the expres-
estimate of error is biased for small data sets and the
sion profile of sample 2, which was omitted. If this predicted
simulation confirms this, with an astounding 98.2% of
class does not agree with the true class label of the second
the simulated data sets resulting in zero misclassifications
sample, then the prediction is in error. The process is re-
even though no true underlying difference exists between
peated leaving each of the n biologically independent sam-
the two groups. Moreover, the maximum number of mis-
ples out of the training set, one at a time. During the steps, n
classified profiles using the resubstitution method was
different models are created and each one is used to predict
the class of the omitted sample. The number of prediction
Cross validating the prediction rule after selection of
errors is totaled and reported as the leave-one-out cross-
differentially expressed genes from the full data set does
validated estimate of the prediction error.
little to correct the bias of the re-substitution estimator:
At the end of the LOOCV procedure, you have con-
90.2% of simulated data sets still result in zero misclassi-
structed n different models. They were constructed in or-
fications. It is not until gene selection is also subjected
der only to estimate the prediction error associated with
to cross validation that we observe results in line with our
the type of model constructed. The model that wouldbe used for future predictions is one constructed using
all n samples. That is the best model for future prediction
Cross validation: none (resubstitution method)
and the one that should be reported in the publication.
The cross-validated error rate is an estimate of the errorrate to be expected in use of this model for future samples,assuming that the relationship between class and expres-
sion profile is the same for future samples as for the cur-
Data Sets
rently available samples. With two classes, one can use asimilar approach to obtain cross-validated estimates of the
sensitivity, specificity, and the negative and positive predic-
Proportion of Simulated
tive values of the classification procedure. One could even
estimate an entire receiver operating characteristics curve.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The cross-validated prediction error is an estimate of
No. of Misclassifications
the prediction error associated with application of the al-
Fig 1. The effect of various levels of cross validation on the estimated error
gorithm for model building to the entire dataset. A com-
rate of a predictor. Two thousand datasets were simulated as described in
monly used invalid estimate is called the re-substitution
the text. Class labels were arbitrarily assigned to the specimens within eachdataset, and so poor classification accuracy is expected. Class prediction
estimate. You use all the samples to develop a model.
was performed on each dataset as described in the supplemental infor-
Then you predict the class of each sample using that
mation, varying the level of leave-one-out cross validation used in prediction.
model. The predicted class labels are compared to the
Vertical bars indicate the proportion of simulated data sets (of 2,000)resulting in a given number of misclassifications for a specified cross-
true class labels and the errors are totaled.
validation strategy. Reprinted from Simon et al.15
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
Therapeutically Relevant Genomic Classifiers
expectation: the median number of misclassified profiles
ple, Rosenwald et al12 developed a classifier of outcome for
jumps to 11, although the range is large (0 to 20).
patients with advanced diffuse large B cell lymphoma
The simulation results underscore the importance
receiving CHOP chemotherapy. The International Prog-
of cross validating all steps of predictor construction in
nostic Index (IPI) is easily measured and prognostically
estimating the error rate. A study of breast cancer also
important for such patients, however, and so it was impor-
illustrates the point: van’t Veer et al16 predicted clinical out-
tant for Rosenwald et al to address whether their classifier
come of patients with axillary node-negative breast cancer
(metastatic disease within 5 years v disease free at 5 years)
The most effective way of addressing whether a classi-
from gene expression profiles. The investigators controlled
fier adds predictive accuracy to a standard classification sys-
the number of misclassified recurrent cases (ie, the sensitiv-
tem is to examine outcome for the new system within the
ity of the test) in both situations, so here we focus attention
levels of the standard system. This was the approach used by
on the difference in estimated error rates for the disease-free
Rosenwald et al12 for data in their separate test set. This is
cases. Partial and complete cross validation resulted in esti-
illustrated in Figure 2. The spread of the outcome survival
mated error rates of 27% (12 of 44) and 41% (18 of 44),
curves for the classes defined by the new expression classi-
respectively. The improperly cross-validated method results
fier within levels of the IPI indicate the extent to which the
in a seriously biased underestimate of the error rate, prob-
new system adds classification accuracy. When the classifier
ably largely due to overfitting the predictor to the specific
has been completely determined on a training set of data,
dataset. Other examples of incorrect use of cross validation
then the statistical significance of the contribution of the
are described by Ambroise and McLachlan.17 There are
new classifier to the standard IPI can be computed easily
numerous articles in the most prominent journals, written
from a log-rank test using the test-set data.
by both biologists and methodologists, that make claims
Measuring whether a classifier adds predictive accu-
for gene expression classifiers and for new classification
racy when there is not a separate test set is more difficult.
algorithms, which are invalid because they have cross
Curves such as those shown in Figure 2 can be constructed
using the predicted class of each case as determined by
It is important to compute the statistical significance
cross validation. The separation of the survival curves
of the cross-validated estimate of classification error. This
within levels of the standard prognostic factor is still a valid
determines the probability of obtaining a cross-validated
measure of the independent contribution of the expression
classification error as small as actually achieved if there
classifier, but the statistical significance of the contribution
were no relationship between the expression data and class
can no longer be determined by computing a log-rank
identifiers. A flexible method for computing this statistical
test of the separation in survival curves. The standard
significance was described by Radmacher et al.6 It involves
log-rank test is not valid because the classes were not de-
randomly permuting the class identifiers among the
termined independently of the data. The cross-validation
patients and then recalculating the cross-validated classifi-
process induces a dependence among cases that invalidates
cation error for the permuted data. This is done a large
the standard statistical analysis. The statistical signifi-
number of times to generate the null distribution of
cance of the independent contribution of the new classifier
the cross-validated prediction error. If the value of the
can be determined using more complex permutation
cross-validated error obtained for the real data lies far
enough in the tail of this null distribution, then the results
Several important publications have attempted to
are statistically significant. This method of computing
determine the relative importance of an expression clas-
statistical significance of cross-validated error rate for a
sifier and standard prognostic factors by using standard
wide variety of classifier functions is implemented in the
multivariate statistical models, such as the logistic model
BRB-ArrayTools software (National Cancer Institute,
for binary response data and the proportional hazards
Bethesda, MD).18 Statistical significance, however, does
model for survival data. The models often include stan-
not imply that the prediction accuracy is sufficient for
dard prognostic factors and the predicted class of a case
the test to have clinical value, however.
based on a cross-validation analysis.16 Statistical sig-nificance and CIs for the regression coefficients corre-
DOES THE CLASSIFIER PERFORM BETTER THAN
sponding to each factor are then computed using the
STANDARD PROGNOSTIC FACTORS?
usual formulas. This kind of analysis is problematic,
Even if a classifier is developed for a set of patients suffi-
however.20 There is also a more fundamental problem
ciently homogeneous and uniformly treated to be thera-
with this kind of analysis. The value of an expression
peutically relevant, it may be important to evaluate
based classifier is determined by its prediction accuracy.
whether the classifier predicts more accurately than do
Consequently, the analysis should emphasize estimating
standard prognostic factors or adds predictive accuracy
prediction accuracy, not the size of regression coeffi-
to that provided by standard prognostic factors. For exam-
cients, in additive multivariate models.21
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
Richard Simon
There are considerable challenges with microarray ex-
pression profiling of formalin-fixed paraffin-embedded
(FFPE) tissue. With appropriately designed primers,
however, RT-PCR can be performed on FFPE tissue.22
Consequently, the developmental strategy of screening the
genome using microarrays and then developing genomic
classifiers based on a limited number of genes whoseexpression is measured using RT-PCR on FFPE tissue is
Probability of Survival
Whether the classifier is based on DNA microarray
analysis or on RT-PCR analysis, it is important that the assay
be standardized and that evaluations of reproducibility
be conducted. The study by Dobbin et al23 demonstrated
that microarray protocols using Affymetrix arrays couldbe sufficiently standardized to achieve good inter- and
intra-laboratory reproducibility. Achieving such repro-ducibility requires standardization of protocols and stan-
dardization of platform and reagents, however. One of
the challenges in moving genomic classifiers to the clinicis the conduct of such studies. If a genomic classifier is
used for identifying a patient population for which an
experimental drug is shown to be effective, the drug sponsor
Probability of Survival
has a financial incentive to adequately standardize and val-idate the classifier so that the classifier can be approved as
a diagnostic test. In using genomic classifiers with commer-
cially available therapy, however, it is not clear whether any-
one has sufficient incentive to do the laborious but necessary
studies needed to standardize and validate the reproducibil-
ity of the assay for measuring the classifier. INDEPENDENT VALIDATION OF GENOMIC CLASSIFIERS
Although studies that develop classifiers often report a
seemingly impressive accuracy for predicting outcome,
there is abundant reason to demand external validationbased on truly independent data. We refer to this as exter-
nal validation because it is based on independent data
Probability of Survival
external to the study used to develop the classifier. The
analysis of high-dimensional gene expression data is com-
plex and there are many examples of serious errors in in-ternal estimates of accuracy included in publications in the
Fig 2. Survival curves for diffuse large-B-cell lymphoma patients by gene
expression classifier stratified by three levels of International Prognostic
best journals. There are also potential biases in internal es-
Index (IPI) score: (A) IPI scores 0-1; (B) IPI scores 2-3; (C) IPI scores 4-5.Four
timates of accuracy based on tissue handling and assay re-
prognostic classes were defined based on gene expression risk score.
agent differences between cases and controls or responders
Graphs show survival curves for patients with risk score below the median(quartiles 1 and 2) versus patients with risk score above the median (quartiles
and nonresponders. Developmental studies also often uti-
3 and 4). Reprinted from Rosenwald et al.12
lize patients selected in a manner that may not be repre-sentative of the diversity of patients to whom the classifier
TRANSLATION OF PLATFORMS AND DEMONSTRATING
would be applied if it were adopted for broad clinical use. ASSAY REPRODUCIBILITY
Developmental studies also often have the assay performed
The power of microarray expression profiling lies in the
in one research laboratory based on archived specimens
parallel measurement of expression levels for thousands
and this may not reflect the sources of assay variability
of genes. This is useful for screening genes to find those
likely to be encountered in broad practice.24
that should be included in a classifier, but it is rarely nec-
Often the initial study in which the classifier is devel-
essary to measure expression for hundreds or thousands of
oped will not be large enough to estimate the positive and
genes in application of the classifier to subsequent cases.
negative predictive values of the test with sufficient
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
Therapeutically Relevant Genomic Classifiers
precision to determine whether the test has real clinical
dated for providing clinical benefit because it enabled the
utility. It is important that the intended clinical use of
identification of patients whose prognosis was so good
the classifier be carefully considered in planning the exter-
with tamoxifen monotherapy that they could be spared
nal validation study so that these performance character-
the toxicity, inconvenience and expense of chemotherapy.
This was the approach used by Paik et al11 for validation of
The objective of external validation is to determine
the OncoType Dx classifier for patients with node-negative,
whether use of a completely specified diagnostic classifier
ER-positive breast cancer. The genes that seemed prog-
for therapeutic decision making in a defined clinical con-
nostic were initially identified based on published micro-
text results in patient benefit. The objective is not to repeat
array studies. Primers for measuring expression of those
the developmental study and see if the same genes are
genes using RT-PCR of FFPE tissue were developed and
prognostic or if the same classifier is obtained. An inde-
a classifier was developed based on archived tissue from
pendent validation study could be a prospective clinical
National Surgical Adjuvant Breast and Bowel (NSABP)
trial in which patients are randomly assigned to treatment
studies. The completely prespecified classifier was then
assignment without use of the classifier versus treatment
tested on 668 patients from NSABP B-14 who received
assignment with the aid of the classifier. Often, however,
tamoxifen alone as systemic therapy. Fifty-one percent
this design will be inefficient and require a huge sample
of the assayed patients fell into the low-risk group. They
size because many or most of the patients will receive
had a distant recurrence rate at 10 years of 6.8% (95%
the same treatment either way they are assigned. For exam-
CI, 4.0% to 9.6%). Much higher rates of distant recurrence
ple, consider women with lymph node-negative, estrogen
were seen in the intermediate- and high-risk groups of the
receptor (ER) –positive breast cancers. Approximately one
classifier (14.3% and 30.5%, respectively).
third of such patients might be expected to be classified as
One might argue that treatment determination using
low risk for recurrence based on the Oncotype-DX expres-
a genomic classifier for women with stage I ER-positive
sion signature–based risk score.11 If one wants to test the
breast cancer should not be compared with the strategy
strategy of withholding cytotoxic chemotherapy from the
of administering to all such women tamoxifen plus
subset of patients classified as low risk, it would be inefficient
chemotherapy, because there are practice guidelines
to randomly assign all of the node-negative, ER-positive
available based on tumor size and age that withhold
patients. If one randomly assigns all the patients and per-
chemotherapy from some patients. Nevertheless, it
forms the assay on only the half assigned to have classifier
would still be inefficient to randomly assign women to
based therapy, then the two randomization groups must
genomic classifier–determined therapy or nongenomic
be compared overall, although two thirds of the patients
practice guidelines–determined therapy in which the ge-
receive the same treatment in both arms. A more efficient
nomic classifier is measured only on the women randomly
alternative is to perform the assay up front for all patients,
assigned to its use. Most of the women will probably re-
and then randomly assign only those classified as low risk.
ceive the same treatment in whichever arm they are as-
Those patients would be assigned to receive either tamox-
signed to. It is much more efficient to perform the assay
ifen alone or tamoxifen plus cytotoxic chemotherapy. If
for measuring the genomic classifier, and then randomly
the low-risk patients do not benefit from cytotoxic chemo-
assign only the women for whom the two treatment strat-
therapy, then the genomic classifier is clinically useful
egies differ. The current plan for independently validating
because it enables chemotherapy to be withheld from pa-
the classifier developed by van’t Veer et al16 for women
tients who otherwise would have received it.
with primary breast cancer utilizes this design strategy.
Randomly assigning only the patients classified as low
Phase III clinical trials generally attempt to utilize an
risk is more efficient than assigning all of the patients, but
intervention in a manner that it might be used if adopted
it still would require many patients. It is a therapeutic
in broad clinical practice. For evaluating a diagnostic clas-
equivalence trial in the sense that finding no difference
sifier, a multicenter clinical trial provides the challenges of
in outcome changes clinical practice; consequently it is
distributed tissue handling and real time assay perfor-
important to be able to detect small differences. Since
mance that would be met in general use. The assays might
the expected recurrence rate is so low, it would take
be performed in multiple laboratories and cannot be
many patients to detect a difference between the treatment
batched in time with a single set of reagents as might be
arms. But if the recurrence rate is as low as predicted by the
done in a retrospective study. Consequently, the prospec-
classifier, then the benefit of chemotherapy is necessarily
tive clinical trial is the gold standard for external validation
extremely small. Consequently, an alternative design for
external validation is a single-arm study in which the pa-
External validation based on a new prospective clini-
tients classified as low risk are treated with tamoxifen
cal trial will require a long follow-up time for low-risk pa-
alone. If, with long follow-up, these patients have a very
tients, however. In such circumstances it can be useful to
low recurrence rate, then the classifier is considered vali-
conduct a prospectively planned validation using patients
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
Richard Simon
treated in a previously conducted prospective multicenter
metastatic breast cancer patients,26,27 cases with less than
clinical trial if archived tumor specimens are available for
a 2ϩ level of expression of the Her2/neu protein were ex-
the vast majority of patients. The validation study should
cluded. In the development of gefitinib, had the phosphor-
be prospectively planned with at least as much detail and
ylation domain of the EGFR gene been sequenced in
rigor as for prospective accrual of new patients. Although
responders and nonresponders on phase II trials of non–
assaying procedures probably cannot be distributed over
small-cell lung cancer patients, mutation status could
time in the same way as for newly accrued patients, assay
have been used in focusing the phase III trials.28,29 For
reproducibility studies should be conducted to demon-
many molecularly targeted drugs, however, the appropriate
strate that the assay has been standardized and quality
assay for selecting patients is not known, and development
controlled sufficiently so that such sources of variation
of a classifier based on comparing expression profiles for
are negligible. A written protocol should be developed
phase II responders versus phase II nonresponders may
to ensure that the study is planned prospectively to eval-
be the best approach. In such instances, one may not
uate the clinical benefit of a completely specified genomic
have sufficient confidence in the genomic classifier devel-
classifier for a defined therapeutic decision in a defined
oped in phase II to use it for excluding patients in phase
population in a hypothesis testing manner as it would
III trials. It may be better in this case to accept all conven-
for a prospective clinical trial. The study of Paik et al11
tionally eligible patients, and use the classifier to define
of the OncoType Dx classifier for women with node-
a single subset analysis for the patients predicted to be
negative, ER-positive breast cancer is an example of
most responsive to the new drug. The overall null hypoth-
careful prospective planning of an independent validation
esis for all randomly assigned patients is tested at the .04
significance level. A portion 0.01 of the usual 5% false-positive rate is reserved for testing the new treatment in
USE OF GENOMIC CLASSIFIERS IN NEW
the subset predicted by the classifier to be responsive. DRUG DEVELOPMENT
This analysis strategy provides sponsors an incentive
The objective of validation of a genomic classifier differs
for developing genomic classifiers for targeting therapy
somewhat for existing therapy compared to an experimen-
in a manner that does not unduly deprive them of the
tal therapy. With existing therapy, the emphasis should be
possibility of broad labeling indications when justified by
on validation of the clinical benefit of using the classifier.
With an experimental therapy, however, the emphasisshould be on demonstrating effectiveness of the drug in
CONCLUSIONS
a population identified by the classifier as being more likely
Oncologists need improved tools for selecting treatments
to benefit. Simon and Maitournam25 demonstrated that use
for individual patients. The genomic technologies avail-
of a genomic classifier for focusing a clinical trial in this
able today are sufficient to develop such tools. There is
manner can result in a dramatic reduction in required sam-
not broad understanding of the steps needed to translate
ple size, depending on the sensitivity and specificity of the
research findings of correlations between gene expression
classifier for identifying such patients. Not only can such
and prognosis into robust diagnostics validated to be of
targeting provide a huge improvement in efficiency in
clinical benefit. This article has attempted to identify
phase III development, it also provides an increased thera-
some of the major steps needed for such translation.
peutic ratio of benefit to toxicity and results in a greater
Many of these steps are not easy, nor cheap. For therapeu-
proportion of treated patients who benefit.
tic decision settings of sufficient importance, attention
Developing a genomic classifier of which patients are
should be devoted to establishing a means of funding
likely to benefit for targeting phase III trials may require
and expeditiously carrying out these steps.
larger phase II studies. This depends on the type of drug be-ing developed. For example, if the drug is an inhibitor of
a kinase mutated in cancer, then there is a natural diagnos-tic and no genome-wide screening is needed. Similarly, in
Author’s Disclosures of Potential
the comparison of trastuzumab plus chemotherapy to chem-
Conflicts of Interest
otherapy alone in chemotherapy-naı¨ve and -refractory
The authors indicated no potential conflicts of interest. REFERENCES 3. FDA: Draft guidance for industry: Pharma- 5. Dudoit S, Fridlyand J, Speed TP: Com- 1. Simon R, Altman DG: Statistical aspects of
prognostic factor studies in oncology. Br J
cogenomics data submission. Rockville, MD,
parison of discrimination methods for clas-
sification of tumors using gene expression
2. Simon RM, Korn EL, McShane LM, 4. Pusztai L, Hess KR: Clinical trial design
et al: Design and analysis of DNA microarray
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
Therapeutically Relevant Genomic Classifiers 6. Radmacher MD, McShane LM, Simon R:
resampling methods. Bioinformatics 2005 (in
paraffin-embedded tissues: development and
A paradigm for class prediction using gene
performance of a 92-gene reverse transcriptase-
expression profiles. J Comput Biol 9:505-511,
15. Simon R, Radmacher MD, Dobbin K, et al:
polymerase chain reaction assay. Am J Pathol
Pitfalls in the analysis of DNA microarray data:
7. Golub TR, Slonim DK, Tamayo P, et al:
Class prediction methods. J Natl Cancer Inst
23. Dobbin
Molecular classification of cancer: Class discov-
ery and class prediction by gene expression
16. van’t Veer LJ, Dai H, Vijver MJVD, et al:
Gene expression profiling predicts clinical out-
oligonucleotide microarrays. Clin Cancer Res 11:
8. Ramaswamy S, Tamayo P, Rifkin R, et al:
come of breast cancer. Nature 415:530-536,
Multiclass cancer diagnosis using tumor gene
24. Simon R: When is a genomic classifier
expression signatures. Proc Natl Acad Sci USA
17. Ambroise C, McLachlan GJ: Selection bias
ready for prime time? Nat Clin Pract Oncology
in gene extraction on the basis of microarray
9. Khan J, Wei JS, Ringner M, et al: Classi-
gene-expression data. Proc Natl Acad Sci U S A
25. Simon R, Maitournam A: Evaluating the
fication and diagnostic prediction of cancers
efficiency of targeted designs for randomized
using gene expression profiling and artificial
18. Simon R, Lam AP: BRB-ArrayTools (Ver-
clinical trials. Clin Cancer Res 10:6759-6763,
neural networks. Nature Medicine 7:673-679,
sion 3.3). Bethesda MD, Biometric Research
Branch, National Cancer Institute, http://linus
26. Baselga J: Herceptin alone or in combina- 10. Hand DJ, Yu K: Idiot’s Bayes: Not so
tion with chemotherapy in the treatment of
stupid after all? Int Stat Rev 69:385-398, 2001
19. Vasselli J, Shih JH, Iyengar SR, et al:
HER2-positive metastatic breast cancer: Pivotal
11. Paik S, Shak S, Tang G, et al: A multigene
Predicting survival in patients with metastatic
assay to predict recurrence of tamoxifen-treated,
kidney cancer by gene expression profiling in the
27. Eiermann
node-negative breast cancer. N Engl J Med
primary tumor. Proc Natl Acad Sci U S A 100:
12. Rosenwald A, Wright G, Chan WC, et al: 20. Lusa L, McShane LM, Radmacher MD,
The use of molecular profiling to predict survival
et al: Appropriateness of inference procedures
based on within-sample validation for assessing
28. Lynch TJ, Bell DW, Sordella R, et al:
gene expression microarray-based prognostic
Activating mutations in the epidermal growth
classifier performance. (Submitted for publica-
factor receptor underlying responsiveness of
13. Michiels S, Koscielny S, Hill C: Prediction
non-small-cell lung cancer to gefitinib. N Engl J
of cancer outcome with microarrays: A multiple
21. Kattan MW: Judging new markers by their
random validation strategy. The Lancet 365:488-
ability to improve predictive accuracy. J Natl
29. Paez JG, Janne PA, Lee JC, et al: EGFR
mutations in lung cancer: Correlation with clinical
14. Molinaro AM, Simon R, Pfeiffer RM: 22. Cronin M, Pho M, Dutta D, et al:
response to gefitinib therapy. Science 304:1497-
Prediction error estimation: A comparison of
Measurement of gene expression in archival
Information downloaded from jco.ascopubs.org and provided by SWETS SUBSCRIPTION SERVICE for Bayerische
Staatsbibliothek on March 4, 2008 from 194.95.59.195.
Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.
SCOTTISH EXECUTIVE Enterprise and Environment Drew Smith (Glasgow) (Scottish Labour): To ask the Scottish Executive what its position is on the recent report by the Information Commissioner’s Office on the blacklisting of trades union members or activists and whether it has made representations to the UK Government on this. Holding answer issued: 27 March 2012 (S4W-006151)
IntroductionThe North West London Hospitals NHS Trust (NWLHT) Formulary is a list of medicines approved for local prescribing. Medicines are listed alphabetically by generic name and under the Bristish National Formulary (BNF) chapter headings. Please note: The formulary does not specify the brand name or formulation of a