Roadmap for Developing and ValidatingTherapeutically Relevant Genomic ClassifiersRichard Simon Oncologists need improved tools for selecting treatments for individual patients. The devel- opment of therapeutically relevant prognostic markers has traditionally been slowed by poor study design, inconsistent findings, and lack of proper validation studies. Microarray expres- sion profiling provides an exciting new technology for relating tumor gene expression to pa- tient outcome, but it also provides increased challenges for translating initial research findings into robust diagnostics that benefit patients and physicians in therapeutic decision making.
This article attempts to clarify some of the misconceptions about the development and val- idation of multigene expression signature classifiers and highlights the steps needed to move genomic signatures into clinical application as therapeutically relevant and robust diagnostics.
and excessive skepticism. In this article, I Oncologists need improved tools for select- will attempt to clarify some of the miscon- ing treatments for individual patients. Most cancer treatments benefit only a minority of the patients to whom they are adminis- classifiers and highlight the steps needed tered. Being able to predict which patients are most likely to benefit would not only application as therapeutically relevant and save patients from unnecessary toxicity and inconvenience, but might facilitate their re-ceiving drugs that are more likely to helpthem. In addition, the current overtreatment WHY ARE SO FEW PROGNOSTIC FACTORS
of patients results in major expense for indi- USED IN ONCOLOGY?
viduals and society, an expense that may not Although there is a large literature on prog- nostic factors for cancer patients, very few such factors are used in clinical practice.
provided an exciting new technology for at- Prognostic factors are unlikely to be used tempting to identify classifiers for tailoring unless they are therapeutically relevant, treatments to patients. To date, however, been widely adopted into oncology practice studies are conducted using a convenience and very few are close to achieving such sta- sample of patients for whom tissue is avail- tus. Development of biomarker classifiers able, but the cohort is often far too hetero- geneous with regard to stage and treatment and sufficiently validated for broad clinical to support therapeutically relevant conclu- application is difficult, and more difficult sions. Additional problems in the prognos- for expression signature classifiers. The tic marker literature derive from the fact field of microarray expression profiling is Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved. Therapeutically Relevant Genomic Classifiers
markers and prognostic models, but do not test prespeci- sion contexts where even accurate, reproducible, and fied models using independent data. Clinical drug trials well-validated classifiers are unlikely to be used widely.
are generally prospective, with patient selection criteria, For example, consider the treatment of patients with ad- primary end point, hypotheses, and analysis plan specified vanced disease treated with a potentially curative treat- in advance in a written protocol. The consumers of clinical ment. A classifier for predicting the patients unlikely to trial reports have been educated to be skeptical of data respond to that therapy may not be widely used if there dredging to find something ‘‘statistically significant’’ to re- is no good alternative treatment. The classifier would port in clinical trials. They are skeptical of analyses with have to have a very high negative predictive value in order multiple end points or multiple subsets, knowing that to justify withholding a potentially curative therapy. It is the chances of erroneous conclusions increase rapidly important to evaluate carefully the context of therapeutic once one leaves the context of a focused, single-hypothesis decision making if one wants to develop a classifier that clinical trial. Prognostic marker studies are generally per- has a sufficiently great chance of having clinical impact formed with no written protocol, no eligibility criteria, no to warrant the large expense and time commitment re- primary end point or hypotheses and no defined analysis quired to achieve the other parts of Table 1.
plan. The analysis often includes numerous analyses of dif-ferent end points and patient subsets. The problem is not WHAT IS A MULTIGENE CLASSIFIER?
just that the studies are for developing prognostic markers A multigene expression signature classifier is a function rather than validating previously specified markers, but that provides a classification of a tumor based on the ex- that even as developmental studies the planning and anal- pression levels of the component genes. The classes are of- ten good-risk or poor-risk, but classifiers can be defined to Another feature that has hindered the use of prog- distinguish any set of classes for which a training set of nostic markers in medical practice is the lack of studies cases exist for each class. The term ‘‘classifier’’ is somewhat demonstrating the reproducibility of results for assaying over-restrictive because a multigene biomarker can be a markers either between laboratories, between samples of function that provides a continuous risk score rather the same tissue specimen, or between times and readers than a class identifier. Here we will use the term ‘‘classi- fier’’ however, because for validation purposes it is usually Many of these problems apply to studies of prog- important that cutoff thresholds of a risk score be defined nostic classifiers on gene expression profiles. Some of the problems are even more formidable. Because of the Some people prefer the phrase ‘‘multigene bio- number of genes available for analysis, microarray data marker’’ to ‘‘multigene classifier.’’ This can lead to serious can be a veritable fountain of false findings unless a misunderstandings, however. A completely defined classi- structured approach to model development and valida- fier can be used to select patients and stratify patients for therapy, and the clinical effectiveness of the classifier can Some of the key steps in obtaining a classifier that is potentially be validated. Specifying only the genes involved ready for ‘‘prime time’’ are listed in Table 1. These steps does not enable one to structure prospective clinical are discussed in the following sections. We have already validation experiments in which patients are assigned or discussed the importance of developing the classifier for stratified in prospectively well-defined ways. Hence, one is a specific therapeutic decision problem and using cases rel- forever correlating expression of individual genes against evant to that decision context. That is of key importance.
outcomes, but never evaluating the use of a defined diag- There are, however, some well-defined therapeutic deci- nostic classifier that can be applied to patients. The genesets identified as associated with outcome tend to be un-stable because gene groups are correlated by co-regulation Table 1. Key Steps in Development and Validation of Therapeutically
and the stringent criteria used for identifying differentially expressed genes results in reduced statistical power for Develop classifier for addressing a specific important therapeutic decision gene selection. It is often much easier to develop a classifier Patients are sufficiently homogeneous and receiving uniformtreatment so that results are therapeutically relevant that performs accurately than it is to identify exactly the Treatment options and costs of mis-classification are such that a Perform internal validation of classifier to assess whether it appears The components of expression signature classifiers sufficiently accurate relative to standard prognostic factors that it is need not be valid biomarkers in the sense of the US Translate classifier to platform that would be used for broad clinical Food and Drug Administration.3 Those criteria require that the role of the biomarker be mechanistically under- Demonstrate that the classifier is reproducibleIndependent validation of the completely specified classifier on a stood and accepted as markers of disease activity. Such criteria are relevant for biomarkers used as surrogateend points but not for the components of expression Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved. Richard Simon
signatures used for tailoring treatments. It is, of course, uct kernel,8 perceptrons,9 and the naı¨ve Bayes classifier for desirable to understand the mechanistic relationship of the components of an expression signature, but the classi- When the number of genes (p) is greater than the fier can be validated without such understanding and clear number of cases (n), perfect separation of a training set biologic interpretation may be more difficult to achieve is always possible with a linear classifier. In fact, there are an infinite number of linear classifiers that achieve The concept of ‘‘validation’’ has been problematic for perfect separation. That suggests that there may not be the development of traditional disease biomarkers. Much sufficient information in most datasets to effectively utilize of the confusion derives from attempting to define valida- nonlinear classifiers. Although complex nonlinear classi- tion in an absolute sense. A much more pragmatic and fiers are popular, there is very little evidence that they productive approach is to focus on validation for a speci- perform any better than simpler methods.
fied purpose. For example, an expression signature should In the study of Dudoit et al,5 the simplest methods, be developed for the purpose of predicting outcome for diagonal linear discriminant analysis and nearest-neighbor a well-defined set of patients who receive a well-defined classification, performed as well or better than the more therapy. The signature classifier would be developed using complex methods. Nearest-neighbor classification is based data from such patients and would be validated for an in- on a distance function d(_x,_y), which measures the distance dependent set of such patients. The developmental study between the expression profiles _x and _y of two samples.
would identify the genes to be included in the classifier, The distance function utilizes only the genes in the selected usually by screening a much larger set of genes to find set of genes F. To classify a sample with expression profile those whose expression is most correlated with outcome.
_y, compute d(_x,_y) for each sample _x in the training set.
The developmental study would also combine the genes The predicted class of _y is the class of the sample in the into a completely specified classifier that can be used training set that is closest to _y with regard to the dis- and potentially validated in a subsequent study. The vali- dation does not consist of seeing whether the same genes Paik et al11 used linear classifiers for predicting recur- are prognostic in the subsequent study. The validation rence risk of patients with primary breast cancer. Paik et al should be focused on addressing whether the application identified 19 genes for inclusion in the classifier. These of the previously defined classifier to a new set of patients included five proliferation genes, four genes related to es- results in clinical benefit. This is discussed further in a trogen metabolism, two Her2 genes, two genes related to tissue invasion, and three other genes. These genes wereselected on the basis of their correlation with recurrence DEVELOPING A GENOMIC CLASSIFIER
in a training set of data. The classifier was based on com-puting the average expression level for each gene group What Kinds of Classifiers Are Most Useful?
and then a weighted average of the gene group–specific Many algorithms have been used effectively with DNA averages. The genes not in the proliferation, estrogen, microarray data for class prediction. A linear discriminant Her2 or invasion groups were taken as members of single- ton groups. The weights were determined to optimize pre-diction on the training set. The final component of the classifier determined based on the training set were two cutpoints for the weighted sum of gene expression in order where xi denotes the expression measurement for the to define groups with a low risk, intermediate risk, and ith gene, wi is the weight given to that gene, and the summa- tion is over the set F of features (genes) selected for inclusionin the classifier. For a two-class problem, there is a threshold How Many Genes Should Be Included
value c that must be defined; a sample with expression pro- in the Classifier?
file defined by a vector _x of values is predicted to be in class 1 Most classifiers do not use all of the genes whose ex- or class 2 depending on whether l(_x) as computed from the pression is measured. Consequently, one step in develop- equation is less than or greater than c.
ing a classifier is determining which genes to include; this Many kinds of classifiers used in the literature have is called feature selection. Using all of the genes means that the form shown in the preceding equation. They differ all of the genes would have to be measured in the future for with regard to how the weights are determined. These clas- classification of new patients. That is particularly problem- sifiers include Fisher’s linear discriminant analysis and di- atic if the classifier is going to be converted to a real-time agonal discriminant analysis,5 the compound covariate reverse transcriptase polymerase chain reaction (RT-PCR) predictor of Radmacher et al,6 the weighted voting method platform. Also, the number of genes that are actually dif- of Golub et al,7 support vector machines with inner prod- ferentially expressed between the classes (ie, ‘‘informative Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved. Therapeutically Relevant Genomic Classifiers
genes’’) is usually small compared to the number of genes method of partitioning the set of samples into a training that are not differentially expressed (‘‘noise genes’’). In- set and a test set. Rosenwald et al12 used this approach suc- cluding too many noise genes can dilute the influence cessfully in their international study of prognostic predic- of the informative genes and reduce the accuracy of pre- tion for large B cell lymphoma. They used two thirds of diction. It also makes interpretation and future use of the their samples as a training set. Multiple kinds of predictors were studied on the training set. When the collaborators of It is sometimes possible to distinguish very different that study agreed on a single fully specified prediction cell types based on expression levels of a small number model, they accessed the test set for the first time. On of genes. Even if such genes are not known a priori, the test set there was no adjustment of the model or fitting they can be identified if they are very differentially ex- of parameters. They merely used the samples in the test set pressed in the two cell types. This is often not the case to evaluate the predictions of the model that was com- for more difficult classification problems however. For pletely specified using only the training data. In addition these problems there may be a dozen or more differentially to estimating the overall error rate on the test set, one can expressed genes, but the fold differences in expression may also estimate other important operating characteristics of not be large and it may be difficult to identify these genes the test such as sensitivity, specificity, positive and negative from among the thousands of noise genes. Omitting infor- mative genes from a classifier has a greater deleterious ef- The split-sample method is often used with so few fect on classification accuracy than does inclusion of noise samples in the test set, however, that the validation is genes, so long as the number of noise genes included is not almost meaningless. One can evaluate the adequacy of too great. Consequently, in many cases accurate classifiers the size of the test set by computing the statistical sig- can be developed, but it is more difficult to develop such nificance of the classification error rate on the test set classifiers based on a very small number of genes.
or by computing a confidence interval for the test-seterror rate. Since the test set is separate from the training INTERNAL VALIDATION OF A CLASSIFIER
set, the number of errors on the test set has a bino- IN DEVELOPMENTAL STUDIES
It is useful to divide genomic classifier studies into devel- Michiels et al13 suggested that multiple training-test opmental studies and validation studies. Developmental partitions be used, rather than just one. The split sample studies are the ones that first develop the classifiers and approach is mostly useful, however, when one does not are analogous to phase II clinical trials. They should in- have a well-defined algorithm for developing the classifier.
clude an indication of whether the genomic classifier is When there is a single training set-test set partition, one promising and worthy of phase III evaluation. There are can perform numerous unplanned analyses on the training special problems in evaluating whether a genomic classifier set to develop a classifier and then test that classifier on the is promising based on a developmental study, however.
test set. With multiple training-test partitions however, The difficulty derives from the fact that the number of can- that type of flexible approach to model development didate genes available for use in the classifier is much cannot be used. If one has an algorithm for classifier de- larger than the number of cases available for analysis. In velopment, it is generally better to use one of the cross such situations, it is always possible to find classifiers validation or bootstrap resampling approaches to estimat- that accurately classify the data on which they were devel- ing error rate because the split sample approach does not oped even if there is no relationship between expression of provide as efficient a use of the available data.14 Some of any of the genes and outcome.6 Consequently, even in de- the conclusions of Michiels et al about the inaccuracy of velopmental studies, some kind of validation on data not published expression profiles may be artifacts of their used for developing the model is necessary. This internal validation is usually accomplished either by splitting thedata into two portions, one used for training the model Cross Validation
and the other for testing the model, or some form of cross Cross validation is an alternative to the split sample validation based on repeated model development and test- method of estimating prediction accuracy.6 Molinaro et al14 ing on random data partitions. This internal validation describe and evaluate many variants of cross-validation should not, however, be confused with the kind of external and bootstrap resampling for classification problems validation of the classifier in a setting simulating broad where the number of candidate predictors vastly exceeds the number of cases. For illustration we will describeleave-one-out cross validation (LOOCV). LOOCV starts Split-Sample Validation
like split-sample cross validation in forming a training The most straightforward method of estimating the set of samples and a test set. With LOOCV, however, accuracy of future prediction is the split-sample validation the test set consists of only a single sample; the rest of Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved. Richard Simon
the samples are placed in the training set. The sample in Simon et al15 performed a simulation to examine the the test set is placed aside and not utilized at all in the de- bias in estimated error rates for class prediction. Two types velopment of the class prediction model. Using only the of LOOCV were studied: one with removal of the left-out training set, the informative genes are selected and the pa- specimen before selection of differentially expressed genes rameters of the model are fit to the data. Let us call M1 the and one with removal of the left-out specimen before com- model developed with sample 1 in the test set. When this putation of gene weights and the prediction rule but after model is fully developed, it is used to predict the class of gene selection. They also computed the re-substitution sample 1. This prediction is made using the expression estimate of the error rate. In a simulated dataset, 20 gene profile of sample 1, but obviously without using knowl- expression profiles of length 6,000 were randomly generated edge of the true class of sample 1. This predicted class is from the same distribution. Ten profiles were arbitrarily as- compared to the true class label of sample 1. If they dis- signed to class 1 and the other 10 to class 2, creating an agree, then the prediction is in error. Then a new training artificial separation of the profiles into two classes. Since set–test set partition is created. This time sample 2 is no true underlying difference exists between the two classes placed in the test set and all of the other samples, including class prediction will perform no better than a random guess sample 1, are placed in the training set. A new model is for future biologically independent samples. Hence, the constructed from scratch using the samples in the new estimated error rates for simulated data sets should be training set. Call this model M2 . Although the same algo- centered around 0.5 (ie, 10 misclassifications of 20).
rithm for gene selection and parameter estimation is used, Figure 1 shows the observed number of misclassifica- since model M2 is constructed from scratch on the new tions resulting from each level of cross validation for 2,000 training set, it will in general not contain exactly the same simulated data sets. It is well known that the re-substitution gene set as M1. After creating M2, it is applied to the expres- estimate of error is biased for small data sets and the sion profile of sample 2, which was omitted. If this predicted simulation confirms this, with an astounding 98.2% of class does not agree with the true class label of the second the simulated data sets resulting in zero misclassifications sample, then the prediction is in error. The process is re- even though no true underlying difference exists between peated leaving each of the n biologically independent sam- the two groups. Moreover, the maximum number of mis- ples out of the training set, one at a time. During the steps, n classified profiles using the resubstitution method was different models are created and each one is used to predict the class of the omitted sample. The number of prediction Cross validating the prediction rule after selection of errors is totaled and reported as the leave-one-out cross- differentially expressed genes from the full data set does validated estimate of the prediction error.
little to correct the bias of the re-substitution estimator: At the end of the LOOCV procedure, you have con- 90.2% of simulated data sets still result in zero misclassi- structed n different models. They were constructed in or- fications. It is not until gene selection is also subjected der only to estimate the prediction error associated with to cross validation that we observe results in line with our the type of model constructed. The model that wouldbe used for future predictions is one constructed using all n samples. That is the best model for future prediction Cross validation: none (resubstitution method) and the one that should be reported in the publication.
The cross-validated error rate is an estimate of the errorrate to be expected in use of this model for future samples,assuming that the relationship between class and expres- sion profile is the same for future samples as for the cur- Data Sets
rently available samples. With two classes, one can use asimilar approach to obtain cross-validated estimates of the sensitivity, specificity, and the negative and positive predic- Proportion of Simulated
tive values of the classification procedure. One could even estimate an entire receiver operating characteristics curve.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The cross-validated prediction error is an estimate of No. of Misclassifications
the prediction error associated with application of the al- Fig 1. The effect of various levels of cross validation on the estimated error
gorithm for model building to the entire dataset. A com- rate of a predictor. Two thousand datasets were simulated as described in monly used invalid estimate is called the re-substitution the text. Class labels were arbitrarily assigned to the specimens within eachdataset, and so poor classification accuracy is expected. Class prediction estimate. You use all the samples to develop a model.
was performed on each dataset as described in the supplemental infor- Then you predict the class of each sample using that mation, varying the level of leave-one-out cross validation used in prediction.
model. The predicted class labels are compared to the Vertical bars indicate the proportion of simulated data sets (of 2,000)resulting in a given number of misclassifications for a specified cross- true class labels and the errors are totaled.
validation strategy. Reprinted from Simon et al.15 Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved. Therapeutically Relevant Genomic Classifiers
expectation: the median number of misclassified profiles ple, Rosenwald et al12 developed a classifier of outcome for jumps to 11, although the range is large (0 to 20).
patients with advanced diffuse large B cell lymphoma The simulation results underscore the importance receiving CHOP chemotherapy. The International Prog- of cross validating all steps of predictor construction in nostic Index (IPI) is easily measured and prognostically estimating the error rate. A study of breast cancer also important for such patients, however, and so it was impor- illustrates the point: van’t Veer et al16 predicted clinical out- tant for Rosenwald et al to address whether their classifier come of patients with axillary node-negative breast cancer (metastatic disease within 5 years v disease free at 5 years) The most effective way of addressing whether a classi- from gene expression profiles. The investigators controlled fier adds predictive accuracy to a standard classification sys- the number of misclassified recurrent cases (ie, the sensitiv- tem is to examine outcome for the new system within the ity of the test) in both situations, so here we focus attention levels of the standard system. This was the approach used by on the difference in estimated error rates for the disease-free Rosenwald et al12 for data in their separate test set. This is cases. Partial and complete cross validation resulted in esti- illustrated in Figure 2. The spread of the outcome survival mated error rates of 27% (12 of 44) and 41% (18 of 44), curves for the classes defined by the new expression classi- respectively. The improperly cross-validated method results fier within levels of the IPI indicate the extent to which the in a seriously biased underestimate of the error rate, prob- new system adds classification accuracy. When the classifier ably largely due to overfitting the predictor to the specific has been completely determined on a training set of data, dataset. Other examples of incorrect use of cross validation then the statistical significance of the contribution of the are described by Ambroise and McLachlan.17 There are new classifier to the standard IPI can be computed easily numerous articles in the most prominent journals, written from a log-rank test using the test-set data.
by both biologists and methodologists, that make claims Measuring whether a classifier adds predictive accu- for gene expression classifiers and for new classification racy when there is not a separate test set is more difficult.
algorithms, which are invalid because they have cross Curves such as those shown in Figure 2 can be constructed using the predicted class of each case as determined by It is important to compute the statistical significance cross validation. The separation of the survival curves of the cross-validated estimate of classification error. This within levels of the standard prognostic factor is still a valid determines the probability of obtaining a cross-validated measure of the independent contribution of the expression classification error as small as actually achieved if there classifier, but the statistical significance of the contribution were no relationship between the expression data and class can no longer be determined by computing a log-rank identifiers. A flexible method for computing this statistical test of the separation in survival curves. The standard significance was described by Radmacher et al.6 It involves log-rank test is not valid because the classes were not de- randomly permuting the class identifiers among the termined independently of the data. The cross-validation patients and then recalculating the cross-validated classifi- process induces a dependence among cases that invalidates cation error for the permuted data. This is done a large the standard statistical analysis. The statistical signifi- number of times to generate the null distribution of cance of the independent contribution of the new classifier the cross-validated prediction error. If the value of the can be determined using more complex permutation cross-validated error obtained for the real data lies far enough in the tail of this null distribution, then the results Several important publications have attempted to are statistically significant. This method of computing determine the relative importance of an expression clas- statistical significance of cross-validated error rate for a sifier and standard prognostic factors by using standard wide variety of classifier functions is implemented in the multivariate statistical models, such as the logistic model BRB-ArrayTools software (National Cancer Institute, for binary response data and the proportional hazards Bethesda, MD).18 Statistical significance, however, does model for survival data. The models often include stan- not imply that the prediction accuracy is sufficient for dard prognostic factors and the predicted class of a case the test to have clinical value, however.
based on a cross-validation analysis.16 Statistical sig-nificance and CIs for the regression coefficients corre- DOES THE CLASSIFIER PERFORM BETTER THAN
sponding to each factor are then computed using the STANDARD PROGNOSTIC FACTORS?
usual formulas. This kind of analysis is problematic, Even if a classifier is developed for a set of patients suffi- however.20 There is also a more fundamental problem ciently homogeneous and uniformly treated to be thera- with this kind of analysis. The value of an expression peutically relevant, it may be important to evaluate based classifier is determined by its prediction accuracy.
whether the classifier predicts more accurately than do Consequently, the analysis should emphasize estimating standard prognostic factors or adds predictive accuracy prediction accuracy, not the size of regression coeffi- to that provided by standard prognostic factors. For exam- cients, in additive multivariate models.21 Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved. Richard Simon
There are considerable challenges with microarray ex- pression profiling of formalin-fixed paraffin-embedded (FFPE) tissue. With appropriately designed primers, however, RT-PCR can be performed on FFPE tissue.22 Consequently, the developmental strategy of screening the genome using microarrays and then developing genomic classifiers based on a limited number of genes whoseexpression is measured using RT-PCR on FFPE tissue is Probability of Survival
Whether the classifier is based on DNA microarray analysis or on RT-PCR analysis, it is important that the assay be standardized and that evaluations of reproducibility be conducted. The study by Dobbin et al23 demonstrated that microarray protocols using Affymetrix arrays couldbe sufficiently standardized to achieve good inter- and intra-laboratory reproducibility. Achieving such repro-ducibility requires standardization of protocols and stan- dardization of platform and reagents, however. One of the challenges in moving genomic classifiers to the clinicis the conduct of such studies. If a genomic classifier is used for identifying a patient population for which an experimental drug is shown to be effective, the drug sponsor Probability of Survival
has a financial incentive to adequately standardize and val-idate the classifier so that the classifier can be approved as a diagnostic test. In using genomic classifiers with commer- cially available therapy, however, it is not clear whether any- one has sufficient incentive to do the laborious but necessary studies needed to standardize and validate the reproducibil- ity of the assay for measuring the classifier.
Although studies that develop classifiers often report a seemingly impressive accuracy for predicting outcome, there is abundant reason to demand external validationbased on truly independent data. We refer to this as exter- nal validation because it is based on independent data Probability of Survival
external to the study used to develop the classifier. The analysis of high-dimensional gene expression data is com- plex and there are many examples of serious errors in in-ternal estimates of accuracy included in publications in the Fig 2. Survival curves for diffuse large-B-cell lymphoma patients by gene
expression classifier stratified by three levels of International Prognostic best journals. There are also potential biases in internal es- Index (IPI) score: (A) IPI scores 0-1; (B) IPI scores 2-3; (C) IPI scores 4-5.Four timates of accuracy based on tissue handling and assay re- prognostic classes were defined based on gene expression risk score.
agent differences between cases and controls or responders Graphs show survival curves for patients with risk score below the median(quartiles 1 and 2) versus patients with risk score above the median (quartiles and nonresponders. Developmental studies also often uti- 3 and 4). Reprinted from Rosenwald et al.12 lize patients selected in a manner that may not be repre-sentative of the diversity of patients to whom the classifier TRANSLATION OF PLATFORMS AND DEMONSTRATING
would be applied if it were adopted for broad clinical use.
Developmental studies also often have the assay performed The power of microarray expression profiling lies in the in one research laboratory based on archived specimens parallel measurement of expression levels for thousands and this may not reflect the sources of assay variability of genes. This is useful for screening genes to find those likely to be encountered in broad practice.24 that should be included in a classifier, but it is rarely nec- Often the initial study in which the classifier is devel- essary to measure expression for hundreds or thousands of oped will not be large enough to estimate the positive and genes in application of the classifier to subsequent cases.
negative predictive values of the test with sufficient Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved. Therapeutically Relevant Genomic Classifiers
precision to determine whether the test has real clinical dated for providing clinical benefit because it enabled the utility. It is important that the intended clinical use of identification of patients whose prognosis was so good the classifier be carefully considered in planning the exter- with tamoxifen monotherapy that they could be spared nal validation study so that these performance character- the toxicity, inconvenience and expense of chemotherapy.
This was the approach used by Paik et al11 for validation of The objective of external validation is to determine the OncoType Dx classifier for patients with node-negative, whether use of a completely specified diagnostic classifier ER-positive breast cancer. The genes that seemed prog- for therapeutic decision making in a defined clinical con- nostic were initially identified based on published micro- text results in patient benefit. The objective is not to repeat array studies. Primers for measuring expression of those the developmental study and see if the same genes are genes using RT-PCR of FFPE tissue were developed and prognostic or if the same classifier is obtained. An inde- a classifier was developed based on archived tissue from pendent validation study could be a prospective clinical National Surgical Adjuvant Breast and Bowel (NSABP) trial in which patients are randomly assigned to treatment studies. The completely prespecified classifier was then assignment without use of the classifier versus treatment tested on 668 patients from NSABP B-14 who received assignment with the aid of the classifier. Often, however, tamoxifen alone as systemic therapy. Fifty-one percent this design will be inefficient and require a huge sample of the assayed patients fell into the low-risk group. They size because many or most of the patients will receive had a distant recurrence rate at 10 years of 6.8% (95% the same treatment either way they are assigned. For exam- CI, 4.0% to 9.6%). Much higher rates of distant recurrence ple, consider women with lymph node-negative, estrogen were seen in the intermediate- and high-risk groups of the receptor (ER) –positive breast cancers. Approximately one classifier (14.3% and 30.5%, respectively).
third of such patients might be expected to be classified as One might argue that treatment determination using low risk for recurrence based on the Oncotype-DX expres- a genomic classifier for women with stage I ER-positive sion signature–based risk score.11 If one wants to test the breast cancer should not be compared with the strategy strategy of withholding cytotoxic chemotherapy from the of administering to all such women tamoxifen plus subset of patients classified as low risk, it would be inefficient chemotherapy, because there are practice guidelines to randomly assign all of the node-negative, ER-positive available based on tumor size and age that withhold patients. If one randomly assigns all the patients and per- chemotherapy from some patients. Nevertheless, it forms the assay on only the half assigned to have classifier would still be inefficient to randomly assign women to based therapy, then the two randomization groups must genomic classifier–determined therapy or nongenomic be compared overall, although two thirds of the patients practice guidelines–determined therapy in which the ge- receive the same treatment in both arms. A more efficient nomic classifier is measured only on the women randomly alternative is to perform the assay up front for all patients, assigned to its use. Most of the women will probably re- and then randomly assign only those classified as low risk.
ceive the same treatment in whichever arm they are as- Those patients would be assigned to receive either tamox- signed to. It is much more efficient to perform the assay ifen alone or tamoxifen plus cytotoxic chemotherapy. If for measuring the genomic classifier, and then randomly the low-risk patients do not benefit from cytotoxic chemo- assign only the women for whom the two treatment strat- therapy, then the genomic classifier is clinically useful egies differ. The current plan for independently validating because it enables chemotherapy to be withheld from pa- the classifier developed by van’t Veer et al16 for women tients who otherwise would have received it.
with primary breast cancer utilizes this design strategy.
Randomly assigning only the patients classified as low Phase III clinical trials generally attempt to utilize an risk is more efficient than assigning all of the patients, but intervention in a manner that it might be used if adopted it still would require many patients. It is a therapeutic in broad clinical practice. For evaluating a diagnostic clas- equivalence trial in the sense that finding no difference sifier, a multicenter clinical trial provides the challenges of in outcome changes clinical practice; consequently it is distributed tissue handling and real time assay perfor- important to be able to detect small differences. Since mance that would be met in general use. The assays might the expected recurrence rate is so low, it would take be performed in multiple laboratories and cannot be many patients to detect a difference between the treatment batched in time with a single set of reagents as might be arms. But if the recurrence rate is as low as predicted by the done in a retrospective study. Consequently, the prospec- classifier, then the benefit of chemotherapy is necessarily tive clinical trial is the gold standard for external validation extremely small. Consequently, an alternative design for external validation is a single-arm study in which the pa- External validation based on a new prospective clini- tients classified as low risk are treated with tamoxifen cal trial will require a long follow-up time for low-risk pa- alone. If, with long follow-up, these patients have a very tients, however. In such circumstances it can be useful to low recurrence rate, then the classifier is considered vali- conduct a prospectively planned validation using patients Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved. Richard Simon
treated in a previously conducted prospective multicenter metastatic breast cancer patients,26,27 cases with less than clinical trial if archived tumor specimens are available for a 2ϩ level of expression of the Her2/neu protein were ex- the vast majority of patients. The validation study should cluded. In the development of gefitinib, had the phosphor- be prospectively planned with at least as much detail and ylation domain of the EGFR gene been sequenced in rigor as for prospective accrual of new patients. Although responders and nonresponders on phase II trials of non– assaying procedures probably cannot be distributed over small-cell lung cancer patients, mutation status could time in the same way as for newly accrued patients, assay have been used in focusing the phase III trials.28,29 For reproducibility studies should be conducted to demon- many molecularly targeted drugs, however, the appropriate strate that the assay has been standardized and quality assay for selecting patients is not known, and development controlled sufficiently so that such sources of variation of a classifier based on comparing expression profiles for are negligible. A written protocol should be developed phase II responders versus phase II nonresponders may to ensure that the study is planned prospectively to eval- be the best approach. In such instances, one may not uate the clinical benefit of a completely specified genomic have sufficient confidence in the genomic classifier devel- classifier for a defined therapeutic decision in a defined oped in phase II to use it for excluding patients in phase population in a hypothesis testing manner as it would III trials. It may be better in this case to accept all conven- for a prospective clinical trial. The study of Paik et al11 tionally eligible patients, and use the classifier to define of the OncoType Dx classifier for women with node- a single subset analysis for the patients predicted to be negative, ER-positive breast cancer is an example of most responsive to the new drug. The overall null hypoth- careful prospective planning of an independent validation esis for all randomly assigned patients is tested at the .04 significance level. A portion 0.01 of the usual 5% false-positive rate is reserved for testing the new treatment in USE OF GENOMIC CLASSIFIERS IN NEW
the subset predicted by the classifier to be responsive.
This analysis strategy provides sponsors an incentive The objective of validation of a genomic classifier differs for developing genomic classifiers for targeting therapy somewhat for existing therapy compared to an experimen- in a manner that does not unduly deprive them of the tal therapy. With existing therapy, the emphasis should be possibility of broad labeling indications when justified by on validation of the clinical benefit of using the classifier.
With an experimental therapy, however, the emphasisshould be on demonstrating effectiveness of the drug in CONCLUSIONS
a population identified by the classifier as being more likely Oncologists need improved tools for selecting treatments to benefit. Simon and Maitournam25 demonstrated that use for individual patients. The genomic technologies avail- of a genomic classifier for focusing a clinical trial in this able today are sufficient to develop such tools. There is manner can result in a dramatic reduction in required sam- not broad understanding of the steps needed to translate ple size, depending on the sensitivity and specificity of the research findings of correlations between gene expression classifier for identifying such patients. Not only can such and prognosis into robust diagnostics validated to be of targeting provide a huge improvement in efficiency in clinical benefit. This article has attempted to identify phase III development, it also provides an increased thera- some of the major steps needed for such translation.
peutic ratio of benefit to toxicity and results in a greater Many of these steps are not easy, nor cheap. For therapeu- proportion of treated patients who benefit.
tic decision settings of sufficient importance, attention Developing a genomic classifier of which patients are should be devoted to establishing a means of funding likely to benefit for targeting phase III trials may require and expeditiously carrying out these steps.
larger phase II studies. This depends on the type of drug be-ing developed. For example, if the drug is an inhibitor of a kinase mutated in cancer, then there is a natural diagnos-tic and no genome-wide screening is needed. Similarly, in Author’s Disclosures of Potential
the comparison of trastuzumab plus chemotherapy to chem- Conflicts of Interest
otherapy alone in chemotherapy-naı¨ve and -refractory The authors indicated no potential conflicts of interest.
3. FDA: Draft guidance for industry: Pharma-
5. Dudoit S, Fridlyand J, Speed TP: Com-
1. Simon R, Altman DG: Statistical aspects of
prognostic factor studies in oncology. Br J cogenomics data submission. Rockville, MD, parison of discrimination methods for clas- sification of tumors using gene expression 2. Simon RM, Korn EL, McShane LM,
4. Pusztai L, Hess KR: Clinical trial design
et al: Design and analysis of DNA microarray Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved. Therapeutically Relevant Genomic Classifiers
6. Radmacher MD, McShane LM, Simon R:
resampling methods. Bioinformatics 2005 (in paraffin-embedded tissues: development and A paradigm for class prediction using gene performance of a 92-gene reverse transcriptase- expression profiles. J Comput Biol 9:505-511, 15. Simon R, Radmacher MD, Dobbin K, et al:
polymerase chain reaction assay. Am J Pathol Pitfalls in the analysis of DNA microarray data: 7. Golub TR, Slonim DK, Tamayo P, et al:
Class prediction methods. J Natl Cancer Inst 23. Dobbin
Molecular classification of cancer: Class discov- ery and class prediction by gene expression 16. van’t Veer LJ, Dai H, Vijver MJVD, et al:
Gene expression profiling predicts clinical out- oligonucleotide microarrays. Clin Cancer Res 11: 8. Ramaswamy S, Tamayo P, Rifkin R, et al:
come of breast cancer. Nature 415:530-536, Multiclass cancer diagnosis using tumor gene 24. Simon R: When is a genomic classifier
expression signatures. Proc Natl Acad Sci USA 17. Ambroise C, McLachlan GJ: Selection bias
ready for prime time? Nat Clin Pract Oncology in gene extraction on the basis of microarray 9. Khan J, Wei JS, Ringner M, et al: Classi-
gene-expression data. Proc Natl Acad Sci U S A 25. Simon R, Maitournam A: Evaluating the
fication and diagnostic prediction of cancers efficiency of targeted designs for randomized using gene expression profiling and artificial 18. Simon R, Lam AP: BRB-ArrayTools (Ver-
clinical trials. Clin Cancer Res 10:6759-6763, neural networks. Nature Medicine 7:673-679, sion 3.3). Bethesda MD, Biometric Research Branch, National Cancer Institute, http://linus 26. Baselga J: Herceptin alone or in combina-
10. Hand DJ, Yu K: Idiot’s Bayes: Not so
tion with chemotherapy in the treatment of stupid after all? Int Stat Rev 69:385-398, 2001 19. Vasselli J, Shih JH, Iyengar SR, et al:
HER2-positive metastatic breast cancer: Pivotal 11. Paik S, Shak S, Tang G, et al: A multigene
Predicting survival in patients with metastatic assay to predict recurrence of tamoxifen-treated, kidney cancer by gene expression profiling in the 27. Eiermann
node-negative breast cancer. N Engl J Med primary tumor. Proc Natl Acad Sci U S A 100: 12. Rosenwald A, Wright G, Chan WC, et al:
20. Lusa L, McShane LM, Radmacher MD,
The use of molecular profiling to predict survival et al: Appropriateness of inference procedures based on within-sample validation for assessing 28. Lynch TJ, Bell DW, Sordella R, et al:
gene expression microarray-based prognostic Activating mutations in the epidermal growth classifier performance. (Submitted for publica- factor receptor underlying responsiveness of 13. Michiels S, Koscielny S, Hill C: Prediction
non-small-cell lung cancer to gefitinib. N Engl J of cancer outcome with microarrays: A multiple 21. Kattan MW: Judging new markers by their
random validation strategy. The Lancet 365:488- ability to improve predictive accuracy. J Natl 29. Paez JG, Janne PA, Lee JC, et al: EGFR
mutations in lung cancer: Correlation with clinical 14. Molinaro AM, Simon R, Pfeiffer RM:
22. Cronin M, Pho M, Dutta D, et al:
response to gefitinib therapy. Science 304:1497- Prediction error estimation: A comparison of Measurement of gene expression in archival Information downloaded from and provided by SWETS SUBSCRIPTION SERVICE for Bayerische Staatsbibliothek on March 4, 2008 from Copyright 2005 by the American Society of Clinical Oncology. All rights reserved.


Written answers - daily

SCOTTISH EXECUTIVE Enterprise and Environment Drew Smith (Glasgow) (Scottish Labour): To ask the Scottish Executive what its position is on the recent report by the Information Commissioner’s Office on the blacklisting of trades union members or activists and whether it has made representations to the UK Government on this. Holding answer issued: 27 March 2012 (S4W-006151)

Formulary drug list_for public v5 25112013.xlsx

IntroductionThe North West London Hospitals NHS Trust (NWLHT) Formulary is a list of medicines approved for local prescribing. Medicines are listed alphabetically by generic name and under the Bristish National Formulary (BNF) chapter headings. Please note: The formulary does not specify the brand name or formulation of a

© 2010-2018 Modern Medicine