Chapter 5 Risk Prediction and Validation (Part I)


5.1 Risk Prediction/Stratification

  • For binary outcomes, a risk prediction is related to the predicted probability that an individual has or will develop a certain condition/trait.

  • Examples of a risk prediction for a binary outcome:

    • Probability that someone has hypertension.

    • Probability that someone has type-2 diabetes.

    • Probability someone will be hospitalized over the next 1 year.

  • Risk stratification: Often, a risk prediction model will only report whether or not someone belongs to one of several risk strata.

    • For example, high, medium, or low risk.

5.2 Area under the ROC curve and the C-statistic

5.2.1 Sensitivity and Specificity

  • Let g(xi) be a risk score for an individual with covariate vector xi.

  • Higher values of g(xi) are supposed to imply greater probability that the binary outcome Yi is equal to 1.

    • Though g(xi) does not necessarily have to be a predicted probability.
  • The value t will be a threshold which determines whether or not, we predict Yi=1 or not.

    • g(xi)t implies Yi=1
    • g(xi)<t implies Yi=0

  • The sensitivity of the risk score g(xi) with the threshold t is defined as the probability you “predict” Yi=1 assuming that Yi is, in fact, equal to 1.

  • In other words, the sensitivity is the probability of making the right decision given that Yi=1.

    • Sensitivity is often called the “true positive rate”.
  • The sensitivity is defined as: Sensitivity(t;g)=P{g(xi)t|Yi=1}=P{g(xi)t,Yi=1}P{Yi=1}=P{g(xi)t,Yi=1}P{g(xi)t,Yi=1}+P{g(xi)<t,Yi=1}

  • For a worthless risk score that is totally uninformative about the outcome, we should expect the sensitivity to be close to P{g(xi)t}.


  • You can compute the in-sample sensitivity with ^Sensitivity(t;g)=ni=1I{g(xi)t,Yi=1}ni=1I{g(xi)t,Yi=1}+ni=1I{g(xi)<t,Yi=1}=number of true positivesnumber of true positives+number of false negatives

  • The specificity of the risk score g(xi) with the threshold t is defined as the probability you “predict” Yi=0 assuming that Yi is, in fact, equal to 0.

  • The specificity is defined as: Specificity(t;g)=P{g(xi)<t|Yi=0}=P{g(xi)<t,Yi=0}P{g(xi)<t,Yi=0}+P{g(xi)t,Yi=0}

  • Note that 1Specificity(t;g)=P{g(xi)t|Yi=0}.

    • 1Specificity(t;g) is often called the “false positive rate”

  • You can compute the in-sample specificity with ^Specificity(t;g)=ni=1I{g(xi)<t,Yi=0}ni=1I{g(xi)<t,Yi=0}+ni=1I{g(xi)t,Yi=0}=number of true negativesnumber of true negatives+number of false positives

  • For a worthless risk score that is totally uninformative about the outcome, we should expect the specificity to be close to P{g(xi)<t}.

  • Note that high values of both sensitivity and specificity is good.

    • For a “perfect” risk score, both sensitivity and specificity would be equal to 1.

5.2.2 The ROC curve

  • The receiver operating characteristic (ROC) curve graphically depicts how sensitivity and specificity change as the threshold t varies.

  • Let ti=g(xi) and let t(1)>t(2)>...>t(n) be the ordered values of t1,,tn.

  • To construct an ROC curve we are going to plot sentivity vs. 1 - specificity for each of the thresholds t(1),,t(n).


  • Let xi=1^Specificity(t(i);g) and yi=^Sensitivity(t(i);g).

  • We will define x0=y0=0 and xn+1=yn+1=1.

    • x0, y0 represent 1 - specificity and sensitivity when using t=.
    • xn+1, yn+1 represent 1 - specificity and sensitivity when using t=.
  • Plotting yi vs. xi for i=0,,n+1 will give you the ROC curve.

  • Ideally, the values of yi will close to 1 for all i.

  • For a worthless risk score we should expect both yi and xi to be roughly equal to P(g(xi)t).

  • Hence, plotting yi vs. xi should be fairly close to the line y=x.

5.2.3 Computing the ROC curve

  • To try computing an ROC curve, we will use the Wisconsin Breast Cancer dataset which is available in the biopsy dataset from the MASS package.
library(MASS)
data(biopsy)
head(biopsy)
##        ID V1 V2 V3 V4 V5 V6 V7 V8 V9     class
## 1 1000025  5  1  1  1  2  1  3  1  1    benign
## 2 1002945  5  4  4  5  7 10  3  2  1    benign
## 3 1015425  3  1  1  1  2  2  3  1  1    benign
## 4 1016277  6  8  8  1  3  4  3  7  1    benign
## 5 1017023  4  1  1  3  2  1  3  1  1    benign
## 6 1017122  8 10 10  8  7 10  9  7  1 malignant
## Look at number of "benign" and "malignant" tumors
table(biopsy$class)
## 
##    benign malignant 
##       458       241
biopsy$tumor.type <- ifelse(biopsy$class=="malignant", 1, 0)
table(biopsy$tumor.type)
## 
##   0   1 
## 458 241
  • Let’s compute risk scores for tumor malignancy by using a logistic regression model with class as the outcome and variables V1, V3, V4, V7, V8 as the covariates.

  • Our risk score for the ith case, will be the predicted probability of having a malignant tumor given the covariate information

logreg.model <- glm(tumor.type ~ V1 + V3 + V4 + V7 + V8, family="binomial", data=biopsy)
risk.score <- logreg.model$fitted.values
  • Let’s now compute the sensitivity and specificity for a threshold of t=0.5
Sensitivity <- function(thresh, Y, risk.score) {
    sum((risk.score >= thresh)*Y)/sum(Y)
}
Specificity <- function(thresh, Y, risk.score) {
    sum((risk.score < thresh)*(1 - Y))/sum(1 - Y)
}
Sensitivity(0.5, Y=biopsy$tumor.type, risk.score)
## [1] 0.9502075
Specificity(0.5, Y=biopsy$tumor.type, risk.score)
## [1] 0.9759825
  • For the threshold of t=0.5, we have a sensitivity of about 0.95 and a specificity of about 0.975.

  • Now, let’s compute the ROC curve by computing sensitivity and specificity for each risk score (plus the values of 0 and 1).
sorted.riskscores <- c(1, sort(risk.score, decreasing=TRUE), 0) 
mm <- length(sorted.riskscores)
roc.y <- roc.x <- rep(0, mm)
for(k in 1:mm) {
   thresh.val <- sorted.riskscores[k]
   roc.y[k] <- Sensitivity(thresh.val, Y=biopsy$tumor.type, risk.score)
   roc.x[k] <- 1 - Specificity(thresh.val, Y=biopsy$tumor.type, risk.score)
}
plot(roc.x, roc.y, main="ROC curve for biopsy data", xlab="1 - Specificity",
     ylab="Sensitivity", las=1)
lines(roc.x, roc.y, type="s")
abline(0, 1)


  • Let’s compare the logistic regression ROC curve with a worthless risk score where we generate risk scores randomly from a uniform distribution.
rr <- runif(nrow(biopsy))
sorted.rr <- c(1, sort(rr, decreasing=TRUE), 0) 
mm <- length(sorted.rr)
roc.random.y <- roc.random.x <- rep(0, mm)
for(k in 1:mm) {
   thresh.val <- sorted.rr[k]
   roc.random.y[k] <- Sensitivity(thresh.val, Y=biopsy$tumor.type, rr)
   roc.random.x[k] <- 1 - Specificity(thresh.val, Y=biopsy$tumor.type, rr)
}
plot(roc.random.x, roc.random.y, main="ROC curve for biopsy data", xlab="1 - Specificity",
     ylab="Sensitivity", las=1, col="red", type="n")
lines(roc.random.x, roc.random.y, type="s", col="red")
lines(roc.x, roc.y, type="s")
legend("bottomright", legend=c("logistic regression risk scores", "random risk scores"),
       col=c("black", "red"), lwd=2, bty='n')
abline(0, 1)

5.3 Area under the ROC curve

  • The Area Under the ROC curve AUC is the area under the graph of the points (xi,yi), i=0,,n+1.

  • The AUC is given by AUC=ni=0yi(xi+1xi)

5.3.1 Rewriting the formula for the AUC

  • Note that yi=nk=1I{g(xk)t(i)}I{Yk=1}/nk=1I{Yk=1}=1nˆpnk=1aki, where nˆp=nk=1I{Yk=1} and aki=I{g(xk)t(i)}I{Yk=1}.

  • Because t(i)t(i+1): xi+1xi=nk=1(I{g(xk)t(i+1),Yk=0}I{g(xk)t(i),Yk=0})/nk=1I{Yk=0}=1n(1ˆp)nk=1I{g(xk)<t(i)}I{g(xk)t(i+1)}I{Yk=0}=1n(1ˆp)nk=1I{g(xk)=t(i+1)}I{Yk=0}=1n(1ˆp)nk=1bki where n(1ˆp)=nk=1I{Yk=0} and bki=I{g(xk)=t(i+1)}I{Yk=0}.


  • So, we can express the AUC as: AUC=1n2ˆp(1ˆp)ni=0nk=0akink=0bki=1n2ˆp(1ˆp)nk=0nj=0ni=0akibji=1n2ˆp(1ˆp)nk=0nj=0I{Yk=1}I{Yj=0}ni=0I{g(xk)t(i)}I{g(xj)=t(i+1)}

  • Note now that because the t(i) are the ordered values of the g(xh) ni=0I{g(xk)t(i)}I{g(xj)=t(i+1)}=I{g(xk)t(j1)}=I{g(xk)>t(j)}, where j is the index such that t(j)=g(xj).

  • Hence, ni=0I{g(xk)t(i)}I{g(xj)=t(i+1)}=I{g(xk)>g(xj)}


  • Now, by plugging (5.2) into (5.1), we can finally express the AUC as AUC=1n2ˆp(1ˆp)nk=0nj=0I{g(xk)>g(xj)}I{Yk=1}I{Yj=0}

5.3.2 Interpreting the AUC

  • We can write the AUC as AUC=S1/S2

  • The sum S2 counts the number of all discordant pairs of responses

    • The pair of outcomes (Yk,Yj) is discordant if Yk=1 and Yj=0 or vice versa. S2=ni=1I{Yi=0}ni=1I{Yi=1}=nj=1nk=1I{Yj=0}I{Yk=1}
  • Now, look at the sum S1: S1=nk=0nj=0I{g(xk)>g(xj)}I{Yk=1}I{Yj=0}

  • S1 looks at all discordant pairs of responses and counts the number of pairs where the risk score ordering agrees with the ordering of the responses.


  • To summarize: The AUC is the proportion of discordant outcome pairs where the risk score ordering for that pair agrees with the ordering of the outcomes

  • An AUC of 0.5 means that the risk score is performing about the same as a risk score generated at random.

  • The AUC is often referred to as the concordance index, or c-index.


  • Let’s compute the AUC for our logistic regression-based risk score for the biopsy data:
AUC.biopsy <- sum(roc.y[-length(roc.y)]*diff(roc.x))
round(AUC.biopsy, 4)
## [1] 0.9928
  • This is a very high AUC: 0.9928

5.3.3 Computing the AUC in R

  • You can compute the AUC in R without writing all your own functions by using the pROC package.

  • With the pROC package, you start by inputting your binary outcomes and risk scores into the roc function in order to get an “roc object”.

library(pROC)
roc.biopsy <- roc(biopsy$tumor.type, risk.score)
  • To find the AUC, you can then use auc(roc.biopsy)
auc( roc.biopsy )
## Area under the curve: 0.9928

  • You can plot the ROC curve by just plugging in the roc object into the plot function.
plot(roc.biopsy)

  • To also print the AUC value, you can just add print.auc=TRUE
plot(roc.biopsy, print.auc=TRUE)

5.4 Calibration

  • The AUC, or c-index is a measure of the statistical discrimination of the risk score.

  • However, a high value of the AUC does not imply that the risk score is well-calibrated.

  • Calibration refers to the agreement between observed frequency of the outcome and the fitted probabilities of those outcomes.


  • For example, if we look at a group of individuals all of whom have a fitted probability of 0.2, then we should expect that roughly 20% of the outcomes should equal 1 from this group.

  • We can examine this graphically by looking at the observed proportion of successes vs. the fitted probabilities for several risk strata.


  • Specifically, if we have risk score-cutoffs r1,,rG, the observed proportion of successes in the kth risk stratum is Ok=ni=1YiI{rk1<g(xi)rk}/ni=1I{rk1<g(xi)rk}

  • The expected proportion of successes in the kth risk stratum is Pk=ni=1g(xi)I{rk1<g(xi)rk}/ni=1I{rk1<g(xi)rk}

  • If g(xi) is well-calibrated, Ok and Pk should be fairly similar for each k.


  • Let’s make a calibration plot for the biopsy data.

  • First, let’s make 10 risk strata using the quantiles of our logistic regression-based risk score.

rr <- c(0, quantile(risk.score, prob=seq(0.1, 0.9, by=0.1)), 1)
rr
##                     10%         20%         30%         40%         50% 
## 0.000000000 0.001505966 0.002756161 0.005385182 0.010154320 0.019231612 
##         60%         70%         80%         90%             
## 0.074574998 0.841857858 0.989561027 0.999471042 1.000000000
  • Now, compute observed and expected frequencies for each of the risk strata and plot the result:
observed.freq <- pred.freq <- rep(0, 10)
for(k in 2:11) {
   ind <- risk.score <= rr[k] & risk.score > rr[k-1] # stratum indicators
   observed.freq[k] <- sum(biopsy$tumor.type[ind])/sum(ind)
   pred.freq[k] <- sum(risk.score[ind])/sum(ind)
}
plot(observed.freq, pred.freq, xlab="Observed Frequency",
     ylab = "Predicted Frequency",
     las=1, main="Calibration plot for biopsy data")
lines(observed.freq, pred.freq)
abline(0, 1, lty=2)


  • Most of the risk scores are more concentrated near zero, but this calibration plot shows fairly good calibration.
    • Expected vs. Observed frequencies are mostly close to the y=x straight line.
  • Sometimes an estimated intercept and slope from a regression of expected vs. observed frequencies is reported.
    • We should expect that the intercept should be close to 0 and the slope should be close to 1 for a well-calibrated risk score.
lm(pred.freq ~ observed.freq)
## 
## Call:
## lm(formula = pred.freq ~ observed.freq)
## 
## Coefficients:
##   (Intercept)  observed.freq  
##      0.001759       0.994948
  • The Hosmer–Lemeshow test is a more formal test that compares these types of observed vs. expected frequencies.

5.5 Longitudinal Data and Risk Score Validation

  • When you have longitudinal responses Yij, sensitivity, specificity, and the AUC/c-index will depend on what exactly you are trying to predict.

  • If you are trying to predict Yij=1 vs. Yij=0 for each j, sensitivity and specificity will be defined by P{g(xij)t|Yij=1} and P{g(xij)<t|Yij=0}

  • Here, g(xij) would be a risk score based on covariate information up to and including time tij and could include responses before time tij.

  • In this case, the AUC would be calculated in the same way as the non-longitudinal case - you would just sum over all responses and risk scores across all individuals and time points.


  • In other cases, you may want to predict a single outcome ˜Yi even though the covariates are collected longitudinally over additional time points.

  • For example, ˜Yi might be an indicator of whether or not a patient is hypertensive over a particular time window.

  • In this case, you might compute a single risk score ˆgi for each individual.

    • For example, ˆgi=1ninij=1g(xij).

    • For example, ˆgi=median{g(xi1),,g(xini)}.

    • For example, ˆgi=max.


  • To “validate” a risk score, you generally want to look at out-of-sample performance of the risk score.

  • For example, you might build your risk model using data from individuals enrolled within a specific six-month time window and look at the AUC statistic for outcomes for individuals who enrolled in a different time window.

  • Looking at the out-of-sample performance over a different time window or using data from a different source is often a good way of justifying the robustness/generalizability of a particular risk model.