Chapter 8 Variable Importance Measures
8.1 Partial Dependence Plots
Beyond variable importance, assessing the impact of a particular covariate can be tricky with more “black box” prediction method.
For these situations, partial dependence plots are a useful way tool.
Suppose we have a method that takes covariate vector \(\mathbf{x}\) as input and generates a prediction \(\hat{f}(\mathbf{x})\).
- For example, \(\hat{f}(\mathbf{x})\) could be the output of random forest or from a boosting method
For this method, the partial dependence function for variable \(k\) at point \(u\) is defined as \[\begin{equation} \rho_{k}( u ) = \frac{1}{n}\sum_{i=1}^{n}\hat{f}(\mathbf{x}_{i,-k}, u) \end{equation}\]
\(\mathbf{x}_{i,-k}\) - all of the covariates for person \(i\) except for covariate \(k\).
\((\mathbf{x}_{i,-k}, u)\) - setting \(x_{ik} = u\) and using the observed values for all other covariates.
The partical dependence function of \(u\) is the “effect” of setting \(x_{ik} = u\), while accounting for the average effect of all the other variables.
Plotting \(\rho_{k}(u)\) versus \(u\) captures how changing \(x_{ik}\) influences the estimated prediction.
Note that partial dependence plots are mostly useful for continuous covariates.
8.2 Uncertainty in Variable Importance Measures
Many machine learning/penalized regression methods generate measures of variable importance.
Random forests generate variable importance scores.
Partial dependence plots are useful for assessing the impact of a covariate for any learning method.
Lasso has selection/magnitude of regression coefficients.
These variable importance measures do not usually come with a measure of uncertainty for each variable importance score.
For example, you may want to report a confidence interval for the VIMP scores.
8.2.1 Subsampling for Random Forest VIMP scores
A general approach to assessing the uncertainty of a variable importance measure is to use some form of repeated subsampling/sample splitting.
The basic idea is to draw subsamples, and for each subsample compute variable importance scores. Then, estimate the variance of each variable importance score using their variation across different subsamples
- See Ishwaran and Lu (2019) for more details of this approach.
- Steps in subsampling approach:
Draw a subsample of size b from the original dataset. Call it \(D_{s}\).
Using random forest on dataset \(D_{s}\), compute VIMP scores \(I_{s,j}\) for variables \(j=1,\ldots,p\).
Repeat steps 1-2 \(S\) times. This will produce \(I_{s,j}\) for all subsamples \(s = 1, \ldots, S\) and all variables \(j = 1, \ldots, p\).
Estimate the variance of \(I_{s,j}\) with the quantity \[\begin{equation} \hat{v}_{j} = \frac{b}{nK} \sum_{s=1}^{S}\Big( I_{s,j} - \bar{I}_{.,j} \Big)^{2} \end{equation}\]
- A \(95\%\) confidence interval for the variable importance of variable \(j\) will then be
\[\begin{equation}
I_{j} \pm 1.96 \times \sqrt{\hat{v}_{j}}
\end{equation}\]
- Here, \(I_{j}\) is the variable importance score from the full dataset.
To test this out, we will use the diabetes data.
- This can be obtained from https://hastie.su.domains/CASI/data.html
This dataset has 442 observations and 10 covariates
The outcome variable of interest is prog
## [1] 442 11
## age sex bmi map tc ldl hdl tch ltg glu prog
## 1 59 1 32.1 101 157 93.2 38 4 2.110590 87 151
## 2 48 0 21.6 87 183 103.2 70 3 1.690196 69 75
## 3 72 1 30.5 93 156 93.6 41 4 2.029384 85 141
## 4 24 0 25.3 84 198 131.4 40 5 2.123852 89 206
## 5 50 0 23.0 101 192 125.4 52 4 1.863323 80 135
## 6 23 0 22.6 89 139 64.8 61 2 1.819544 68 97
- Let’s first fit a randomForest to the entire dataset and plot the variable importance measures
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
- You can extract the actual values of the variable importance scores by
using the
importance
function.
## IncNodePurity
## age 142384.26
## sex 32265.63
## bmi 595349.82
## map 295325.63
## tc 146983.44
## ldl 149014.08
## hdl 201732.42
## tch 164443.58
## ltg 545383.65
## glu 230024.81
- Now, let’s compute variable importance scores across \(S = 100\) subsamples (each of size 50)
and store it in a \(10 \times S\) matrix called
Imp.Subs
S <- 5
b <- 100
Imp.Subs <- matrix(0, nrow=nrow(Imp.Full), ncol=S)
rownames(Imp.Subs) <- rownames(Imp.Full)
for(k in 1:S) {
## sample without replacement
subs <- sample(1:nrow(diabetes), size=b)
diabetes.sub <- diabetes[subs,]
rf.sub <- randomForest(prog ~ ., data=diabetes.sub)
Imp.Subs[,k] <- importance(rf.sub)
}
- From
Imp.Subs
, we can compute the variance estimates \(\hat{v}_{j}\).
imp.mean <- rowMeans(Imp.Subs)
vhat <- (b/nrow(diabetes))*rowMeans((Imp.Subs - imp.mean)^2)
print(vhat)
## age sex bmi map tc ldl
## 15586019.5 539851.9 250660310.7 137517689.8 1305928.4 377357.7
## hdl tch ltg glu
## 41431341.1 10972266.1 69701261.0 45808711.6
- We can now report confidence intervals for the variable importance scores:
vi.upper <- Imp.Full[,1] + 1.96*sqrt(vhat)
vi.lower <- Imp.Full[,1] - 1.96*sqrt(vhat)
VIMP_CI <- cbind(Imp.Full[,1], vi.lower, vi.upper)
colnames(VIMP_CI) <- c("estimate", "lower", "upper")
VIMP_CI[order(-VIMP_CI[,1]),]
## estimate lower upper
## bmi 595349.82 564318.60 626381.04
## ltg 545383.65 529020.14 561747.15
## map 295325.63 272341.11 318310.14
## glu 230024.81 216759.12 243290.51
## hdl 201732.42 189116.45 214348.39
## tch 164443.58 157951.20 170935.97
## ldl 149014.08 147810.06 150218.10
## tc 146983.44 144743.61 149223.28
## age 142384.26 134646.35 150122.17
## sex 32265.63 30825.52 33705.73
8.2.2 Stability Selection for Penalized Regression
- Let’s use the lasso with penalty \(\lambda = 10\) on the
diabetes
data:
## Loading required package: Matrix
## Loaded glmnet 4.1-8
- We can look at the estimated coefficients to see which variables were selected
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -159.4129428
## age .
## sex .
## bmi 1.2690867
## map 0.7852953
## tc .
## ldl .
## hdl .
## tch .
## ltg 100.9574403
## glu .
- The selected variables are those with nonzero coefficients:
# Look at selected coefficients ignoring the intercept:
selected <- abs(coef(diabet.mod)[-1]) > 0
selected
## [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
When thinking about how certain you might be about a given set of selected variables, one natural question is how would this set of selected variables change if you re-ran the lasso on a different subset.
You might have more confidence in a variable that is consistently selected across random subsets of your data.
For a given choice of \(\lambda\), the stability of variable \(j\) is defined as \[\begin{equation} \hat{\pi}_{j}(\lambda) = \frac{1}{S}\sum_{s=1}^{S} I( A_{j,s}(\lambda) = 1), \end{equation}\] where …
\(A_{j,s}(\lambda) = 1\) if variable \(j\) in data subsample \(s\) is selected and
\(A_{j,s}(\lambda) = 0\) if variable \(j\) in data subsample \(s\) is not selected
Meinshausen and Bühlmann (2010) recommend drawing subsamples of size \(n/2\).
The quantity \(\hat{\pi}_{j}(\lambda)\) can be thought of as an estimate of the probability that variable \(j\) is in the “selected set” of variables.
Variables with a large value of \(\hat{\pi}_{j}(\lambda)\) have a greater “selection stability”.
You can plot \(\hat{\pi}_{j}(\lambda)\) across different values of \(\lambda\) to get a sense of the range of selection stability.
- For the
diabetes
data, let’s first compute an \(S \times m \times 10\) array, where the \((k, h, j)\) element of this array equals \(1\) if variable \(j\) was selected in subsample \(k\) with penalty term \(\lambda_{h}\):
nsamps <- 200
b <- floor(nrow(diabetes)/2)
nlambda <- 40 ## 40 different lambda values
lambda.seq <- seq(0.1, 20.1, length.out=nlambda)
## Create an nsamps x nlambda x 10 array
SelectionArr <- array(0, dim=c(nsamps, nlambda, 10))
for(k in 1:nsamps) {
subs <- sample(1:nrow(diabetes), size=b)
diabetes.sub <- diabetes[subs,]
for(h in 1:nlambda) {
sub.fit <- glmnet(x=diabetes.sub[,1:10],y=diabetes.sub$prog,
lambda=lambda.seq[h])
selected <- abs(coef(sub.fit)[-1]) > 0
SelectionArr[k,h,] <- selected
}
}
- From this array, we can compute a matrix containing
selection probability estimates \(\hat{\pi}_{j}(\lambda)\).
- The \((j, h)\) component of this matrix has the value \(\hat{\pi}_{j}(\lambda_{h})\)
SelectionProb <- matrix(0, nrow=10, ncol=nlambda)
rownames(SelectionProb) <- names(diabetes)[1:10]
for(h in 1:nlambda) {
SelectionProb[,h] <- colMeans(SelectionArr[,h,])
}
- The first few columns of
SelectionProb
look like the following:
## [,1] [,2] [,3] [,4] [,5]
## age 0.930 0.750 0.580 0.475 0.315
## sex 1.000 1.000 1.000 1.000 1.000
## bmi 1.000 1.000 1.000 1.000 1.000
## map 1.000 1.000 1.000 1.000 1.000
## tc 0.980 0.920 0.840 0.705 0.565
## ldl 0.890 0.220 0.225 0.255 0.270
## hdl 0.930 0.915 0.980 0.995 1.000
## tch 0.945 0.625 0.405 0.285 0.175
## ltg 1.000 1.000 1.000 1.000 1.000
## glu 0.990 0.890 0.805 0.730 0.660
- We can now plot the stability measures as a function of \(\lambda\)
## Convert to long form:
df <- data.frame(varname=rep(names(diabetes)[1:10], each=nlambda),
selection.prob=c(t(SelectionProb)), lambda=rep(lambda.seq, 10))
head(df)
## varname selection.prob lambda
## 1 age 0.930 0.1000000
## 2 age 0.750 0.6128205
## 3 age 0.580 1.1256410
## 4 age 0.475 1.6384615
## 5 age 0.315 2.1512821
## 6 age 0.250 2.6641026
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin