varSel {randomSurvivalForest} | R Documentation |
Variable selection for random survival forests using minimal depth (Ishwaran et al. 2010).
varSel(formula = NULL, data = NULL, object = NULL, method = c("vh", "vhVIMP", "md")[3], ntree=(if (method == "md") 1000 else 500), mvars = (if (!is.null(data) & method!="md") min(1000,round(ncol(data)/5)) else NULL), mtry = (if (!is.null(data) & method == "md") max(sqrt(ncol(data)),ncol(data)/3) else NULL), nodesize = (if (method == "vh" | method == "vhVIMP") 2 else NULL), nsplit = 10, predictorWt = NULL, big.data = FALSE, na.action = c("na.omit", "na.impute")[1], do.trace = 0, always.use = NULL, nrep = 50, K = 5, nstep = 1, verbose = TRUE, ... )
formula |
A symbolic description of the model to be fit.
Must be specified unless |
data |
Data frame containing the data used in the formula.
Missing values allowed. Must be specified unless |
object |
An object of class |
method |
Variable selection method: " |
ntree |
Number of trees to grow. |
mvars |
Number of randomly selected variables used in the variable hunting algorithm. |
mtry |
Number of variables randomly sampled at each split. Should be large when the goal is variable selection. |
nodesize |
Minimum number of deaths with unique survival times required for a terminal node. Should be small if number of variables is large. |
nsplit |
Non-negative integer value. If non-zero, the specified tree splitting rule is randomized which significantly increases speed. |
predictorWt |
Vector of non-negative weights specifying the probability of selecting a variable for splitting. Must be of dimension equal to the number of variables. Default (NULL) invokes a data-adaptive method. |
big.data |
Only set this value to TRUE when the sample size is very large. |
na.action |
Action to be taken if the data contains NA's. |
do.trace |
Should trace output be enabled? Default is
FALSE. A positive integer value causes output to be printed
each |
always.use |
Character vector of variable names to be always included in the model selection procedure and in the final selected model. |
nrep |
Number of Monte Carlo iterations of the variable hunting algorithm. |
K |
Integer value specifying the K-fold size used in the variable hunting algorithm. |
nstep |
Integer value controlling the step size used in the forward selection process of the variable hunting algorithm. Increasing this will encourage more variables to be selected. |
verbose |
Set to TRUE to get verbose output. |
... |
Further arguments passed to or from other methods. |
Variable selection using minimal depth. The option method
allows for two different approaches: (1) minimal depth: uses all
data and all variables simultaenously; and (2) variable hunting:
uses K-fold Monte Carlo validation, random selection of variables,
and regularized forward selection.
The following is a brief description of the two methods. For complete details, one should see Ishwaran et al. (2010).
—> Minimal Depth variable selection (method
="md
")
The maximal subtree for a variable x
is the largest subtree
whose root node splits on x
(all parent nodes of x
's
maximal subtree have nodes that split on variables other than
x
). The minimal depth of a maximal subtree equals the
shortest distance (the depth) from the root node to the parent node
of the maximal subtree (zero is the smallest value possible). The
smaller the minimal depth, the more impact x
has on
prediction.
Variables are selected using an adaptive threshold based on minimal depth coupled with minor supervision using VIMP (variable importance).
Set mtry
to larger values when the number of variables is
high.
—> Variable Hunting (method
="vh
" or method
="vhVIMP
")
Variable hunting is used for problems where the number of variables is magnitudes larger than the sample size and the sample size is reasonably small. Microarray data is a good example.
Using training data from random K-fold subsampling, a forest is fit
to a randomly selected set of variables of size mvars
where
variables are chosen with probability proportional to weights
determined using an initial forest fit on the training data. The
subset of variables are ordered by increasing minimal depth and
added sequentially (starting from a minimal model) until joint VIMP
no longer increases (signifying the final model). A forest is refit
with these variables and applied to test data to estimate prediction
error and VIMP. The process is repeated nrep
times. Final
selected variables are the top P ranked variables, where P is the
average model size and variables are ranked by average minimal
depth.
A rough rule for choosing mvars
is to set it equal to some
fraction of the number of variables.
The same algorithm is used when method
="vhVIMP
", but
variables are ordered using VIMP (including the final model). This
is faster, but not as accurate.
If method
="vh
", and the number of variables is large,
set nsplit
to a fairly large number, such as 10, to ensure
that tree splitting is not overly influenced by noisy variables.
—> Miscellanea
If big.data
=TRUE, and variable hunting is used, the training
data is chosen to be of size n/K, where n=sample size (i.e., the
size of the training data is swapped with the test data). This
speeds up the algorithm. Increasing K also helps.
For efficiency, transformations used in the formula (such as logs etc.) are ignored. Variables are interpreted as is.
Can be used for competing risk data. Variable selection is based on the ensemble CHF.
A list with the following components:
err.rate |
Prediction error for the forest (a vector of
length |
modelSize |
Number of variables selected. |
topvars |
Character vector of names of the final selected variables. |
varselect |
Matrix of values used in determining the set of selected variables. |
rsf.out |
Refitted forest using the final set of selected variables.
NULL if |
Hemant Ishwaran hemant.ishwaran@gmail.com
Udaya B. Kogalur kogalurshear@gmail.com
Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). High-dimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205-217.
Ishwaran H., Kogalur U.B., Chen X. and Minn A.J. (2010). Random survival forests for high-dimensional data.
max.subtree
,
rsf
.
## Not run: #------------------------------------------------------------ # Minimal depth variable selection: pbc data with noise data(pbc, package = "randomSurvivalForest") vs <- varSel(Surv(days, status) ~ ., pbc) # As dimension increases, mtry should increase pbc.noise <- cbind(pbc, noise = matrix(rnorm(nrow(pbc) * 1000), nrow(pbc))) vs.bigp <- varSel(Surv(days, status) ~ ., pbc.noise, mtry = 100) #------------------------------------------------------------ # Variable hunting: van de Vijver microarray breast cancer # Note: nrep is small for illustration; typical values are nrep = 100 data(vdv, package = "randomSurvivalForest") vh <- varSel(Surv(Time, Censoring) ~ ., vdv, method = "vh", nrep = 10, nstep = 5) # Same analysis, but using predefined weights for selecting a gene # for node splitting. We illustrate this using univariate cox p-values. if (library("survival", logical.return = TRUE) & library("Hmisc", logical.return = TRUE)) { cox.weights <- function(rsf.f, rsf.data) { event.names <- all.vars(rsf.f)[1:2] p <- ncol(rsf.data) - 2 event.pt <- match(event.names, names(rsf.data)) predictor.pt <- setdiff(1:ncol(rsf.data), event.pt) sapply(1:p, function(j) { cox.out <- coxph(rsf.f, rsf.data[, c(event.pt, predictor.pt[j])]) pvalue <- summary(cox.out)$coef[5] if (is.na(pvalue)) 1.0 else 1/(pvalue + 1e-100) }) } data(vdv, package = "randomSurvivalForest") rsf.f <- as.formula(Surv(Time, Censoring) ~ .) cox.wts <- cox.weights(rsf.f, vdv) vh.cox <- varSel(rsf.f, vdv, method = "vh", nstep = 5, predictorWt = cox.wts) } ## End(Not run)