rsf {randomSurvivalForest}R Documentation

Random Survival Forests

Description

Prediction and variable selection for right censored survival and competing risk data using Random Survival Forests (RSF) (Ishwaran, Kogalur, Blackstone and Lauer, 2008). A random forest (Breiman, 2001) of survival trees is used for ensemble estimation of the cumulative hazard function (CHF) in right-censored settings and the conditional cumulative hazard function (CCHF) in the case of competing risks. Different survival tree splitting rules can be used to grow trees. An “out-of-bag” estimate of Harrell's concordance index (Harrell, 1982) is provided for assessing prediction accuracy. Variable importance (VIMP) for single, as well as grouped variables, can be used to filter variables and to assess variable predictiveness. Minimal depth variable selection is also available. Missing data (x-variables, survival times, censoring indicators) can be imputed on both training and test data.

Usage

rsf(formula, data = NULL, ntree = 1000, mtry = NULL,
    nodesize = NULL, splitrule = NULL, nsplit = 0,
    importance = c("randomsplit", "permute", "none")[1], 
    big.data = FALSE, na.action = c("na.omit", "na.impute")[1],
    nimpute = 1, predictorWt = NULL, forest = TRUE, 
    proximity = FALSE, varUsed = NULL, split.depth = FALSE, 
    seed = NULL, do.trace = FALSE, ...)

Arguments

formula

A symbolic description of the model to be fit.

data

Data frame containing the data used in the formula. Missing values allowed. See na.action for details.

ntree

Number of trees to grow. This should not be set to a number too small, in order to ensure that every input row gets predicted at least a few times.

mtry

Number of variables randomly sampled at each split. The default is sqrt(p), where p equals the number of variables.

nodesize

Minimum number of deaths with unique survival times required for a terminal node. Default is approximately 3 for right-censoring and 6 for competing risk data. Larger values create smaller trees.

splitrule

Splitting rule used to grow trees. See details below.

nsplit

Non-negative integer value. If non-zero, the specified tree splitting rule is randomized which can significantly increase speed. See details below.

importance

Method used to compute variable importance. See details below.

big.data

Set this value to TRUE when the number of variables p is very large, or the sample size is very large. See details below.

na.action

Action to be taken if the data contains NA's. Possible values are "na.omit" and "na.impute". Default is "na.omit", which removes the entire record if even one of its entries is NA (for x-variables this applies only to those specifically listed in formula). The action "na.impute" implements a sophisticated tree imputation technique. See details below.

nimpute

Number of iterations of the missing data algorithm.

predictorWt

Vector of non-negative weights where entry k, after normalizing, is the probability of selecting variable k as a candidate for splitting. Default is to use uniform weights. Vector must be of dimension p, where p equals the number of variables.

forest

Should the forest object be returned? Used for prediction on new data and required for many of the wrappers to work.

proximity

Should the proximity between observations be calculated? Creates an n x n matrix (which can be huge). Default is FALSE.

varUsed

Analyzes which variables are used (split upon) in the forest. Default is NULL. Possible values are "all.trees" and "by.tree". See details below.

split.depth

Return minimal depth for each variable for each case. Default is FALSE. Used for variable selection: see details below.

seed

Seed for random number generator. Must be a negative integer (the R wrapper handles incorrectly set seed values).

do.trace

Should trace output be enabled? Default is FALSE. Integer values can also be passed. A positive value causes output to be printed each do.trace iteration.

...

Further arguments passed to or from other methods.

Details

—> Splitting Rules:

Four primary splitting rules are available for growing a survival forest for right-censored data: "logrank", "conserve", "logrankscore", and "random".

The default rule, "logrank", splits tree nodes by maximization of the log-rank test statistic (Segal, 1988; Leblanc and Crowley, 1993). The "conserve" rule splits nodes by finding daughters closest to the conservation of events principle (see Naftel, Blackstone and Turner, 1985). The "logrankscore" rule uses a standardized log-rank statistic (Hothorn and Lausen, 2003). The "random" rule implements pure random splitting. For each node, a variable is randomly selected from a random set of mtry variables and the node is split using a random split point (Cutler and Zhao, 2001; Lin and Jeon, 2006). Note, however, that because random splitting promotes splits near the edges, node splitting can terminate early resulting in extremely unbalanced trees. To correct this, the definition of nodesize is taken in this setting (and this setting alone) to equal the minimum number of unique deaths within a node required to split the node.

A random version of the "logrank", "conserve" and "logrankscore" splitting rules can be invoked using nsplit. If nsplit is set to a non-zero positive integer, then a maximum of nsplit split points are chosen randomly for each of the mtry variables within a node (this is in contrast to deterministic splitting, i.e. nsplit=0, where all possible split points for each of the mtry variables are considered). The splitting rule is applied to these random split points and the node is split on that variable and random split point maximizing survival difference (as measured by the splitting rule).

A detailed study carried out by Ishwaran et al. (2008) found "logrank" and "logrankscore" to be the most accurate in terms of prediction error, followed by "conserve". Setting nsplit=1 and using "logrank" splitting gave performance close to "logrank", but with significantly shorter computational times. Accuracy can be further improved without overly compromising speed by using larger values of nsplit.

Trees tend to favor splits on continuous variables (Loh and Shih, 1997), so it is good practice to use the nsplit option when the data contains a mix of continuous and discrete variables. Using a reasonably small value mitigates bias.

—> Large Data Sets:

Computation times for very large data sets can be improved by discretizing continuous variables and/or the observed survival times; in addition to using random splitting. Discretization does not have to be overly granular for substantial gains to be seen. Users may also consider setting big.data=TRUE for data with a large number of variables. This bypasses the large overhead R needs to create design matrices and parse formula. Be aware, however, that variables are not processed and are interpreted as is under this option. Think of the data frame as containing time and censoring information and the rest of the data as the pre-processed design matrix. In particular, transformations used in the formula (such as logs etc.) are ignored.

—> Formula:

A typical RSF formula has the form Surv(time, censoring) ~ terms, where "time" is survival time and "censoring" is a binary censoring indicator. Censoring must be coded as a non-negative integer with 0 reserved for censoring and (usually) 1=death (event). Also, "time" must be strictly positive.

—> Factors and Variable Types:

Variables encoded as factors are treated as such. If the factor is ordered, then splits are similar to real valued variables. If the factor is unordered, a split will move a subset of the levels in the parent node to the left daughter, and the complementary subset to the right daughter. All possible complementary pairs are considered and apply to factors with an unlimited number of levels. However, there is an optimization check to ensure that the number of splits attempted is not greater than the number of cases in a node (this internal check will override the nsplit value in random splitting mode if nsplit is large enough). Note that when predicting on test data involving factors, the factor labels in the test data must be the same as in the grow (training) data. Consider setting labels that are unique in the test data to missing to avoid issues.

Other than factors, all other x-variables are coerced and treated as being real valued.

—> Variable Importance:

Variable importance (VIMP) is computed similar to Breiman (2001), although there are two ways to perturb a variable to determine its VIMP: "randomsplit", "permute". The default method is "randomsplit" which works as follows. Out-of-bag (OOB) cases are dropped down the in-bag (bootstrap) survival tree. A case is assigned a daughter node randomly whenever an x-split is encountered. An OOB ensemble cumulative hazard function (CHF) is computed from the forest of such trees and its OOB error rate calculated. The VIMP for x is the difference between this and the OOB error rate for the original forest (without random node assignment using x). If "permute" is used, then x is randomly permuted in OOB data and dropped down the in-bag tree. See Ishwaran et al. (2008) for further details.

—> Predition Error:

Prediction error is measured by 1-C, where C is Harrell's concordance index. Prediction error is between 0 and 1, and measures how well the ensemble correctly ranks (classifies) two random individuals in terms of survival. A value of 0.5 is no better than random guessing. A value of 0 is perfect. Because VIMP is based on the concordance index, VIMP indicates how much misclassification increases, or decreases, for a new test case if a given variable were not available for that case (given that the forest was grown using that variable).

—> Competing Risks:

The implementation is similar to right-censoring but with the following caveats:

(1) Censoring must be coded as a non-negative integer where 0 indicates right-censoring and non-zero values indicate different event types. While 0,1,2,..,J is standard, events can be coded non-sequentially, although 0 must always be used for censoring.

(2) The default splitting rule is "logrankCR", a modified log-rank splitting rule tailored for competing risks. Over-riding this by manually selecting any split rule other than "logrankCR" or "random" will result in a right-censored analysis in which all (non-censored) events are treated as if they are one event type (indeed, they will coerced as such). Note that nsplit works as in right-censoring.

(3) The ensemble (see below) is a 3-D array in which the 3rd dimension is reserved for the ensemble CHF and each of the J ensemble conditional CHFs (CCHFs). The wrapper competing.risk can be used to process the ensemble and to generate event-specific cumulative incidence functions (CIF) and subsurvival functions (see Gray (1988) for background and definitions).

(4) The cases within a terminal node are used to estimate both the unconditional survival function and the event-specific subsurvival functions and for this reason nodesize should generally be set larger than in right-censored data settings.

—> Missing Data and Imputation:

Setting na.action="na.impute" implements a tree imputation method whereby missing data (x-variables or outcomes) are imputed dynamically as a tree is grown by randomly sampling from the distribution within the current node (Ishwaran et al. 2008). OOB data is not used in imputation to avoid biasing prediction error and VIMP estimates. Final imputation for integer valued variables and censoring indicators use a maximal class rule, whereas continuous variables and survival time use a mean rule. Records in which all outcome and x-variable information are missing are removed. Variables having all missing values are removed. The algorithm can be iterated by setting nimpute to a positive integer greater than 1. A few iterations should be used in heavy missing data settings to improve accuracy of imputed values (see Ishwaran et al., 2008). Note if the algorithm is iterated, a side effect is that missing values in the returned objects predictors, time and cens are replaced by imputed values. Further, imputed objects such as imputedData are set to NULL. See the examples below. Also see the wrapper impute.rsf for a fast impute interface.

—> Miscellanea:

Setting varUsed="all.trees" returns a vector where each element is a count of the number of times a split occurred on a variable. If varUsed="by.tree", a matrix of size ntree x p is returned. Each element [i][j] is the count of the number of times a split occurred on variable [j] in tree [i].

Setting split.depth=TRUE returns a matrix of size n x p where entry [i][j] is the mean minimal depth for variable [j] for case [i]. Used to select variables at the case-level. See max.subtree for more details regarding minimal depth.

Value

An object of class (rsf, grow) with the following components:

call

The original call to rsf.

formula

The formula used in the call.

n

Sample size of the data (depends upon NA's, see na.action).

ndead

Number of deaths.

ntree

Number of trees grown.

mtry

Number of variables randomly selected for splitting at each node.

nodesize

Minimum size of terminal nodes.

splitrule

Splitting rule used.

nsplit

Number of randomly selected split points.

time

Vector of length n of survival times.

cens

Vector of length n of censoring information (0=censored).

timeInterest

Sorted unique event times. Ensemble values are given for these time points only.

predictorNames

A character vector of the variable names used in growing the forest.

predictorWt

Vector of non-negative weights used for randomly sampling variables for splitting.

predictors

Data frame comprising x-variables used to grow the forest.

ensemble

Matrix for the in-bag ensemble CHF with each row corresponding to an individual's CHF evaluated at each of the time points in timeInterest. For competing risks, a 3-D array where the 3rd dimension is for the ensemble CHF and each of the CCHFs, respectively.

oob.ensemble

Same as ensemble, but based on OOB data.

poe

Matrix for the in-bag ensemble probability of an event (poe) for each individual: used to estimate the CIF. Rows correspond to each of the event types. Applies only to competing risk data. NULL otherwise.

oob.poe

Same as poe, but based on OOB data.

mortality

A vector of length n for the in-bag ensemble mortality for an individual in the data. Ensemble mortality values should be interpreted in terms of total number of deaths.

oob.mortality

Same as mortality, but based on oob.ensemble.

err.rate

Vector of length ntree containing OOB error rates for the ensemble, with the bth element being the error rate for the ensemble formed using the first b trees. Error rates are measured using 1-C, where C is Harrell's concordance index. For competing risks, a matrix with rows corresponding to the ensemble CHF and each of the CCHFs, respectively.

leaf.count

Number of terminal nodes for each tree in the forest. Vector of length ntree. A value of zero indicates a rejected tree (sometimes occurs when imputing missing data). Values of one indicate tree stumps.

importance

Vector recording VIMP for each variable. For competing risks, a matrix with rows corresponding to the ensemble CHF and each of the CCHFs, respectively.

forest

If forest=TRUE, the forest object is returned. This object can then be used for prediction with new test data sets and is required for other R-wrappers.

proximity

If proximity=TRUE, a matrix of dimension n x n recording the frequency pairs of data points occur within the same terminal node. Value returned is a vector of the lower diagonal of the matrix. Use plot.proximity to extract this information.

varUsed

Count of the number of times a variable is used in growing the forest. Can be a vector, matrix, or NULL.

imputedIndv

Vector of indices for cases with missing values. Can be NULL.

imputedData

Data frame comprising imputed data. First two columns are censoring and survival time, respectively. Remaining columns are the x-variables. Row i contains imputed outcomes and x-variables for row j of predictors, where j=imputedIndv[i]. See the examples below. Can be NULL.

splitDepth

Matrix of size n x p where entry [i][j] is the mean minimal depth for variable [j] for case [i]. Can be NULL.

Note

The key deliverable is the matrix ensemble (and its OOB counterpart, oob.ensemble) containing the ensemble CHF function for each individual evaluated at a set of distinct time points. The vector mortality (likewise oob.mortality) is a weighted sum over the columns of ensemble, weighted by the number of individuals at risk at the different time points. Entry [i] of the vector represents the estimated total mortality of individual i in terms of total number of deaths. In other words, if i has a mortality value of 100, then if all individuals had the same x-values as i, there would be on average 100 deaths in the dataset.

Different R-wrappers are provided to aid in parsing the ensemble.

Author(s)

Hemant Ishwaran hemant.ishwaran@gmail.com

Udaya B. Kogalur kogalurshear@gmail.com

References

Breiman L. (2001). Random forests, Machine Learning, 45:5-32.

Cutler A. and Zhao G. (2001). Pert-Perfect random tree ensembles. Comp. Sci. Statist., 33: 490-497.

Gray R.J. (1988). A class of k-sample tests for comparing the cumulative incidence of a competing risk, Ann. Statist., 16:1141-1154.

Harrell F.E. et al. (1982). Evaluating the yield of medical tests, J. Amer. Med. Assoc., 247:2543-2546.

Hothorn T. and Lausen B. (2003). On the exact distribution of maximally selected rank statistics, Comp. Statist. Data Anal., 43:121-137.

Ishwaran H. (2007). Variable importance in binary regression trees and forests, Electronic J. Statist., 1:519-537.

Ishwaran H., Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.

Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.

Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). High-dimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205-217.

Ishwaran H., Kogalur U.B., Chen X. and Minn A.J. (2010). Random survival forests for high-dimensional data.

Ishwaran H., Kogalur U.B., Moore R.D., Gange S.J. and Lau B.M. (2010). Random survival forests for competing risks.

LeBlanc M. and Crowley J. (1993). Survival trees by goodness of split, J. Amer. Statist. Assoc., 88:457-467.

Liaw A. and Wiener M. (2002). Classification and regression by randomForest, R News, 2:18-22.

Lin Y. and Jeon Y. (2006). Random forests and adaptive nearest neighbors, J. Amer. Statist. Assoc., 101:578-590.

Loh W.-Y and Shih Y.-S (1997). Split selection methods for classification trees, Statist. Sinica, 7:815-840, 1997.

Naftel N.C., Blackstone E.H. and Turner M.E. (1985). Conservation of events, unpublished notes.

Segal M.R. (1988). Regression trees for censored data, Biometrics, 44:35-47.

See Also

competing.risk, find.interaction, impute.rsf, max.subtree, plot.ensemble, plot.variable, plot.error, plot.proximity, pmml2rsf, predict.rsf, print.rsf, rsf2rfz, rsf2pmml, varSel, vimp.

Examples

#------------------------------------------------------------
# Example 1:  Veteran's Administration lung cancer data
# Randomized trial of two treatment regimens for lung cancer
# See Kalbfleisch & Prentice

data(veteran, package = "randomSurvivalForest")
veteran.out <- rsf(Surv(time, status) ~ ., data = veteran)
print(veteran.out)
plot(veteran.out)

#------------------------------------------------------------
# Example 2:  More detailed call (veteran data)

#read the data, set various options
data(veteran, package = "randomSurvivalForest")
veteran.f <- as.formula(Surv(time, status) ~ .)
ntree <- 200
nsplit <- 3
varUsed <- "by.tree"

# coerce 'celltype' as a factor and 'karnofsky score'
# as an ordered factor to illustrate factor useage
veteran$celltype <- factor(veteran$celltype,
    labels=c("squamous", "smallcell",  "adeno",  "large"))
veteran$karno <- factor(veteran$karno, ordered = TRUE)

# grow call
veteran2.out <- rsf(veteran.f, veteran, 
       ntree = ntree, nsplit = nsplit, varUsed = varUsed) 

# plot of ensemble survival for a single individual
surv.ensb <- t(exp(-veteran2.out$oob.ensemble))
plot(veteran2.out$timeInterest, surv.ensb[, 1])

# take a peek at the forest
head(veteran2.out$forest$nativeArray)

# average number of times a variable was split 
apply(veteran2.out$varUsed, 2, mean)

# partial plot of top variable
plot.variable(veteran2.out, partial = TRUE, npred = 1)


## Not run: 
#------------------------------------------------------------
# Example 3:  Competing risks

# Follicular Cell Lymphoma
data(follic, package = "randomSurvivalForest")
follic.out <- rsf(Surv(time, status) ~ ., follic, nsplit = 3, ntree = 400)
print(follic.out)
plot(follic.out, sorted = FALSE)
competing.risk(follic.out)

# Hodgkin's disease
data(hd, package = "randomSurvivalForest")
hd.out <- rsf(Surv(time, status) ~ ., hd, nsplit = 3, ntree = 400)
print(hd.out)
plot(hd.out, sorted = FALSE)
competing.risk(hd.out)

#------------------------------------------------------------
# Example 4:  Primary biliary cirrhosis (PBC) of the liver
# See Appendix D.1 of Fleming and Harrington

data(pbc, package = "randomSurvivalForest") 
pbc.out <- rsf(Surv(days, status) ~ ., pbc, nsplit = 3)
print(pbc.out)

#------------------------------------------------------------
# Example 5:  Same as Example 4, but with data imputation
# Also see the R-wrapper "impute.rsf"

# rsf call with imputation
data(pbc, package = "randomSurvivalForest") 
pbc2.out <- rsf(Surv(days, status)~., pbc, 
                nsplit = 3, na.action="na.impute")
print(pbc2.out)

# here's a nice wrapper to combine original data + imputed data
combine.impute <- function(object) {
  imputed.data <- cbind(cens = object$cens,
                        time = object$time,
                        object$predictors)
  if (!is.null(object$imputedIndv)) {
    imputed.data[object$imputedIndv, ] <- object$imputedData
  }
  colnames(imputed.data)[c(2,1)] <- all.vars(object$formula)[1:2]
  imputed.data
}

# combine original data + imputed data
pbc.imputed.data <- combine.impute(pbc2.out)

# iterate the missing data algorithm
# compare to non-iterated algorithm
pbc3.out <- rsf(Surv(days, status)~., pbc, nsplit=5,  
         na.action="na.impute", nimpute = 3)
pbc.iterate.imputed.data <- combine.impute(pbc3.out)
tail(pbc.imputed.data)
tail(pbc.iterate.imputed.data)


#------------------------------------------------------------
# Example 6:  German breast cancer data
# Variable selection using minimal depth

data(breast, package = "randomSurvivalForest")
breast.out <- rsf(Surv(time, cens) ~ . , breast, nsplit = 3)

# use varSel to select variables
# see the help file of varSel for details/examples

breast.vs <- varSel(object=breast.out)

#------------------------------------------------------------
# Example 7:  Compare Cox regression to RSF using PBC data
# OOB estimate of C-index for Cox based on 100 bootstraps
# Assumes "Hmisc" and "survival" libraries are loaded

if (library("survival", logical.return = TRUE) 
    & library("Hmisc", logical.return = TRUE))
{
  data(pbc, package = "randomSurvivalForest")
  rsf.f <- as.formula(Surv(days, status) ~ .)
  pbc3.out <- rsf(rsf.f, pbc, nsplit = 10, mtry = 2)
  B <- 100 
  cox.err <- rep(NA, B) 
  pbc.data <- pbc[apply(is.na(pbc), 1, sum) == 0,] ##remove NA's 
  cat("Out-of-bag Cox Analysis ...", "\n")
  for (b in 1:B) {
    cat("Cox bootstrap:", b, "\n") 
    bag.sample <- sample(1:nrow(pbc.data),
                         nrow(pbc.data),
                         replace = TRUE) 
    oob.sample <- setdiff(1:nrow(pbc.data), bag.sample)
    train <- pbc.data[bag.sample,]
    test <- pbc.data[oob.sample,]
    cox.out <- tryCatch({coxph(rsf.f, train)}, error=function(ex){NULL})
    if (is.list(cox.out)) {
      cox.predict <- predict(cox.out, test)
      cox.err[b] <- rcorr.cens(cox.predict, 
              Surv(pbc.data$days[oob.sample],
              pbc.data$status[oob.sample]))[1]
    }
  }
  cat("Error rates:", "\n")
  cat("Random Survival Forests:", pbc3.out$err.rate[pbc3.out$ntree], "\n")
  cat("         Cox Regression:", mean(cox.err, na.rm = TRUE), "\n")
}

## End(Not run)

[Package randomSurvivalForest version 3.6.3 Index]