predict.rsf {randomSurvivalForest} | R Documentation |
Prediction on test data using Random Survival Forests.
predict.rsf(object = NULL, test = NULL, importance = c("randomsplit", "permute", "none")[1], na.action = c("na.omit", "na.impute")[1], outcome = c("train", "test")[1], proximity = FALSE, split.depth = FALSE, seed = NULL, do.trace = FALSE, ...)
object |
An object of class |
test |
Data frame containing test data. Missing values allowed. |
importance |
Method used to compute variable importance (VIMP). Only applies when test data contains survival outcomes. |
na.action |
Action to be taken if the data contains NA's. Possible
values are " |
outcome |
Data frame used in calculating the ensemble. By default this is always the training data, but see details below. |
proximity |
Should proximity measure between test observations be calculated? Can be large. |
split.depth |
Return minimal depth for each variable for each test set individual? |
seed |
Seed (negative integer) for random number generator. |
do.trace |
Logical. Should trace output be enabled? Integer
values can also be passed. A positive value causes output to
be printed each |
... |
Further arguments passed to or from other methods. |
The test data is dropped down the grow-forest (i.e., the forest grown
from the training data) yielding the ensemble cumulative hazard
function (CHF) for each individual in the test data evaluated at each
unique death time point from the grow data. If survival outcome
information is present in the test data, the overall error rate and
VIMP for each variable is also returned. Setting
na.action
="na.impute
" imputes missing test data
(x-variables or outcomes). Imputation uses the grow-forest such that
only training data is used when imputing test data to avoid biasing
error rates and VIMP (Ishwaran et al. 2008).
For competing risks, the ensemble conditional CHF (CCHF) is computed for each event type in addition to the ensemble CHF.
If outcome
="test
", the ensemble is calculated by
specifically using survival information from the test data (survival
information must be present). In this case, the terminal nodes from
the grow-forest are recalculated using survival data from the test
set. This yields a modified predictor in which the topology of the
forest is based solely on the training data, but where the predicted
value is based on the test data. Error rates and VIMP are calculated
by bootstrapping the test data and using out-of-bagging to ensure
unbiased estimates. See Examples 2 and 3 below for illustration.
An object of class (rsf, predict)
, which is a list with the
following components:
call |
The original grow call to |
forest |
The grow forest. |
ntree |
Number of trees in the grow forest. |
leaf.count |
Number of terminal nodes for each tree in the
grow forest. Vector of length |
timeInterest |
Sorted unique event times from grow (training) data. Ensemble values given for these time points only. |
n |
Sample size of test data (depends upon NA's, see |
ndead |
Number of deaths in test data (can be NULL). |
time |
Vector of survival times from test data (can be NULL). |
cens |
Vector of censoring indicators from test data (can be NULL). |
predictorNames |
Character vector of variable names. |
predictors |
Data frame comprising x-variables used for prediction. |
ensemble |
Matrix containing the ensemble CHF for the test
data. Each row corresponds to the CHF for an individual in the
test set evaluated at each of the time points in
|
poe |
Matrix containing the ensemble probability of an event (POE) for each test set individual: used to estimate the CIF. Rows correspond to each of the event types. Applies only to competing risk data. NULL otherwise. |
mortality |
Vector containing ensemble mortality for each
individual in the test data. Ensemble mortality should
be interpreted in terms of total number of training deaths if
|
err.rate |
Vector of length |
importance |
VIMP of each variable in the test data. For competing risks, a matrix with rows corresponding to the ensemble CHF and each of the CCHFs, respectively. Can be NULL. |
proximity |
If |
imputedIndv |
Vector of indices of records in test data with missing values. Can be NULL. |
imputedData |
Data frame containing the imputed test data. First
two columns are censoring and survival time, respectively. The
remaining columns are the x-variables. Row i contains imputed
outcomes and x-variables for row |
splitDepth |
Matrix where entry [i][j] is the mean minimal
depth for variable [j] for case [i] in the test data. Used for
variable selection (see |
Hemant Ishwaran hemant.ishwaran@gmail.com
Udaya B. Kogalur kogalurshear@gmail.com
Breiman L. (2001). Random forests, Machine Learning, 45:5-32.
Ishwaran H., Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.
Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.
Ishwaran H., Kogalur U.B., Moore R.D., Gange S.J. and Lau B.M. (2010). Random survival forests for competing risks.
rsf
.
#------------------------------------------------------------ # Example 1: Typical call (veteran data) data(veteran, package = "randomSurvivalForest") pt.train <- sample(1:nrow(veteran), round(nrow(veteran)*0.80)) veteran.out <- rsf(Surv(time, status) ~ ., data = veteran[pt.train , ]) veteran.pred <- predict(veteran.out, veteran[-pt.train , ]) ## Not run: #------------------------------------------------------------ # Example 2: Get out-of-bag error rate using the training # data as test data (pbc example) data(pbc, package = "randomSurvivalForest") pbc.grow <- rsf(Surv(days, status) ~ ., pbc, nsplit = 3) pbc.pred <- predict(pbc.grow, pbc, outcome = "test") cat("GROW error rate :", round(pbc.grow$err.rate[1000], 3)) cat("PRED error rate :", round(pbc.pred$err.rate, 3)) #------------------------------------------------------------ # Example 3: Verify reproducibility of forest (pbc data) #primary call data(pbc, package = "randomSurvivalForest") pt.train <- sample(1:nrow(pbc), round(nrow(pbc)*0.50)) pbc.out <- rsf(Surv(days, status) ~ ., nsplit = 3, data = pbc[pt.train, ]) #make separate predict calls using the outcome option pbc.train <- predict(pbc.out, pbc[-pt.train, ], outcome = "train") pbc.test <- predict(pbc.out, pbc[-pt.train, ], outcome = "test") #check forest reproducibilility by comparing predicted survival curves timeInterest <- pbc.out$timeInterest surv.train <- exp(-pbc.train$ensemble) surv.test <- exp(-pbc.test$ensemble) matplot(timeInterest, t(surv.train - surv.test), type = "l") #test reproducibility by repeating B times #compute l1-difference in predicted survival B <- 25 l1.valid <- rep(NA, B) for (b in 1:B) { cat("Replication:", b, "\n") pt.train <- sample(1:nrow(pbc), round(nrow(pbc)*0.50)) pbc.out <- rsf(Surv(days, status) ~ ., nsplit = 3, data = pbc[pt.train, ]) surv.train <- exp(-predict(pbc.out, pbc[-pt.train, ], outcome = "train")$ensemble) surv.test <- exp(-predict(pbc.out, pbc[-pt.train, ], outcome = "test")$ensemble) l1.valid <- mean(apply(abs(surv.train - surv.test), 1, mean, na.rm = TRUE), na.rm = TRUE) } cat("l1-reproducibility:", round(mean(l1.valid, na.rm = TRUE), 3), "\n") ## End(Not run)