Title: | SuperLearner Method for MICE |
---|---|
Description: | Adds a Super Learner ensemble model method (using the 'SuperLearner' package) to the 'mice' package. Laqueur, H. S., Shev, A. B., Kagawa, R. M. C. (2021) <doi:10.1093/aje/kwab271>. |
Authors: | Aaron B. Shev |
Maintainer: | Aaron B. Shev <[email protected]> |
License: | GPL-3 |
Version: | 1.1.1 |
Built: | 2025-03-05 04:18:32 UTC |
Source: | https://github.com/abshev/supermice |
Function to generate imputations using SuperLearner for data with a binary outcome.
binarySuperLearner(y, x, wy, SL.library, ...)
binarySuperLearner(y, x, wy, SL.library, ...)
y |
Vector of observed values of the variable to be imputed. |
x |
Numeric matrix of variables to be used as predictors in SuperLearner methods with rows corresponding to values in Y. |
wy |
Logical vector of length |
SL.library |
Either a character vector of prediction algorithms or a list containing character vectors. A list of functions included in the SuperLearner package can be found with SuperLearner::listWrappers(). |
... |
Further arguments passed to SuperLearner. |
Binary Vector of randomly drawn imputed values.
Function to generate imputations using SuperLearner for data with a continuous outcome
continuousSuperLearner(y, x, wy, SL.library, kernel, bw, bw.update, ...)
continuousSuperLearner(y, x, wy, SL.library, kernel, bw, bw.update, ...)
y |
Vector of observed and missing/imputed values of the variable to be imputed. |
x |
Numeric matrix of variables to be used as predictors in SuperLearner models with rows corresponding to observed values of the variable to be imputed and columns corresponding to individual predictor variables. |
wy |
Logical vector. A TRUE value indicates locations in |
SL.library |
Either a character vector of prediction algorithms or a
list containing character vectors. A list of functions included in the
SuperLearner package can be found with |
kernel |
one of |
bw |
|
bw.update |
logical indicating whether bandwidths should be computed
every iteration or only on the first iteration. Default is |
... |
further arguments passed to |
numeric vector of randomly drawn imputed values.
Kernel functions used for local imputation
gaussianKernel(x, xcenter, bw = 1, lambda = NULL)
gaussianKernel(x, xcenter, bw = 1, lambda = NULL)
x |
numeric vector of values to weight. |
xcenter |
numeric value to center the kernel. |
bw |
bandwidth of the kernel. |
lambda |
kernel radius, function of |
kernel values for x
centered at xcenter
.
Jackknife method for selection bandwidth
jackknifeBandwidthSelection(i, bwGrid, preds, y, delta, kernel)
jackknifeBandwidthSelection(i, bwGrid, preds, y, delta, kernel)
i |
integer referring to the index of the missing value to be imputed. |
bwGrid |
numeric vector of candidate bandwidth values |
preds |
numeric vector of predicted values for missing observations |
y |
numeric vector of length |
delta |
Binary vector of length |
kernel |
one of |
bandwidth
Computes jackknife variance
jackknifeVariance(j, kernMatrix, delta, y)
jackknifeVariance(j, kernMatrix, delta, y)
j |
integer index for deleted observation in the jackknife procedure. |
kernMatrix |
|
delta |
Binary vector of length |
y |
numeric vector of length |
returns a single numeric value for the estimate of the jackknife variance.
Function to generate imputations using non-parametric and semi-parametric local imputation methods.
localImputation( i, preds, y, delta, bw = NULL, kernel = c("gaussian", "uniform", "triangular") )
localImputation( i, preds, y, delta, bw = NULL, kernel = c("gaussian", "uniform", "triangular") )
i |
integer referring to the index of the missing value to be imputed. |
preds |
numeric vector of predictions of missing values from SuperLearner. |
y |
numeric vector for variable to be imputed. |
delta |
binary vector of length |
bw |
|
kernel |
one of |
numeric vector of randomly drawn imputed values.
mice
package.Method for the mice
package that uses SuperLearner as the predctive
algorithm. Model fitting is done using the SuperLearner
package.
mice.impute.SuperLearner( y, ry, x, wy = NULL, SL.library, kernel = c("gaussian", "uniform", "triangular"), bw = c(0.1, 0.2, 0.25, 0.3, 0.5, 1, 2.5, 5, 10, 20), bw.update = TRUE, ... )
mice.impute.SuperLearner( y, ry, x, wy = NULL, SL.library, kernel = c("gaussian", "uniform", "triangular"), bw = c(0.1, 0.2, 0.25, 0.3, 0.5, 1, 2.5, 5, 10, 20), bw.update = TRUE, ... )
y |
Vector to be imputed |
ry |
Logical vector of length |
x |
Numeric design matrix with |
wy |
Logical vector of length |
SL.library |
For SuperLearner: Either a character vector of prediction algorithms or list containing character vectors as specified by the SuperLearner package. See details below. |
kernel |
One of "gaussian", "uniform", "triangular". Kernel function used to compute weights. |
bw |
NULL or numeric value for bandwidth of kernel function (as standard deviations of the kernel). |
bw.update |
logical indicating whether bandwidths should be computed
every iteration or only on the first iteration. Default is |
... |
Further arguments passed to |
mice.impute.SuperLearner()
is a method for use with the mice() function that
implements the ensemble predictive model, SuperLearner (van der Laan, 2011),
into the mice (van Buuren, 2011) multiple imputation procedure. This function
is never called directly, instead a user that wishes to use SuperLearner
in MICE simply needs to set the argument method = "SuperLearner"
in the
call to mice()
. Arguments for the SuperLearner()
function are passed from mice as extra arguments in the mice()
call.
All MICE methods randomly generate imputed values for a number of data sets.
The approach of SuperMICE is to estimate parameters for a normal distribution
centered at the point estimate for an imputed value predicted by a SuperLearner model.
The point estimates are obtained by fitting a selection of different
predictive models on complete cases and determining an optimal weighted average
of candidate models to predict the missing cases. SuperMICE uses the implementation
of SuperLearner found in the SuperLearner package.
The models to be used with SuperLearner()
are supplied by the user as a
character vector. For a full list of available methods see
listWrappers()
.
SuperLearner models do not produce standard errors for estimates, so instead
we use a kernel based estimate of local variance around each point estimate
as the variance parameter in the normal distribution used to randomly sample values.
The kernel can be set by the user with the kernel
argument as either
a gaussian kernel, uniform kernel, or triangular kernel. The user must also
supply a list of candidate bandwidths in the bw
argument as a numeric
vector. For more information on the variance and bandwidth selection
see Laqueur, et. al (2021). In every iteration the mice procedure, the optimal
bandwidth is reselected. This may be changed to select the bandwidth only
on the first iteration to speed up the total run time of the imputation by
changing bw.update
to FALSE
; however this may bias your results.
Note that this only applies to continuous response variables. In the binary
case the variance is a function of the SuperLearner estimate.
Vector with imputed data, same type as y
, and of length sum(wy)
Laqueur, H. S., Shev, A. B., Kagawa, R. M. C. (2021). SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations. American Journal of Epidemiology, kwab271, doi:10.1093/aje/kwab271.
Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice
: Multivariate
Imputation by Chained Equations in R
. Journal of Statistical Software,
45(3), 1-67. doi:10.18637/jss.v045.i03.
van der Laan, M. J., Polley, E. C. and Hubbard, A. E. (2008) Super Learner, Statistical Applications of Genetics and Molecular Biology, 6, article 25.
#Multiple imputation with missingness on a continuous variable. #Randomly generated data with missingness in x2. The probability of x2 # being missing increases with with value of x1. n <- 20 pmissing <- 0.10 x1 <- runif(n, min = -3, max = 3) x2 <- x1^2 + rnorm(n, mean = 0, sd = 1) error <- rnorm(n, mean = 0, sd = 1) y <- x1 + x2 + error f <- ecdf(x1) x2 <- ifelse(runif(x2) < (f(x1) * 2 * pmissing), NA, x2) dat <- data.frame(y, x1, x2) #Create vector of SuperLearner method names # Note: see SuperLearner::listWrappers() for a full list of methods # available. SL.lib <- c("SL.mean", "SL.glm") #Run mice(). # Note 1: m >= 30 and maxit >= 10 are recommended outside of this # toy example # Note 2: a denser bandwidth grid is recommended, see default for bw # argument for example. imp.SL <- mice::mice(dat, m = 2, maxit = 2, method = "SuperLearner", print = TRUE, SL.library = SL.lib, kernel = "gaussian", bw = c(0.25, 1, 5))
#Multiple imputation with missingness on a continuous variable. #Randomly generated data with missingness in x2. The probability of x2 # being missing increases with with value of x1. n <- 20 pmissing <- 0.10 x1 <- runif(n, min = -3, max = 3) x2 <- x1^2 + rnorm(n, mean = 0, sd = 1) error <- rnorm(n, mean = 0, sd = 1) y <- x1 + x2 + error f <- ecdf(x1) x2 <- ifelse(runif(x2) < (f(x1) * 2 * pmissing), NA, x2) dat <- data.frame(y, x1, x2) #Create vector of SuperLearner method names # Note: see SuperLearner::listWrappers() for a full list of methods # available. SL.lib <- c("SL.mean", "SL.glm") #Run mice(). # Note 1: m >= 30 and maxit >= 10 are recommended outside of this # toy example # Note 2: a denser bandwidth grid is recommended, see default for bw # argument for example. imp.SL <- mice::mice(dat, m = 2, maxit = 2, method = "SuperLearner", print = TRUE, SL.library = SL.lib, kernel = "gaussian", bw = c(0.25, 1, 5))