We are interested in assessing the population intervention direct effect and the population intervention indirect effect, based on the effect decomposition of the population intervention effect introduced in Dı́az and Hejazi (2020).
To proceed, we’ll use as our running example a simple data set from
an observational study of the relationship between BMI and kids
behavior, distributed as part of the mma
R package
on CRAN. First, let’s load the packages we’ll be using and set a
seed; then, load this data set and take a quick look at it
# preliminaries
library(data.table)
library(dplyr)
library(tidyr)
library(sl3)
library(medshift)
library(mma)
set.seed(429153)
# load and examine data
data(weight_behavior)
dim(weight_behavior)
## [1] 691 15
## bmi age sex race numpeople car gotosch snack tvhours cmpthours
## 1 18.20665 12.2 F OTHER 5 3 2 1 4 0
## 2 22.78401 12.8 M OTHER 4 3 2 1 4 2
## 3 19.60725 12.6 F OTHER 4 2 4 2 NA NA
## 4 25.56754 12.1 M OTHER 2 3 2 1 0 2
## 5 15.07408 12.3 M OTHER 4 1 2 1 2 1
## 6 22.98338 11.8 M OTHER 4 1 1 1 4 3
## cellhours sports exercises sweat overweigh
## 1 0 2 2 1 0
## 2 0 1 8 2 0
## 3 NA <NA> 4 2 0
## 4 0 2 9 1 1
## 5 3 1 12 1 0
## 6 2 1 1 1 0
The documentation for the data set describes it as a “database obtained from the Louisiana State University Health Sciences Center, New Orleans, by Dr. Richard Scribner. He explored the relationship between BMI and kids behavior through a survey at children, teachers and parents in Grenada in 2014. This data set includes 691 observations and 15 variables.”
Unfortunately, the data set contains a few observations with missing values. As these are unrelated to the object of our analysis, we’ll simply remove these for the time being. Note that in a real data analysis, we might consider strategies to fully make of the observed data, perhaps by imputing missing values. For now, we simply remove the incomplete observations, resulting in a data set with fewer observations but much the same structure as the original:
## [1] 567 15
## bmi age sex race numpeople car gotosch snack tvhours cmpthours
## 1 18.20665 12.2 F OTHER 5 3 2 1 4 0
## 2 22.78401 12.8 M OTHER 4 3 2 1 4 2
## 3 25.56754 12.1 M OTHER 2 3 2 1 0 2
## 4 15.07408 12.3 M OTHER 4 1 2 1 2 1
## 5 22.98338 11.8 M OTHER 4 1 1 1 4 3
## 6 19.15658 12.1 F OTHER 3 3 2 1 0 0
## cellhours sports exercises sweat overweigh
## 1 0 1 2 1 0
## 2 0 0 8 2 0
## 3 0 1 9 1 1
## 4 3 0 12 1 0
## 5 2 0 1 1 0
## 6 1 0 1 3 0
For the analysis of this observational data set, we focus on the
effect of participating in a sports team (sports
) on the
BMI of children (bmi
), taking several related covariates as
mediators (snack
, exercises
,
overweigh
) and all other collected covariates as potential
confounders. Considering an NPSEM, we separate the observed variables
from the data set into their corresponding nodes as follows
Y <- weight_behavior_complete$bmi
A <- weight_behavior_complete$sports
Z <- weight_behavior_complete %>%
select(snack, exercises, overweigh)
W <- weight_behavior_complete %>%
select(age, sex, race, numpeople, car, gotosch, tvhours, cmpthours,
cellhours, sweat)
Finally, in our analysis, we consider an incremental propensity score intervention (IPSI), as first proposed by Kennedy (2018), wherein the odds of participating in a sports team is modulated by some fixed amount (0 ≤ δ ≤ ∞) for each individual. Such an intervention may be interpreted as the effect of a school program that motivates children to participate in sports teams. To exemplify our approach, we postulate a motivational intervention that triples the odds of participating in a sports team for each individual:
To easily incorporate ensemble machine learning into the estimation
procedure, we rely on the facilities provided in the sl3
R package (coyle2020sl3?). For a complete
guide on using the sl3
R package, consider consulting https://tlverse.org/sl3,
or https://tlverse.org
(and https://github.com/tlverse) for the tlverse
ecosystem, of which sl3
is a major part. We construct an
ensemble learner using a handful of popular machine learning algorithms
below
# SL learners used for continuous data (the nuisance parameter M)
xgb_contin_lrnr <- Lrnr_xgboost$new(nrounds = 50, objective = "reg:linear")
enet_contin_lrnr <- Lrnr_glmnet$new(alpha = 0.5, family = "gaussian",
nfolds = 3)
lasso_contin_lrnr <- Lrnr_glmnet$new(alpha = 1, family = "gaussian",
nfolds = 3)
fglm_contin_lrnr <- Lrnr_glm_fast$new(family = gaussian())
contin_lrnr_lib <- Stack$new(enet_contin_lrnr, lasso_contin_lrnr,
fglm_contin_lrnr, xgb_contin_lrnr)
sl_contin_lrnr <- Lrnr_sl$new(learners = contin_lrnr_lib,
metalearner = Lrnr_nnls$new())
# SL learners used for binary data (nuisance parameters G and E in this case)
xgb_binary_lrnr <- Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic")
enet_binary_lrnr <- Lrnr_glmnet$new(alpha = 0.5, family = "binomial",
nfolds = 3)
lasso_binary_lrnr <- Lrnr_glmnet$new(alpha = 1, family = "binomial",
nfolds = 3)
fglm_binary_lrnr <- Lrnr_glm_fast$new(family = binomial())
binary_lrnr_lib <- Stack$new(enet_binary_lrnr, lasso_binary_lrnr,
fglm_binary_lrnr, xgb_binary_lrnr)
logistic_metalearner <- make_learner(Lrnr_solnp,
metalearner_logistic_binomial,
loss_loglik_binomial)
sl_binary_lrnr <- Lrnr_sl$new(learners = binary_lrnr_lib,
metalearner = logistic_metalearner)
We may decompose the population intervention effect (PIE) in terms of a population intervention direct effect (PIDE) and a population intervention indirect effect (PIIE):
This decomposition of the PIE as the sum of the population intervention direct and indirect effects has an interpretation analogous to the corresponding standard decomposition of the average treatment effect. In the sequel, we will compute each of the components of the direct and indirect effects above using appropriate estimators as follows
npcausal
R
package.As given in Dı́az and Hejazi (2020), the
statistical functional identifying the decomposition term that appears
in both the PIDE and PIIE 𝔼{Y(Aδ, Z)},
which corresponds to altering the exposure mechanism while keeping the
mediation mechanism fixed, is for which a one-step estimator is
available. The corresponding efficient influence function (EIF)
with respect to the nonparametric model ℳ is Dη, δ(o) = Dη, δY(o) + Dη, δA(o) + Dη, δZ, W(o) − θ(δ).
The one-step estimator may be computed using the EIF estimating
equation, making use of cross-fitting (Zheng and
van der Laan 2011; Chernozhukov et al. 2018) to circumvent any
need for entropy conditions (i.e., Donsker class restrictions). The
resultant estimator is which is implemented in the medshift
R package. We make use of that implementation to estimate 𝔼{Y(Aδ, Z)}
via its one-step estimator θ̂(δ) below
# let's compute the parameter where A (but not Z) are shifted
theta_eff <- medshift(W = W, A = A, Z = Z, Y = Y,
delta = delta_shift_ipsi,
g_learners = sl_binary_lrnr,
e_learners = sl_binary_lrnr,
m_learners = sl_contin_lrnr,
phi_learners = Lrnr_hal9001$new(),
estimator = "onestep",
estimator_args = list(cv_folds = 3))
## [02:46:10] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:10] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:11] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:11] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:11] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:11] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:11] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:11] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:11] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:11] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:11] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:47] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:47] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:48] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:48] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:48] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:48] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:48] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:48] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:48] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:48] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:46:48] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:07] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:07] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:07] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:07] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:07] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:07] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:07] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:08] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:08] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:08] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [02:47:08] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## lwr_ci param_est upr_ci param_var eif_mean estimator
## 18.77974 19.133084 19.486428 0.032501 9.767717e-16 onestep
Recall that, based on the decomposition outlined previously, the population intervention direct effect may be denoted βPIDE(δ) = θ0(δ) − 𝔼Y. Thus, an estimator of the PIDE, β̂PIDE(δ) may be expressed as a composition of estimators of its constituent parameters:
Based on the above, we may construct an estimator of the PIDE using quantities already computed. The convenience function below applies the simple delta method required in the case of a linear contrast between the two constituent parameters:
# convenience function to compute inference via delta method: EY1 - EY0
linear_contrast <- function(params, eifs, ci_level = 0.95) {
# bounds for confidence interval
ci_norm_bounds <- c(-1, 1) * abs(stats::qnorm(p = (1 - ci_level) / 2))
param_est <- params[[1]] - params[[2]]
eif <- eifs[[1]] - eifs[[2]]
se_eif <- sqrt(var(eif) / length(eif))
param_ci <- param_est + ci_norm_bounds * se_eif
# parameter and inference
out <- c(param_ci[1], param_est, param_ci[2])
names(out) <- c("lwr_ci", "param_est", "upr_ci")
return(out)
}
With the above convenience function in hand, we’ll construct or extract the necessary components from existing objects and simply apply the function:
# parameter estimates and EIFs for components of direct effect
EY <- mean(Y)
eif_EY <- Y - EY
params_de <- list(theta_eff$theta, EY)
eifs_de <- list(theta_eff$eif, eif_EY)
# direct effect = EY - estimated quantity
de_est <- linear_contrast(params_de, eifs_de)
de_est
## lwr_ci param_est upr_ci
## -0.481588292 0.005976251 0.493540793
As given above, we have for our estimate of the direct effect β̂PIDE(δ)= 0.006.