Type: | Package |
Title: | Hierarchical GxE Interactions in a Regularized Regression Model |
Version: | 1.0.2 |
Date: | 2021-11-28 |
Author: | Natalia Zemlianskaia |
Maintainer: | Natalia Zemlianskaia <natasha.zemlianskaia@gmail.com> |
Description: | The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <doi:10.48550/arXiv.2103.13510>. |
License: | MIT + file LICENSE |
Imports: | Rcpp (≥ 1.0.3), Matrix, bigmemory, methods |
Depends: | dplyr, R (≥ 3.5) |
Suggests: | glmnet, testthat, knitr, rmarkdown, ggplot2 |
LinkingTo: | Rcpp, RcppEigen, RcppThread, BH, bigmemory |
VignetteBuilder: | knitr |
NeedsCompilation: | yes |
Packaged: | 2021-11-28 17:20:02 UTC; nataliazemlianskaia |
Repository: | CRAN |
Date/Publication: | 2021-11-30 07:30:02 UTC |
Hierarchical GxE Interactions in a Regularized Regression Model
Description
The method focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure for the joint selection of interaction terms in a regularized regression model. For details see Zemlianskaia et al. (2021) <arxiv:2103.13510>.
Author(s)
Natalia Zemlianskaia
Maintainer: Natalia Zemlianskaia <natasha.zemlianskaia@gmail.com>
References
"A Scalable Hierarchical Lasso for Gene-Environment Interactions", Natalia Zemlianskaia, W.James Gauderman, Juan Pablo Lewinger https://arxiv.org/abs/2103.13510
Data Generation
Description
Generates genotypes data matrix G (sample_size
by p
), vector of environmental measurments E, and an outcome vector Y of size sample_size
. Simulates training, validation, and test datasets.
Usage
data.gen(sample_size = 100, p = 20, n_g_non_zero = 15, n_gxe_non_zero = 10,
family = "gaussian", mode = "strong_hierarchical",
normalize = FALSE, normalize_response = FALSE,
seed = 1, pG = 0.2, pE = 0.3,
n_confounders = NULL)
Arguments
sample_size |
sample size of the data |
p |
total number of main effects |
n_g_non_zero |
number of non-zero main effects to generate |
n_gxe_non_zero |
number of non-zero interaction effects to generate |
family |
"gaussian" for continous outcome Y and "binomial" for binary 0/1 outcome |
mode |
either "strong_hierarchical", "hierarchical", or "anti_hierarchical". In the strong hierarchical mode the hierarchical structure is maintained (beta_g = 0 then beta_gxe = 0) and also |beta_g| >= |beta_gxe|. In the hierarchical mode the hierarchical structure is maintained, but |beta_G| < |beta_gxe|. In the anti_hierarchical mode the hierarchical structure is violated (beta_g = 0 then beta_gxe != 0). |
normalize |
|
normalize_response |
|
pG |
genotypes prevalence, value from 0 to 1 |
pE |
environment prevalence, value from 0 to 1 |
seed |
random seed |
n_confounders |
number of confounders to generate, either |
Value
A list of simulated datasets and generating coefficients
G_train , G_valid , G_test |
generated genotypes matrices |
E_train , E_valid , E_test |
generated vectors of environmental values |
Y_train , Y_valid , Y_test |
generated outcome vectors |
C_train , C_valid , C_test |
generated confounders matrices |
GxE_train , GxE_valid , GxE_test |
generated GxE matrix |
Beta_G |
main effect coefficients vector |
Beta_GxE |
interaction coefficients vector |
beta_0 |
intercept coefficient value |
beta_E |
environment coefficient value |
Beta_C |
confounders coefficient values |
index_beta_non_zero , index_beta_gxe_non_zero , index_beta_zero , index_beta_gxe_zero |
inner data generation variables |
n_g_non_zero |
number of non-zero main effects generated |
n_gxe_non_zero |
number of non-zero interactions generated |
n_total_non_zero |
total number of non-zero variables |
SNR_g |
signal-to-noise ratio for the main effects |
SNR_gxe |
signal-to-noise ratio for the interactions |
family , p , sample_size , mode , seed |
input simulation parameters |
Examples
data = data.gen(sample_size=100, p=100)
G = data$G_train; GxE = data$GxE_train
E = data$E_train; Y = data$Y_train
Get model coefficients
Description
A function to obtain coefficients from the model fit object corresponding to the desired pair of tuning parameters lambda
= (lambda_1
, lambda_2
).
Usage
gesso.coef(fit, lambda)
Arguments
fit |
model fit object obtained either by using function |
lambda |
a pair of tuning parameters organized in a tibble (ex: |
Value
A list of model coefficients corresponding to lambda
values of tuning parameters
beta_0 |
estimated intercept value |
beta_e |
estimated environmental coefficient value |
beta_g |
a vector of estimated main effect coefficients |
beta_c |
a vector of estimated confounders coefficients |
beta_gxe |
a vector of estimated interaction coefficients |
Examples
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train, grid_size=20,
parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g
Get model coefficients with specified number of non-zero interactions
Description
A function to obtain coefficients with target_b_gxe_non_zero
specified to control the desired sparsity of interactions in the model.
Usage
gesso.coefnum(cv_model, target_b_gxe_non_zero, less_than = TRUE)
Arguments
cv_model |
cross-validated model fit object obtained by using function |
target_b_gxe_non_zero |
number of non-zero interactions we want to inlcude in the model |
less_than |
|
Value
A list of model coefficients corresponding to the best model that contains at most or at least target_b_gxe_non_zero
non-zero interaction terms.
The target model is selected based on the averaged cross-validation (cv) results: for each pair of parameters lambda
=(lambda_1, lambda_2) in the grid and each cv fold we obtain a number of non-zero estimated interaction terms, then average cv results by lambda
and choose the tuning parameters corresponding to the minimum average cv loss that have at most or at least target_b_gxe_non_zero
non-zero interaction terms. Returned coefficients are obtained by fitting the model on the full data with the selected tuning parameters.
Note that the number of estimated non-zero interactions will only approximately reflect the numbers obtained on cv datasets.
beta_0 |
estimated intercept value |
beta_e |
estimated environmental coefficient value |
beta_g |
a vector of estimated main effect coefficients |
beta_gxe |
a vector of estimated interaction coefficients |
beta_c |
a vector of estimated confounders coefficients |
Examples
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
model_coefficients = gesso.coefnum(model, 5)
gxe_coefficients = model_coefficients$beta_gxe; sum(gxe_coefficients!=0)
Cross-Validation
Description
Performs nfolds
-fold cross-validation to tune hyperparmeters lambda_1
and lambda_2
for the gesso model.
Usage
gesso.cv(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE, grid = NULL,
grid_size = 20, grid_min_ratio = NULL, alpha = NULL, family = "gaussian",
type_measure = "loss", fold_ids = NULL, nfolds = 4,
parallel = TRUE, seed = 42, tolerance = 1e-3, max_iterations = 5000,
min_working_set_size = 100, verbose = TRUE)
Arguments
G |
matrix of main effects of size |
E |
vector of environmental measurments |
Y |
outcome vector. Set |
C |
matrix of confounders of size |
normalize |
|
normalize_response |
|
grid |
grid sequence for tuning hyperparameters, we use the same grid for |
grid_size |
specify |
grid_min_ratio |
parameter to determine |
alpha |
if |
family |
|
type_measure |
loss to use for cross-validation. Specity |
fold_ids |
option to input custom folds assignments |
tolerance |
tolerance for the dual gap convergence criterion |
max_iterations |
maximum number of iterations |
min_working_set_size |
minimum size of the working set |
nfolds |
number of cross-validation splits |
parallel |
|
seed |
set random seed to control random folds assignments |
verbose |
|
Value
A list of objects
cv_result |
a tibble with cross-validation results: averaged across folds loss and the number of non-zero coefficients for each value of (
|
lambda_min |
a tibble of optimal ( |
fit |
list, return of the function gesso.fit on the full data |
grid |
vector of values used for hyperparameters tuning |
full_cv_result |
inner variables |
Examples
data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train,
grid_size=20, parallel=TRUE, nfolds=3)
gxe_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_gxe
g_coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)$beta_g
gesso fit
Description
Fits gesso model over the two dimentional grid of hyperparmeters lambda_1
and lambda_2
, returns estimated coefficients for each pair of hyperparameters.
Usage
gesso.fit(G, E, Y, C = NULL, normalize = TRUE, normalize_response = FALSE,
grid = NULL, grid_size = 20, grid_min_ratio = NULL,
alpha = NULL, family = "gaussian", weights = NULL,
tolerance = 1e-3, max_iterations = 5000,
min_working_set_size = 100,
verbose = FALSE)
Arguments
G |
matrix of main effects of size |
E |
vector of environmental measurments |
Y |
outcome vector. Set |
C |
matrix of confounders of size |
normalize |
|
normalize_response |
|
grid |
grid sequence for tuning hyperparameters, we use the same grid for |
grid_size |
specify |
grid_min_ratio |
parameter to determine |
alpha |
if |
family |
|
tolerance |
tolerance for the dual gap convergence criterion |
max_iterations |
maximum number of iterations |
min_working_set_size |
minimum size of the working set |
weights |
inner fitting parameter |
verbose |
|
Value
A list of estimated coefficients and other model fit metrics for each pair of hyperparameters (lambda_1
, lambda_2
)
beta_0 |
vector of estimated intercept values of size |
beta_e |
vector of estimated environment coefficients of size |
beta_g |
matrix of estimated main effects coefficients organized by rows, size ( |
beta_gxe |
matrix of estimated interactions coefficients organized by rows, size ( |
beta_c |
matrix of estimated confounders coefficients organized by rows, size ( |
num_iterations |
number of iterations until convergence for each fit |
working_set_size |
maximum number of variables in the working set for each fit |
has_converged |
1 if the model converged within given |
objective_value |
objective function (loss) value for each fit |
beta_g_nonzero |
number of estimated non-zero main effects for each fit |
beta_gxe_nonzero |
number of estimated non-zero interactions for each fit |
lambda_1 |
|
lambda_2 |
|
grid |
vector of values used for hyperparameters tuning |
Examples
data = data.gen()
fit = gesso.fit(G=data$G_train, E=data$E_train, Y=data$Y_train, normalize=TRUE)
plot(fit$beta_g_nonzero, pch=19, cex=0.4,
ylab="num of non-zero features", xlab="lambdas path")
points(fit$beta_gxe_nonzero, pch=19, cex=0.4, col="red")
Predict new outcome vector
Description
Predict new outcome vector based on the new data and estimated model coefficients.
Usage
gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E,
beta_c=NULL, new_C=NULL, family = "gaussian")
Arguments
beta_0 |
estimated intercept value |
beta_e |
estimated environmental coefficient value |
beta_g |
a vector of estimated main effect coefficients |
beta_gxe |
a vector of estimated interaction coefficients |
new_G |
matrix of main effects, variables organized by columns |
new_E |
vector of environmental measurments |
beta_c |
a vector of estimated confounders coefficients |
new_C |
matrix of confounders, variables organized by columns |
family |
set |
Value
Returns a vector of predicted values
Examples
data = data.gen()
tune_model = gesso.cv(data$G_train, data$E_train, data$Y_train)
coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)
beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e
beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe
new_G = data$G_test; new_E = data$E_test
new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E)
cor(new_Y, data$Y_test)^2
Selection metrics
Description
Calculates principal selection metrics for the binary zero/non-zero classification problem (sensitivity, specificity, precision, auc).
Usage
selection.metrics(true_b_g, true_b_gxe, estimated_b_g, estimated_b_gxe)
Arguments
true_b_g |
vector of true main effect coefficients |
true_b_gxe |
vector of true interaction coefficients |
estimated_b_g |
vector of estimated main effect coefficients |
estimated_b_gxe |
vector of estimated interaction coefficients |
Value
A list of principal selection metrics
b_g_non_zero |
number of non-zero main effects |
b_gxe_non_zero |
number of non-zero interactions |
mse_b_g |
mean squared error for estimation of main effects effect sizes |
mse_b_gxe |
mean squared error for estimation of interactions effect sizes |
sensitivity_g |
recall of the non-zero main effects |
specificity_g |
recall of the zero main effects |
precision_g |
precision with respect to non-zero main effects |
sensitivity_gxe |
recall of the non-zero interactions |
specificity_gxe |
recall of the zero interactions |
precision_gxe |
precision with respect to non-zero interactions |
auc_g |
area under the curve for zero/non-zero binary classification problem for main effects |
auc_gxe |
area under the curve for zero/non-zero binary classification problem for interactions |
Examples
data = data.gen()
model = gesso.cv(data$G_train, data$E_train, data$Y_train)
gxe_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_gxe
g_coefficients = gesso.coef(model$fit, model$lambda_min)$beta_g
selection.metrics(data$Beta_G, data$Beta_GxE, g_coefficients, gxe_coefficients)