Type: | Package |
Title: | Estimation of Generalized Matrix Factorization Models via Stochastic Gradient Descent |
Version: | 1.0.1 |
Date: | 2025-05-17 |
Description: | Efficient framework to estimate high-dimensional generalized matrix factorization models using penalized maximum likelihood under a dispersion exponential family specification. Either deterministic and stochastic methods are implemented for the numerical maximization. In particular, the package implements the stochastic gradient descent algorithm with a block-wise mini-batch strategy to speed up the computations and an efficient adaptive learning rate schedule to stabilize the convergence. All the theoretical details can be found in Castiglione et al. (2024, <doi:10.48550/arXiv.2412.20509>). Other methods considered for the optimization are the alternated iterative re-weighted least squares and the quasi-Newton method with diagonal approximation of the Fisher information matrix discussed in Kidzinski et al. (2022, http://jmlr.org/papers/v23/20-1104.html). |
License: | MIT + file LICENSE |
Imports: | Rcpp (≥ 1.0.10), RcppArmadillo, RSpectra, parallel, doParallel, foreach, MASS, SuppDists, methods, generics, reshape2, ggpubr, viridisLite |
LinkingTo: | Rcpp, RcppArmadillo |
Depends: | R (≥ 4.0.0), ggplot2 |
Suggests: | testthat (≥ 3.0.0), Rtsne, dplyr, knitr, rmarkdown |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
URL: | https://github.com/CristianCastiglione/sgdGMF |
BugReports: | https://github.com/CristianCastiglione/sgdGMF/issues |
RoxygenNote: | 7.2.3 |
VignetteBuilder: | knitr |
NeedsCompilation: | yes |
Packaged: | 2025-05-17 12:50:04 UTC; crist |
Author: | Cristian Castiglione
|
Maintainer: | Cristian Castiglione <cristian_castiglione@libero.it> |
Repository: | CRAN |
Date/Publication: | 2025-05-17 22:00:02 UTC |
sgdGMF: Estimation of Generalized Matrix Factorization Models via Stochastic Gradient Descent
Description
Efficient framework to estimate high-dimensional generalized matrix factorization models using penalized maximum likelihood under a dispersion exponential family specification. Either deterministic and stochastic methods are implemented for the numerical maximization. In particular, the package implements the stochastic gradient descent algorithm with a block-wise mini-batch strategy to speed up the computations and an efficient adaptive learning rate schedule to stabilize the convergence. All the theoretical details can be found in Castiglione et al. (2024, doi:10.48550/arXiv.2412.20509). Other methods considered for the optimization are the alternated iterative re-weighted least squares and the quasi-Newton method with diagonal approximation of the Fisher information matrix discussed in Kidzinski et al. (2022, http://jmlr.org/papers/v23/20-1104.html).
Author(s)
Maintainer: Cristian Castiglione cristian_castiglione@libero.it (ORCID)
Other contributors:
Davide Risso davide.risso@unipd.it (ORCID) [contributor]
Alexandre Segers alexandre.segers@ugent.be (ORCID) [contributor]
See Also
Useful links:
Report bugs at https://github.com/CristianCastiglione/sgdGMF/issues
Biplot of an initialized GMF model
Description
Plot the observations on a two-dimensional projection determined by the estimated score matrix
Usage
## S3 method for class 'initgmf'
biplot(
x,
...,
choices = 1:2,
arrange = TRUE,
byrow = FALSE,
normalize = FALSE,
labels = NULL,
palette = NULL
)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
choices |
a length 2 vector specifying the components to plot |
arrange |
if |
byrow |
if |
normalize |
if |
labels |
a vector of labels which should be plotted |
palette |
the color-palette which should be used |
Value
If arrange=TRUE
, a single ggplot object with the selected biplots,
otherwise, a list of two ggplot objects showing the row and column latent variables.
See Also
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())
# Get the biplot of a GMF model
biplot(init) # 1st vs 2nd principal components
biplot(init, choices = 2:3) #2nd vs 3rd principal components
Biplot of a GMF model
Description
Plot the observations on a two-dimensional projection determined by the estimated score matrix
Usage
## S3 method for class 'sgdgmf'
biplot(
x,
...,
choices = 1:2,
arrange = TRUE,
byrow = FALSE,
normalize = FALSE,
labels = NULL,
palette = NULL,
titles = c(NULL, NULL)
)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
choices |
a length 2 vector specifying the components to plot |
arrange |
if |
byrow |
if |
normalize |
if |
labels |
a vector of labels which should be plotted |
palette |
the color-palette which should be used |
titles |
a 2-dimensional string vector containing the plot titles |
Value
If arrange=TRUE
, a single ggplot object with the selected biplots,
otherwise, a list of two ggplot objects showing the row and column latent variables.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Get the biplot of a GMF model
biplot(gmf)
Extract the coefficient of an initialized GMF model
Description
Return the initialized coefficients of a GMF model, i.e., the row- and column-specific regression effects, the latent scores and loadings.
Usage
## S3 method for class 'initgmf'
coefficients(
object,
...,
type = c("all", "colreg", "rowreg", "scores", "loadings")
)
## S3 method for class 'initgmf'
coef(object, ..., type = c("all", "colreg", "rowreg", "scores", "loadings"))
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of coefficients which should be returned |
Value
If type="all"
, a list of coefficients containing the fields B
, A
, U
and V
.
Otherwise, a matrix of coefficients, corresponding to the selected type
.
See Also
coefficients.sgdgmf
and coef.sgdgmf
.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())
# Get the estimated coefficients of a GMF model
str(coefficients(init)) # returns all the coefficients
str(coefficients(init, type = "scores")) # returns only the scores, say U
str(coefficients(init, type = "loadings")) # returns only the loadings, say V
Extract the coefficient of a GMF model
Description
Return the estimated coefficients of a GMF model, i.e., the row- and column-specific regression effects, the latent scores and loadings.
Usage
## S3 method for class 'sgdgmf'
coefficients(
object,
...,
type = c("all", "colreg", "rowreg", "scores", "loadings")
)
## S3 method for class 'sgdgmf'
coef(object, ..., type = c("all", "colreg", "rowreg", "scores", "loadings"))
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of coefficients which should be returned |
Value
If type="all"
, a list of coefficients containing the fields B
, A
, U
and V
.
Otherwise, a matrix of coefficients, corresponding to the selected type
.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Get the estimated coefficients of a GMF model
str(coefficients(gmf)) # returns all the coefficients
str(coefficients(gmf, type = "scores")) # returns only the scores, say U
str(coefficients(gmf, type = "loadings")) # returns only the loadings, say V
Fisher scoring algorithm for GLMs
Description
Internal function implementing the Fisher scoring algorithms for the estimation of GLMs. It is used in the AIRWLS algorithm for the estimation of GMF models.
Usage
cpp.airwls.glmfit(
beta,
y,
X,
familyname,
linkname,
varfname,
offset,
weights,
penalty,
nsteps = 100L,
stepsize = 0.1,
print = FALSE
)
Arguments
beta |
initial value of the regression coefficients to be estimated |
y |
response vector |
X |
design matrix |
familyname |
model family name |
linkname |
link function name |
varfname |
variance function name |
offset |
vector of constants to be added to the linear predictor |
weights |
vector of constants non-negative weights |
penalty |
penalty parameter of a ridge-type penalty |
nsteps |
number of iterations |
stepsize |
stepsize parameter of the Fisher scoring algorithm |
print |
if |
Compute one Fisher scoring step for GLMs
Description
Internal function to compute one Fisher scoring step for GLMs. It constitutes the building block of the AIRWLS algorithm for the estimation of GMF models.
Usage
cpp.airwls.glmstep(
beta,
y,
X,
familyname,
linkname,
varfname,
offset,
weights,
penalty
)
Arguments
beta |
current value of the regression coefficients to be updated |
y |
response vector |
X |
design matrix |
familyname |
model family name |
linkname |
link function name |
varfname |
variance function name |
offset |
vector of constants to be added to the linear predictor |
weights |
vector of constants non-negative weights |
penalty |
penalty parameter of a ridge-type penalty |
AIRWLS update for GMF models
Description
Internal function implementing one step of AIRWLS for the estimation of GMF models.
Usage
cpp.airwls.update(
beta,
Y,
X,
familyname,
linkname,
varfname,
idx,
offset,
weights,
penalty,
transp = FALSE,
nsteps = 100L,
stepsize = 0.1,
print = FALSE,
parallel = FALSE,
nthreads = 1L
)
Arguments
beta |
initial value of the regression coefficients to be estimated |
Y |
response vector |
X |
design matrix |
familyname |
model family name |
linkname |
link function name |
varfname |
variance function name |
idx |
index identifying the parameters to be updated in |
offset |
vector of constants to be added to the linear predictor |
weights |
vector of constants non-negative weights |
penalty |
penalty parameter of a ridge-type penalty |
transp |
if |
nsteps |
number of iterations |
stepsize |
stepsize parameter of the Fisher scoring algorithm |
print |
if |
parallel |
if |
nthreads |
number of threads to be run in parallel (only if |
Fit a GMF model using the AIRWLS algorithm
Description
Fit a GMF model using the AIRWLS algorithm
Usage
cpp.fit.airwls(
Y,
X,
B,
A,
Z,
U,
V,
O,
W,
familyname,
linkname,
varfname,
ncomp,
lambda,
maxiter = 500L,
nsteps = 1L,
stepsize = 0.1,
eps = 1e-08,
nafill = 1L,
tol = 1e-05,
damping = 0.001,
verbose = TRUE,
frequency = 10L,
parallel = FALSE,
nthreads = 1L
)
Arguments
Y |
matrix of responses ( |
X |
matrix of row fixed effects ( |
B |
initial row-effect matrix ( |
A |
initial column-effect matrix ( |
Z |
matrix of column fixed effects ( |
U |
initial factor matrix ( |
V |
initial loading matrix ( |
O |
matrix of constant offset ( |
W |
matrix of constant weights ( |
familyname |
a |
linkname |
a |
varfname |
variance function name |
ncomp |
rank of the latent matrix factorization |
lambda |
penalization parameters |
maxiter |
maximum number of iterations |
nsteps |
number of inner Fisher scoring iterations |
stepsize |
stepsize of the inner Fisher scoring algorithm |
eps |
shrinkage factor for extreme predictions |
nafill |
how often the missing values are updated |
tol |
tolerance threshold for the stopping criterion |
damping |
diagonal dumping factor for the Hessian matrix |
verbose |
if |
frequency |
how often the optimization status is printed |
parallel |
if |
nthreads |
number of cores to be used in parallel |
Fit a GMF model using the adaptive SGD with block-wise minibatch subsampling
Description
Fit a GMF model using the adaptive SGD with block-wise minibatch subsampling
Usage
cpp.fit.block.sgd(
Y,
X,
B,
A,
Z,
U,
V,
O,
W,
familyname,
linkname,
varfname,
ncomp,
lambda,
maxiter = 1000L,
eps = 0.01,
nafill = 10L,
tol = 1e-08,
size1 = 100L,
size2 = 100L,
burn = 0.75,
rate0 = 0.01,
decay = 0.01,
damping = 0.001,
rate1 = 0.95,
rate2 = 0.99,
parallel = FALSE,
nthreads = 1L,
verbose = TRUE,
frequency = 250L,
progress = FALSE
)
Arguments
Y |
matrix of responses ( |
X |
matrix of row fixed effects ( |
B |
initial row-effect matrix ( |
A |
initial column-effect matrix ( |
Z |
matrix of column fixed effects ( |
U |
initial factor matrix ( |
V |
initial loading matrix ( |
O |
matrix of constant offset ( |
W |
matrix of constant weights ( |
familyname |
a |
linkname |
a |
varfname |
variance function name |
ncomp |
rank of the latent matrix factorization |
lambda |
penalization parameters |
maxiter |
maximum number of iterations |
eps |
shrinkage factor for extreme predictions |
nafill |
how often the missing values are updated |
tol |
tolerance threshold for the stopping criterion |
size1 |
row-minibatch dimension |
size2 |
column-minibatch dimension |
burn |
burn-in period in which the learning late is not decreased |
rate0 |
initial learning rate |
decay |
decay rate of the learning rate |
damping |
diagonal dumping factor for the Hessian matrix |
rate1 |
decay rate of the first moment estimate of the gradient |
rate2 |
decay rate of the second moment estimate of the gradient |
parallel |
if |
nthreads |
number of cores to be used in parallel |
verbose |
if |
frequency |
how often the optimization status is printed |
progress |
if |
Fit a GMF model using the adaptive SGD with coordinate-wise minibatch subsampling algorithm
Description
Fit a GMF model using the adaptive SGD with coordinate-wise minibatch subsampling algorithm
Usage
cpp.fit.coord.sgd(
Y,
X,
B,
A,
Z,
U,
V,
O,
W,
familyname,
linkname,
varfname,
ncomp,
lambda,
maxiter = 1000L,
eps = 0.01,
nafill = 10L,
tol = 1e-08,
size1 = 100L,
size2 = 100L,
burn = 0.75,
rate0 = 0.01,
decay = 0.01,
damping = 0.001,
rate1 = 0.95,
rate2 = 0.99,
parallel = FALSE,
nthreads = 1L,
verbose = TRUE,
frequency = 250L,
progress = FALSE
)
Arguments
Y |
matrix of responses ( |
X |
matrix of row fixed effects ( |
B |
initial row-effect matrix ( |
A |
initial column-effect matrix ( |
Z |
matrix of column fixed effects ( |
U |
initial factor matrix ( |
V |
initial loading matrix ( |
O |
matrix of constant offset ( |
W |
matrix of constant weights ( |
familyname |
a |
linkname |
a |
varfname |
variance function name |
ncomp |
rank of the latent matrix factorization |
lambda |
penalization parameters |
maxiter |
maximum number of iterations |
eps |
shrinkage factor for extreme predictions |
nafill |
how often the missing values are updated |
tol |
tolerance threshold for the stopping criterion |
size1 |
row-minibatch dimension |
size2 |
column-minibatch dimension |
burn |
burn-in period in which the learning late is not decreased |
rate0 |
initial learning rate |
decay |
decay rate of the learning rate |
damping |
diagonal dumping factor for the Hessian matrix |
rate1 |
decay rate of the first moment estimate of the gradient |
rate2 |
decay rate of the second moment estimate of the gradient |
parallel |
if |
nthreads |
number of cores to be used in parallel |
verbose |
if |
frequency |
how often the optimization status is printed |
progress |
if |
Fit a GMF model using the diagonal quasi-Newton algorithm
Description
Fit a GMF model using the diagonal quasi-Newton algorithm
Usage
cpp.fit.newton(
Y,
X,
B,
A,
Z,
U,
V,
O,
W,
familyname,
linkname,
varfname,
ncomp,
lambda,
maxiter = 500L,
stepsize = 0.1,
eps = 1e-08,
nafill = 1L,
tol = 1e-05,
damping = 0.001,
verbose = TRUE,
frequency = 10L,
parallel = FALSE,
nthreads = 1L
)
Arguments
Y |
matrix of responses ( |
X |
matrix of row fixed effects ( |
B |
initial row-effect matrix ( |
A |
initial column-effect matrix ( |
Z |
matrix of column fixed effects ( |
U |
initial factor matrix ( |
V |
initial loading matrix ( |
O |
matrix of constant offset ( |
W |
matrix of constant weights ( |
familyname |
a |
linkname |
a |
varfname |
variance function name |
ncomp |
rank of the latent matrix factorization |
lambda |
penalization parameters |
maxiter |
maximum number of iterations |
stepsize |
stepsize of the quasi-Newton update |
eps |
shrinkage factor for extreme predictions |
nafill |
how often the missing values are updated |
tol |
tolerance threshold for the stopping criterion |
damping |
diagonal dumping factor for the Hessian matrix |
verbose |
if |
frequency |
how often the optimization status is printed |
parallel |
if |
nthreads |
number of cores to be used in parallel |
Fit a GMF model using the adaptive SGD with block-wise minibatch subsampling
Description
Fit a GMF model using the adaptive SGD with block-wise minibatch subsampling
Usage
cpp.fit.random.block.sgd(
Y,
X,
B,
A,
Z,
U,
V,
O,
W,
familyname,
linkname,
varfname,
ncomp,
lambda,
maxiter = 1000L,
eps = 0.01,
nafill = 10L,
tol = 1e-08,
size1 = 100L,
size2 = 100L,
burn = 0.75,
rate0 = 0.01,
decay = 0.01,
damping = 0.001,
rate1 = 0.95,
rate2 = 0.99,
parallel = FALSE,
nthreads = 1L,
verbose = TRUE,
frequency = 250L,
progress = FALSE
)
Arguments
Y |
matrix of responses ( |
X |
matrix of row fixed effects ( |
B |
initial row-effect matrix ( |
A |
initial column-effect matrix ( |
Z |
matrix of column fixed effects ( |
U |
initial factor matrix ( |
V |
initial loading matrix ( |
O |
matrix of constant offset ( |
W |
matrix of constant weights ( |
familyname |
a |
linkname |
a |
varfname |
variance function name |
ncomp |
rank of the latent matrix factorization |
lambda |
penalization parameters |
maxiter |
maximum number of iterations |
eps |
shrinkage factor for extreme predictions |
nafill |
how often the missing values are updated |
tol |
tolerance threshold for the stopping criterion |
size1 |
row-minibatch dimension |
size2 |
column-minibatch dimension |
burn |
burn-in period in which the learning late is not decreased |
rate0 |
initial learning rate |
decay |
decay rate of the learning rate |
damping |
diagonal dumping factor for the Hessian matrix |
rate1 |
decay rate of the first moment estimate of the gradient |
rate2 |
decay rate of the second moment estimate of the gradient |
parallel |
if |
nthreads |
number of cores to be used in parallel |
verbose |
if |
frequency |
how often the optimization status is printed |
progress |
if |
Compute deviance, AIC and BIC of an initialized GMF model
Description
Compute deviance, AIC and BIC of an initialized GMF object
Usage
## S3 method for class 'initgmf'
deviance(object, ..., normalize = FALSE)
## S3 method for class 'initgmf'
AIC(object, ..., k = 2)
## S3 method for class 'initgmf'
BIC(object, ...)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
normalize |
if |
k |
the penalty parameter to be used for AIC; the default is |
Value
The value of the deviance extracted from a initgmf
object.
See Also
deviance.sgdgmf
, AIC.sgdgmf
and AIC.sgdgmf
.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())
# Get the GMF deviance, AIC and BIC
deviance(init)
AIC(init)
BIC(init)
Compute deviance, AIC and BIC of a GMF model
Description
Compute deviance, AIC and BIC of a GMF object
Usage
## S3 method for class 'sgdgmf'
deviance(object, ..., normalize = FALSE)
## S3 method for class 'sgdgmf'
AIC(object, ..., k = 2)
## S3 method for class 'sgdgmf'
BIC(object, ...)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
normalize |
if |
k |
the penalty parameter to be used for AIC; the default is |
Value
The value of the deviance extracted from a sgdgmf
object.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Get the GMF deviance, AIC and BIC
deviance(gmf)
AIC(gmf)
BIC(gmf)
Rank selection via adjust correlation thresholding
Description
Select the number of significant principal components of a matrix via adjust correlation threshold (ACT)
Usage
eigengap.act(covmat, nobs, maxcomp = NULL)
Arguments
covmat |
matrix to be decomposed |
nobs |
number of observations used to compute the covariance matrix |
maxcomp |
maximum number of eigenvalues to compute |
References
Fan, J., Guo, j. and Zheng, S. (2020). Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association, 117(538): 852–861
Rank selection via eigenvalue ratio maximization
Description
Select the number of significant principal components of a matrix via the eigenvalue ratio (EVR) maximization method
Usage
eigengap.evr(covmat, maxcomp = 50, thr = 0.95)
Arguments
covmat |
matrix to be decomposed |
maxcomp |
maximum number of eigenvalues to compute |
References
Ahn, S.C., Horenstein, A.R. (2013). Eigenvalue ratio test for the number of factors. Econometrica, 81, 1203-1227
Rank selection via optimal hard thresholding
Description
Select the number of significant principal components of a matrix via optimal hard thresholding (OHT)
Usage
eigengap.oht(covmat, nobs, maxcomp = NULL)
Arguments
covmat |
matrix to be decomposed |
nobs |
number of observations used to compute the covariance matrix |
maxcomp |
maximum number of eigenvalues to compute |
References
Gavish, M., Donoho, D.L. (2014) The optimal hard thresholding for singular values is 4/sqrt(3). IEEE Transactions on Information Theory, 60(8): 5040–5053
Rank selection via the Onatski method
Description
Select the number of significant principal components of a matrix via the Onatski method
Usage
eigengap.onatski(covmat, maxcomp = 50, maxiter = 100)
Arguments
covmat |
matrix to be decomposed |
maxcomp |
maximum number of eigenvalues to compute |
maxiter |
maximum number of iterations |
References
Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. Review of Economics and Statistics, 92(4): 1004-1016
Extract the fitted values of an initialized GMF model
Description
Computes the fitted values of an initialized GMF model.
Usage
## S3 method for class 'initgmf'
fitted(object, ..., type = c("link", "response", "terms"), partial = FALSE)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of fitted values which should be returned |
partial |
if |
Value
If type="terms"
, a list of fitted values containing the fields XB
,
AZ
and UV
. Otherwise, a matrix of fitted values in the link or
response scale, depending on the selected type
.
See Also
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())
# Get the fitted values of a GMF model
str(fitted(init)) # returns the overall fitted values in link scale
str(fitted(init, type = "response")) # returns the overall fitted values in response scale
str(fitted(init, partial = TRUE)) # returns the partial fitted values in link scale
Extract the fitted values of a GMF models
Description
Computes the fitted values of a GMF model.
Usage
## S3 method for class 'sgdgmf'
fitted(object, ..., type = c("link", "response", "terms"), partial = FALSE)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of fitted values which should be returned |
partial |
if |
Value
If type="terms"
, a list of fitted values containing the fields XB
,
AZ
and UV
. Otherwise, a matrix of fitted values in the link or
response scale, depending on the selected type
.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Get the fitted values of a GMF model
str(fitted(gmf)) # returns the overall fitted values in link scale
str(fitted(gmf, type = "response")) # returns the overall fitted values in response scale
Heatmap of an initialized GMF model
Description
Plots a heatmap of either the data, the fitted values, or the residual values of a GMF model allowing for different types of transformations and normalizations. Moreover, it also permits to plot the latent score and loading matrices.
Usage
## S3 method for class 'initgmf'
image(
x,
...,
type = c("data", "response", "link", "scores", "loadings", "deviance", "pearson",
"working"),
resid = FALSE,
symmetric = FALSE,
transpose = FALSE,
limits = NULL,
palette = NULL
)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of data/predictions/residuals which should be returned |
resid |
if |
symmetric |
if |
transpose |
if |
limits |
the color limits which should be used |
palette |
the color-palette which should be used |
Value
A ggplot object showing the selected heatmap.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())
# Get the heatmap of a GMF model
image(init, type = "data") # original data
image(init, type = "response") # fitted values in response scale
image(init, type = "scores") # estimated score matrix
image(init, type = "loadings") # estimated loading matrix
image(init, type = "deviance", resid = TRUE) # deviance residual matrix
Heatmap of a GMF model
Description
Plots a heatmap of either the data, the fitted values, or the residual values of a GMF model allowing for different types of transformations and normalizations. Moreover, it also permits to plot the latent score and loading matrices.
Usage
## S3 method for class 'sgdgmf'
image(
x,
...,
type = c("data", "response", "link", "scores", "loadings", "deviance", "pearson",
"working"),
resid = FALSE,
symmetric = FALSE,
transpose = FALSE,
limits = NULL,
palette = NULL
)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of data/predictions/residuals which should be returned |
resid |
if |
symmetric |
if |
transpose |
if |
limits |
the color limits which should be used |
palette |
the color-palette which should be used |
Value
A ggplot object showing the selected heatmap.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Get the heatmap of a GMF model
image(gmf, type = "data") # original data
image(gmf, type = "response") # fitted values in response scale
image(gmf, type = "scores") # estimated score matrix
image(gmf, type = "loadings") # estimated loading matrix
image(gmf, type = "deviance", resid = TRUE) # deviance residual matrix
Fix sign ambiguity of eigen-vectors
Description
Fix sign ambiguity of eigen-vectors by making U positive diagonal
Usage
make.pos.diag(U)
Arguments
U |
target matrix |
Model deviance of a GMF model
Description
Compute the overall deviance averaging the contributions of all data
Usage
matrix.deviance(mu, y, family = gaussian())
Frobenius penalty for the parameters of a GMF model
Description
Compute the Frobenius penalty for all the parameters in the model
Usage
matrix.penalty(U, penalty)
Procrustes distance
Description
Compute the Procrustes distance between two matrices
Usage
norm.procrustes(A, B)
Arguments
A |
target matrix |
B |
matrix to be rotated |
Normalize the matrices U and V
Description
Rotate U and V using either QR or SVD decompositions. The QR methods rotate U and V in such a way to obtain an orthogonal U and a lower triangular V. The SVD method rotate U and V in such a way to obtain an orthogonal U and a scaled orthogonal V.
Usage
normalize.uv(U, V, method = c("qr", "svd"))
Estimate the coefficients of a multivariate linear model
Description
Estimate the coefficients of a multivariate linear model via ordinary least squares.
Usage
ols.fit.coef(Y, X, offset = NULL)
Arguments
Y |
|
X |
|
offset |
|
Check if OpenMP is enabled
Description
Internal function to check if OpenMP is enabled
Usage
omp.check()
Split the data matrix in train and test sets
Description
Returns a list of two matrices train
and test
.
train
corresponds to the input matrix with a fixed persentage of
entries masked by NA values. test
is the complement of train
and contains the values of the input matrix in the cells where train
is NA, while all the other entries are filled by NA values.
Usage
partition(y, p = 0.3)
Arguments
y |
input matrix to be split into train and test sets |
p |
fraction of observations to be used for the test set |
Plot diagnostics for an initialized GMF model
Description
Plots (one of) six diagnostics to graphically analyze the marginal and conditional distribution of the residuals of a GMF model. Currently, the following plots are available: residuals against observation indices, residuals agains fitted values, absolute square-root residuals against fitted values, histogram of the residuals, residual QQ-plot, residual ECDF-plot.
Usage
## S3 method for class 'initgmf'
plot(
x,
...,
type = c("res-idx", "res-fit", "std-fit", "hist", "qq", "ecdf"),
resid = c("deviance", "pearson", "working", "response", "link"),
subsample = FALSE,
sample.size = 500,
partial = FALSE,
normalize = FALSE,
fillna = FALSE
)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of plot which should be returned |
resid |
the type of residuals which should be used |
subsample |
if |
sample.size |
the dimension of the sub-sample which should be used |
partial |
if |
normalize |
if |
fillna |
if |
Value
A ggplot object showing the selected diagnostic plot.
See Also
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())
# Plot the residual-based GMF diagnostics
plot(init, type = "res-fit") # Residuals vs fitted values
plot(init, type = "std-fit") # Abs-sqrt-transformed residuals vs fitted values
plot(init, type = "qq") # Residual QQ-plot
plot(init, type = "hist") # Residual histogram
Plot diagnostics for a GMF model
Description
Plots (one of) six diagnostics to graphically analyze the marginal and conditional distribution of the residuals of a GMF model. Currently, the following plots are available: residuals against observation indices, residuals agains fitted values, absolute square-root residuals against fitted values, histogram of the residuals, residual QQ-plot, residual ECDF-plot.
Usage
## S3 method for class 'sgdgmf'
plot(
x,
...,
type = c("res-idx", "res-fit", "std-fit", "hist", "qq", "ecdf"),
resid = c("deviance", "pearson", "working", "response", "link"),
subsample = FALSE,
sample.size = 500,
partial = FALSE,
normalize = FALSE,
fillna = FALSE
)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of plot which should be returned |
resid |
the type of residuals which should be used |
subsample |
if |
sample.size |
the dimension of the sub-sample which should be used |
partial |
if |
normalize |
if |
fillna |
if |
Value
A ggplot object showing the selected diagnostic plot.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Plot the residual-based GMF diagnostics
plot(gmf, type = "res-fit") # Residuals vs fitted values
plot(gmf, type = "std-fit") # Abs-sqrt-transformed residuals vs fitted values
plot(gmf, type = "qq") # Residual QQ-plot
plot(gmf, type = "hist") # Residual histogram
Pointwise deviance of a GMF model
Description
Compute the pointwise deviance for all the observations in the sample
Usage
pointwise.deviance(mu, y, family = gaussian())
Predict method for GMF models
Description
Computes the predictions of a GMF model. Out-of-sample predictions for a new
set of responses and covariates are computed via MLE, by keeping fixed the values
of the estimated B
and V
and maximizing the likelihood with respect
to A
and U
.
Usage
## S3 method for class 'sgdgmf'
predict(
object,
...,
newY = NULL,
newX = NULL,
type = c("link", "response", "terms", "coef"),
parallel = FALSE,
nthreads = 1
)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
newY |
optionally, a matrix of new response variable |
newX |
optionally, a matrix of new covariate values |
type |
the type of prediction which should be returned |
parallel |
if |
nthreads |
number of cores to be used in parallel (only if |
Details
If newY
and newX
are omitted, the predictions are based on the data
used for the fit. In that case, the predictions corresponds to the fitted values.
If newY
and newX
are provided, a corresponding set of A
and
U
are estimated via maximum likelihood using the glm.fit
function.
By doing so, B
and V
are kept fixed.
Value
If type="link"
or typr="response"
, a matrix of predictions.
If type="terms"
, a list of matrices containing the fields XB
, AZ
and UV
.
If type="coef"
, a list of matrices containing the field B
, A
, U
and V
.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 120, m = 20, ncomp = 5, family = poisson())
train = sample(1:120, size = 100)
test = setdiff(1:120, train)
Y = data$Y[train, ]
newY = data$Y[test, ]
# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(Y, ncomp = 3, family = poisson())
# Get the fitted values of a GMF model
str(predict(gmf)) # returns the overall fitted values in link scale
str(predict(gmf, type = "response")) # returns the overall fitted values in response scale
str(predict(gmf, type = "terms")) # returns the partial fitted values in link scale
str(predict(gmf, newY = newY)) # returns the predictions for the new set of responses
Print the fundamental characteristics of an initialized GMF
Description
Print some summary information of an initialized GMF model.
Usage
## S3 method for class 'initgmf'
print(x, ...)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
Value
No return value, called only for printing.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())
# Print the GMF object
print(init)
Print the fundamental characteristics of a GMF
Description
Print some summary information of a GMF model.
Usage
## S3 method for class 'sgdgmf'
print(x, ...)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
Value
No return value, called only for printing.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Print the GMF object
print(gmf)
Procrustes rotation of two configurations
Description
Rotates a configuration to maximum similarity with another configuration
Usage
procrustes(X, Y, scale = TRUE, symmetric = FALSE)
Arguments
X |
target matrix |
Y |
matrix to be rotated |
scale |
allow scaling of axes of Y |
symmetric |
if |
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- generics
Refine the final estimate of a GMF model
Description
Refine the estimated latent scores of a GMF model via IRWLS
Usage
## S3 method for class 'sgdgmf'
refit(
object,
...,
normalize = TRUE,
verbose = FALSE,
parallel = FALSE,
nthreads = 1
)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
normalize |
if |
verbose |
if |
parallel |
if |
nthreads |
number of cores to be used in the |
Value
An sgdgmf
object containing the re-fitted model.
See Also
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model using SGD
gmf_old = sgdgmf.fit(data$Y, ncomp = 3, family = poisson(), method = "sgd")
# Refine the score matrix estimate
gmf_new = refit(gmf_old)
# Get the fitted values in the link and response scales
mu_hat_old = fitted(gmf_old, type = "response")
mu_hat_new = fitted(gmf_new, type = "response")
# Compare the results
oldpar = par(no.readonly = TRUE)
par(mfrow = c(2,2), mar = c(1,1,3,1))
image(data$Y, axes = FALSE, main = expression(Y))
image(data$mu, axes = FALSE, main = expression(mu))
image(mu_hat_old, axes = FALSE, main = expression(hat(mu)[old]))
image(mu_hat_new, axes = FALSE, main = expression(hat(mu)[new]))
par(oldpar)
Extract the residuals of an initialized GMF model
Description
Extract the residuals of an initialized GMF model and, if required, compute the eigenvalues of the residuals covariance/correlation matrix. Moreover, if required, return the partial residual of the model obtained by excluding the matrix decomposition from the linear predictor.
Usage
## S3 method for class 'initgmf'
residuals(
object,
...,
type = c("deviance", "pearson", "working", "response", "link"),
partial = FALSE,
normalize = FALSE,
fillna = FALSE,
spectrum = FALSE,
ncomp = 50
)
## S3 method for class 'initgmf'
resid(
object,
...,
type = c("deviance", "pearson", "working", "response", "link"),
partial = FALSE,
normalize = FALSE,
fillna = FALSE,
spectrum = FALSE,
ncomp = 50
)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of residuals which should be returned |
partial |
if |
normalize |
if |
fillna |
if |
spectrum |
if |
ncomp |
number of eigenvalues to be calculated (only if |
Value
If spectrum=FALSE
, a matrix containing the selected residuals.
If spectrum=TRUE
, a list containing the residuals (res
), the first ncomp
eigenvalues of the residual covariance matrix, say (lambdas
), the variance explained by the first
ncomp
principal component of the residuals (explained.var
), the variance not
explained by the first ncomp
principal component of the residuals (residual.var
),
the total variance of the residuals (total.var
).
See Also
residuals.sgdgmf
and resid.sgdgmf
for more details on the residual computation.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())
# Get the deviance residuals of a GMF model
str(residuals(init)) # returns the overall deviance residuals
str(residuals(init, partial = TRUE)) # returns the partial residuals
str(residuals(init, spectrum = TRUE)) # returns the eigenvalues of the residual var-cov matrix
Extract the residuals of a GMF model
Description
Extract the residuals of a GMF model and, if required, compute the eigenvalues of the residuals covariance/correlation matrix. Moreover, if required, return the partial residual of the model obtained by excluding the matrix decomposition from the linear predictor.
Usage
## S3 method for class 'sgdgmf'
residuals(
object,
...,
type = c("deviance", "pearson", "working", "response", "link"),
partial = FALSE,
normalize = FALSE,
fillna = FALSE,
spectrum = FALSE,
ncomp = 50
)
## S3 method for class 'sgdgmf'
resid(
object,
...,
type = c("deviance", "pearson", "working", "response", "link"),
partial = FALSE,
normalize = FALSE,
fillna = FALSE,
spectrum = FALSE,
ncomp = 50
)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
type |
the type of residuals which should be returned |
partial |
if |
normalize |
if |
fillna |
if |
spectrum |
if |
ncomp |
number of eigenvalues to be calculated (only if |
Details
Let g(\mu) = \eta = X B^\top + \Gamma Z^\top + U V^\top
be the linear predictor of a
GMF model. Let R = (r_{ij})
be the correspondent residual matrix.
The following residuals can be considered:
deviance:
r_{ij}^{_D} = \textrm{sign}(y_{ij} - \mu_{ij}) \sqrt{D(y_{ij}, \mu_{ij})}
;Pearson:
r_{ij}^{_P} = (y_{ij} - \mu_{ij}) / \sqrt{\nu(\mu_{ij})}
;working:
r_{ij}^{_W} = (y_{ij} - \mu_{ij}) / \{g'(\mu_{ij}) \,\nu(\mu_{ij})\}
;response:
r_{ij}^{_R} = y_{ij} - \mu_{ij}
;link:
r_{ij}^{_G} = g(y_{ij}) - \eta_{ij}
.
If partial=TRUE
, mu
is computed excluding the latent matrix decomposition
from the linear predictor, so as to obtain the partial residuals.
Let \Sigma
be the empirical variance-covariance matrix of R
, being
\sigma_{ij} = \textrm{Cov}(r_{:i}, r_{:j})
. Then, the latent spectrum of
the model is the collection of eigenvalues of \Sigma
.
Notice that, in case of Gaussian data, the latent spectrum corresponds to the principal component analysis on the regression residuals, whose eigenvalues can be used to infer the amount of variance explained by each principal component. Similarly, we can use the (partial) latent spectrum in non-Gaussian data settings to infer the correct number of principal components to include into the GMF model or to detect some residual dependence structures not already explained by the model.
Value
If spectrum=FALSE
, a matrix containing the selected residuals.
If spectrum=TRUE
, a list containing the residuals (res
), the first ncomp
eigenvalues of the residual covariance matrix, say (lambdas
), the variance explained by the first
ncomp
principal component of the residuals (explained.var
), the variance not
explained by the first ncomp
principal component of the residuals (residual.var
),
the total variance of the residuals (total.var
).
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Get the deviance residuals of a GMF model
str(residuals(gmf)) # returns the overall deviance residuals
str(residuals(gmf, partial = TRUE)) # returns the partial residuals
str(residuals(gmf, spectrum = TRUE)) # returns the eigenvalues of the residual var-cov matrix
Screeplot for the residuals of an initialized GMF model
Description
Plots the variances of the principal components of the residuals against the number of principal component.
Usage
## S3 method for class 'initgmf'
screeplot(
x,
...,
ncomp = 20,
type = c("deviance", "pearson", "working", "response", "link"),
partial = FALSE,
normalize = FALSE,
cumulative = FALSE,
proportion = FALSE
)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
ncomp |
number of components to be plotted |
type |
the type of residuals which should be used |
partial |
if |
normalize |
if |
cumulative |
if |
proportion |
if |
Value
A ggplot object showing the residual screeplot of the model.
See Also
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())
# Get the partial residual spectrum of a GMF model
screeplot(init) # screeplot of the var-cov matrix of the deviance residuals
screeplot(init, partial = TRUE) # screeplot of the partial residuals
screeplot(init, cumulative = TRUE) # cumulative screeplot
screeplot(init, proportion = TRUE) # proportion of explained residual variance
Screeplot for the residuals of a GMF model
Description
Plots the variances of the principal components of the residuals against the number of principal component.
Usage
## S3 method for class 'sgdgmf'
screeplot(
x,
...,
ncomp = 20,
type = c("deviance", "pearson", "working", "response", "link"),
partial = FALSE,
normalize = FALSE,
cumulative = FALSE,
proportion = FALSE
)
Arguments
x |
an object of class |
... |
further arguments passed to or from other methods |
ncomp |
number of components to be plotted |
type |
the type of residuals which should be used |
partial |
if |
normalize |
if |
cumulative |
if |
proportion |
if |
Value
A ggplot object showing the residual screeplot of the model.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Get the partial residual spectrum of a GMF model
screeplot(gmf) # screeplot of the var-cov matrix of the deviance residuals
screeplot(gmf, partial = TRUE) # screeplot of the partial residuals
screeplot(gmf, cumulative = TRUE) # cumulative screeplot
screeplot(gmf, proportion = TRUE) # proportion of explained residual variance
Check and set the control parameters for the AIRWLS algorithm
Description
Check if the input control parameters of the AIRWLS algorithm are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.
Usage
set.control.airwls(
normalize = TRUE,
maxiter = 100,
nstep = 1,
stepsize = 0.1,
eps = 1e-08,
nafill = 1,
tol = 1e-05,
damping = 0.001,
verbose = FALSE,
frequency = 10,
parallel = FALSE,
nthreads = 1
)
Arguments
normalize |
if |
maxiter |
maximum number of iterations |
nstep |
number of IRWLS steps in each inner loop of AIRWLS |
stepsize |
step-size parameter scaling each IRWLS step |
eps |
how much shrinkage has to be introduced on extreme predictions lying outside of the data range |
nafill |
how frequently the |
tol |
tolerance threshold for the stopping criterion |
damping |
regularization parameter which is added to the diagonal of the Hessian to ensure numerical stability |
verbose |
if |
frequency |
how often the optimization status is printed (only if |
parallel |
if |
nthreads |
number of cores to be used in parallel (only if |
Value
A list
of control parameters for the AIRWLS algorithm
Examples
library(sgdGMF)
# Empty call
set.control.airwls()
# Parametrized call
set.control.airwls(maxiter = 100, nstep = 5, stepsize = 0.3)
Check and set the control parameters for the select optimization algorithm
Description
Check if the input control parameters are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.
Usage
set.control.alg(
method = c("airwls", "newton", "sgd"),
sampling = c("block", "coord", "rnd-block"),
control = list()
)
Arguments
method |
optimization method to use |
sampling |
sub-sampling method to use |
control |
list of algorithm-specific control parameters |
Details
It is not necessary to provide a complete list of control parameters, one can
just specify a list containing the parameters he/she needs to change from the
default values. Wrongly specified parameters are ignored or set to default values.
For a detailed description of all the algorithm-specific control parameters,
please refer to
set.control.airwls
(method="airwls"
),
set.control.newton
(method="newton"
),
set.control.block.sgd
(method="sgd"
, sampling="block"
).
set.control.coord.sgd
(method="sgd"
, sampling="coord"
),
Value
A list
of control parameters for the selected estimation algorithm
See Also
set.control.init
, set.control.cv
, sgdgmf.fit
Examples
library(sgdGMF)
# Empty call
set.control.alg()
# Parametrized call
set.control.alg(method = "airwls", control = list(maxiter = 200, stepsize = 0.3))
Check and set the control parameters for the blockwise-SGD algorithm
Description
Check if the input control parameters are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.
Usage
set.control.block.sgd(
normalize = TRUE,
maxiter = 1000,
eps = 1e-08,
nafill = 10,
tol = 1e-08,
size = c(100, 100),
burn = 1,
rate0 = 0.01,
decay = 0.01,
damping = 0.001,
rate1 = 0.1,
rate2 = 0.01,
verbose = FALSE,
frequency = 250,
progress = FALSE
)
Arguments
normalize |
if |
maxiter |
maximum number of iterations |
eps |
how much shrinkage has to be introduced on extreme predictions lying outside of the data range |
nafill |
how frequently the |
tol |
tolerance threshold for the stopping criterion |
size |
mini-batch size, the first value is for row sub-sample, the second value is for column sub-sample |
burn |
percentage of iterations to ignore before performing Polyak averaging |
rate0 |
initial learning rate |
decay |
learning rate decay |
damping |
regularization parameter which is added to the Hessian to ensure numerical stability |
rate1 |
exponential decay rate for the moment estimate of the gradient |
rate2 |
exponential decay rate for the moment estimate of the Hessian |
verbose |
if |
frequency |
how often the optimization status is printed (only if |
progress |
if |
Value
A list
of control parameters for the adaptive SGD algorithm with block-wise sub-sampling
Examples
library(sgdGMF)
# Empty call
set.control.block.sgd()
# Parametrized call
set.control.block.sgd(maxiter = 2000, rate0 = 0.01, decay = 0.01)
Check and set the control parameters for the coordinate-SGD algorithm
Description
Check if the input control parameters are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.
Usage
set.control.coord.sgd(
normalize = TRUE,
maxiter = 1000,
eps = 1e-08,
nafill = 10,
tol = 1e-08,
size = c(100, 100),
burn = 1,
rate0 = 0.01,
decay = 0.01,
damping = 0.001,
rate1 = 0.1,
rate2 = 0.01,
verbose = FALSE,
frequency = 250,
progress = FALSE
)
Arguments
normalize |
if |
maxiter |
maximum number of iterations |
eps |
how much shrinkage has to be introduced on extreme predictions lying outside of the data range |
nafill |
how frequently the |
tol |
tolerance threshold for the stopping criterion |
size |
mini-batch size, the first value is for row sub-sample, the second value is for column sub-sample |
burn |
percentage of iterations to ignore before performing Polyak averaging |
rate0 |
initial learning rate |
decay |
learning rate decay |
damping |
regularization parameter which is added to the Hessian to ensure numerical stability |
rate1 |
exponential decay rate for the moment estimate of the gradient |
rate2 |
exponential decay rate for the moment estimate of the Hessian |
verbose |
if |
frequency |
how often the optimization status is printed (only if |
progress |
if |
Value
A list
of control parameters for the adaptive SGD algorithm with coordinate-wise sub-sampling
Examples
library(sgdGMF)
# Empty call
set.control.coord.sgd()
# Parametrized call
set.control.coord.sgd(maxiter = 2000, rate0 = 0.01, decay = 0.01)
Check and set the cross-validation parameters
Description
Check if the input cross-validation parameters are allowed and set them to default values if they are not. Returns a list of well-defined cross-validation parameters.
Usage
set.control.cv(
criterion = c("dev", "mae", "mse", "aic", "bic"),
refit = TRUE,
nfolds = 5,
proportion = 0.3,
init = c("common", "separate"),
verbose = FALSE,
parallel = FALSE,
nthreads = 1
)
Arguments
criterion |
information criterion to minimize for selecting the matrix rank |
refit |
if |
nfolds |
number of cross-validation folds |
proportion |
proportion of the data to be used as test set in each fold |
init |
initialization approach to use |
verbose |
if |
parallel |
if |
nthreads |
number of cores to use in parallel (only if |
Value
A list
of control parameters for the cross-validation algorithm
See Also
set.control.init
, set.control.alg
, sgdgmf.cv
Examples
library(sgdGMF)
# Empty call
set.control.cv()
# Parametrized call
set.control.cv(criterion = "bic", proportion = 0.2)
Check and set the initialization parameters for a GMF model
Description
Check if the input initialization parameters are allowed and set them to default
values if they are not. Returns a list of well-defined options which specify how
to initialize a GMF model. See sgdgmf.init
for more details upon the methods used for initialisation.
Usage
set.control.init(
method = c("ols", "glm", "random", "values"),
type = c("deviance", "pearson", "working", "link"),
values = list(),
niter = 5,
normalize = TRUE,
verbose = FALSE,
parallel = FALSE,
nthreads = 1
)
Arguments
method |
initialization method (see |
type |
residual type to be decomposed (see |
values |
list of custom initialization parameters fixed by the user |
niter |
number if refinement iterations in the |
normalize |
if |
verbose |
if |
parallel |
if |
nthreads |
number of cores to be used in the |
Value
A list
of control parameters for the initialization
See Also
set.control.alg
, set.control.cv
, sgdgmf.init
Examples
library(sgdGMF)
# Empty call
set.control.init()
# Parametrized call
set.control.init(method = "glm", type = "deviance", niter = 10)
Check and set the control parameters for the Newton algorithm
Description
Check if the input control parameters of the quasi-Newton algorithm are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.
Usage
set.control.newton(
normalize = TRUE,
maxiter = 500,
stepsize = 0.01,
eps = 1e-08,
nafill = 1,
tol = 1e-05,
damping = 0.001,
verbose = FALSE,
frequency = 50,
parallel = FALSE,
nthreads = 1
)
Arguments
normalize |
if |
maxiter |
maximum number of iterations |
stepsize |
step-size parameter scaling each IRWLS step |
eps |
how much shrinkage has to be introduced on extreme predictions lying outside of the data range |
nafill |
how frequently the |
tol |
tolerance threshold for the stopping criterion |
damping |
regularization parameter which is added to the Hessian to ensure numerical stability |
verbose |
if |
frequency |
how often the optimization status is printed (only if |
parallel |
if |
nthreads |
number of cores to be used in parallel (only if |
Value
A list
of control parameters for the quasi-Newton algorithm
Examples
library(sgdGMF)
# Empty call
set.control.newton()
# Parametrized call
set.control.newton(maxiter = 1000, stepsize = 0.01, tol = 1e-04)
Check and set the model family
Description
Check if the model family is allowed and return it eventually with a
different family name for compatibility with the C++
implementation
Usage
set.family(family)
Arguments
family |
a |
Check and set the covariate matrix X
Description
Check if the input covariate matrix X is well-defined and return the same matrix without attributes such as row and column names.
Usage
set.mat.X(X, n, m)
Check and set the response matrix Y
Description
Check if the input response matrix is well-defined and return the same matrix without attributes such as row and column names.
Usage
set.mat.Y(Y)
Check and set the covariate matrix X
Description
Check if the input covariate matrix X is well-defined and return the same matrix without attributes such as row and column names.
Usage
set.mat.Z(Z, n, m)
Check and set the offset matrix
Description
Check if the input offset matrix is well-defined and return the same matrix without attributes such as row and column names.
Usage
set.mat.offset(O, n, m)
Check and set the weighting matrix
Description
Check if the input weighting matrix is well-defined and return the same matrix without attributes such as row and column names.
Usage
set.mat.weights(W, n, m)
Check and set the penalty parameters
Description
Check if the input penalty parameters are allowed and set them to default values if they are not. Returns a list of well-defined penalty parameters.
Usage
set.penalty(B = 0, A = 0, U = 1, V = 0)
Arguments
B |
penalty parameter of |
A |
penalty parameter of |
U |
penalty parameter of |
V |
penalty parameter of |
Model selection via cross-validation for generalized matrix factorization models
Description
K-fold cross-validation for generalized matrix factorization (GMF) models.
Usage
sgdgmf.cv(
Y,
X = NULL,
Z = NULL,
family = gaussian(),
ncomps = seq(from = 1, to = 10, by = 1),
weights = NULL,
offset = NULL,
method = c("airwls", "newton", "sgd"),
sampling = c("block", "coord", "rnd-block"),
penalty = list(),
control.init = list(),
control.alg = list(),
control.cv = list()
)
Arguments
Y |
matrix of responses ( |
X |
matrix of row fixed effects ( |
Z |
matrix of column fixed effects ( |
family |
a |
ncomps |
ranks of the latent matrix factorization used in cross-validation (default 1 to 10) |
weights |
an optional matrix of weights ( |
offset |
an optional matrix of offset values ( |
method |
estimation method to minimize the negative penalized log-likelihood |
sampling |
sub-sampling strategy to use if |
penalty |
list of penalty parameters (see |
control.init |
list of control parameters for the initialization (see |
control.alg |
list of control parameters for the optimization (see |
control.cv |
list of control parameters for the cross-validation (see |
Details
Cross-validation is performed by minimizing the estimated out-of-sample error, which
can be measured in terms of averaged deviance, AIC or BIC calculated on fold-specific
test sets. Within each fold, the test set is defined as a fixed proportion of entries
in the response matrix which are held out from the estimation process.
To this end, the test set entries are hidden by NA
values when training the
model. Then, the predicted, i.e. imputed, values are used to compute the fold-specific
out-of-sample error.
Value
If refit = FALSE
(see set.control.cv
), the function returns a list containing control.init
,
control.alg
, control.cv
and summary.cv
. The latter is a matrix
collecting the cross-validation results for each combination of fold and latent
dimension.
If refit = TRUE
(see set.control.cv
), the function returns an object of class sgdgmf
,
obtained by refitting the model on the whole data matrix using the latent dimension
selected via cross-validation. The returned object also contains the summary.cv
information along with the other standard output of the sgdgmf.fit
function.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Set the data dimensions
n = 100; m = 20; d = 5
# Generate data using Poisson, Binomial and Gamma models
data_pois = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
data_bin = sim.gmf.data(n = n, m = m, ncomp = d, family = binomial())
data_gam = sim.gmf.data(n = n, m = m, ncomp = d, family = Gamma(link = "log"), dispersion = 0.25)
# Set RUN = TRUE to run the example, it may take some time. To speed up
# the computation it is possible to run CV in parallel specifying
# control.cv = list(parallel = TRUE, nthreads = <number_of_workers>)
# as an argument of sgdgmf.cv()
RUN = FALSE
if (RUN) {
# Initialize the GMF parameters assuming 3 latent factors
gmf_pois = sgdgmf.cv(data_pois$Y, ncomp = 1:10, family = poisson())
gmf_bin = sgdgmf.cv(data_bin$Y, ncomp = 3, family = binomial())
gmf_gam = sgdgmf.cv(data_gam$Y, ncomp = 3, family = Gamma(link = "log"))
# Get the fitted values in the link and response scales
mu_hat_pois = fitted(gmf_pois, type = "response")
mu_hat_bin = fitted(gmf_bin, type = "response")
mu_hat_gam = fitted(gmf_gam, type = "response")
# Compare the results
oldpar = par(no.readonly = TRUE)
par(mfrow = c(1,3), mar = c(1,1,3,1))
image(data_pois$Y, axes = FALSE, main = expression(Y[Pois]))
image(data_pois$mu, axes = FALSE, main = expression(mu[Pois]))
image(mu_hat_pois, axes = FALSE, main = expression(hat(mu)[Pois]))
image(data_bin$Y, axes = FALSE, main = expression(Y[Bin]))
image(data_bin$mu, axes = FALSE, main = expression(mu[Bin]))
image(mu_hat_bin, axes = FALSE, main = expression(hat(mu)[Bin]))
image(data_gam$Y, axes = FALSE, main = expression(Y[Gam]))
image(data_gam$mu, axes = FALSE, main = expression(mu[Gam]))
image(mu_hat_gam, axes = FALSE, main = expression(hat(mu)[Gam]))
par(oldpar)
}
Single step of cross-validation for generalized matrix factorization models
Description
Internal function running a single step of cross-validation for generalized matrix factorization (GMF) models and calculating some goodness-of-fit measures on the train and test sets.
Usage
sgdgmf.cv.step(
train,
test,
X,
Z,
family,
ncomp,
maxcomp,
fold,
nfolds,
weights,
offset,
method,
sampling,
penalty,
control.init,
control.alg,
control.cv
)
Arguments
train |
train-set matrix of responses ( |
test |
test-set matrix of responses ( |
X |
matrix of row fixed effects ( |
Z |
matrix of column fixed effects ( |
family |
a |
ncomp |
ranks of the latent matrix factorization used in cross-validation (default 1 to 10) |
maxcomp |
maximum rank allowed in the cross-validation exploration |
fold |
integer number identifying the current fold |
nfolds |
maximum number of folds in the cross-validation |
weights |
an optional matrix of weights ( |
offset |
an optional matrix of offset values ( |
method |
estimation method to minimize the negative penalized log-likelihood |
sampling |
sub-sampling strategy to use if |
penalty |
list of penalty parameters (see |
control.init |
list of control parameters for the initialization (see |
control.alg |
list of control parameters for the optimization (see |
control.cv |
list of control parameters for the cross-validation (see |
Value
Returns a data.frame
containing the current number of latent factors
in the model (ncomp
), the fold identifier (fold
), the degrees of
freedom, i.e. the number of parameters, of the model (df
), the AIC, BIC
and deviance (respectively, aic
, bic
, dev
)
calculated on the train and test sets.
Factorize a matrix of non-Gaussian observations using GMF
Description
Fit a generalized matrix factorization (GMF) model for non-Gaussian data using either deterministic or stochastic optimization methods. It is an alternative to PCA when the observed data are binary, counts, and positive scores or, more generally, when the conditional distribution of the observations can be appropriately described using a dispersion exponential family or a quasi-likelihood model. Some examples are Gaussian, Gamma, Binomial and Poisson probability laws.
The dependence among the observations and the variables in the sample can be taken into account through appropriate row- and column-specific regression effects. The residual variability is then modeled through a low-rank matrix factorization.
For the estimation, the package implements two deterministic optimization methods, (AIRWLS and Newton) and two stochastic optimization algorithms (adaptive SGD with coordinate-wise and block-wise sub-sampling).
Usage
sgdgmf.fit(
Y,
X = NULL,
Z = NULL,
family = gaussian(),
ncomp = 2,
weights = NULL,
offset = NULL,
method = c("airwls", "newton", "sgd"),
sampling = c("block", "coord", "rnd-block"),
penalty = list(),
control.init = list(),
control.alg = list()
)
Arguments
Y |
matrix of responses ( |
X |
matrix of row fixed effects ( |
Z |
matrix of column fixed effects ( |
family |
a |
ncomp |
rank of the latent matrix factorization (default 2) |
weights |
an optional matrix of weights ( |
offset |
an optional matrix of offset values ( |
method |
estimation method to minimize the negative penalized log-likelihood |
sampling |
sub-sampling strategy to use if |
penalty |
list of penalty parameters (see |
control.init |
list of control parameters for the initialization (see |
control.alg |
list of control parameters for the optimization (see |
Details
Model specification
The model we consider is defined as follows.
Let Y = (y_{ij})
be a matrix of observed data of dimension n \times m
.
We assume for the (i,j)
th observation in the matrix a dispersion exponential family law
(y_{ij} \mid \theta_{ij}) \sim EF(\theta_{ij}, \phi)
, where \theta_{ij}
is the
natural parameter and \phi
is the dispersion parameter.
Recall that the conditional probability density function of y_{ij}
is given by
f (y_{ij}; \psi) = \exp \big[ w_{ij} \{(y_{ij} \theta_{ij} - b(\theta_{ij})\} / \phi - c(y_{ij}, \phi / w_{ij}) \big],
where \psi
is the vector of unknown parameters to be estimated,
b(\cdot)
is a convex twice differentiable log-partition function,
and c(\cdot,\cdot)
is the cumulant function of the family.
The conditional mean of y_{ij}
, say \mu_{ij}
, is then modeled as
g(\mu_{ij}) = \eta_{ij} = x_i^\top \beta_j + \alpha_i^\top z_j + u_i^\top v_j,
where g(\cdot)
is a bijective twice differentiable link function, \eta_{ij}
is
a linear predictor, x_i \in \mathbb{R}^p
and z_j \in \mathbb{R}^q
are
observed covariate vectors, \beta_j \in \mathbb{R}^p
and \alpha_j \in \mathbb{R}^q
are unknown regression parameters and, finally, u_i \in \mathbb{R}^d
and
v_j \in \mathbb{R}^d
are latent vector explaining the residual varibility
not captured by the regression effects.
Equivalently, in matrix form, we have
g(\mu) = \eta = X B^\top + A Z^\top + U V^\top.
The natural parameter \theta_{ij}
is linked to the conditional mean of y_{ij}
through the equation E(y_{ij}) = \mu_{ij} = b'(\theta_{ij})
.
Similarly, the variance of y_{ij}
is given by
\text{Var}(y_{ij}) = (\phi / w_{ij}) \,\nu(\mu_{ij}) = (\phi / w_{ij}) \,b''(\mu_{ij})
,
where \nu(\cdot)
is the so-called variance function of the family.
Finally, we denote by D_\phi(y,\mu)
the deviance function of the family, which
is defined as D_\phi(y,\mu) = - 2 \log\{ f(y, \psi) / f_0 (y) \}
,
where f_0(y)
is the likelihood of the saturated model.
The estimation of the model parameters is performed by minimizing the penalized deviance function
\displaystyle \ell_\lambda (\psi; y) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} D_\phi(y_{ij}, \mu_{ij}) + \frac{\lambda_{\scriptscriptstyle U}}{2} \| U \|_F^2 + \frac{\lambda_{\scriptscriptstyle V}}{2} \| V \|_F^2,
where \lambda_{\scriptscriptstyle U} > 0
and \lambda_{\scriptscriptstyle V} > 0
are regularization parameters and \|\cdot\|_F
is the Frobenious norm.
Additional \ell_2
penalization terms can be introduced to regularize B
and A
.
Quasi-likelihood models can be considered as well, where D_\phi(y, \mu)
is substituted by
Q_\phi(y, \mu) = - \log (\phi/w) - (w / \phi) \int_y^\mu \{(y - t) / \nu(t) \} \,dt,
under an appropriate specification of mean, variance and link functions.
Identifiability constraints
The GMF model is not identifiable being invariant with respect to rotation, scaling
and sign-flip transformations of U
and V
. To enforce the uniqueness of the
solution, we impose the following identifiability constraints:
-
\text{Cov}(U) = U^\top (I_n - 1_n 1_n^\top / n) U / n = I_d
, -
V
is lower triangular, with positive diagonal entries,
where I_n
and 1_n
are, respectively, the n
-dimensional identity
matrix and unitary vector.
Alternative identifiability constraints on U
and V
can be easily obtained
by post processing. For instance, a PCA-like solution, say U^\top U
is diagonal
and V^\top V = I_d
, can by obtained by applying the truncated SVD decomposition
U V^\top = \tilde{U} \tilde{D} \tilde{V}^\top
, and setting
U = \tilde{U} \tilde{D}
and V = \tilde{V}
.
Estimation algorithms
To obtain the penalized maximum likelihood estimate, we here employs four different algorithms
AIRWLS: alternated iterative re-weighted least squares (
method="airwls"
);Newton: quasi-Newton algorithm with diagonal Hessian (
method="newton"
);C-SGD: adaptive stochastic gradient descent with coordinate-wise sub-sampling (
method="sgd", sampling="coord"
);B-SGD: adaptive stochastic gradient descent with block-wise sub-sampling (
method="sgd", sampling="block"
);RB-SGD: as B-SGD but with an alternative rule to scan randomly the minibatch blocks (
method="sgd", sampling="rnd-block"
).
Likelihood families
Currently, all standard glm
families are supported, including neg.bin
and negative.binomial
families from the MASS
package.
In such a case, the deviance function we consider takes the form
D_\phi(y, \mu) = 2 w \big[ y \log(y / \mu) - (y + \phi) \log\{ (y + \phi) / (\mu + \phi) \} \big]
.
This corresponds to a Negative Binomial model with variance function \nu(\mu) = \mu + \mu^2 / \phi
.
Then, for \phi \rightarrow \infty
, the Negative Binomial likelihood converges
to a Poisson likelihood, having linear variance function, say \nu(\mu) = \mu
.
Notice that the over-dispersion parameter, that here is denoted as \phi
,
in the MASS
package is referred to as \theta
.
If the Negative Binomial family is selected, a global over-dispersion parameter
\phi
is estimated from the data using the method of moments.
Parallelization
Parallel execution is implemented in C++
using OpenMP
. When installing
and compiling the sgdGMF
package, the compiler check whether OpenMP
is installed in the system. If it is not, the package is compiled excluding all
the OpenMP
functionalities and no parallel execution is allowed at C++
level.
Notice that OpenMP
is not compatible with R
parallel computing packages,
such as parallel
and foreach
. Therefore, when parallel=TRUE
,
it is not possible to run the sgdgmf.fit
function within R
level
parallel functions, e.g., foreach
loop.
Value
An sgdgmf
object, namely a list, containing the estimated parameters of the GMF model.
In particular, the returned object collects the following information:
-
method
: the selected estimation method -
family
: the model family -
ncomp
: rank of the latent matrix factorization -
npar
: number of unknown parameters to be estimated -
control.init
: list of control parameters used for the initialization -
control.alg
: list of control parameters used for the optimization -
control.cv
: list of control parameters used for the cross.validation -
Y
: response matrix -
X
: row-specific covariate matrix -
Z
: column-specific covariate matrix -
B
: the estimated col-specific coefficient matrix -
A
: the estimated row-specific coefficient matrix -
U
: the estimated factor matrix -
V
: the estimated loading matrix -
weights
: weighting matrix -
offset
: offset matrix -
eta
: the estimated linear predictor -
mu
: the estimated mean matrix -
var
: the estimated variance matrix -
phi
: the estimated dispersion parameter -
penalty
: the penalty value at the end of the optimization -
deviance
: the deviance value at the end of the optimization -
objective
: the penalized objective function at the end of the optimization -
aic
: Akaike information criterion -
bic
: Bayesian information criterion -
names
: list of row and column names for all the output matrices -
exe.time
: the total execution time in seconds -
trace
: a trace matrix recording the optimization history -
summary.cv
:
References
Kidzinnski, L., Hui, F.K.C., Warton, D.I. and Hastie, J.H. (2022). Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays. Journal of Machine Learning Research, 23: 1-29.
Wang, L. and Carvalho, L. (2023). Deviance matrix factorization. Electronic Journal of Statistics, 17(2): 3762-3810.
Castiglione, C., Segers, A., Clement, L, Risso, D. (2024). Stochastic gradient descent estimation of generalized matrix factorization models with application to single-cell RNA sequencing data. arXiv preprint: arXiv:2412.20509.
See Also
set.control.init
, set.control.alg
,
sgdgmf.init
, sgdgmf.rank
,
refit.sgdgmf
, coef.sgdgmf
, resid.sgdgmf
,
fitted.sgdgmf
, predict.sgdgmf
, plot.sgdgmf
,
screeplot.sgdgmf
, biplot.sgdgmf
, image.sgdgmf
Examples
# Load the sgdGMF package
library(sgdGMF)
# Set the data dimensions
n = 100; m = 20; d = 5
# Generate data using Poisson, Binomial and Gamma models
data = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
# Estimate the GMF parameters assuming 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson(), method = "airwls")
# Get the fitted values in the link and response scales
mu_hat = fitted(gmf, type = "response")
# Compare the results
oldpar = par(no.readonly = TRUE)
par(mfrow = c(1,3), mar = c(1,1,3,1))
image(data$Y, axes = FALSE, main = expression(Y))
image(data$mu, axes = FALSE, main = expression(mu))
image(mu_hat, axes = FALSE, main = expression(hat(mu)))
par(oldpar)
Initialize the parameters of a generalized matrix factorization model
Description
Provide four initialization methods to set the initial values of
a generalized matrix factorization (GMF) model identified by a glm
family
and a linear predictor of the form g(\mu) = \eta = X B^\top + A Z^\top + U V^\top
,
with bijective link function g(\cdot)
.
See sgdgmf.fit
for more details on the model specification.
Usage
sgdgmf.init(
Y,
X = NULL,
Z = NULL,
ncomp = 2,
family = gaussian(),
weights = NULL,
offset = NULL,
method = c("ols", "glm", "random", "values"),
type = c("deviance", "pearson", "working", "link"),
niter = 0,
values = list(),
verbose = FALSE,
parallel = FALSE,
nthreads = 1,
savedata = TRUE
)
sgdgmf.init.ols(
Y,
X = NULL,
Z = NULL,
ncomp = 2,
family = gaussian(),
weights = NULL,
offset = NULL,
type = c("deviance", "pearson", "working", "link"),
verbose = FALSE
)
sgdgmf.init.glm(
Y,
X = NULL,
Z = NULL,
ncomp = 2,
family = gaussian(),
weights = NULL,
offset = NULL,
type = c("deviance", "pearson", "working", "link"),
verbose = FALSE,
parallel = FALSE,
nthreads = 1
)
sgdgmf.init.random(
Y,
X = NULL,
Z = NULL,
ncomp = 2,
family = gaussian(),
weights = NULL,
offset = NULL,
sigma = 1
)
sgdgmf.init.custom(
Y,
X = NULL,
Z = NULL,
ncomp = 2,
family = gaussian(),
values = list(),
verbose = FALSE
)
Arguments
Y |
matrix of responses ( |
X |
matrix of row-specific fixed effects ( |
Z |
matrix of column-specific fixed effects ( |
ncomp |
rank of the latent matrix factorization |
family |
a model family, as in the |
weights |
matrix of constant weights ( |
offset |
matrix of constant offset ( |
method |
optimization method to be used for the initial fit |
type |
type of residuals to be used for initializing |
niter |
number of iterations to refine the initial estimate (only if |
values |
a list of custom initial values for |
verbose |
if |
parallel |
if |
nthreads |
number of cores to be used in parallel (only if |
savedata |
if |
Details
If method = "ols"
, the initialization is performed fitting a sequence of linear
regressions followed by a residual SVD decomposition.
To account for non-Gaussian distribution of the data, regression and
decomposition are applied on the transformed response matrix Y_h = (g \circ h)(Y)
,
where h(\cdot)
is a function which prevent Y_h
to take infinite values.
For instance, in the Binomial case h(y) = 2 (1-\epsilon) y + \epsilon
,
while in the Poisson case h(y) = y + \epsilon
, where \epsilon
is a small
positive constant, typically 0.1
or 0.01
.
If method = "glm"
, the initialization is performed by fitting a sequence of
generalized linear models followed by a residual SVD decomposition.
In particular, to set \beta_j
, we use independent GLM fit with y_j \sim X \beta_j
.
Similarly, to set \alpha_i
, we fit the model y_i \sim Z \alpha_i + o_i
, with offset o_i = B x_i
.
Then, we obtain U
via SVD on the residuals. Finally, we obtain V
via independent GLM fit
under the model y_j \sim U v_j + o_j
, with offset o_i = X \beta_j + A z_j
.
Both under method = "ols"
and method = "glm"
, it is possible to specify the
parameter type
to change the type of residuals used for the SVD decomposition.
If method = "random"
, the initialization is performed using independent Gaussian
random values for all the parameters in the model.
If method = "values"
, the initialization is performed using user-specified
values provided as an input, which must have compatible dimensions.
Value
An initgmf
object, namely a list, containing the initial estimates of the GMF parameters.
In particular, the returned object collects the following information:
-
Y
: response matrix (only ifsavedata=TRUE
) -
X
: row-specific covariate matrix (only ifsavedata=TRUE
) -
Z
: column-specific covariate matrix (only ifsavedata=TRUE
) -
B
: the estimated col-specific coefficient matrix -
A
: the estimated row-specific coefficient matrix -
U
: the estimated factor matrix -
V
: the estimated loading matrix -
phi
: the estimated dispersion parameter -
method
: the selected estimation method -
family
: the model family -
ncomp
: rank of the latent matrix factorization -
type
: type of residuals used for the initialization ofU
-
verbose
: ifTRUE
, print the status of the initialization process -
parallel
: ifTRUE
, allows for parallel computing -
nthreads
: number of cores to be used in parallel -
savedata
: ifTRUE
, stores a copy of the input data
Examples
library(sgdGMF)
# Set the data dimensions
n = 100; m = 20; d = 5
# Generate data using Poisson, Binomial and Gamma models
data_pois = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
data_bin = sim.gmf.data(n = n, m = m, ncomp = d, family = binomial())
data_gam = sim.gmf.data(n = n, m = m, ncomp = d, family = Gamma(link = "log"), dispersion = 0.25)
# Initialize the GMF parameters assuming 3 latent factors
init_pois = sgdgmf.init(data_pois$Y, ncomp = 3, family = poisson(), method = "ols")
init_bin = sgdgmf.init(data_bin$Y, ncomp = 3, family = binomial(), method = "ols")
init_gam = sgdgmf.init(data_gam$Y, ncomp = 3, family = Gamma(link = "log"), method = "ols")
# Get the fitted values in the link and response scales
mu_hat_pois = fitted(init_pois, type = "response")
mu_hat_bin = fitted(init_bin, type = "response")
mu_hat_gam = fitted(init_gam, type = "response")
# Compare the results
oldpar = par(no.readonly = TRUE)
par(mfrow = c(3,3), mar = c(1,1,3,1))
image(data_pois$Y, axes = FALSE, main = expression(Y[Pois]))
image(data_pois$mu, axes = FALSE, main = expression(mu[Pois]))
image(mu_hat_pois, axes = FALSE, main = expression(hat(mu)[Pois]))
image(data_bin$Y, axes = FALSE, main = expression(Y[Bin]))
image(data_bin$mu, axes = FALSE, main = expression(mu[Bin]))
image(mu_hat_bin, axes = FALSE, main = expression(hat(mu)[Bin]))
image(data_gam$Y, axes = FALSE, main = expression(Y[Gam]))
image(data_gam$mu, axes = FALSE, main = expression(mu[Gam]))
image(mu_hat_gam, axes = FALSE, main = expression(hat(mu)[Gam]))
par(oldpar)
Rank selection via eigenvalue-gap methods
Description
Select the number of significant principal components of a GMF model via exploitation of eigenvalue-gap methods
Usage
sgdgmf.rank(
Y,
X = NULL,
Z = NULL,
maxcomp = ncol(Y),
family = gaussian(),
weights = NULL,
offset = NULL,
method = c("evr", "onatski", "act", "oht"),
type.reg = c("ols", "glm"),
type.res = c("deviance", "pearson", "working", "link"),
normalize = FALSE,
maxiter = 10,
parallel = FALSE,
nthreads = 1,
return.eta = FALSE,
return.mu = FALSE,
return.res = FALSE,
return.cov = FALSE
)
Arguments
Y |
matrix of responses ( |
X |
matrix of row-specific fixed effects ( |
Z |
matrix of column-specific fixed effects ( |
maxcomp |
maximum number of eigenvalues to compute |
family |
a family as in the |
weights |
matrix of optional weights ( |
offset |
matrix of optional offsets ( |
method |
rank selection method |
type.reg |
regression method to be used to profile out the covariate effects |
type.res |
residual type to be decomposed |
normalize |
if |
maxiter |
maximum number of iterations |
parallel |
if |
nthreads |
number of cores to be used in parallel (only if |
return.eta |
if |
return.mu |
if |
return.res |
if |
return.cov |
if |
Value
A list containing the method
, the selected latent rank ncomp
,
and the eigenvalues used to select the latent rank lambdas
.
Additionally, if required, in the output list will also provide the linear predictor
eta
, the predicted mean matrix mu
, the residual matrix res
, and
the implied residual covariance matrix covmat
.
References
Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. Review of Economics and Statistics, 92(4): 1004-1016
Ahn, S.C., Horenstein, A.R. (2013). Eigenvalue ratio test for the number of factors. Econometrica, 81, 1203-1227
Gavish, M., Donoho, D.L. (2014) The optimal hard thresholding for singular values is 4/sqrt(3). IEEE Transactions on Information Theory, 60(8): 5040–5053
Fan, J., Guo, J. and Zheng, S. (2020). Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association, 117(538): 852–861
Wang, L. and Carvalho, L. (2023). Deviance matrix factorization. Electronic Journal of Statistics, 17(2): 3762-3810
Examples
library(sgdGMF)
# Set the data dimensions
n = 100; m = 20; d = 5
# Generate data using Poisson, Binomial and Gamma models
data_pois = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
data_bin = sim.gmf.data(n = n, m = m, ncomp = d, family = binomial())
data_gam = sim.gmf.data(n = n, m = m, ncomp = d, family = Gamma(link = "log"), dispersion = 0.25)
# Initialize the GMF parameters assuming 3 latent factors
ncomp_pois = sgdgmf.rank(data_pois$Y, family = poisson(), normalize = TRUE)
ncomp_bin = sgdgmf.rank(data_bin$Y, family = binomial(), normalize = TRUE)
ncomp_gam = sgdgmf.rank(data_gam$Y, family = Gamma(link = "log"), normalize = TRUE)
# Get the selected number of components
print(paste("Poisson:", ncomp_pois$ncomp))
print(paste("Binomial:", ncomp_bin$ncomp))
print(paste("Gamma:", ncomp_gam$ncomp))
# Plot the screeplot used for the component determination
oldpar = par(no.readonly = TRUE)
par(mfrow = c(3,1))
barplot(ncomp_pois$lambdas, main = "Poisson screeplot")
barplot(ncomp_bin$lambdas, main = "Binomial screeplot")
barplot(ncomp_gam$lambdas, main = "Gamma screeplot")
par(oldpar)
Simulate non-Gaussian data from a GMF model
Description
Simulate synthetic non-Gaussian data from a generalized matrix factorization (GMF) model.
Usage
sim.gmf.data(n = 100, m = 20, ncomp = 5, family = gaussian(), dispersion = 1)
Arguments
n |
number of observations |
m |
number of variables |
ncomp |
rank of the latent matrix factorization |
family |
a |
dispersion |
a positive dispersion parameter |
Details
The loadings, V
, are independently sampled from a standard normal distribution.
The scores, U
, are simulated according to sinusoidal signals evaluated at different
phases, frequencies and amplitudes. These parameters are randomly sampled from independent
uniform distributions.
Value
A list containing the following objects:
-
Y
: simulated response matrix -
U
: simulated factor matrix -
V
: simulated loading matrix -
eta
: linear predictor matrix -
mu
: conditional mean matrix -
phi
: scalar dispersion parameter -
family
: model family -
ncomp
: rank of the latent matrix factorization -
param
: a list containing time, phase, frequency and amplitude vectors used to generateU
Examples
library(sgdGMF)
# Set the data dimensions
n = 100; m = 20; d = 5
# Generate data using Poisson, Binomial and Gamma models
data_pois = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
data_bin = sim.gmf.data(n = n, m = m, ncomp = d, family = binomial())
data_gam = sim.gmf.data(n = n, m = m, ncomp = d, family = Gamma(link = "log"), dispersion = 0.25)
# Compare the results
oldpar = par(no.readonly = TRUE)
par(mfrow = c(3,3), mar = c(1,1,3,1))
image(data_pois$Y, axes = FALSE, main = expression(Y[Pois]))
image(data_pois$mu, axes = FALSE, main = expression(mu[Pois]))
image(data_pois$U, axes = FALSE, main = expression(U[Pois]))
image(data_bin$Y, axes = FALSE, main = expression(Y[Bin]))
image(data_bin$mu, axes = FALSE, main = expression(mu[Bin]))
image(data_bin$U, axes = FALSE, main = expression(U[Bin]))
image(data_gam$Y, axes = FALSE, main = expression(Y[Gam]))
image(data_gam$mu, axes = FALSE, main = expression(mu[Gam]))
image(data_gam$U, axes = FALSE, main = expression(U[Gam]))
par(oldpar)
Simulate new data
Description
Generic function to simulate new data from a statistical model
Usage
simulate(object, ...)
Arguments
object |
an object from which simulate new data |
... |
additional arguments passed to or from other methods |
Value
An array containing the simulated data.
Simulate method for GMF models
Description
Simulate new data from a fitted generalized matrix factorization models
Usage
## S3 method for class 'sgdgmf'
simulate(object, ..., nsim = 1)
Arguments
object |
an object of class |
... |
further arguments passed to or from other methods |
nsim |
number of samples |
Value
An 3-fold array containing the simulated data.
Examples
# Load the sgdGMF package
library(sgdGMF)
# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())
# Fit a GMF model
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())
# Simulate new data from a GMF model
str(simulate(gmf))
Estimate the coefficients of a vector generalized linear model
Description
Estimate the coefficients of a vector generalized linear model via parallel iterative re-weighted least squares. Computations can be performed in parallel to speed up the execution.
Usage
vglm.fit.coef(
Y,
X,
family = gaussian(),
weights = NULL,
offset = NULL,
parallel = FALSE,
nthreads = 1,
clust = NULL
)
Arguments
Y |
|
X |
|
family |
a |
weights |
|
offset |
|
parallel |
if |
nthreads |
number of cores to be used in parallel (only if |
clust |
registered cluster to be used for distributing the computations (only if |
Compute the whitening matrix from a given covariance matrix
Description
Compute the whitening matrix from a given covariance matrix
Usage
whitening.matrix(
sigma,
method = c("ZCA", "ZCA-cor", "PCA", "PCA-cor", "Cholesky")
)
Arguments
sigma |
covariance matrix. |
method |
determines the type of whitening transformation. |
Details
This function is an internal re-implementation of the function whiteningMatrix
in the whitening
package. See the original documentation to get more details.