Type: Package
Title: Estimation of Generalized Matrix Factorization Models via Stochastic Gradient Descent
Version: 1.0.1
Date: 2025-05-17
Description: Efficient framework to estimate high-dimensional generalized matrix factorization models using penalized maximum likelihood under a dispersion exponential family specification. Either deterministic and stochastic methods are implemented for the numerical maximization. In particular, the package implements the stochastic gradient descent algorithm with a block-wise mini-batch strategy to speed up the computations and an efficient adaptive learning rate schedule to stabilize the convergence. All the theoretical details can be found in Castiglione et al. (2024, <doi:10.48550/arXiv.2412.20509>). Other methods considered for the optimization are the alternated iterative re-weighted least squares and the quasi-Newton method with diagonal approximation of the Fisher information matrix discussed in Kidzinski et al. (2022, http://jmlr.org/papers/v23/20-1104.html).
License: MIT + file LICENSE
Imports: Rcpp (≥ 1.0.10), RcppArmadillo, RSpectra, parallel, doParallel, foreach, MASS, SuppDists, methods, generics, reshape2, ggpubr, viridisLite
LinkingTo: Rcpp, RcppArmadillo
Depends: R (≥ 4.0.0), ggplot2
Suggests: testthat (≥ 3.0.0), Rtsne, dplyr, knitr, rmarkdown
Config/testthat/edition: 3
Encoding: UTF-8
URL: https://github.com/CristianCastiglione/sgdGMF
BugReports: https://github.com/CristianCastiglione/sgdGMF/issues
RoxygenNote: 7.2.3
VignetteBuilder: knitr
NeedsCompilation: yes
Packaged: 2025-05-17 12:50:04 UTC; crist
Author: Cristian Castiglione ORCID iD [aut, cre], Davide Risso ORCID iD [ctb], Alexandre Segers ORCID iD [ctb]
Maintainer: Cristian Castiglione <cristian_castiglione@libero.it>
Repository: CRAN
Date/Publication: 2025-05-17 22:00:02 UTC

sgdGMF: Estimation of Generalized Matrix Factorization Models via Stochastic Gradient Descent

Description

Efficient framework to estimate high-dimensional generalized matrix factorization models using penalized maximum likelihood under a dispersion exponential family specification. Either deterministic and stochastic methods are implemented for the numerical maximization. In particular, the package implements the stochastic gradient descent algorithm with a block-wise mini-batch strategy to speed up the computations and an efficient adaptive learning rate schedule to stabilize the convergence. All the theoretical details can be found in Castiglione et al. (2024, doi:10.48550/arXiv.2412.20509). Other methods considered for the optimization are the alternated iterative re-weighted least squares and the quasi-Newton method with diagonal approximation of the Fisher information matrix discussed in Kidzinski et al. (2022, http://jmlr.org/papers/v23/20-1104.html).

Author(s)

Maintainer: Cristian Castiglione cristian_castiglione@libero.it (ORCID)

Other contributors:

See Also

Useful links:


Biplot of an initialized GMF model

Description

Plot the observations on a two-dimensional projection determined by the estimated score matrix

Usage

## S3 method for class 'initgmf'
biplot(
  x,
  ...,
  choices = 1:2,
  arrange = TRUE,
  byrow = FALSE,
  normalize = FALSE,
  labels = NULL,
  palette = NULL
)

Arguments

x

an object of class initgmf

...

further arguments passed to or from other methods

choices

a length 2 vector specifying the components to plot

arrange

if TRUE, return a single plot with two panels

byrow

if TRUE, the panels are arranged row-wise (if arrange=TRUE)

normalize

if TRUE, orthogonalizes the scores using SVD

labels

a vector of labels which should be plotted

palette

the color-palette which should be used

Value

If arrange=TRUE, a single ggplot object with the selected biplots, otherwise, a list of two ggplot objects showing the row and column latent variables.

See Also

biplot.sgdgmf.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())

# Get the biplot of a GMF model
biplot(init) # 1st vs 2nd principal components
biplot(init, choices = 2:3) #2nd vs 3rd principal components


Biplot of a GMF model

Description

Plot the observations on a two-dimensional projection determined by the estimated score matrix

Usage

## S3 method for class 'sgdgmf'
biplot(
  x,
  ...,
  choices = 1:2,
  arrange = TRUE,
  byrow = FALSE,
  normalize = FALSE,
  labels = NULL,
  palette = NULL,
  titles = c(NULL, NULL)
)

Arguments

x

an object of class sgdgmf

...

further arguments passed to or from other methods

choices

a length 2 vector specifying the components to plot

arrange

if TRUE, return a single plot with two panels

byrow

if TRUE, the panels are arranged row-wise (if arrange=TRUE)

normalize

if TRUE, orthogonalizes the scores using SVD

labels

a vector of labels which should be plotted

palette

the color-palette which should be used

titles

a 2-dimensional string vector containing the plot titles

Value

If arrange=TRUE, a single ggplot object with the selected biplots, otherwise, a list of two ggplot objects showing the row and column latent variables.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Get the biplot of a GMF model
biplot(gmf)


Extract the coefficient of an initialized GMF model

Description

Return the initialized coefficients of a GMF model, i.e., the row- and column-specific regression effects, the latent scores and loadings.

Usage

## S3 method for class 'initgmf'
coefficients(
  object,
  ...,
  type = c("all", "colreg", "rowreg", "scores", "loadings")
)

## S3 method for class 'initgmf'
coef(object, ..., type = c("all", "colreg", "rowreg", "scores", "loadings"))

Arguments

object

an object of class initgmf

...

further arguments passed to or from other methods

type

the type of coefficients which should be returned

Value

If type="all", a list of coefficients containing the fields B, A, U and V. Otherwise, a matrix of coefficients, corresponding to the selected type.

See Also

coefficients.sgdgmf and coef.sgdgmf.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())

# Get the estimated coefficients of a GMF model
str(coefficients(init)) # returns all the coefficients
str(coefficients(init, type = "scores")) # returns only the scores, say U
str(coefficients(init, type = "loadings")) # returns only the loadings, say V


Extract the coefficient of a GMF model

Description

Return the estimated coefficients of a GMF model, i.e., the row- and column-specific regression effects, the latent scores and loadings.

Usage

## S3 method for class 'sgdgmf'
coefficients(
  object,
  ...,
  type = c("all", "colreg", "rowreg", "scores", "loadings")
)

## S3 method for class 'sgdgmf'
coef(object, ..., type = c("all", "colreg", "rowreg", "scores", "loadings"))

Arguments

object

an object of class sgdgmf

...

further arguments passed to or from other methods

type

the type of coefficients which should be returned

Value

If type="all", a list of coefficients containing the fields B, A, U and V. Otherwise, a matrix of coefficients, corresponding to the selected type.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Get the estimated coefficients of a GMF model
str(coefficients(gmf)) # returns all the coefficients
str(coefficients(gmf, type = "scores")) # returns only the scores, say U
str(coefficients(gmf, type = "loadings")) # returns only the loadings, say V


Fisher scoring algorithm for GLMs

Description

Internal function implementing the Fisher scoring algorithms for the estimation of GLMs. It is used in the AIRWLS algorithm for the estimation of GMF models.

Usage

cpp.airwls.glmfit(
  beta,
  y,
  X,
  familyname,
  linkname,
  varfname,
  offset,
  weights,
  penalty,
  nsteps = 100L,
  stepsize = 0.1,
  print = FALSE
)

Arguments

beta

initial value of the regression coefficients to be estimated

y

response vector

X

design matrix

familyname

model family name

linkname

link function name

varfname

variance function name

offset

vector of constants to be added to the linear predictor

weights

vector of constants non-negative weights

penalty

penalty parameter of a ridge-type penalty

nsteps

number of iterations

stepsize

stepsize parameter of the Fisher scoring algorithm

print

if TRUE, print the algorithm history


Compute one Fisher scoring step for GLMs

Description

Internal function to compute one Fisher scoring step for GLMs. It constitutes the building block of the AIRWLS algorithm for the estimation of GMF models.

Usage

cpp.airwls.glmstep(
  beta,
  y,
  X,
  familyname,
  linkname,
  varfname,
  offset,
  weights,
  penalty
)

Arguments

beta

current value of the regression coefficients to be updated

y

response vector

X

design matrix

familyname

model family name

linkname

link function name

varfname

variance function name

offset

vector of constants to be added to the linear predictor

weights

vector of constants non-negative weights

penalty

penalty parameter of a ridge-type penalty


AIRWLS update for GMF models

Description

Internal function implementing one step of AIRWLS for the estimation of GMF models.

Usage

cpp.airwls.update(
  beta,
  Y,
  X,
  familyname,
  linkname,
  varfname,
  idx,
  offset,
  weights,
  penalty,
  transp = FALSE,
  nsteps = 100L,
  stepsize = 0.1,
  print = FALSE,
  parallel = FALSE,
  nthreads = 1L
)

Arguments

beta

initial value of the regression coefficients to be estimated

Y

response vector

X

design matrix

familyname

model family name

linkname

link function name

varfname

variance function name

idx

index identifying the parameters to be updated in beta

offset

vector of constants to be added to the linear predictor

weights

vector of constants non-negative weights

penalty

penalty parameter of a ridge-type penalty

transp

if TRUE, transpose the data

nsteps

number of iterations

stepsize

stepsize parameter of the Fisher scoring algorithm

print

if TRUE, print the algorithm history

parallel

if TRUE, run the updates in parallel using openMP

nthreads

number of threads to be run in parallel (only if parallel=TRUE)


Fit a GMF model using the AIRWLS algorithm

Description

Fit a GMF model using the AIRWLS algorithm

Usage

cpp.fit.airwls(
  Y,
  X,
  B,
  A,
  Z,
  U,
  V,
  O,
  W,
  familyname,
  linkname,
  varfname,
  ncomp,
  lambda,
  maxiter = 500L,
  nsteps = 1L,
  stepsize = 0.1,
  eps = 1e-08,
  nafill = 1L,
  tol = 1e-05,
  damping = 0.001,
  verbose = TRUE,
  frequency = 10L,
  parallel = FALSE,
  nthreads = 1L
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row fixed effects (n \times p)

B

initial row-effect matrix (n \times p)

A

initial column-effect matrix (n \times q)

Z

matrix of column fixed effects (m \times q)

U

initial factor matrix (n \times d)

V

initial loading matrix (m \times d)

O

matrix of constant offset (n \times m)

W

matrix of constant weights (n \times m)

familyname

a glm model family name

linkname

a glm link function name

varfname

variance function name

ncomp

rank of the latent matrix factorization

lambda

penalization parameters

maxiter

maximum number of iterations

nsteps

number of inner Fisher scoring iterations

stepsize

stepsize of the inner Fisher scoring algorithm

eps

shrinkage factor for extreme predictions

nafill

how often the missing values are updated

tol

tolerance threshold for the stopping criterion

damping

diagonal dumping factor for the Hessian matrix

verbose

if TRUE, print the optimization status

frequency

how often the optimization status is printed

parallel

if TRUE, allows for parallel computing

nthreads

number of cores to be used in parallel


Fit a GMF model using the adaptive SGD with block-wise minibatch subsampling

Description

Fit a GMF model using the adaptive SGD with block-wise minibatch subsampling

Usage

cpp.fit.block.sgd(
  Y,
  X,
  B,
  A,
  Z,
  U,
  V,
  O,
  W,
  familyname,
  linkname,
  varfname,
  ncomp,
  lambda,
  maxiter = 1000L,
  eps = 0.01,
  nafill = 10L,
  tol = 1e-08,
  size1 = 100L,
  size2 = 100L,
  burn = 0.75,
  rate0 = 0.01,
  decay = 0.01,
  damping = 0.001,
  rate1 = 0.95,
  rate2 = 0.99,
  parallel = FALSE,
  nthreads = 1L,
  verbose = TRUE,
  frequency = 250L,
  progress = FALSE
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row fixed effects (n \times p)

B

initial row-effect matrix (n \times p)

A

initial column-effect matrix (n \times q)

Z

matrix of column fixed effects (m \times q)

U

initial factor matrix (n \times d)

V

initial loading matrix (m \times d)

O

matrix of constant offset (n \times m)

W

matrix of constant weights (n \times m)

familyname

a glm model family name

linkname

a glm link function name

varfname

variance function name

ncomp

rank of the latent matrix factorization

lambda

penalization parameters

maxiter

maximum number of iterations

eps

shrinkage factor for extreme predictions

nafill

how often the missing values are updated

tol

tolerance threshold for the stopping criterion

size1

row-minibatch dimension

size2

column-minibatch dimension

burn

burn-in period in which the learning late is not decreased

rate0

initial learning rate

decay

decay rate of the learning rate

damping

diagonal dumping factor for the Hessian matrix

rate1

decay rate of the first moment estimate of the gradient

rate2

decay rate of the second moment estimate of the gradient

parallel

if TRUE, allows for parallel computing

nthreads

number of cores to be used in parallel

verbose

if TRUE, print the optimization status

frequency

how often the optimization status is printed

progress

if TRUE, print an progress bar


Fit a GMF model using the adaptive SGD with coordinate-wise minibatch subsampling algorithm

Description

Fit a GMF model using the adaptive SGD with coordinate-wise minibatch subsampling algorithm

Usage

cpp.fit.coord.sgd(
  Y,
  X,
  B,
  A,
  Z,
  U,
  V,
  O,
  W,
  familyname,
  linkname,
  varfname,
  ncomp,
  lambda,
  maxiter = 1000L,
  eps = 0.01,
  nafill = 10L,
  tol = 1e-08,
  size1 = 100L,
  size2 = 100L,
  burn = 0.75,
  rate0 = 0.01,
  decay = 0.01,
  damping = 0.001,
  rate1 = 0.95,
  rate2 = 0.99,
  parallel = FALSE,
  nthreads = 1L,
  verbose = TRUE,
  frequency = 250L,
  progress = FALSE
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row fixed effects (n \times p)

B

initial row-effect matrix (n \times p)

A

initial column-effect matrix (n \times q)

Z

matrix of column fixed effects (m \times q)

U

initial factor matrix (n \times d)

V

initial loading matrix (m \times d)

O

matrix of constant offset (n \times m)

W

matrix of constant weights (n \times m)

familyname

a glm model family name

linkname

a glm link function name

varfname

variance function name

ncomp

rank of the latent matrix factorization

lambda

penalization parameters

maxiter

maximum number of iterations

eps

shrinkage factor for extreme predictions

nafill

how often the missing values are updated

tol

tolerance threshold for the stopping criterion

size1

row-minibatch dimension

size2

column-minibatch dimension

burn

burn-in period in which the learning late is not decreased

rate0

initial learning rate

decay

decay rate of the learning rate

damping

diagonal dumping factor for the Hessian matrix

rate1

decay rate of the first moment estimate of the gradient

rate2

decay rate of the second moment estimate of the gradient

parallel

if TRUE, allows for parallel computing

nthreads

number of cores to be used in parallel

verbose

if TRUE, print the optimization status

frequency

how often the optimization status is printed

progress

if TRUE, print an progress bar


Fit a GMF model using the diagonal quasi-Newton algorithm

Description

Fit a GMF model using the diagonal quasi-Newton algorithm

Usage

cpp.fit.newton(
  Y,
  X,
  B,
  A,
  Z,
  U,
  V,
  O,
  W,
  familyname,
  linkname,
  varfname,
  ncomp,
  lambda,
  maxiter = 500L,
  stepsize = 0.1,
  eps = 1e-08,
  nafill = 1L,
  tol = 1e-05,
  damping = 0.001,
  verbose = TRUE,
  frequency = 10L,
  parallel = FALSE,
  nthreads = 1L
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row fixed effects (n \times p)

B

initial row-effect matrix (n \times p)

A

initial column-effect matrix (n \times q)

Z

matrix of column fixed effects (m \times q)

U

initial factor matrix (n \times d)

V

initial loading matrix (m \times d)

O

matrix of constant offset (n \times m)

W

matrix of constant weights (n \times m)

familyname

a glm model family name

linkname

a glm link function name

varfname

variance function name

ncomp

rank of the latent matrix factorization

lambda

penalization parameters

maxiter

maximum number of iterations

stepsize

stepsize of the quasi-Newton update

eps

shrinkage factor for extreme predictions

nafill

how often the missing values are updated

tol

tolerance threshold for the stopping criterion

damping

diagonal dumping factor for the Hessian matrix

verbose

if TRUE, print the optimization status

frequency

how often the optimization status is printed

parallel

if TRUE, allows for parallel computing

nthreads

number of cores to be used in parallel


Fit a GMF model using the adaptive SGD with block-wise minibatch subsampling

Description

Fit a GMF model using the adaptive SGD with block-wise minibatch subsampling

Usage

cpp.fit.random.block.sgd(
  Y,
  X,
  B,
  A,
  Z,
  U,
  V,
  O,
  W,
  familyname,
  linkname,
  varfname,
  ncomp,
  lambda,
  maxiter = 1000L,
  eps = 0.01,
  nafill = 10L,
  tol = 1e-08,
  size1 = 100L,
  size2 = 100L,
  burn = 0.75,
  rate0 = 0.01,
  decay = 0.01,
  damping = 0.001,
  rate1 = 0.95,
  rate2 = 0.99,
  parallel = FALSE,
  nthreads = 1L,
  verbose = TRUE,
  frequency = 250L,
  progress = FALSE
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row fixed effects (n \times p)

B

initial row-effect matrix (n \times p)

A

initial column-effect matrix (n \times q)

Z

matrix of column fixed effects (m \times q)

U

initial factor matrix (n \times d)

V

initial loading matrix (m \times d)

O

matrix of constant offset (n \times m)

W

matrix of constant weights (n \times m)

familyname

a glm model family name

linkname

a glm link function name

varfname

variance function name

ncomp

rank of the latent matrix factorization

lambda

penalization parameters

maxiter

maximum number of iterations

eps

shrinkage factor for extreme predictions

nafill

how often the missing values are updated

tol

tolerance threshold for the stopping criterion

size1

row-minibatch dimension

size2

column-minibatch dimension

burn

burn-in period in which the learning late is not decreased

rate0

initial learning rate

decay

decay rate of the learning rate

damping

diagonal dumping factor for the Hessian matrix

rate1

decay rate of the first moment estimate of the gradient

rate2

decay rate of the second moment estimate of the gradient

parallel

if TRUE, allows for parallel computing

nthreads

number of cores to be used in parallel

verbose

if TRUE, print the optimization status

frequency

how often the optimization status is printed

progress

if TRUE, print an progress bar


Compute deviance, AIC and BIC of an initialized GMF model

Description

Compute deviance, AIC and BIC of an initialized GMF object

Usage

## S3 method for class 'initgmf'
deviance(object, ..., normalize = FALSE)

## S3 method for class 'initgmf'
AIC(object, ..., k = 2)

## S3 method for class 'initgmf'
BIC(object, ...)

Arguments

object

an object of class initgmf

...

further arguments passed to or from other methods

normalize

if TRUE, normalize the result using the null-deviance

k

the penalty parameter to be used for AIC; the default is k = 2

Value

The value of the deviance extracted from a initgmf object.

See Also

deviance.sgdgmf, AIC.sgdgmf and AIC.sgdgmf.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())

# Get the GMF deviance, AIC and BIC
deviance(init)
AIC(init)
BIC(init)


Compute deviance, AIC and BIC of a GMF model

Description

Compute deviance, AIC and BIC of a GMF object

Usage

## S3 method for class 'sgdgmf'
deviance(object, ..., normalize = FALSE)

## S3 method for class 'sgdgmf'
AIC(object, ..., k = 2)

## S3 method for class 'sgdgmf'
BIC(object, ...)

Arguments

object

an object of class sgdgmf

...

further arguments passed to or from other methods

normalize

if TRUE, normalize the result using the null-deviance

k

the penalty parameter to be used for AIC; the default is k = 2

Value

The value of the deviance extracted from a sgdgmf object.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Get the GMF deviance, AIC and BIC
deviance(gmf)
AIC(gmf)
BIC(gmf)


Rank selection via adjust correlation thresholding

Description

Select the number of significant principal components of a matrix via adjust correlation threshold (ACT)

Usage

eigengap.act(covmat, nobs, maxcomp = NULL)

Arguments

covmat

matrix to be decomposed

nobs

number of observations used to compute the covariance matrix

maxcomp

maximum number of eigenvalues to compute

References

Fan, J., Guo, j. and Zheng, S. (2020). Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association, 117(538): 852–861


Rank selection via eigenvalue ratio maximization

Description

Select the number of significant principal components of a matrix via the eigenvalue ratio (EVR) maximization method

Usage

eigengap.evr(covmat, maxcomp = 50, thr = 0.95)

Arguments

covmat

matrix to be decomposed

maxcomp

maximum number of eigenvalues to compute

References

Ahn, S.C., Horenstein, A.R. (2013). Eigenvalue ratio test for the number of factors. Econometrica, 81, 1203-1227


Rank selection via optimal hard thresholding

Description

Select the number of significant principal components of a matrix via optimal hard thresholding (OHT)

Usage

eigengap.oht(covmat, nobs, maxcomp = NULL)

Arguments

covmat

matrix to be decomposed

nobs

number of observations used to compute the covariance matrix

maxcomp

maximum number of eigenvalues to compute

References

Gavish, M., Donoho, D.L. (2014) The optimal hard thresholding for singular values is 4/sqrt(3). IEEE Transactions on Information Theory, 60(8): 5040–5053


Rank selection via the Onatski method

Description

Select the number of significant principal components of a matrix via the Onatski method

Usage

eigengap.onatski(covmat, maxcomp = 50, maxiter = 100)

Arguments

covmat

matrix to be decomposed

maxcomp

maximum number of eigenvalues to compute

maxiter

maximum number of iterations

References

Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. Review of Economics and Statistics, 92(4): 1004-1016


Extract the fitted values of an initialized GMF model

Description

Computes the fitted values of an initialized GMF model.

Usage

## S3 method for class 'initgmf'
fitted(object, ..., type = c("link", "response", "terms"), partial = FALSE)

Arguments

object

an object of class initgmf

...

further arguments passed to or from other methods

type

the type of fitted values which should be returned

partial

if TRUE, returns the partial fitted values

Value

If type="terms", a list of fitted values containing the fields XB, AZ and UV. Otherwise, a matrix of fitted values in the link or response scale, depending on the selected type.

See Also

fitted.sgdgmf.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())

# Get the fitted values of a GMF model
str(fitted(init)) # returns the overall fitted values in link scale
str(fitted(init, type = "response")) # returns the overall fitted values in response scale
str(fitted(init, partial = TRUE)) # returns the partial fitted values in link scale


Extract the fitted values of a GMF models

Description

Computes the fitted values of a GMF model.

Usage

## S3 method for class 'sgdgmf'
fitted(object, ..., type = c("link", "response", "terms"), partial = FALSE)

Arguments

object

an object of class sgdgmf

...

further arguments passed to or from other methods

type

the type of fitted values which should be returned

partial

if TRUE, returns the partial fitted values

Value

If type="terms", a list of fitted values containing the fields XB, AZ and UV. Otherwise, a matrix of fitted values in the link or response scale, depending on the selected type.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Get the fitted values of a GMF model
str(fitted(gmf)) # returns the overall fitted values in link scale
str(fitted(gmf, type = "response")) # returns the overall fitted values in response scale


Heatmap of an initialized GMF model

Description

Plots a heatmap of either the data, the fitted values, or the residual values of a GMF model allowing for different types of transformations and normalizations. Moreover, it also permits to plot the latent score and loading matrices.

Usage

## S3 method for class 'initgmf'
image(
  x,
  ...,
  type = c("data", "response", "link", "scores", "loadings", "deviance", "pearson",
    "working"),
  resid = FALSE,
  symmetric = FALSE,
  transpose = FALSE,
  limits = NULL,
  palette = NULL
)

Arguments

x

an object of class initgmf

...

further arguments passed to or from other methods

type

the type of data/predictions/residuals which should be returned

resid

if TRUE, plots the residual values

symmetric

if TRUE, symmetrizes the color limits

transpose

if TRUE, transposes the matrix before plotting it

limits

the color limits which should be used

palette

the color-palette which should be used

Value

A ggplot object showing the selected heatmap.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())

# Get the heatmap of a GMF model
image(init, type = "data") # original data
image(init, type = "response") # fitted values in response scale
image(init, type = "scores") # estimated score matrix
image(init, type = "loadings") # estimated loading matrix
image(init, type = "deviance", resid = TRUE) # deviance residual matrix


Heatmap of a GMF model

Description

Plots a heatmap of either the data, the fitted values, or the residual values of a GMF model allowing for different types of transformations and normalizations. Moreover, it also permits to plot the latent score and loading matrices.

Usage

## S3 method for class 'sgdgmf'
image(
  x,
  ...,
  type = c("data", "response", "link", "scores", "loadings", "deviance", "pearson",
    "working"),
  resid = FALSE,
  symmetric = FALSE,
  transpose = FALSE,
  limits = NULL,
  palette = NULL
)

Arguments

x

an object of class sgdgmf

...

further arguments passed to or from other methods

type

the type of data/predictions/residuals which should be returned

resid

if TRUE, plots the residual values

symmetric

if TRUE, symmetrizes the color limits

transpose

if TRUE, transposes the matrix before plotting it

limits

the color limits which should be used

palette

the color-palette which should be used

Value

A ggplot object showing the selected heatmap.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

 # Fit a GMF model
 gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Get the heatmap of a GMF model
image(gmf, type = "data") # original data
image(gmf, type = "response") # fitted values in response scale
image(gmf, type = "scores") # estimated score matrix
image(gmf, type = "loadings") # estimated loading matrix
image(gmf, type = "deviance", resid = TRUE) # deviance residual matrix


Fix sign ambiguity of eigen-vectors

Description

Fix sign ambiguity of eigen-vectors by making U positive diagonal

Usage

make.pos.diag(U)

Arguments

U

target matrix


Model deviance of a GMF model

Description

Compute the overall deviance averaging the contributions of all data

Usage

matrix.deviance(mu, y, family = gaussian())

Frobenius penalty for the parameters of a GMF model

Description

Compute the Frobenius penalty for all the parameters in the model

Usage

matrix.penalty(U, penalty)

Procrustes distance

Description

Compute the Procrustes distance between two matrices

Usage

norm.procrustes(A, B)

Arguments

A

target matrix

B

matrix to be rotated


Normalize the matrices U and V

Description

Rotate U and V using either QR or SVD decompositions. The QR methods rotate U and V in such a way to obtain an orthogonal U and a lower triangular V. The SVD method rotate U and V in such a way to obtain an orthogonal U and a scaled orthogonal V.

Usage

normalize.uv(U, V, method = c("qr", "svd"))

Estimate the coefficients of a multivariate linear model

Description

Estimate the coefficients of a multivariate linear model via ordinary least squares.

Usage

ols.fit.coef(Y, X, offset = NULL)

Arguments

Y

n \times m matrix of response variables

X

n \times p matrix of covariates

offset

n \times m matrix of offset values


Check if OpenMP is enabled

Description

Internal function to check if OpenMP is enabled

Usage

omp.check()

Split the data matrix in train and test sets

Description

Returns a list of two matrices train and test. train corresponds to the input matrix with a fixed persentage of entries masked by NA values. test is the complement of train and contains the values of the input matrix in the cells where train is NA, while all the other entries are filled by NA values.

Usage

partition(y, p = 0.3)

Arguments

y

input matrix to be split into train and test sets

p

fraction of observations to be used for the test set


Plot diagnostics for an initialized GMF model

Description

Plots (one of) six diagnostics to graphically analyze the marginal and conditional distribution of the residuals of a GMF model. Currently, the following plots are available: residuals against observation indices, residuals agains fitted values, absolute square-root residuals against fitted values, histogram of the residuals, residual QQ-plot, residual ECDF-plot.

Usage

## S3 method for class 'initgmf'
plot(
  x,
  ...,
  type = c("res-idx", "res-fit", "std-fit", "hist", "qq", "ecdf"),
  resid = c("deviance", "pearson", "working", "response", "link"),
  subsample = FALSE,
  sample.size = 500,
  partial = FALSE,
  normalize = FALSE,
  fillna = FALSE
)

Arguments

x

an object of class initgmf

...

further arguments passed to or from other methods

type

the type of plot which should be returned

resid

the type of residuals which should be used

subsample

if TRUE, computes the residuals over o small fraction of the data

sample.size

the dimension of the sub-sample which should be used

partial

if TRUE, computes the partial residuals

normalize

if TRUE, standardizes the residuals column-by-column

fillna

if TRUE, fills the NA values with 0

Value

A ggplot object showing the selected diagnostic plot.

See Also

plot.sgdgmf.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())

# Plot the residual-based GMF diagnostics
plot(init, type = "res-fit") # Residuals vs fitted values
plot(init, type = "std-fit") # Abs-sqrt-transformed residuals vs fitted values
plot(init, type = "qq") # Residual QQ-plot
plot(init, type = "hist") # Residual histogram


Plot diagnostics for a GMF model

Description

Plots (one of) six diagnostics to graphically analyze the marginal and conditional distribution of the residuals of a GMF model. Currently, the following plots are available: residuals against observation indices, residuals agains fitted values, absolute square-root residuals against fitted values, histogram of the residuals, residual QQ-plot, residual ECDF-plot.

Usage

## S3 method for class 'sgdgmf'
plot(
  x,
  ...,
  type = c("res-idx", "res-fit", "std-fit", "hist", "qq", "ecdf"),
  resid = c("deviance", "pearson", "working", "response", "link"),
  subsample = FALSE,
  sample.size = 500,
  partial = FALSE,
  normalize = FALSE,
  fillna = FALSE
)

Arguments

x

an object of class sgdgmf

...

further arguments passed to or from other methods

type

the type of plot which should be returned

resid

the type of residuals which should be used

subsample

if TRUE, computes the residuals over o small fraction of the data

sample.size

the dimension of the sub-sample which should be used

partial

if TRUE, computes the partial residuals

normalize

if TRUE, standardizes the residuals column-by-column

fillna

if TRUE, fills the NA values with 0

Value

A ggplot object showing the selected diagnostic plot.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Plot the residual-based GMF diagnostics
plot(gmf, type = "res-fit") # Residuals vs fitted values
plot(gmf, type = "std-fit") # Abs-sqrt-transformed residuals vs fitted values
plot(gmf, type = "qq") # Residual QQ-plot
plot(gmf, type = "hist") # Residual histogram


Pointwise deviance of a GMF model

Description

Compute the pointwise deviance for all the observations in the sample

Usage

pointwise.deviance(mu, y, family = gaussian())

Predict method for GMF models

Description

Computes the predictions of a GMF model. Out-of-sample predictions for a new set of responses and covariates are computed via MLE, by keeping fixed the values of the estimated B and V and maximizing the likelihood with respect to A and U.

Usage

## S3 method for class 'sgdgmf'
predict(
  object,
  ...,
  newY = NULL,
  newX = NULL,
  type = c("link", "response", "terms", "coef"),
  parallel = FALSE,
  nthreads = 1
)

Arguments

object

an object of class sgdgmf

...

further arguments passed to or from other methods

newY

optionally, a matrix of new response variable

newX

optionally, a matrix of new covariate values

type

the type of prediction which should be returned

parallel

if TRUE, allows for parallel computing using the package foreach

nthreads

number of cores to be used in parallel (only if parallel=TRUE)

Details

If newY and newX are omitted, the predictions are based on the data used for the fit. In that case, the predictions corresponds to the fitted values. If newY and newX are provided, a corresponding set of A and U are estimated via maximum likelihood using the glm.fit function. By doing so, B and V are kept fixed.

Value

If type="link" or typr="response", a matrix of predictions. If type="terms", a list of matrices containing the fields XB, AZ and UV. If type="coef", a list of matrices containing the field B, A, U and V.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 120, m = 20, ncomp = 5, family = poisson())
train = sample(1:120, size = 100)
test = setdiff(1:120, train)

Y = data$Y[train, ]
newY = data$Y[test, ]

# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(Y, ncomp = 3, family = poisson())

# Get the fitted values of a GMF model
str(predict(gmf)) # returns the overall fitted values in link scale
str(predict(gmf, type = "response")) # returns the overall fitted values in response scale
str(predict(gmf, type = "terms")) # returns the partial fitted values in link scale
str(predict(gmf, newY = newY)) # returns the predictions for the new set of responses


Print the fundamental characteristics of an initialized GMF

Description

Print some summary information of an initialized GMF model.

Usage

## S3 method for class 'initgmf'
print(x, ...)

Arguments

x

an object of class initgmf

...

further arguments passed to or from other methods

Value

No return value, called only for printing.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())

# Print the GMF object
print(init)


Print the fundamental characteristics of a GMF

Description

Print some summary information of a GMF model.

Usage

## S3 method for class 'sgdgmf'
print(x, ...)

Arguments

x

an object of class sgdgmf

...

further arguments passed to or from other methods

Value

No return value, called only for printing.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Print the GMF object
print(gmf)


Procrustes rotation of two configurations

Description

Rotates a configuration to maximum similarity with another configuration

Usage

procrustes(X, Y, scale = TRUE, symmetric = FALSE)

Arguments

X

target matrix

Y

matrix to be rotated

scale

allow scaling of axes of Y

symmetric

if TRUE, use symmetric Procrustes statistic


Objects exported from other packages

Description

These objects are imported from other packages. Follow the links below to see their documentation.

generics

refit


Refine the final estimate of a GMF model

Description

Refine the estimated latent scores of a GMF model via IRWLS

Usage

## S3 method for class 'sgdgmf'
refit(
  object,
  ...,
  normalize = TRUE,
  verbose = FALSE,
  parallel = FALSE,
  nthreads = 1
)

Arguments

object

an object of class sgdgmf

...

further arguments passed to or from other methods

normalize

if TRUE, normalize U and V to uncorrelated Gaussian U and upper triangular V with positive diagonal

verbose

if TRUE, print the optimization status

parallel

if TRUE, use parallel computing using the foreach package

nthreads

number of cores to be used in the "glm" method

Value

An sgdgmf object containing the re-fitted model.

See Also

sgdgmf.fit

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model using SGD
gmf_old = sgdgmf.fit(data$Y, ncomp = 3, family = poisson(), method = "sgd")

# Refine the score matrix estimate
gmf_new = refit(gmf_old)

# Get the fitted values in the link and response scales
mu_hat_old = fitted(gmf_old, type = "response")
mu_hat_new = fitted(gmf_new, type = "response")

# Compare the results
oldpar = par(no.readonly = TRUE)
par(mfrow = c(2,2), mar = c(1,1,3,1))
image(data$Y, axes = FALSE, main = expression(Y))
image(data$mu, axes = FALSE, main = expression(mu))
image(mu_hat_old, axes = FALSE, main = expression(hat(mu)[old]))
image(mu_hat_new, axes = FALSE, main = expression(hat(mu)[new]))
par(oldpar)


Extract the residuals of an initialized GMF model

Description

Extract the residuals of an initialized GMF model and, if required, compute the eigenvalues of the residuals covariance/correlation matrix. Moreover, if required, return the partial residual of the model obtained by excluding the matrix decomposition from the linear predictor.

Usage

## S3 method for class 'initgmf'
residuals(
  object,
  ...,
  type = c("deviance", "pearson", "working", "response", "link"),
  partial = FALSE,
  normalize = FALSE,
  fillna = FALSE,
  spectrum = FALSE,
  ncomp = 50
)

## S3 method for class 'initgmf'
resid(
  object,
  ...,
  type = c("deviance", "pearson", "working", "response", "link"),
  partial = FALSE,
  normalize = FALSE,
  fillna = FALSE,
  spectrum = FALSE,
  ncomp = 50
)

Arguments

object

an object of class initgmf

...

further arguments passed to or from other methods

type

the type of residuals which should be returned

partial

if TRUE, computes the residuals excluding the matrix factorization from the linear predictor

normalize

if TRUE, standardize the residuals column-by-column

fillna

if TRUE, fills NA values column-by-column

spectrum

if TRUE, returns the eigenvalues of the residual covariance matrix

ncomp

number of eigenvalues to be calculated (only if spectrum=TRUE)

Value

If spectrum=FALSE, a matrix containing the selected residuals. If spectrum=TRUE, a list containing the residuals (res), the first ncomp eigenvalues of the residual covariance matrix, say (lambdas), the variance explained by the first ncomp principal component of the residuals (explained.var), the variance not explained by the first ncomp principal component of the residuals (residual.var), the total variance of the residuals (total.var).

See Also

residuals.sgdgmf and resid.sgdgmf for more details on the residual computation.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())

# Get the deviance residuals of a GMF model
str(residuals(init)) # returns the overall deviance residuals
str(residuals(init, partial = TRUE)) # returns the partial residuals
str(residuals(init, spectrum = TRUE)) # returns the eigenvalues of the residual var-cov matrix


Extract the residuals of a GMF model

Description

Extract the residuals of a GMF model and, if required, compute the eigenvalues of the residuals covariance/correlation matrix. Moreover, if required, return the partial residual of the model obtained by excluding the matrix decomposition from the linear predictor.

Usage

## S3 method for class 'sgdgmf'
residuals(
  object,
  ...,
  type = c("deviance", "pearson", "working", "response", "link"),
  partial = FALSE,
  normalize = FALSE,
  fillna = FALSE,
  spectrum = FALSE,
  ncomp = 50
)

## S3 method for class 'sgdgmf'
resid(
  object,
  ...,
  type = c("deviance", "pearson", "working", "response", "link"),
  partial = FALSE,
  normalize = FALSE,
  fillna = FALSE,
  spectrum = FALSE,
  ncomp = 50
)

Arguments

object

an object of class sgdgmf

...

further arguments passed to or from other methods

type

the type of residuals which should be returned

partial

if TRUE, computes the residuals excluding the matrix factorization from the linear predictor

normalize

if TRUE, standardize the residuals column-by-column

fillna

if TRUE, fills NA values column-by-column

spectrum

if TRUE, returns the eigenvalues of the residual covariance matrix

ncomp

number of eigenvalues to be calculated (only if spectrum=TRUE)

Details

Let g(\mu) = \eta = X B^\top + \Gamma Z^\top + U V^\top be the linear predictor of a GMF model. Let R = (r_{ij}) be the correspondent residual matrix. The following residuals can be considered:

If partial=TRUE, mu is computed excluding the latent matrix decomposition from the linear predictor, so as to obtain the partial residuals.

Let \Sigma be the empirical variance-covariance matrix of R, being \sigma_{ij} = \textrm{Cov}(r_{:i}, r_{:j}). Then, the latent spectrum of the model is the collection of eigenvalues of \Sigma.

Notice that, in case of Gaussian data, the latent spectrum corresponds to the principal component analysis on the regression residuals, whose eigenvalues can be used to infer the amount of variance explained by each principal component. Similarly, we can use the (partial) latent spectrum in non-Gaussian data settings to infer the correct number of principal components to include into the GMF model or to detect some residual dependence structures not already explained by the model.

Value

If spectrum=FALSE, a matrix containing the selected residuals. If spectrum=TRUE, a list containing the residuals (res), the first ncomp eigenvalues of the residual covariance matrix, say (lambdas), the variance explained by the first ncomp principal component of the residuals (explained.var), the variance not explained by the first ncomp principal component of the residuals (residual.var), the total variance of the residuals (total.var).

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model with 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Get the deviance residuals of a GMF model
str(residuals(gmf)) # returns the overall deviance residuals
str(residuals(gmf, partial = TRUE)) # returns the partial residuals
str(residuals(gmf, spectrum = TRUE)) # returns the eigenvalues of the residual var-cov matrix


Screeplot for the residuals of an initialized GMF model

Description

Plots the variances of the principal components of the residuals against the number of principal component.

Usage

## S3 method for class 'initgmf'
screeplot(
  x,
  ...,
  ncomp = 20,
  type = c("deviance", "pearson", "working", "response", "link"),
  partial = FALSE,
  normalize = FALSE,
  cumulative = FALSE,
  proportion = FALSE
)

Arguments

x

an object of class sgdgmf

...

further arguments passed to or from other methods

ncomp

number of components to be plotted

type

the type of residuals which should be used

partial

if TRUE, plots the eigenvalues of the partial residuals

normalize

if TRUE, plots the eigenvalues of the standardized residuals

cumulative

if TRUE, plots the cumulative sum of the eigenvalues

proportion

if TRUE, plots the fractions of explained variance

Value

A ggplot object showing the residual screeplot of the model.

See Also

screeplot.sgdgmf.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model
init = sgdgmf.init(data$Y, ncomp = 3, family = poisson())

# Get the partial residual spectrum of a GMF model
screeplot(init) # screeplot of the var-cov matrix of the deviance residuals
screeplot(init, partial = TRUE) # screeplot of the partial residuals
screeplot(init, cumulative = TRUE) # cumulative screeplot
screeplot(init, proportion = TRUE) # proportion of explained residual variance


Screeplot for the residuals of a GMF model

Description

Plots the variances of the principal components of the residuals against the number of principal component.

Usage

## S3 method for class 'sgdgmf'
screeplot(
  x,
  ...,
  ncomp = 20,
  type = c("deviance", "pearson", "working", "response", "link"),
  partial = FALSE,
  normalize = FALSE,
  cumulative = FALSE,
  proportion = FALSE
)

Arguments

x

an object of class sgdgmf

...

further arguments passed to or from other methods

ncomp

number of components to be plotted

type

the type of residuals which should be used

partial

if TRUE, plots the eigenvalues of the partial residuals

normalize

if TRUE, plots the eigenvalues of the standardized residuals

cumulative

if TRUE, plots the cumulative sum of the eigenvalues

proportion

if TRUE, plots the fractions of explained variance

Value

A ggplot object showing the residual screeplot of the model.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Get the partial residual spectrum of a GMF model
screeplot(gmf) # screeplot of the var-cov matrix of the deviance residuals
screeplot(gmf, partial = TRUE) # screeplot of the partial residuals
screeplot(gmf, cumulative = TRUE) # cumulative screeplot
screeplot(gmf, proportion = TRUE) # proportion of explained residual variance


Check and set the control parameters for the AIRWLS algorithm

Description

Check if the input control parameters of the AIRWLS algorithm are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.

Usage

set.control.airwls(
  normalize = TRUE,
  maxiter = 100,
  nstep = 1,
  stepsize = 0.1,
  eps = 1e-08,
  nafill = 1,
  tol = 1e-05,
  damping = 0.001,
  verbose = FALSE,
  frequency = 10,
  parallel = FALSE,
  nthreads = 1
)

Arguments

normalize

if TRUE, normalize U and V to uncorrelated Gaussian U and upper triangular V with positive diagonal

maxiter

maximum number of iterations

nstep

number of IRWLS steps in each inner loop of AIRWLS

stepsize

step-size parameter scaling each IRWLS step

eps

how much shrinkage has to be introduced on extreme predictions lying outside of the data range

nafill

how frequently the NA values are filled, by default NA values are filled at each iteration of the algorithm

tol

tolerance threshold for the stopping criterion

damping

regularization parameter which is added to the diagonal of the Hessian to ensure numerical stability

verbose

if TRUE, print the optimization status (default TRUE)

frequency

how often the optimization status is printed (only if verbose=TRUE)

parallel

if TRUE, allows for parallel computing using the C++ library OpenMP

nthreads

number of cores to be used in parallel (only if parallel=TRUE)

Value

A list of control parameters for the AIRWLS algorithm

Examples

library(sgdGMF)

# Empty call
set.control.airwls()

# Parametrized call
set.control.airwls(maxiter = 100, nstep = 5, stepsize = 0.3)



Check and set the control parameters for the select optimization algorithm

Description

Check if the input control parameters are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.

Usage

set.control.alg(
  method = c("airwls", "newton", "sgd"),
  sampling = c("block", "coord", "rnd-block"),
  control = list()
)

Arguments

method

optimization method to use

sampling

sub-sampling method to use

control

list of algorithm-specific control parameters

Details

It is not necessary to provide a complete list of control parameters, one can just specify a list containing the parameters he/she needs to change from the default values. Wrongly specified parameters are ignored or set to default values. For a detailed description of all the algorithm-specific control parameters, please refer to set.control.airwls (method="airwls"), set.control.newton (method="newton"), set.control.block.sgd (method="sgd", sampling="block"). set.control.coord.sgd (method="sgd", sampling="coord"),

Value

A list of control parameters for the selected estimation algorithm

See Also

set.control.init, set.control.cv, sgdgmf.fit

Examples

library(sgdGMF)

# Empty call
set.control.alg()

# Parametrized call
set.control.alg(method = "airwls", control = list(maxiter = 200, stepsize = 0.3))



Check and set the control parameters for the blockwise-SGD algorithm

Description

Check if the input control parameters are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.

Usage

set.control.block.sgd(
  normalize = TRUE,
  maxiter = 1000,
  eps = 1e-08,
  nafill = 10,
  tol = 1e-08,
  size = c(100, 100),
  burn = 1,
  rate0 = 0.01,
  decay = 0.01,
  damping = 0.001,
  rate1 = 0.1,
  rate2 = 0.01,
  verbose = FALSE,
  frequency = 250,
  progress = FALSE
)

Arguments

normalize

if TRUE, normalize U and V to uncorrelated Gaussian U and upper triangular V with positive diagonal

maxiter

maximum number of iterations

eps

how much shrinkage has to be introduced on extreme predictions lying outside of the data range

nafill

how frequently the NA values are filled, by default NA values are filled at each iteration of the algorithm

tol

tolerance threshold for the stopping criterion

size

mini-batch size, the first value is for row sub-sample, the second value is for column sub-sample

burn

percentage of iterations to ignore before performing Polyak averaging

rate0

initial learning rate

decay

learning rate decay

damping

regularization parameter which is added to the Hessian to ensure numerical stability

rate1

exponential decay rate for the moment estimate of the gradient

rate2

exponential decay rate for the moment estimate of the Hessian

verbose

if TRUE, print the optimization status

frequency

how often the optimization status is printed (only if verbose=TRUE)

progress

if TRUE, print a compact progress-bar instead of a full-report of the optimization status (only if verbose=TRUE)

Value

A list of control parameters for the adaptive SGD algorithm with block-wise sub-sampling

Examples

library(sgdGMF)

# Empty call
set.control.block.sgd()

# Parametrized call
set.control.block.sgd(maxiter = 2000, rate0 = 0.01, decay = 0.01)



Check and set the control parameters for the coordinate-SGD algorithm

Description

Check if the input control parameters are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.

Usage

set.control.coord.sgd(
  normalize = TRUE,
  maxiter = 1000,
  eps = 1e-08,
  nafill = 10,
  tol = 1e-08,
  size = c(100, 100),
  burn = 1,
  rate0 = 0.01,
  decay = 0.01,
  damping = 0.001,
  rate1 = 0.1,
  rate2 = 0.01,
  verbose = FALSE,
  frequency = 250,
  progress = FALSE
)

Arguments

normalize

if TRUE, normalize U and V to uncorrelated Gaussian U and upper triangular V with positive diagonal

maxiter

maximum number of iterations

eps

how much shrinkage has to be introduced on extreme predictions lying outside of the data range

nafill

how frequently the NA values are filled, by default NA values are filled at each iteration of the algorithm

tol

tolerance threshold for the stopping criterion

size

mini-batch size, the first value is for row sub-sample, the second value is for column sub-sample

burn

percentage of iterations to ignore before performing Polyak averaging

rate0

initial learning rate

decay

learning rate decay

damping

regularization parameter which is added to the Hessian to ensure numerical stability

rate1

exponential decay rate for the moment estimate of the gradient

rate2

exponential decay rate for the moment estimate of the Hessian

verbose

if TRUE, print the optimization status

frequency

how often the optimization status is printed (only if verbose=TRUE)

progress

if TRUE, print a compact progress-bar instead of a full-report of the optimization status (only if verbose=TRUE)

Value

A list of control parameters for the adaptive SGD algorithm with coordinate-wise sub-sampling

Examples

library(sgdGMF)

# Empty call
set.control.coord.sgd()

# Parametrized call
set.control.coord.sgd(maxiter = 2000, rate0 = 0.01, decay = 0.01)


Check and set the cross-validation parameters

Description

Check if the input cross-validation parameters are allowed and set them to default values if they are not. Returns a list of well-defined cross-validation parameters.

Usage

set.control.cv(
  criterion = c("dev", "mae", "mse", "aic", "bic"),
  refit = TRUE,
  nfolds = 5,
  proportion = 0.3,
  init = c("common", "separate"),
  verbose = FALSE,
  parallel = FALSE,
  nthreads = 1
)

Arguments

criterion

information criterion to minimize for selecting the matrix rank

refit

if TRUE, refit the model with the selected rank and return the fitted model

nfolds

number of cross-validation folds

proportion

proportion of the data to be used as test set in each fold

init

initialization approach to use

verbose

if TRUE, print the cross-validation status

parallel

if TRUE, allows for parallel computing

nthreads

number of cores to use in parallel (only if parallel=TRUE)

Value

A list of control parameters for the cross-validation algorithm

See Also

set.control.init, set.control.alg, sgdgmf.cv

Examples

library(sgdGMF)

# Empty call
set.control.cv()

# Parametrized call
set.control.cv(criterion = "bic", proportion = 0.2)


Check and set the initialization parameters for a GMF model

Description

Check if the input initialization parameters are allowed and set them to default values if they are not. Returns a list of well-defined options which specify how to initialize a GMF model. See sgdgmf.init for more details upon the methods used for initialisation.

Usage

set.control.init(
  method = c("ols", "glm", "random", "values"),
  type = c("deviance", "pearson", "working", "link"),
  values = list(),
  niter = 5,
  normalize = TRUE,
  verbose = FALSE,
  parallel = FALSE,
  nthreads = 1
)

Arguments

method

initialization method (see sgdgmf.init for more details upon the initialization methods used)

type

residual type to be decomposed (see sgdgmf.init for more details upon the residuals used)

values

list of custom initialization parameters fixed by the user

niter

number if refinement iterations in the "svd" method

normalize

if TRUE, normalize U and V to orthogonal U and lower triangular V

verbose

if TRUE, print the initialization state

parallel

if TRUE, use parallel computing for the "glm" method

nthreads

number of cores to be used in the "glm" method

Value

A list of control parameters for the initialization

See Also

set.control.alg, set.control.cv, sgdgmf.init

Examples

library(sgdGMF)

# Empty call
set.control.init()

# Parametrized call
set.control.init(method = "glm", type = "deviance", niter = 10)


Check and set the control parameters for the Newton algorithm

Description

Check if the input control parameters of the quasi-Newton algorithm are allowed and set them to default values if they are not. Returns a list of well-defined control parameters.

Usage

set.control.newton(
  normalize = TRUE,
  maxiter = 500,
  stepsize = 0.01,
  eps = 1e-08,
  nafill = 1,
  tol = 1e-05,
  damping = 0.001,
  verbose = FALSE,
  frequency = 50,
  parallel = FALSE,
  nthreads = 1
)

Arguments

normalize

if TRUE, normalize U and V to uncorrelated Gaussian U and upper triangular V with positive diagonal

maxiter

maximum number of iterations

stepsize

step-size parameter scaling each IRWLS step

eps

how much shrinkage has to be introduced on extreme predictions lying outside of the data range

nafill

how frequently the NA values are filled, by default NA values are filled at each iteration of the algorithm

tol

tolerance threshold for the stopping criterion

damping

regularization parameter which is added to the Hessian to ensure numerical stability

verbose

if TRUE, print the optimization status

frequency

how often the optimization status is printed (only if verbose=TRUE

parallel

if TRUE, allows for parallel computing using the C++ library OpenMP

nthreads

number of cores to be used in parallel (only if parallel=TTUE)

Value

A list of control parameters for the quasi-Newton algorithm

Examples

library(sgdGMF)

# Empty call
set.control.newton()

# Parametrized call
set.control.newton(maxiter = 1000, stepsize = 0.01, tol = 1e-04)


Check and set the model family

Description

Check if the model family is allowed and return it eventually with a different family name for compatibility with the C++ implementation

Usage

set.family(family)

Arguments

family

a glm family (see family for more details)


Check and set the covariate matrix X

Description

Check if the input covariate matrix X is well-defined and return the same matrix without attributes such as row and column names.

Usage

set.mat.X(X, n, m)

Check and set the response matrix Y

Description

Check if the input response matrix is well-defined and return the same matrix without attributes such as row and column names.

Usage

set.mat.Y(Y)

Check and set the covariate matrix X

Description

Check if the input covariate matrix X is well-defined and return the same matrix without attributes such as row and column names.

Usage

set.mat.Z(Z, n, m)

Check and set the offset matrix

Description

Check if the input offset matrix is well-defined and return the same matrix without attributes such as row and column names.

Usage

set.mat.offset(O, n, m)

Check and set the weighting matrix

Description

Check if the input weighting matrix is well-defined and return the same matrix without attributes such as row and column names.

Usage

set.mat.weights(W, n, m)

Check and set the penalty parameters

Description

Check if the input penalty parameters are allowed and set them to default values if they are not. Returns a list of well-defined penalty parameters.

Usage

set.penalty(B = 0, A = 0, U = 1, V = 0)

Arguments

B

penalty parameter of B

A

penalty parameter of A

U

penalty parameter of U

V

penalty parameter of V


Model selection via cross-validation for generalized matrix factorization models

Description

K-fold cross-validation for generalized matrix factorization (GMF) models.

Usage

sgdgmf.cv(
  Y,
  X = NULL,
  Z = NULL,
  family = gaussian(),
  ncomps = seq(from = 1, to = 10, by = 1),
  weights = NULL,
  offset = NULL,
  method = c("airwls", "newton", "sgd"),
  sampling = c("block", "coord", "rnd-block"),
  penalty = list(),
  control.init = list(),
  control.alg = list(),
  control.cv = list()
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row fixed effects (n \times p)

Z

matrix of column fixed effects (q \times m)

family

a glm family (see family for more details)

ncomps

ranks of the latent matrix factorization used in cross-validation (default 1 to 10)

weights

an optional matrix of weights (n \times m)

offset

an optional matrix of offset values (n \times m), that specify a known component to be included in the linear predictor.

method

estimation method to minimize the negative penalized log-likelihood

sampling

sub-sampling strategy to use if method = "sgd"

penalty

list of penalty parameters (see set.penalty for more details)

control.init

list of control parameters for the initialization (see set.control.init for more details)

control.alg

list of control parameters for the optimization (see set.control.alg for more details)

control.cv

list of control parameters for the cross-validation (see set.control.cv for more details)

Details

Cross-validation is performed by minimizing the estimated out-of-sample error, which can be measured in terms of averaged deviance, AIC or BIC calculated on fold-specific test sets. Within each fold, the test set is defined as a fixed proportion of entries in the response matrix which are held out from the estimation process. To this end, the test set entries are hidden by NA values when training the model. Then, the predicted, i.e. imputed, values are used to compute the fold-specific out-of-sample error.

Value

If refit = FALSE (see set.control.cv), the function returns a list containing control.init, control.alg, control.cv and summary.cv. The latter is a matrix collecting the cross-validation results for each combination of fold and latent dimension.

If refit = TRUE (see set.control.cv), the function returns an object of class sgdgmf, obtained by refitting the model on the whole data matrix using the latent dimension selected via cross-validation. The returned object also contains the summary.cv information along with the other standard output of the sgdgmf.fit function.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Set the data dimensions
n = 100; m = 20; d = 5

# Generate data using Poisson, Binomial and Gamma models
data_pois = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
data_bin = sim.gmf.data(n = n, m = m, ncomp = d, family = binomial())
data_gam = sim.gmf.data(n = n, m = m, ncomp = d, family = Gamma(link = "log"), dispersion = 0.25)

# Set RUN = TRUE to run the example, it may take some time. To speed up
# the computation it is possible to run CV in parallel specifying
# control.cv = list(parallel = TRUE, nthreads = <number_of_workers>)
# as an argument of sgdgmf.cv()
RUN = FALSE
if (RUN) {
  # Initialize the GMF parameters assuming 3 latent factors
  gmf_pois = sgdgmf.cv(data_pois$Y, ncomp = 1:10, family = poisson())
  gmf_bin = sgdgmf.cv(data_bin$Y, ncomp = 3, family = binomial())
  gmf_gam = sgdgmf.cv(data_gam$Y, ncomp = 3, family = Gamma(link = "log"))

  # Get the fitted values in the link and response scales
  mu_hat_pois = fitted(gmf_pois, type = "response")
  mu_hat_bin = fitted(gmf_bin, type = "response")
  mu_hat_gam = fitted(gmf_gam, type = "response")

  # Compare the results
  oldpar = par(no.readonly = TRUE)
  par(mfrow = c(1,3), mar = c(1,1,3,1))
  image(data_pois$Y, axes = FALSE, main = expression(Y[Pois]))
  image(data_pois$mu, axes = FALSE, main = expression(mu[Pois]))
  image(mu_hat_pois, axes = FALSE, main = expression(hat(mu)[Pois]))
  image(data_bin$Y, axes = FALSE, main = expression(Y[Bin]))
  image(data_bin$mu, axes = FALSE, main = expression(mu[Bin]))
  image(mu_hat_bin, axes = FALSE, main = expression(hat(mu)[Bin]))
  image(data_gam$Y, axes = FALSE, main = expression(Y[Gam]))
  image(data_gam$mu, axes = FALSE, main = expression(mu[Gam]))
  image(mu_hat_gam, axes = FALSE, main = expression(hat(mu)[Gam]))
  par(oldpar)
}


Single step of cross-validation for generalized matrix factorization models

Description

Internal function running a single step of cross-validation for generalized matrix factorization (GMF) models and calculating some goodness-of-fit measures on the train and test sets.

Usage

sgdgmf.cv.step(
  train,
  test,
  X,
  Z,
  family,
  ncomp,
  maxcomp,
  fold,
  nfolds,
  weights,
  offset,
  method,
  sampling,
  penalty,
  control.init,
  control.alg,
  control.cv
)

Arguments

train

train-set matrix of responses (n \times m)

test

test-set matrix of responses (n \times m)

X

matrix of row fixed effects (n \times p)

Z

matrix of column fixed effects (q \times m)

family

a glm family (see family for more details)

ncomp

ranks of the latent matrix factorization used in cross-validation (default 1 to 10)

maxcomp

maximum rank allowed in the cross-validation exploration

fold

integer number identifying the current fold

nfolds

maximum number of folds in the cross-validation

weights

an optional matrix of weights (n \times m)

offset

an optional matrix of offset values (n \times m), that specify a known component to be included in the linear predictor.

method

estimation method to minimize the negative penalized log-likelihood

sampling

sub-sampling strategy to use if method = "sgd"

penalty

list of penalty parameters (see set.penalty for more details)

control.init

list of control parameters for the initialization (see set.control.init for more details)

control.alg

list of control parameters for the optimization (see set.control.alg for more details)

control.cv

list of control parameters for the cross-validation (see set.control.cv for more details)

Value

Returns a data.frame containing the current number of latent factors in the model (ncomp), the fold identifier (fold), the degrees of freedom, i.e. the number of parameters, of the model (df), the AIC, BIC and deviance (respectively, aic, bic, dev) calculated on the train and test sets.


Factorize a matrix of non-Gaussian observations using GMF

Description

Fit a generalized matrix factorization (GMF) model for non-Gaussian data using either deterministic or stochastic optimization methods. It is an alternative to PCA when the observed data are binary, counts, and positive scores or, more generally, when the conditional distribution of the observations can be appropriately described using a dispersion exponential family or a quasi-likelihood model. Some examples are Gaussian, Gamma, Binomial and Poisson probability laws.

The dependence among the observations and the variables in the sample can be taken into account through appropriate row- and column-specific regression effects. The residual variability is then modeled through a low-rank matrix factorization.

For the estimation, the package implements two deterministic optimization methods, (AIRWLS and Newton) and two stochastic optimization algorithms (adaptive SGD with coordinate-wise and block-wise sub-sampling).

Usage

sgdgmf.fit(
  Y,
  X = NULL,
  Z = NULL,
  family = gaussian(),
  ncomp = 2,
  weights = NULL,
  offset = NULL,
  method = c("airwls", "newton", "sgd"),
  sampling = c("block", "coord", "rnd-block"),
  penalty = list(),
  control.init = list(),
  control.alg = list()
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row fixed effects (n \times p)

Z

matrix of column fixed effects (q \times m)

family

a glm family (see family for more details)

ncomp

rank of the latent matrix factorization (default 2)

weights

an optional matrix of weights (n \times m)

offset

an optional matrix of offset values (n \times m), that specify a known component to be included in the linear predictor

method

estimation method to minimize the negative penalized log-likelihood

sampling

sub-sampling strategy to use if method = "sgd"

penalty

list of penalty parameters (see set.penalty for more details)

control.init

list of control parameters for the initialization (see set.control.init for more details)

control.alg

list of control parameters for the optimization (see set.control.alg for more details)

Details

Model specification

The model we consider is defined as follows. Let Y = (y_{ij}) be a matrix of observed data of dimension n \times m. We assume for the (i,j)th observation in the matrix a dispersion exponential family law (y_{ij} \mid \theta_{ij}) \sim EF(\theta_{ij}, \phi), where \theta_{ij} is the natural parameter and \phi is the dispersion parameter. Recall that the conditional probability density function of y_{ij} is given by

f (y_{ij}; \psi) = \exp \big[ w_{ij} \{(y_{ij} \theta_{ij} - b(\theta_{ij})\} / \phi - c(y_{ij}, \phi / w_{ij}) \big],

where \psi is the vector of unknown parameters to be estimated, b(\cdot) is a convex twice differentiable log-partition function, and c(\cdot,\cdot) is the cumulant function of the family.

The conditional mean of y_{ij}, say \mu_{ij}, is then modeled as

g(\mu_{ij}) = \eta_{ij} = x_i^\top \beta_j + \alpha_i^\top z_j + u_i^\top v_j,

where g(\cdot) is a bijective twice differentiable link function, \eta_{ij} is a linear predictor, x_i \in \mathbb{R}^p and z_j \in \mathbb{R}^q are observed covariate vectors, \beta_j \in \mathbb{R}^p and \alpha_j \in \mathbb{R}^q are unknown regression parameters and, finally, u_i \in \mathbb{R}^d and v_j \in \mathbb{R}^d are latent vector explaining the residual varibility not captured by the regression effects. Equivalently, in matrix form, we have g(\mu) = \eta = X B^\top + A Z^\top + U V^\top.

The natural parameter \theta_{ij} is linked to the conditional mean of y_{ij} through the equation E(y_{ij}) = \mu_{ij} = b'(\theta_{ij}). Similarly, the variance of y_{ij} is given by \text{Var}(y_{ij}) = (\phi / w_{ij}) \,\nu(\mu_{ij}) = (\phi / w_{ij}) \,b''(\mu_{ij}), where \nu(\cdot) is the so-called variance function of the family. Finally, we denote by D_\phi(y,\mu) the deviance function of the family, which is defined as D_\phi(y,\mu) = - 2 \log\{ f(y, \psi) / f_0 (y) \}, where f_0(y) is the likelihood of the saturated model.

The estimation of the model parameters is performed by minimizing the penalized deviance function

\displaystyle \ell_\lambda (\psi; y) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} D_\phi(y_{ij}, \mu_{ij}) + \frac{\lambda_{\scriptscriptstyle U}}{2} \| U \|_F^2 + \frac{\lambda_{\scriptscriptstyle V}}{2} \| V \|_F^2,

where \lambda_{\scriptscriptstyle U} > 0 and \lambda_{\scriptscriptstyle V} > 0 are regularization parameters and \|\cdot\|_F is the Frobenious norm. Additional \ell_2 penalization terms can be introduced to regularize B and A. Quasi-likelihood models can be considered as well, where D_\phi(y, \mu) is substituted by Q_\phi(y, \mu) = - \log (\phi/w) - (w / \phi) \int_y^\mu \{(y - t) / \nu(t) \} \,dt, under an appropriate specification of mean, variance and link functions.

Identifiability constraints

The GMF model is not identifiable being invariant with respect to rotation, scaling and sign-flip transformations of U and V. To enforce the uniqueness of the solution, we impose the following identifiability constraints:

where I_n and 1_n are, respectively, the n-dimensional identity matrix and unitary vector.

Alternative identifiability constraints on U and V can be easily obtained by post processing. For instance, a PCA-like solution, say U^\top U is diagonal and V^\top V = I_d, can by obtained by applying the truncated SVD decomposition U V^\top = \tilde{U} \tilde{D} \tilde{V}^\top, and setting U = \tilde{U} \tilde{D} and V = \tilde{V}.

Estimation algorithms

To obtain the penalized maximum likelihood estimate, we here employs four different algorithms

Likelihood families

Currently, all standard glm families are supported, including neg.bin and negative.binomial families from the MASS package. In such a case, the deviance function we consider takes the form D_\phi(y, \mu) = 2 w \big[ y \log(y / \mu) - (y + \phi) \log\{ (y + \phi) / (\mu + \phi) \} \big]. This corresponds to a Negative Binomial model with variance function \nu(\mu) = \mu + \mu^2 / \phi. Then, for \phi \rightarrow \infty, the Negative Binomial likelihood converges to a Poisson likelihood, having linear variance function, say \nu(\mu) = \mu. Notice that the over-dispersion parameter, that here is denoted as \phi, in the MASS package is referred to as \theta. If the Negative Binomial family is selected, a global over-dispersion parameter \phi is estimated from the data using the method of moments.

Parallelization

Parallel execution is implemented in C++ using OpenMP. When installing and compiling the sgdGMF package, the compiler check whether OpenMP is installed in the system. If it is not, the package is compiled excluding all the OpenMP functionalities and no parallel execution is allowed at C++ level.

Notice that OpenMP is not compatible with R parallel computing packages, such as parallel and foreach. Therefore, when parallel=TRUE, it is not possible to run the sgdgmf.fit function within R level parallel functions, e.g., foreach loop.

Value

An sgdgmf object, namely a list, containing the estimated parameters of the GMF model. In particular, the returned object collects the following information:

References

Kidzinnski, L., Hui, F.K.C., Warton, D.I. and Hastie, J.H. (2022). Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays. Journal of Machine Learning Research, 23: 1-29.

Wang, L. and Carvalho, L. (2023). Deviance matrix factorization. Electronic Journal of Statistics, 17(2): 3762-3810.

Castiglione, C., Segers, A., Clement, L, Risso, D. (2024). Stochastic gradient descent estimation of generalized matrix factorization models with application to single-cell RNA sequencing data. arXiv preprint: arXiv:2412.20509.

See Also

set.control.init, set.control.alg, sgdgmf.init, sgdgmf.rank, refit.sgdgmf, coef.sgdgmf, resid.sgdgmf, fitted.sgdgmf, predict.sgdgmf, plot.sgdgmf, screeplot.sgdgmf, biplot.sgdgmf, image.sgdgmf

Examples

# Load the sgdGMF package
library(sgdGMF)

# Set the data dimensions
n = 100; m = 20; d = 5

# Generate data using Poisson, Binomial and Gamma models
data = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())

# Estimate the GMF parameters assuming 3 latent factors
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson(), method = "airwls")

# Get the fitted values in the link and response scales
mu_hat = fitted(gmf, type = "response")

# Compare the results
oldpar = par(no.readonly = TRUE)
par(mfrow = c(1,3), mar = c(1,1,3,1))
image(data$Y, axes = FALSE, main = expression(Y))
image(data$mu, axes = FALSE, main = expression(mu))
image(mu_hat, axes = FALSE, main = expression(hat(mu)))
par(oldpar)


Initialize the parameters of a generalized matrix factorization model

Description

Provide four initialization methods to set the initial values of a generalized matrix factorization (GMF) model identified by a glm family and a linear predictor of the form g(\mu) = \eta = X B^\top + A Z^\top + U V^\top, with bijective link function g(\cdot). See sgdgmf.fit for more details on the model specification.

Usage

sgdgmf.init(
  Y,
  X = NULL,
  Z = NULL,
  ncomp = 2,
  family = gaussian(),
  weights = NULL,
  offset = NULL,
  method = c("ols", "glm", "random", "values"),
  type = c("deviance", "pearson", "working", "link"),
  niter = 0,
  values = list(),
  verbose = FALSE,
  parallel = FALSE,
  nthreads = 1,
  savedata = TRUE
)

sgdgmf.init.ols(
  Y,
  X = NULL,
  Z = NULL,
  ncomp = 2,
  family = gaussian(),
  weights = NULL,
  offset = NULL,
  type = c("deviance", "pearson", "working", "link"),
  verbose = FALSE
)

sgdgmf.init.glm(
  Y,
  X = NULL,
  Z = NULL,
  ncomp = 2,
  family = gaussian(),
  weights = NULL,
  offset = NULL,
  type = c("deviance", "pearson", "working", "link"),
  verbose = FALSE,
  parallel = FALSE,
  nthreads = 1
)

sgdgmf.init.random(
  Y,
  X = NULL,
  Z = NULL,
  ncomp = 2,
  family = gaussian(),
  weights = NULL,
  offset = NULL,
  sigma = 1
)

sgdgmf.init.custom(
  Y,
  X = NULL,
  Z = NULL,
  ncomp = 2,
  family = gaussian(),
  values = list(),
  verbose = FALSE
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row-specific fixed effects (n \times p)

Z

matrix of column-specific fixed effects (q \times m)

ncomp

rank of the latent matrix factorization

family

a model family, as in the glm interface

weights

matrix of constant weights (n \times m)

offset

matrix of constant offset (n \times m)

method

optimization method to be used for the initial fit

type

type of residuals to be used for initializing U via incomplete SVD decomposition

niter

number of iterations to refine the initial estimate (only if method="ols" or "svd")

values

a list of custom initial values for B, A, U and V

verbose

if TRUE, prints the status of the initialization process

parallel

if TRUE, allows for parallel computing using the foreach package (only if method="glm")

nthreads

number of cores to be used in parallel (only if parallel=TRUE and method="glm")

savedata

if TRUE, stores a copy of the input data

Details

If method = "ols", the initialization is performed fitting a sequence of linear regressions followed by a residual SVD decomposition. To account for non-Gaussian distribution of the data, regression and decomposition are applied on the transformed response matrix Y_h = (g \circ h)(Y), where h(\cdot) is a function which prevent Y_h to take infinite values. For instance, in the Binomial case h(y) = 2 (1-\epsilon) y + \epsilon, while in the Poisson case h(y) = y + \epsilon, where \epsilon is a small positive constant, typically 0.1 or 0.01.

If method = "glm", the initialization is performed by fitting a sequence of generalized linear models followed by a residual SVD decomposition. In particular, to set \beta_j, we use independent GLM fit with y_j \sim X \beta_j. Similarly, to set \alpha_i, we fit the model y_i \sim Z \alpha_i + o_i, with offset o_i = B x_i. Then, we obtain U via SVD on the residuals. Finally, we obtain V via independent GLM fit under the model y_j \sim U v_j + o_j, with offset o_i = X \beta_j + A z_j.

Both under method = "ols" and method = "glm", it is possible to specify the parameter type to change the type of residuals used for the SVD decomposition.

If method = "random", the initialization is performed using independent Gaussian random values for all the parameters in the model.

If method = "values", the initialization is performed using user-specified values provided as an input, which must have compatible dimensions.

Value

An initgmf object, namely a list, containing the initial estimates of the GMF parameters. In particular, the returned object collects the following information:

Examples

library(sgdGMF)

# Set the data dimensions
n = 100; m = 20; d = 5

# Generate data using Poisson, Binomial and Gamma models
data_pois = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
data_bin = sim.gmf.data(n = n, m = m, ncomp = d, family = binomial())
data_gam = sim.gmf.data(n = n, m = m, ncomp = d, family = Gamma(link = "log"), dispersion = 0.25)

# Initialize the GMF parameters assuming 3 latent factors
init_pois = sgdgmf.init(data_pois$Y, ncomp = 3, family = poisson(), method = "ols")
init_bin = sgdgmf.init(data_bin$Y, ncomp = 3, family = binomial(), method = "ols")
init_gam = sgdgmf.init(data_gam$Y, ncomp = 3, family = Gamma(link = "log"), method = "ols")

# Get the fitted values in the link and response scales
mu_hat_pois = fitted(init_pois, type = "response")
mu_hat_bin = fitted(init_bin, type = "response")
mu_hat_gam = fitted(init_gam, type = "response")

# Compare the results
oldpar = par(no.readonly = TRUE)
par(mfrow = c(3,3), mar = c(1,1,3,1))
image(data_pois$Y, axes = FALSE, main = expression(Y[Pois]))
image(data_pois$mu, axes = FALSE, main = expression(mu[Pois]))
image(mu_hat_pois, axes = FALSE, main = expression(hat(mu)[Pois]))
image(data_bin$Y, axes = FALSE, main = expression(Y[Bin]))
image(data_bin$mu, axes = FALSE, main = expression(mu[Bin]))
image(mu_hat_bin, axes = FALSE, main = expression(hat(mu)[Bin]))
image(data_gam$Y, axes = FALSE, main = expression(Y[Gam]))
image(data_gam$mu, axes = FALSE, main = expression(mu[Gam]))
image(mu_hat_gam, axes = FALSE, main = expression(hat(mu)[Gam]))
par(oldpar)


Rank selection via eigenvalue-gap methods

Description

Select the number of significant principal components of a GMF model via exploitation of eigenvalue-gap methods

Usage

sgdgmf.rank(
  Y,
  X = NULL,
  Z = NULL,
  maxcomp = ncol(Y),
  family = gaussian(),
  weights = NULL,
  offset = NULL,
  method = c("evr", "onatski", "act", "oht"),
  type.reg = c("ols", "glm"),
  type.res = c("deviance", "pearson", "working", "link"),
  normalize = FALSE,
  maxiter = 10,
  parallel = FALSE,
  nthreads = 1,
  return.eta = FALSE,
  return.mu = FALSE,
  return.res = FALSE,
  return.cov = FALSE
)

Arguments

Y

matrix of responses (n \times m)

X

matrix of row-specific fixed effects (n \times p)

Z

matrix of column-specific fixed effects (q \times m)

maxcomp

maximum number of eigenvalues to compute

family

a family as in the glm interface (default gaussian())

weights

matrix of optional weights (n \times m)

offset

matrix of optional offsets (n \times m)

method

rank selection method

type.reg

regression method to be used to profile out the covariate effects

type.res

residual type to be decomposed

normalize

if TRUE, standardize column-by-column the residual matrix

maxiter

maximum number of iterations

parallel

if TRUE, allows for parallel computing using foreach

nthreads

number of cores to be used in parallel (only if parallel=TRUE)

return.eta

if TRUE, return the linear predictor martix

return.mu

if TRUE, return the fitted value martix

return.res

if TRUE, return the residual matrix

return.cov

if TRUE, return the covariance matrix of the residuals

Value

A list containing the method, the selected latent rank ncomp, and the eigenvalues used to select the latent rank lambdas. Additionally, if required, in the output list will also provide the linear predictor eta, the predicted mean matrix mu, the residual matrix res, and the implied residual covariance matrix covmat.

References

Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. Review of Economics and Statistics, 92(4): 1004-1016

Ahn, S.C., Horenstein, A.R. (2013). Eigenvalue ratio test for the number of factors. Econometrica, 81, 1203-1227

Gavish, M., Donoho, D.L. (2014) The optimal hard thresholding for singular values is 4/sqrt(3). IEEE Transactions on Information Theory, 60(8): 5040–5053

Fan, J., Guo, J. and Zheng, S. (2020). Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association, 117(538): 852–861

Wang, L. and Carvalho, L. (2023). Deviance matrix factorization. Electronic Journal of Statistics, 17(2): 3762-3810

Examples

library(sgdGMF)

# Set the data dimensions
n = 100; m = 20; d = 5

# Generate data using Poisson, Binomial and Gamma models
data_pois = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
data_bin = sim.gmf.data(n = n, m = m, ncomp = d, family = binomial())
data_gam = sim.gmf.data(n = n, m = m, ncomp = d, family = Gamma(link = "log"), dispersion = 0.25)

# Initialize the GMF parameters assuming 3 latent factors
ncomp_pois = sgdgmf.rank(data_pois$Y, family = poisson(), normalize = TRUE)
ncomp_bin = sgdgmf.rank(data_bin$Y, family = binomial(), normalize = TRUE)
ncomp_gam = sgdgmf.rank(data_gam$Y, family = Gamma(link = "log"), normalize = TRUE)

# Get the selected number of components
print(paste("Poisson:", ncomp_pois$ncomp))
print(paste("Binomial:", ncomp_bin$ncomp))
print(paste("Gamma:", ncomp_gam$ncomp))

# Plot the screeplot used for the component determination
oldpar = par(no.readonly = TRUE)
par(mfrow = c(3,1))
barplot(ncomp_pois$lambdas, main = "Poisson screeplot")
barplot(ncomp_bin$lambdas, main = "Binomial screeplot")
barplot(ncomp_gam$lambdas, main = "Gamma screeplot")
par(oldpar)


Simulate non-Gaussian data from a GMF model

Description

Simulate synthetic non-Gaussian data from a generalized matrix factorization (GMF) model.

Usage

sim.gmf.data(n = 100, m = 20, ncomp = 5, family = gaussian(), dispersion = 1)

Arguments

n

number of observations

m

number of variables

ncomp

rank of the latent matrix factorization

family

a glm family (see family for more details)

dispersion

a positive dispersion parameter

Details

The loadings, V, are independently sampled from a standard normal distribution. The scores, U, are simulated according to sinusoidal signals evaluated at different phases, frequencies and amplitudes. These parameters are randomly sampled from independent uniform distributions.

Value

A list containing the following objects:

Examples

library(sgdGMF)

# Set the data dimensions
n = 100; m = 20; d = 5

# Generate data using Poisson, Binomial and Gamma models
data_pois = sim.gmf.data(n = n, m = m, ncomp = d, family = poisson())
data_bin = sim.gmf.data(n = n, m = m, ncomp = d, family = binomial())
data_gam = sim.gmf.data(n = n, m = m, ncomp = d, family = Gamma(link = "log"), dispersion = 0.25)

# Compare the results
oldpar = par(no.readonly = TRUE)
par(mfrow = c(3,3), mar = c(1,1,3,1))
image(data_pois$Y, axes = FALSE, main = expression(Y[Pois]))
image(data_pois$mu, axes = FALSE, main = expression(mu[Pois]))
image(data_pois$U, axes = FALSE, main = expression(U[Pois]))
image(data_bin$Y, axes = FALSE, main = expression(Y[Bin]))
image(data_bin$mu, axes = FALSE, main = expression(mu[Bin]))
image(data_bin$U, axes = FALSE, main = expression(U[Bin]))
image(data_gam$Y, axes = FALSE, main = expression(Y[Gam]))
image(data_gam$mu, axes = FALSE, main = expression(mu[Gam]))
image(data_gam$U, axes = FALSE, main = expression(U[Gam]))
par(oldpar)


Simulate new data

Description

Generic function to simulate new data from a statistical model

Usage

simulate(object, ...)

Arguments

object

an object from which simulate new data

...

additional arguments passed to or from other methods

Value

An array containing the simulated data.


Simulate method for GMF models

Description

Simulate new data from a fitted generalized matrix factorization models

Usage

## S3 method for class 'sgdgmf'
simulate(object, ..., nsim = 1)

Arguments

object

an object of class sgdgmf

...

further arguments passed to or from other methods

nsim

number of samples

Value

An 3-fold array containing the simulated data.

Examples

# Load the sgdGMF package
library(sgdGMF)

# Generate data from a Poisson model
data = sim.gmf.data(n = 100, m = 20, ncomp = 5, family = poisson())

# Fit a GMF model
gmf = sgdgmf.fit(data$Y, ncomp = 3, family = poisson())

# Simulate new data from a GMF model
str(simulate(gmf))


Estimate the coefficients of a vector generalized linear model

Description

Estimate the coefficients of a vector generalized linear model via parallel iterative re-weighted least squares. Computations can be performed in parallel to speed up the execution.

Usage

vglm.fit.coef(
  Y,
  X,
  family = gaussian(),
  weights = NULL,
  offset = NULL,
  parallel = FALSE,
  nthreads = 1,
  clust = NULL
)

Arguments

Y

n \times m matrix of response variables

X

n \times p matrix of covariates

family

a glm family (see family for more details)

weights

n \times m matrix of weighting values

offset

n \times m matrix of offset values

parallel

if TRUE, allows for parallel computing using the foreach package

nthreads

number of cores to be used in parallel (only if parallel=TRUE)

clust

registered cluster to be used for distributing the computations (only if parallel=TRUE)


Compute the whitening matrix from a given covariance matrix

Description

Compute the whitening matrix from a given covariance matrix

Usage

whitening.matrix(
  sigma,
  method = c("ZCA", "ZCA-cor", "PCA", "PCA-cor", "Cholesky")
)

Arguments

sigma

covariance matrix.

method

determines the type of whitening transformation.

Details

This function is an internal re-implementation of the function whiteningMatrix in the whitening package. See the original documentation to get more details.