Title: | Flexible Clustering of Ordinal and Mixed-with-Ordinal Data |
Version: | 1.0.0 |
Description: | Extends the capabilities for flexible partitioning and model-based clustering available in the packages 'flexclust' and 'flexmix' to handle ordinal and mixed-with-ordinal data types via new distance, centroid and driver functions that make various assumptions regarding ordinality. Using them within the flex-scheme allows for easy comparisons across methods. |
Imports: | flexclust (≥ 1.5.0), methods, mvtnorm, flexmix |
Suggests: | rmarkdown, knitr |
License: | GPL-2 |
Encoding: | UTF-8 |
URL: | https://zettlchen.github.io/flexord/ |
BugReports: | https://github.com/dernst/flexord/issues |
RoxygenNote: | 7.3.2 |
VignetteBuilder: | rmarkdown, knitr |
Depends: | R (≥ 4.3) |
NeedsCompilation: | no |
Packaged: | 2025-03-26 12:02:37 UTC; zettlchen |
Author: | Lena Ortega Menjivar
|
Maintainer: | Lena Ortega Menjivar <lena.ortega-menjivar@boku.ac.at> |
Repository: | CRAN |
Date/Publication: | 2025-03-27 17:50:02 UTC |
FlexMix Driver for Regularized Beta-Binomial Mixtures
Description
This model driver can be used to cluster data using the beta-binomial distribution.
Usage
FLXMCregbetabinom(
formula = . ~ .,
size = NULL,
alpha = 0,
eps = sqrt(.Machine$double.eps)
)
Arguments
formula |
A formula which is interpreted relative to the
formula specified in the call to |
size |
Number of trials (one or more). Default |
alpha |
A non-negative scalar acting as regularization
parameter. Can be regarded as adding |
eps |
Lower threshold for the shape parameters a and b. |
Details
Using a regularization parameter alpha
greater than zero can be
viewed as adding alpha
observations equal to the population mean
to each component. This can be used to avoid degenerate solutions
(i.e., probabilites of 0 or 1). It also has the effect that
clusters become more similar to each other the larger alpha
is
chosen. For small values this effect is, however, mostly
negligible.
Value
An object of class "FLXC"
.
References
Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Kondofersky, I (2008). Modellbasiertes Clustern mit der Beta-Binomialverteilung. Bachelor's thesis, Ludwig-Maximilians-Universität München.
Examples
library("flexmix")
library("flexord")
library("flexclust")
# Sample data
k <- 4 # nr of clusters
size <- 4 # nr of trials
N <- 100 # obs. per cluster
set.seed(0xdeaf)
# random probabilities per component
probs <- lapply(seq_len(k), \(ki) runif(10, 0.01, 0.99))
# sample data
dat <- lapply(probs, \(p) {
lapply(p, \(p_i) {
rbinom(N, size, p_i)
}) |> do.call(cbind, args=_)
}) |> do.call(rbind, args=_)
true_clusters <- rep(1:4, rep(N, k))
# Sample data is drawn from a binomial distribution but we fit
# beta-binomial which is a slight mis-specification but the
# beta-binomial can be seen as a generalized binomial.
m <- flexmix(dat~1, model=FLXMCregbetabinom(size=size, alpha=0),
cluster = true_clusters)
# Cluster without regularization
m1 <- stepFlexmix(dat~1, model=FLXMCregbetabinom(size=size, alpha=0), k=k)
# Cluster with regularization
m2 <- flexmix(dat~1, model=FLXMCregbetabinom(size=size, alpha=1), k=k,
cluster = posterior(m1))
# Both models are mostly able to reconstruct the true clusters (ARI ~ 0.95)
# (it's a very easy clustering problem)
# Small values for the regularization don't seem to affect the ARI (much)
randIndex(clusters(m1), true_clusters)
randIndex(clusters(m2), true_clusters)
FlexMix Driver for Regularized Binomial Mixtures
Description
This model driver can be used to cluster data using the binomial distribution.
Usage
FLXMCregbinom(formula = . ~ ., size = NULL, hasNA = FALSE, alpha = 0, eps = 0)
Arguments
formula |
A formula which is interpreted relative to the
formula specified in the call to |
size |
Number of trials (one or more). Default |
hasNA |
Boolean whether the data set may contain NA values. Default is FALSE. For data sets without NAs, the same results are obtained but it runs slightly faster when the absence of NAs can be assumed. |
alpha |
A non-negative scalar acting as regularization
parameter. Can be regarded as adding |
eps |
A numeric value in [0, 1). When greater than zero,
probabilities are truncated to be within in [ |
Details
Using a regularization parameter alpha
greater than zero can be
viewed as adding alpha
observations equal to the population mean
to each component. This can be used to avoid degenerate solutions
(i.e., probabilites of 0 or 1). It also has the effect that
clusters become more similar to each other the larger alpha
is
chosen. For small values this effect is, however, mostly
negligible.
Parameter estimation is achieved using the MAP estimator for each component and variable using a Beta prior.
Value
An object of class "FLXC"
.
References
Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Examples
library("flexmix")
library("flexord")
library("flexclust")
# Sample data
k <- 4 # nr of clusters
size <- 4 # nr of trials
N <- 100 # obs. per cluster
set.seed(0xdeaf)
# random probabilities per component
probs <- lapply(seq_len(k), \(ki) runif(10, 0.01, 0.99))
# sample data
dat <- lapply(probs, \(p) {
lapply(p, \(p_i) {
rbinom(N, size, p_i)
}) |> do.call(cbind, args=_)
}) |> do.call(rbind, args=_)
true_clusters <- rep(1:4, rep(N, k))
# Cluster without regularization
m1 <- stepFlexmix(dat~1, model=FLXMCregbinom(size=size, alpha=0), k=k)
# Cluster with regularization
m2 <- stepFlexmix(dat~1, model=FLXMCregbinom(size=size, alpha=1), k=k)
# Both models are mostly able to reconstruct the true clusters (ARI ~ 0.96)
# (it's a very easy clustering problem)
# Small values for the regularization don't seem to affect the ARI (much)
randIndex(clusters(m1), true_clusters)
randIndex(clusters(m2), true_clusters)
FlexMix Driver for Regularized Multinomial Mixtures
Description
This model driver can be used to cluster data using a multinomial distribution.
Usage
FLXMCregmultinom(formula = . ~ ., r = NULL, alpha = 0)
Arguments
formula |
A formula which is interpreted relative to the
formula specified in the call to |
r |
Number of different categories. Values are assumed to be
integers in |
alpha |
A non-negative scalar acting as regularization
parameter. Can be regarded as adding |
Details
Using a regularization parameter alpha
greater than zero
acts as adding alpha
observations conforming to the population
mean to each component. This can be used to avoid degenerate
solutions. It also has the effect
that clusters become more similar to each other the larger
alpha
is chosen. For small values it is mostly negligible however.
For regularization we compute the MAP estimates for the multinomial distribution using the Dirichlet distribution as prior, which is the conjugate prior. The parameters of this prior are selected to correspond to the marginal distribution of the variable across all observations.
Value
An object of class "FLXC"
.
References
Galindo Garre, F, Vermunt, JK (2006). Avoiding Boundary Estimates in Latent Class Analysis by Bayesian Posterior Mode Estimation Behaviormetrika, 33, 43-59. - Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Examples
library("flexmix")
library("flexord")
library("flexclust")
set.seed(0xdeaf)
# Sample data
k <- 4 # nr of clusters
nvar <- 10 # nr of variables
r <- sample(2:7, size=nvar, replace=TRUE) # nr of categories
N <- 100 # obs. per cluster
# random probabilities per component
probs <- lapply(seq_len(k), \(ki) runif(nvar, 0.01, 0.99))
# sample data by drawing from a binomial distribution with size = r - 1
# values are expect values to lie inside 1:r hence we add +1.
dat <- lapply(probs, \(p) {
mapply(\(p_i, r_i) {
rbinom(N, r_i, p_i) + 1
}, p, r-1, SIMPLIFY=FALSE) |> do.call(cbind, args=_)
}) |> do.call(rbind, args=_)
true_clusters <- rep(1:4, rep(N, k))
# Cluster without regularization
m1 <- stepFlexmix(dat~1, model=FLXMCregmultinom(r=r, alpha=0), k=k)
# Cluster with regularization
m2 <- stepFlexmix(dat~1, model=FLXMCregmultinom(r=r, alpha=1), k=k)
# Both models are mostly able to reconstruct the true clusters (ARI ~ 0.95)
# (it's a very easy clustering problem)
# Small values for the regularization don't seem to affect the ARI (much)
randIndex(clusters(m1), true_clusters)
randIndex(clusters(m2), true_clusters)
FlexMix Driver for Regularized Multivariate Normal Mixtures
Description
This model driver implements the regularization method as
introduced by Fraley and Raftery (2007) for univariate normal
mixtures. Default parameters for the regularization according to
that paper may be obtained using FLXMCregnorm_defaults()
. We
extend this to the multivariate case assuming independence between
variables within components, i.e., we only implement the special
case where the covariance matrix is diagonal. For more general
applications of normal mixtures see package mclust.
Usage
FLXMCregnorm(formula = . ~ ., params)
Arguments
formula |
A formula which is interpreted relative to the formula
specified in the call to |
params |
Prior parameters for normal mixtures. You may obtain
default values according to Fraley and Raftery (2007) using
|
Details
For the regularization the conjugate prior distributions for the normal distribution are used, which are:
Normal prior with parameter
mu_p
andsigma^2/kappa_p
for the mean.Inverse Gamma prior with parameters
nu_p/2
andzeta_p^2/2
for the variance.
Value
An object of class "FLXC"
.
References
Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Fraley, C, Raftery, AE (2007) Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering. Journal of Classification, 24(2), 155-181
See Also
FLXMCregnorm_defaults
Examples
library("flexmix")
library("flexord")
library("flexclust")
# example data
data("iris", package = "datasets")
my_iris <- subset(iris, select=setdiff(colnames(iris), "Species")) |>
as.matrix()
# cluster one model with a scale parameter similar to the default for 3 components
params <- FLXMCregnorm_defaults(my_iris, zeta_p = c(0.23, 0.06, 1.04, 0.19))
m1 <- stepFlexmix(my_iris ~ 1, k = 3,
model=FLXMCregnorm(params = params))
summary(m1)
# rand index of clusters vs species
randIndex(clusters(m1), iris$Species)
# cluster one model with default scale parameter
params <- FLXMCregnorm_defaults(my_iris, k = 3)
m2 <- stepFlexmix(my_iris ~ 1, k = 3,
model = FLXMCregnorm(params = params))
summary(m2)
# rand index of clusters vs species
randIndex(clusters(m2), iris$Species)
# rand index between both models (should be >= 0.8)
randIndex(clusters(m1), clusters(m2))
Data-Driven Default Parameters for Regularized Normal Mixtures
Description
Determines the default values for regularized univariate normal mixtures as proposed by Fraley and Raftery (2007) based on the data set to be clustered and the number of components in the mixture model.
Usage
FLXMCregnorm_defaults(x, zeta_p = NULL, kappa_p = 0.01, nu_p = 3, k = NULL)
Arguments
x |
The data set to be clustered. Should be the same data set
as is used in |
zeta_p |
Scale (hyperparameter for IG prior). If not given the empirical variance divided by the square of the number of components is used as per Fraley and Raftery (2007).
A value for the scale hyperparameter |
kappa_p |
Shrinkage parameter. Corresponds to adding |
nu_p |
Degress of freedom (hyperparameter for IG prior) |
k |
Number of components assumed for the mixture model (not
used if |
Value
A named list with values for mu_p
, kappa_p
, nu_p
and zeta_p
.
Centroid Functions for K-Centroids Clustering of (Ordinal) Categorical/Mixed Data
Description
Functions to calculate cluster centroids for K-centroids clustering that extend the options available in package flexclust.
centMode
calculates centroids based on the mode of each variable.
centMin
determines centroids within a specified range which
minimize the supplied distance metric. centOptimNA
replicates
the behaviour of flexclust::centOptim()
but removes missing
values.
These functions are designed for use with flexclust::kcca()
or
functions that are built upon it. Their use is easiest via the
wrapper kccaExtendedFamily()
.
Usage
centMode(x)
centMin(x, dist, xrange = NULL)
centOptimNA(x, dist)
Arguments
x |
A numeric matrix or data frame. |
dist |
The distance measure function used in |
xrange |
The range of the data in
|
Details
-
centMode
: Column-wise modes are used as centroids, and ties are broken randomly. In combination with Simple Matching Distance (distSimMatch
), this results in thekmodes
algorithm. -
centMin
: Column-wise centroids are calculated by minimizing the specified distance measure between the values inx
and all possible levels ofx
. -
centOptimNA
: Column-wise centroids are calculated by minimizing the specified distance measure via a general purpose optimizer. Unlike inflexclust::centOptim()
, NAs are removed from the starting search values and disregarded in the distance calculation.
Value
A named numeric vector containing the centroid values for each column of x
.
See Also
kccaExtendedFamily()
,
flexclust::kcca()
Examples
# Example: Mode as centroid
dat <- data.frame(A = rep(2:5, 2),
B = rep(1:4, 2),
C = rep(c(1, 2, 4, 5), 2))
centMode(dat)
## within kcca
flexclust::kcca(dat, 3, family=kccaExtendedFamily('kModes')) #default centroid
# Example: Centroid is level for which distance is minimal
centMin(dat, flexclust::distManhattan, xrange = 'all')
## within kcca
flexclust::kcca(dat, 3,
family=flexclust::kccaFamily(dist=flexclust::distManhattan,
cent=\(y) centMin(y, flexclust::distManhattan,
xrange='all')))
# Example: Centroid calculated by general purpose optimizer with NA removal
nas <- sample(c(TRUE, FALSE), prod(dim(dat)),
replace=TRUE, prob=c(0.1,0.9)) |>
matrix(nrow=nrow(dat))
dat[nas] <- NA
centOptimNA(dat, flexclust::distManhattan)
## within kcca
flexclust::kcca(dat, 3, family=kccaExtendedFamily('kGower')) #default centroid
Distance Functions for K-Centroids Clustering of (Ordinal) Categorical/Mixed Data
Description
Functions to calculate the distance between a matrix x
and a
matrix c
, which can be used for K-centroids clustering via
flexclust::kcca()
.
distSimMatch
implements Simple Matching Distance (most frequently
used for categorical, or symmetric binary data) for K-centroids
clustering.
distGower
implements Gower's Distance after Gower (1971) and
Kaufman & Rousseeuw (1990) for mixed-type data with missings for K-centroids
clustering.
distGDM2
implements GDM2 distance for ordinal data introduced by
Walesiak et al. (1993) and adapted to K-centroids clustering by
Ernst et al. (2025).
These functions are designed for use with flexclust::kcca()
or
functions that are built upon it. Their use is easiest via the
wrapper kccaExtendedFamily()
. However, they can also easily be
used to obtain a distance matrix of x
, see Examples.
Usage
distGDM2(x, centers, genDist, xrange = NULL)
distGower(x, centers, genDist)
distSimMatch(x, centers)
Arguments
x |
A numeric matrix or data frame. |
centers |
A numeric matrix with |
genDist |
Additional information on
|
xrange |
Range specification for the variables. Currently only
used for
|
Details
-
distSimMatch
: Simple Matching Distance between two observations is calculated as the proportion of disagreements acros all variables. Described, e.g., in Kaufman & Rousseeuw (1990), p. 24. If this is used in K-centroids analysis in combination with mode centroids (as implemented incentMode
), this results in the kModes algorithm. A wrapper for this algorithm is obtained withkccaExtendedFamily(which='kModes')
. -
distGower
: Distances are calculated for each column (Euclidean distance,distEuclidean
, is recommended for numeric, Manhattan distance,distManhattan
for ordinal, Simple Matching Distance,distSimMatch
for categorical, and Jaccard distance,distJaccard
for asymmetric binary variables), and they are summed up as:d(x_i, x_k) = \frac{\sum_{j=1}^p \delta_{ikj} d(x_{ij}, x_{kj})}{\sum_{j=1}^p \delta_{ikj}}
where
p
is the number of variables and with the weight\delta_{ikj}
being 1 if both valuesx_{ij}
andx_{kj}
are not missing, and in the case of asymmetric binary variables, at least one of them is not 0. Please note that for calculating Gower's distance, scaling of numeric/ordered variables is required (as f.i. by.ScaleVarSpecific
). A wrapper for K-centroids analysis using Gower's distance in combination with a numerically optimized centroid is found inkccaExtendedFamily(which='kGower')
. -
distGDM2
: GDM2 distance for ordinal variables conducts only relational operations on the variables, such as\leq
,\geq
and=
. By translatingx
to its relative frequencies and empirical cumulative distributions, we are able to extend this principle to compare two arbitrary values, and thus use it within K-centroids clustering. For more details, see Ernst et al. (2025). A wrapper for this algorithm in combination with a numerically optimized centroid is found inkccaExtendedFamily(which='kGDM2')
.
The distances functions presented here can also be used in clustering algorithms that rely on distance matrices (such as hierarchical clustering and PAM), if applied accordingly, see Examples.
Value
A matrix of dimensions c(nrow(x), nrow(centers))
that contains the distance
between each row of x
and each row of centers
.
References
Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Gower, JC (1971). A General Coefficient for Similarity and Some of Its Properties. Biometrics, 27(4), 857-871. doi:10.2307/2528823
Kaufman, L, Rousseeuw, P (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. doi:10.1002/9780470316801
Leisch, F (2006). A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 17(3), 526-544. doi:10.1016/j.csda.2005.10.006
Kaufman, L, Rousseeuw, P (1990.) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, New York: John Wiley & Sons. doi:10.1002/9780470316801
Walesiak, M (1993). Statystyczna Analiza Wielowymiarowa w Badaniach Marketingowych. Wydawnictwo Akademii Ekonomicznej, 44-46.
Weihs, C, Ligges, U, Luebke, K, Raabe, N (2005). klaR Analyzing German Business Cycles. In Baier D, Decker, R, Schmidt-Thieme, L (eds.). Data Analysis and Decision Support, 335-343. Berlin: Springer-Verlag. doi:10.1007/3-540-28397-8_36
See Also
flexclust::kcca()
,
klaR::kmodes()
,
cluster::daisy()
,
clusterSim::dist.GDM()
Examples
# Example 1: Simple Matching Distance
set.seed(123)
dat <- data.frame(question1 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
question2 = factor(sample(LETTERS[1:6], 10, replace=TRUE)),
question3 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
question4 = factor(sample(LETTERS[1:5], 10, replace=TRUE)),
state = factor(sample(state.name[1:10], 10, replace=TRUE)),
gender = factor(sample(c('M', 'F', 'N'), 10, replace=TRUE,
prob=c(0.45, 0.45, 0.1))))
datmat <- data.matrix(dat)
initcenters <- datmat[sample(1:10, 3),]
distSimMatch(datmat, initcenters)
## within kcca
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kModes'))
## as a distance matrix
as.dist(distSimMatch(datmat, datmat))
# Example 2: GDM2 distance
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kGDM2'))
# Example 3: Gower's distance
# Ex. 3.1: single variable type case with no missings:
flexclust::kcca(datmat, 3, kccaExtendedFamily('kGower'))
# Ex. 3.2: single variable type case with missing values:
nas <- sample(c(TRUE, FALSE), prod(dim(dat)), replace = TRUE,
prob=c(0.1, 0.9)) |>
matrix(nrow = nrow(dat))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower', cent=centMode))
#Ex. 3.3: mixed variable types (with or without missings):
dat <- data.frame(cont = sample(1:100, 10, replace=TRUE)/10,
bin_sym = as.logical(sample(0:1, 10, replace=TRUE)),
bin_asym = as.logical(sample(0:1, 10, replace=TRUE)),
ord_levmis = factor(sample(1:5, 10, replace=TRUE),
levels=1:6, ordered=TRUE),
ord_levfull = factor(sample(1:4, 10, replace=TRUE),
levels=1:4, ordered=TRUE),
nom = factor(sample(letters[1:4], 10, replace=TRUE),
levels=letters[1:4]))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower'))
Extending K-Centroids Clustering to (Mixed-with-)Ordinal Data
Description
This wrapper creates objects of class "kccaFamily"
,
which can be used with flexclust::kcca()
to conduct K-centroids
clustering using the following methods:
-
kModes (after Weihs et al., 2005)
-
kGower (Gower's distance after Kaufman & Rousseeuw, 1990, and a user specified centroid)
-
kGDM2 (GDM2 distance after Walesiak et al., 1993, and a user specified centroid)
Usage
kccaExtendedFamily(which = c('kModes', 'kGDM2', 'kGower'),
cent = NULL,
preproc = NULL,
xrange = NULL,
xmethods = NULL,
trim = 0, groupFun = 'minSumClusters')
Arguments
which |
One of either |
cent |
Function for determining cluster centroids. This argument is ignored for `which='kModes'`, and `centMode` is used. For `'kGDM2'` and `'kGower'`, `cent=NULL` defaults to a general purpose optimizer. |
preproc |
Preprocessing function applied to the data before clustering. This argument is ignored for `which='kGower'`. In this case, the default preprocessing proposed by Gower (1971) and Kaufman & Rousseeuw (1990) is conducted. For `'kGDM2'` and `'kModes'`, users can specify preprocessing steps here, though this is not recommended. |
xrange |
The range of the data in
This argument is ignored for |
xmethods |
An optional character vector of length |
trim |
Proportion of points trimmed in robust clustering, wee
|
groupFun |
A character string specifying the function for clustering. Default is `'minSumClusters'`, see [flexclust::kccaFamily()]. |
Details
Wrappers for defining families are obtained by specifying
which
using:
-
which='kModes'
creates an object for kModes clustering, i.e., K-centroids clustering using Simple Matching Distance (counts of disagreements) and modes as centroids. Argumentcent
is ignored for this method. -
which='kGower'
creates an object for performing clustering using Gower's method as described in Kaufman & Rousseeuw (1990):Numeric and/or ordinal variables are scaled by
\frac{\mathbf{x}-\min{\mathbf{x}}}{\max{\mathbf{x}-\min{\mathbf{x}}}}
. Note that for ordinal variables the internal coding with values from 1 up to their maximum level is used.Distances are calculated for each column (Euclidean distance,
distEuclidean
, is recommended for numeric, Manhattan distance,distManhattan
for ordinal, Simple Matching Distance,distSimMatch
for categorical, and Jaccard distance,distJaccard
for asymmetric binary variables), and they are summed up as:d(x_i, x_k) = \frac{\sum_{j=1}^p \delta_{ikj} d(x_{ij}, x_{kj})}{\sum_{j=1}^p \delta_{ikj}}
where
p
is the number of variables and with the weight\delta_{ikj}
being 1 if both valuesx_{ij}
andx_{kj}
are not missing, and in the case of asymmetric binary variables, at least one of them is not 0.The columnwise distances used can be influenced in two ways: By passing a character vector of length
p
toxmethods
that specifies the distance for each column. Options are:distEuclidean
,distManhattan
,distJaccard
, anddistSimMatch
. Another option is to not specify any methods withinkccaExtendedFamily
, but rather pass a"data.frame"
as argumentx
inkcca
, where the class of the column is used to infer the distance measure.distEuclidean
is used on numeric and integer columns,distManhattan
on columns that are coded as ordered factors,distSimMatch
is the default for categorically coded columns, anddistJaccard
is the default for binary coded columns.For this method, if
cent=NULL
, a general purpose optimizer withNA
omission is applied for centroid calculation.
-
which='kGDM2'
creates an obejct for clustering using the GDM2 distance for ordinal variables. The GMD2 distance was first introduced by Walesiak et al. (1993), and adapted in Ernst et al. (2025), as the distance measure withinflexclust::kcca()
.This distance respects the ordinal nature of a variable by conducting only relational operations to compare values, such as
\leq
,\geq
and=
. By obtaining the relative frequencies and empirical cumulative distributions ofx
, we allow for comparison of two arbitrary values, and thus are able to conduct K-centroids clustering. For more details, see Ernst et al. (2025).
Also for this method, if cent=NULL
, a general purpose optimizer
with NA
omission will be applied for centroid calculation.
Scale handling.
In 'kModes'
, all variables are treated as unordered factors.
In 'kGDM2'
, all variables are treated as ordered factors, with strict assumptions
regarding their ordinality.
'kGower'
is currently the only method designed to handle mixed-type data. For ordinal
variables, the assumptions are more lax than with GDM2 distance.
NA handling.
NA handling via omission and upweighting non-missing variables is currently
only implemented for 'kGower'
. Within 'kModes'
, the omission of NA responses
can be avoided by coding missings as valid factor levels. For 'kGDM2'
, currently
the only option is to omit missing values completely.
Value
An object of class "kccaFamily"
.
References
Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Gower, JC (1971). A General Coefficient for Similarity and Some of Its Properties. Biometrics, 27(4), 857-871. doi:10.2307/2528823
Kaufman, L, Rousseeuw, P (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. doi:10.1002/9780470316801
Leisch, F (2006). A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 17(3), 526-544. doi:10.1016/j.csda.2005.10.006
Walesiak, M (1993). Statystyczna Analiza Wielowymiarowa w Badaniach Marketingowych. Wydawnictwo Akademii Ekonomicznej, 44-46.
Weihs, C, Ligges, U, Luebke, K, Raabe, N (2005). klaR Analyzing German Business Cycles. In: Data Analysis and Decision Support, Springer: Berlin. 335-343. doi:10.1007/3-540-28397-8_36
See Also
flexclust::kcca()
,
flexclust::stepFlexclust()
,
flexclust::bootFlexclust()
Examples
# Example 1: kModes
set.seed(123)
dat <- data.frame(cont = sample(1:100, 10, replace=TRUE)/10,
bin_sym = as.logical(sample(0:1, 10, replace=TRUE)),
bin_asym = as.logical(sample(0:1, 10, replace=TRUE)),
ord_levmis = factor(sample(1:5, 10, replace=TRUE),
levels=1:6, ordered=TRUE),
ord_levfull = factor(sample(1:4, 10, replace=TRUE),
levels=1:4, ordered=TRUE),
nom = factor(sample(letters[1:4], 10, replace=TRUE),
levels=letters[1:4]))
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kModes'))
# Example 2: kGDM2
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kGDM2',
xrange='columnwise'))
# Example 3: kGower
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower'))
nas <- sample(c(TRUE,FALSE), prod(dim(dat)), replace=TRUE, prob=c(0.1,0.9)) |>
matrix(nrow=nrow(dat))
dat[nas] <- NA
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower',
xrange='all'))
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower',
xmethods=c('distEuclidean',
'distEuclidean',
'distJaccard',
'distManhattan',
'distManhattan',
'distSimMatch')))
#the case where column 2 is a binary variable, but is symmetric
Low Back Pain Diagnoses and Diagnosis Criteria
Description
In a 2011 study, Smart et al. (2011) collected information on 464 Irish and British patients suffering from low back pain regarding:
the type of low back pain (classified into "nociceptive", "peripheral neuropathic", and "central neuropathic")
the presence/absence of 38 clinical criteria and symptoms relating to low back pain.
Fop et al. (2017) conducted Latent Class Analysis on this data set to retrieve the experts' classifications; and by comparing nested models they were able to select 11 out of 38 criteria which contain the most of the relevant grouping information while avoiding redundancy.
Usage
data('lowbackpain')
Format
A list containing:
- data:
A 464x11 binary matrix indicating the presence/absence of the 11 selected criteria for each of the 464 patients.
- group:
A factor of length 464 indicating the diagnosis each patient received, numerically coded (order has no meaning).
- index:
The index for the criteria explaining which symptom they refer to.
Source
Supplemental Content for Fop et al. (2017): doi:10.1214/17-AOAS1061SUPP
References
Fop, M, Smart, K, Murphy, TB (2017). Variable Selection for Latent Class Analysis with Application to Low Back Pain Diagnosis. The Annals of Applied Statitics. 11(4), 2080-2110. doi:10.1214/17-aoas1061
Smart, K, Blake, C, Staines, A, Doody, C (2011). The Discriminative Validity of "Nociceptive", "Peripheral Neuropathic", and "Central Sensitization" as Mechanisms-Based Classifications of Musculoskeletal Pain. The Clinical Journal of Pain. 27, 655-663. doi:10.1097/AJP.0b013e318215f16a
Risk Aversion
Description
Survey data from 563 respondents on frequency of risk taking on six different types. Taken from the companion package to Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful (Dolnicar et al., 2018, doi:10.1007/978-981-10-8818-6).
Usage
data('risk')
Format
A matrix with 563 respondents (rows) and 6 variables (columns) named
Recreational
, Health
, Career
, Financial
, Safety
and Social
.
Details
The data was collected by academic researchers using a permission based online panel.
The sample was taken from adult Australian residents who have undertaken at least one holiday in the last year which involved staying away from home for at least four nights.
The respondents were asked: "Which risks have you taken in the past?" and answered on a 5-point scale with options:
Never (1)
Rarely (2)
Quite often (3)
Often (4)
Very often (5)
The six types of risk were:
Recreational: e.g. rock-climbing, scuba diving
Health: e.g., smoking, poor diet, high alcohol consumption
Career: e.g., quitting a job without another to go to
Financial: e.g., gambling, risky investments
Safety: e.g., speeding
Social: e.g., standing for election, publicly challenging a rule or decision
Source
Sara Dolnicar.
Data and help page are taken from the companion package to Market Segmentation Analysis: Understanding It, Doing It, and Making It Useful (Dolnicar et al., 2018, doi:10.1007/978-981-10-8818-6).
URL: https://statistik.boku.ac.at/nachlass_leisch/MSA/
References
Hajibaba H, Dolnicar S (2017). Helping When Disaster Hits. In: Dolnicar S (ed) Peer-to-Peer Accomodation Networks: Pushing the Boundaries, Goodfellow Publishers, Oxford, chap.21, 235-243. doi:10.23912/9781911396512-3619
Hajibaba H, Karlsson L, Dolnicar S (2017) Residents Open Their Homes to Tourists When Disaster Strikes. Journal of Travel Research. 58(8), 1065-1078. doi:10.1177/0047287516677167