Type: | Package |
Title: | Automated Classification of Mass Spectra |
Version: | 0.4.0 |
Maintainer: | Alexandre Godmer <alexandre.godmer@aphp.fr> |
Description: | Functions to classify mass spectra in known categories, and to determine discriminant mass-to-charge values. It includes easy-to-use functions for preprocessing mass spectra, functions to determine discriminant mass-to-charge values (m/z) from a library of mass spectra corresponding to different categories, and functions to predict the category (species, phenotypes, etc.) associated to a mass spectrum from a list of selected mass-to-charge values. If you use this package in your research, please cite the associated publication (<doi:10.1016/j.eswa.2025.128796>). For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository (https://github.com/agodmer/MSclassifR_examples). |
URL: | https://github.com/agodmer/MSclassifR_examples, https://doi.org/10.1016/j.eswa.2025.128796 |
BugReports: | https://github.com/agodmer/MSclassifR_examples/issues |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
LazyData: | true |
Depends: | R (≥ 4.0), cp4p, caret, statmod, MALDIquant, MALDIrppa, |
Imports: | e1071, MALDIquantForeign, mixOmics, reshape2, ggplot2, nnet, dplyr, VSURF, metap, xgboost, glmnet, Boruta, mltools, mclust, stats, limma, car, vita, randomForest |
Suggests: | knitr, rmarkdown, |
NeedsCompilation: | no |
RoxygenNote: | 7.1.1 |
Packaged: | 2025-07-02 09:54:43 UTC; qgiaigia |
Author: | Alexandre Godmer [aut, cre], Quentin Giai Gianetto [aut], Karen Druart [aut] |
Repository: | CRAN |
Date/Publication: | 2025-07-02 16:20:16 UTC |
Metadata of mass spectra corresponding to the bacterial species Citrobacter sp. from The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra
Description
Metadada of the CitrobacterRKIspectra
list of mass spectra.
Usage
data("CitrobacterRKImetadata", package = "MSclassifR")
Format
A data frame with 14 rows (each corresponding to a mass spectrum), and five columns that contain (in order): the strain name, the species name, the spot, a sample number and the name of the strain associated with the spot.
Details
The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra contains raw mass spectra. Only mass spectra of the Citrobacter bacterial species were collected. Metadata were manually reported from raw data.
Source
The raw data were downloaded from this link : https://zenodo.org/record/163517#.YIkWiNZuJCp. The dataset focuses only on mass spectra from Citrobacter.
References
Lasch, Peter, Stammler, Maren, & Schneider, Andy. (2018). Version 3 (20181130) of the MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly Pathogenic Microorganisms from the Robert Koch-Institute (RKI) [Data set]. Zenodo.doi:10.5281/zenodo.163517
Mass spectra corresponding to the bacterial species Citrobacter sp. from The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra
Description
Mass spectra of the CitrobacterRKIspectra
dataset.
Usage
data("CitrobacterRKIspectra", package = "MSclassifR")
#####
#Plotting the first mass spectrum
#library("MSclassifR")
#PlotSpectra(SpectralData=CitrobacterRKIspectra[[1]],absx = "ALL", Peaks = NULL,
# Peaks2 = NULL, col_spec = 1, col_peak = 2, shape_peak = 3,
# col_peak2 = 2, shape_peak2 = 2)
Format
A list that contains 14 objects of class S4 corresponding each to a each mass spectrum.
Details
The Robert Koch-Institute (RKI) database of microbial MALDI-TOF mass spectra contains raw mass spectra. Only mass spectra of the Citrobacter bacterial species were collected.
Source
The raw data were downloaded from this link : https://zenodo.org/record/163517#.YIkWiNZuJCp. The dataset focuses only on mass spectra from Citrobacter.
References
Lasch, Peter, Stammler, Maren, & Schneider, Andy. (2018). Version 3 (20181130) of the MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly Pathogenic Microorganisms from the Robert Koch-Institute (RKI) [Data set]. Zenodo.doi:10.5281/zenodo.163517
Estimation of a multinomial regression to predict the category to which a mass spectrum belongs
Description
This function estimates a multinomial regression using cross-validation to predict the category (species, phenotypes...) to which a mass spectrum belongs from a set of shortlisted mass-to-charge values corresponding to discriminant peaks. Two main kinds of models can be estimated: linear or nonlinear (with neural networks, random forests, support vector machines with linear kernel, or eXtreme Gradient Boosting). Hyperparameters are randomly searched, except for the eXtreme Gradient Boosting where a grid search is performed.
Usage
LogReg(X,
moz,
Y,
number = 2,
repeats = 2,
Metric = c("Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation"),
kind="linear",
Sampling = c("no", "up", "down", "smote"))
Arguments
X |
|
moz |
|
Y |
|
number |
|
Metric |
a |
repeats |
|
kind |
If |
Sampling |
a |
Details
This function estimates a model from a library of mass spectra for which we already know the category to which they belong (ex.: species, etc). This model can next be used to predict the category of a new coming spectrum for which the category is unknown (see PredictLogReg
).
The estimation is performed using the train
function of the caret
R package. For each kind of model, random parameters are tested to find a model according to the best metric
. The formulas for the metric
are as follows:
Accuracy = Number Of Correct Predictions/Total Number Of Predictions
Kappa coefficient = (Observed Agreement-Chance Agreement)/(1-Chance Agreement)
F1 = True Positive/(True Positive + 1/2 (False Positive + False Negative))
The adjusted Rand index ("AdjRankIndex"
) is defined as the corrected-for-chance version of the Rand index which allows comparing two groups (see mclust
package and adjustedRandIndex()
function for more details). The Matthews correlation coefficient ("MatthewsCorrelation"
) is estimated using mcc
function in the mltools
R package.
The Sampling
methods available for imbalanced data are: "up"
to the up-sampling method which consists of random sampling (with replacement) so that the minority class is the same size as the majority class; "down"
to the down-sampling method which consists of random sampling (without replacement) of the majority class so that their class frequencies match the minority class; "smote"
to the Synthetic Minority Over sampling Technique (SMOTE) algorithm for data augmentation which consists of creating new data from minority class using the K Nearest Neighbor algorithm.
Godmer et al. (2025) presents a comparison of different pipelines using LogReg that can help you to optimize your workflow. For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository: here.
Value
Returns a list
with four items:
train_mod |
a |
conf_mat |
a confusion matrix containing percentages classes of predicted categories in function of actual categories, resulting from repeated cross-validation. |
stats_global |
a |
boxplot |
a |
References
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(1), 1-26.
L. Hubert and P. Arabie (1985) Comparing Partitions, Journal of the Classification, 2, pp. 193-218.
Scrucca L, Fop M, Murphy TB, Raftery AE (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal.
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1.
Matthews, B. W. (1975). "Comparison of the predicted and observed secondary structure of T4 phage lysozyme". Biochimica et Biophysica Acta (BBA) - Protein Structure. PMID 1180967.
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Examples
library("MSclassifR")
library("MALDIquant")
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-to-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
###############################################################################
## 2. Selection of discriminant mass-to-charge values using RFERF
# with 5 to 10 variables,
# up sampling method and
# trained with the Accuracy coefficient metric
a <- MSclassifR::SelectionVar(X,
Y,
MethodSelection = c("RFERF"),
MethodValidation = c("cv"),
PreProcessing = c("center","scale","nzv","corr"),
NumberCV = 2,
Metric = "Accuracy",
Sizes = c(2:5),
Sampling = "up")
sel_moz=a$sel_moz
###############################################################################
## 3. Perform LogReg from shortlisted discriminant mass-to-charge values
# linear multinomial regression
# without sampling mehod
# and trained with the Kappa coefficient metric
model_lm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
Metric = "Kappa")
# Estimated model:
model_lm
# nonlinear multinomial regression using neural networks
# with up-sampling method and
# trained with the Kappa coefficient metric
model_nn=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="nnet",
Metric = "Kappa",
Sampling = "up")
# Estimated model:
model_nn
# nonlinear multinomial regression using random forests
# without down-sampling method and
# trained with the Kappa coefficient metric
model_rf=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="rf",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_rf
# nonlinear multinomial regression using xgboost
# with down-sampling method and
# trained with the Kappa coefficient metric
model_xgb=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="xgb",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_xgb
# nonlinear multinomial regression using svm
# with down-sampling method and
# trained with the Kappa coefficient metric
model_svm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="svm",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_svm
##########
# Of note, step 3 can be performed several times
# to find optimal models
# because of random hyperparameter search
###############################################################################
## 4. Select best models in term of average Kappa and saving it for reuse
Kappa_model=c(model_lm$stats_global[1,2],model_nn$stats_global[1,2],
model_rf$stats_global[1,2],model_xgb$stats_global[1,2],model_svm$stats_global[1,2])
names(Kappa_model)=c("lm","nn","rf","xgb","svm")
#Best models in term of accuracy
Kappa_model[which(Kappa_model==max(Kappa_model))]
#save best models for reuse
#models=list(model_lm$train_mod,model_nn$train_mod,model_rf$train_mod,
#model_xgb$train_mod,model_svm$train_mod)
#models_best=models[which(Kappa_model==max(Kappa_model))]
#for (i in 1:length(models_best)){
#save(models_best[[i]], file = paste0("model_best_",i,".rda",collapse="")
#}
#load a saved model
#load("model_best_1.rda")
###############################################################################
## 5. Try other metrics to select the best model
# linear multinomial regression
# with up-sampling method and
# trained with the Adjusted Rank index metric
model_lm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=3,
Metric = "AdjRankIndex",
Sampling = "up")
Automated classification of mass spectra
Description
This package provides R functions to classify mass spectra in known categories, and to determine discriminant mass-to-charge values. It was developed with the aim of identifying very similar species or phenotypes of bacteria from mass spectra obtained by Matrix Assisted Laser Desorption Ionisation - Time Of Flight Mass Spectrometry (MALDI-TOF MS). However, the different functions of this package can also be used to classify other categories associated to mass spectra; or from mass spectra obtained with other mass spectrometry techniques. It includes easy-to-use functions for pre-processing mass spectra, functions to determine discriminant mass-to-charge values (m/z) from a library of mass spectra corresponding to different categories, and functions to predict the category (species, phenotypes, etc.) associated to a mass spectrum from a list of selected mass-to-charge values.
If you use this package in your research, please cite the associated publication: doi:10.1016/j.eswa.2025.128796.
For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository: here.
Value
No return value. Package description.
Author(s)
Alexandre Godmer, Quentin Giai Gianetto
References
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Detection of peaks in MassSpectrum
objects.
Description
This function performs a data analysis pipeline to pre-process mass spectra. It provides average intensities and detects peaks using functions of R packages MALDIquant
and MALDIrppa
.
Usage
PeakDetection(x,
averageMassSpec = TRUE,
labels = NULL,
averageMassSpectraMethod = "median",
SNRdetection = 3,
binPeaks = TRUE,
PeakDetectionMethod = "MAD",
halfWindowSizeDetection = 11,
AlignMethod = "strict",
Tolerance = 0.002,
...)
Arguments
x |
a |
averageMassSpec |
a |
labels |
a |
averageMassSpectraMethod |
a |
PeakDetectionMethod |
a |
SNRdetection |
a |
binPeaks |
a |
halfWindowSizeDetection |
a |
AlignMethod |
a |
Tolerance |
a |
... |
other arguments from |
Details
The PeakDetection
function provides an analysis pipeline for MassSpectrum
objects including peaks detection and binning.
All the methods used for PeakDetection
functions are selected from MALDIquant
and MALDIrppa
packages.
Value
Returns a list of MassPeaks
objects (see MALDIquant
R package) for each mass spectrum in x
.
References
Gibb S, Strimmer K. MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics. 2012 Sep 1;28(17):2270-1. doi:10.1093/bioinformatics/bts447. Epub 2012 Jul 12. PMID: 22796955.
Javier Palarea-Albaladejo, Kevin Mclean, Frank Wright, David G E Smith, MALDIrppa: quality control and robust analysis for mass spectrometry data, Bioinformatics, Volume 34, Issue 3, 01 February 2018, Pages 522 - 523, doi:10.1093/bioinformatics/btx628
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Examples
library("MALDIquant")
library("MSclassifR")
## Load mass spectra and metadata
data("CitrobacterRKIspectra", "CitrobacterRKImetadata", package = "MSclassifR")
## Pre-processing of mass spectra
spectra <- SignalProcessing(CitrobacterRKIspectra)
## Detection of peaks in pre-processed mass spectra
peaks <- PeakDetection(x = spectra,
averageMassSpec = FALSE,
labels = CitrobacterRKImetadata$Strain_name_spot,
averageMassSpectraMethod = "median",
SNRdetection = 3,
binPeaks = TRUE,
halfWindowSizeDetection = 11,
AlignFrequency = 0.20,
AlignMethod = "strict",
Tolerance = 0.002)
# Plot peaks on a pre-processed mass spectrum
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")
Plot mass spectra with detected peaks
Description
This function performs a plot of a AbstractMassObject
object (see the MALDIquant
R package). It can be used to highlight peaks in a mass spectrum.
Usage
PlotSpectra(SpectralData, absx="ALL", Peaks=NULL, Peaks2=NULL, col_spec=1,
col_peak=2, shape_peak=3, col_peak2=2, shape_peak2=2)
Arguments
SpectralData |
|
absx |
|
Peaks |
|
Peaks2 |
numeric |
col_spec |
color of the mass spectrum. |
col_peak |
color of the peak points corresponding to |
shape_peak |
shape of the peak points corresponding to |
col_peak2 |
color of the peak points corresponding to |
shape_peak2 |
Shape of the peak points corresponding to |
Value
A ggplot
object (see ggplot2
R package). mass-to-charge values are in x-axis and intensities in y-axis.
Examples
library("MSclassifR")
# Load mass spectra
data("CitrobacterRKIspectra", package = "MSclassifR")
# Plot raw mass spectrum
PlotSpectra(SpectralData = CitrobacterRKIspectra[[1]])
# standard pre-processing of mass spectra
spectra <- SignalProcessing(CitrobacterRKIspectra)
# Plot pre-processed mass spectrum
PlotSpectra(SpectralData=spectra[[1]])
# detection of peaks in pre-processed mass spectra
peaks <- PeakDetection(x = spectra, averageMassSpec=FALSE)
# Plot peaks on pre-processed mass spectrum
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")
Prediction of the category to which a mass spectrum belongs using linear regressions of mass spectra.
Description
For each mass peak in a list of mass peaks, a linear regression is performed between the mass spectrum and mass spectra corresponding to a category. This is performed for each category and associated to an Akaike Information Criterium. Next, the AIC are used to determine the belonging of a mass spectrum to a category. It also provides a probability that the mass spectrum does not belong to any of the input categories.
Usage
PredictFastClass(peaks,
mod_peaks,
Y_mod_peaks,
moz="ALL",
tolerance = 6,
toleranceStep = 2,
normalizeFun = TRUE,
noMatch = 0)
Arguments
peaks |
a list of |
mod_peaks |
an intensity matrix corresponding to mass spectra for which the category is known. Each column is a mass-to-charge value, each row corresponds to a mass spectrum. |
Y_mod_peaks |
a |
moz |
a |
tolerance |
a |
toleranceStep |
a |
normalizeFun |
a |
noMatch |
a |
Details
Godmer et al. (2025) presents a comparison of different pipelines using PredictFastClass that can help you to optimize your workflow. For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository: here.
Value
Returns a dataframe
containing AIC criteria by category for each mass spectrum in peaks
. The AIC criterion should be minimal for the most probable category. The pred_cat
column is the predicted category for each mass spectrum in peaks
. The p_not_in_DB
is the minimal p-value of several Fisher tests testing if all the linear coefficients associated to mass spectra of a category are null. It can be interpreted as a p-value that the mass spectrum is not present in the input database.
References
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Examples
library("MSclassifR")
library("MALDIquant")
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-to-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
#Predict species without peak selection using a tolerance of 1 Da
res = PredictFastClass(peaks=peaks[1:5],
mod_peaks=X,
Y_mod_peaks=Y,
tolerance = 1)
#comparing predicted categories (species) and the truth
cbind(res$pred_cat,as.character(Y[1:5]))
# The method can be applied after a peak selection step
a <- SelectionVar(X,
Y,
MethodSelection = c("RFERF"),
MethodValidation = c("cv"),
PreProcessing = c("center","scale","nzv","corr"),
NumberCV = 2,
Metric = "Kappa",
Sizes = c(20:40),
Sampling = "up")
#Predict species from selected peaks using a tolerance of 1 Da
res = PredictFastClass(peaks=peaks[1:5],
moz = a$sel_moz,
mod_peaks=X,
Y_mod_peaks=Y, tolerance = 1)
#comparing predicted categories (species) and the truth
cbind(res$pred_cat,as.character(Y[1:5]))
Prediction of the category to which a mass spectrum belongs from a multinomial logistic regression model
Description
This function predicts the category (species, phenotypes...) to which a mass spectrum belongs from a set of shortlisted mass-to-charge values of interest and a short-listed multinomial logistic regression model (see LogReg
).
Usage
PredictLogReg(peaks,
model,
moz,
tolerance = 6,
toleranceStep = 2,
normalizeFun = TRUE,
noMatch=0,
Reference = NULL)
Arguments
peaks |
a list of |
model |
a model or a list of models estimated from a set of shortlisted mass-to-charge values (output of the |
moz |
a |
tolerance |
a |
toleranceStep |
a |
normalizeFun |
a |
noMatch |
a |
Reference |
a |
Details
The PredictLogReg
function allows predicting the membership of a mass spectrum to a category from a multinomial regression model. The mass spectrum from the peaks
object will be matched to the discriminant mass-to-chage (m/z) values (sel_moz
object from the SelectionVar
or SelectionVarStat
functions) with a tolerance between 2 m/z and defined by the tolerance
parameter (by default this value is 6 Da). If a repetition of a same m/z occurs in the selection, only the m/z that is closest in mass peaks (moz
) is used. When no match, intensity values are replaced by the noMatch
argument. If no m/z values from peaks
object matched with the m/z in the object moz
, the tolerance will be increased according to a numeric value defined in the toleranceStep
parameter and a warning will be notified. Note that it is possible to not perform the SelectionVar
function prior to the PredictLogReg
function, and to replace the argument moz
by all the m/z values present in a mass spectrum.
Godmer et al. (2025) presents a comparison of different pipelines using PredictLogReg that can help you to optimize your workflow. For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository: here.
Value
Returns a dataframe
containing probabilities of membership by category for each mass spectrum in peaks
. The method used is provided in the method
column. The comb_fisher
method is the result of the Fisher's method when merging probabilities of membership of used prediction models.The max_vote
method is the result of the maximum voting from used prediction models.
If the Reference
parameter is not null, the function returns:
Confusion.Matrix |
a |
Gobal.stat |
a |
Details.stat |
a |
Correct.ClassificationFreq |
a |
Incorrect.ClassificationFreq |
a |
References
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(1), 1-26.
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Examples
library("MSclassifR")
library("MALDIquant")
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-to-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
###############################################################################
## 2. Selection of discriminant mass-to-charge values using RFERF
# with 5 to 10 variables,
# without sampling method and trained
# with the Accuracy coefficient metric
a <- MSclassifR::SelectionVar(X,
Y,
MethodSelection = c("RFERF"),
MethodValidation = c("cv"),
PreProcessing = c("center","scale","nzv","corr"),
NumberCV = 2,
Metric = "Kappa",
Sizes = c(5:10))
sel_moz=a$sel_moz
###############################################################################
## 3. Perform LogReg from shortlisted discriminant mass-to-charge values
# linear multinomial regression
# without sampling mehod and
# trained with the Kappa coefficient metric
model_lm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
Metric = "Kappa")
# Estimated model:
model_lm
# nonlinear multinomial regression using neural networks
# with up-sampling method and
# trained with the Kappa coefficient metric
model_nn=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="nnet",
Metric = "Kappa",
Sampling = "up")
# Estimated model:
model_nn
# nonlinear multinomial regression using random forests
# without down-sampling method and
# trained with the Kappa coefficient metric
model_rf=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="rf",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_rf
# nonlinear multinomial regression using xgboost
# with down-sampling method and
# trained with the Kappa coefficient metric
model_xgb=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="xgb",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_xgb
# nonlinear multinomial regression using svm
# with down-sampling method and
# trained with the Kappa coefficient metric
model_svm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="svm",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_svm
# Of note, you can also load a model already saved
# (see example in LogReg function) for the next step
###############################################################################
## 4. Probabilities of belonging to each category for the mass spectra
## and associated statitics
# Collect all the estimated models in a list
Models <- list(model_lm$train_mod,
model_nn$train_mod,
model_rf$train_mod,
model_xgb$train_mod,
model_svm$train_mod)
# Predict classes of mass spectra with 6 Da of tolerance for matching peaks.
prob_cat=MSclassifR::PredictLogReg(peaks = peaks[c(1:5)],
model = Models,
moz = sel_moz,
tolerance = 6,
Reference = Y[c(1:5)])
################################################################################
## 5. Example of meta-classifiers based on several random forest models
## to optimize a Kappa value using the SMOTE method for imbalanced datasets.
## -> a merge of the prediction probabilities using the Fisher's method
## leads generally to robust prediction models.
#Selecting peaks with mda method
a=SelectionVar(X,Y,MethodSelection="mda",Ntree=5*ncol(X))
sel_moz=a$sel_moz
#Building 4 Random Forest models
models=NULL;nbm=4;
for (i in 1:nbm){
model_rf=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=5,
repeats=5,
kind="rf",
Metric = "Kappa",
Sampling = "smote")
models <- c(models,list(model_rf$train_mod))
}
#Combining their prediction probabilities
prob_cat=MSclassifR::PredictLogReg(peaks = peaks,model = models,moz = sel_moz,
tolerance = 6,Reference = Y)
Variable selection using random forests, logistic regression methods or sparse partial least squares discriminant analysis (sPLS-DA).
Description
This function performs variable selection (i.e. selection of discriminant mass-to-charge values) using either recursive feature elimination (RFE) algorithm with Random Forest, or logistic regression model, or sparse partial least squares discriminant analysis (sPLS-DA) or methods based on the distribution of variable importances of random forests.
Usage
SelectionVar(X,
Y,
MethodSelection = c("RFERF", "RFEGlmnet", "VSURF", "sPLSDA", "mda",
"cvp", "boruta"),
MethodValidation = c("cv", "repeatedcv", "LOOCV"),
PreProcessing = c("center", "scale", "nzv", "corr"),
Metric = c("Kappa", "Accuracy"),
Sampling = c("no", "up","down", "smote"),
NumberCV = NULL,
RepeatsCV = NULL,
Sizes,
Ntree = 1000,
ncores = 2,
threshold = 0.01,
ncomp.max = 10,
nbf=0)
Arguments
X |
a numeric |
Y |
a |
MethodSelection |
a |
MethodValidation |
a |
NumberCV |
a |
RepeatsCV |
a |
PreProcessing |
a |
Metric |
a |
Sampling |
a |
Sizes |
a numeric |
Ntree |
a |
ncores |
a |
ncomp.max |
a |
threshold |
a |
nbf |
a |
Details
The selection of variables can be carried out with two different objectives: either to find a minimum number of variables allowing to obtain the highest possible accuracy (or Kappa coefficient), which involves the possible elimination of variables correlated between them (i.e. not bringing any additional predictive power with respect to some other variables); or to find all the variables in the dataset with a potential predictive power ("discriminant" variables).
The VSURF
method attempts to accomplish only the first objective.
The mda
and cvp
methods attempt to accomplish the second objective, as do the methods available in the SelectionVarStat
function of our MSclassifR
R package.
The RFERF
, RFEGlmnet
and sPLSDA
methods take as input a number of variables to be selected(Sizes
argument), and can therefore be used with both objectives.
Within the framework of the second objective, either the mda
or cvp
methods can be used to estimate a number of discriminant variables from the importances of variables. The SelectionVarStat
function can also be used to estimate this number from distributions of p-values. Of note, be sure that the Ntree
argument is high enough to get a robust estimation with the mda
or cvp
methods.
The "RFEGlmnet"
and "RFERF"
methods are based on recursive feature elimination and can either optimize the kappa coefficient or the accuracy as metrics when selecting variables.
The "sPLSDA"
method selects variables from the ones kept in latent components of the sparse PLS-DA model using an automatic choice of the number of components (when the balanced classification error rate (BER) reaches a plateau - see argument threshold
).
The "mda"
and "cvp"
methods use the distribution of variable importances to estimate the number of discriminant features (mass-to-charge values). Briefly, the distribution of variable importances for useless (not discriminant) features is firstly estimated from negative importance variables by the method proposed in section 2.6 of Janitza et al.(2018). Next, the following mixture model is assumed:
F(x)=\pi\times F_u(x)+(1-\pi)\times F_d(x)
where F
is the empirical cumulative distribution of variable importances of all the features, F_u
the one of the useless features, F_d
the one of the discriminative features, and \pi
is the proportion of useless features in the dataset.
From the estimated distribution of useless features, we can estimate quantile values x_q
and compute \epsilon_q=min(F(x_q)/q;1)
for each quantile q
. The minimum of the \epsilon_q
corresponds to the estimated proportion of useless features in the dataset, what allows estimating the number of discriminant features by N_d=floor(N\times (1 - \pi))
where N is the total number of features. Next, the N_d
features with the highest variable importances are selected.
The "VSURF"
and "sPLSDA"
methods use the minimum mean out-of-bag (OOB) and balanced classification error rate (BER) metrics respectively.
The "boruta"
method selects variables from the Boruta algorithm (see Kursa and Rudnicki (2010)). The maxRuns
argument of the Boruta
function is fixed to 3*ncol(X)
to perform the selection of variables.
For Sampling
methods available for unbalanced data: "up"
corresponds to the up-sampling method which consists of random sampling (with replacement) so that the minority class is the same size as the majority class; "down"
corresponds to the down-sampling method randomly which consists of random sampling (without replacement) of the majority class so that their class frequencies match the minority class; "smote"
corresponds to the Synthetic Minority Over sampling Technique (SMOTE) specific algorithm for data augmentation which consist of creates new data from minority class using the K Nearest Neighbor algorithm.
See rfe
in the caret
R package, VSURF
in the VSURF
R package, splsda
in the mixOmics
R package, importance
function in the randomForest
R package, CVPVI
function in the vita
R package, and Boruta
function in the Boruta
R package for more details.
Godmer et al. (2025) presents a comparison of different pipelines using SelectionVar that can help you to optimize your workflow. For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository: here.
Value
A list composed of:
sel_moz |
a |
For the "RFERF"
and "RFEGlmnet"
methods, it also returns the results of the rfe
function of the caret
R package.
For the "VSURF"
method, it also returns the results of the results of the VSURF
function of the VSURF
R package.
For the "sPLSDA"
method, it also returns the following items:
Raw_data |
a horizontal bar plot and containing the contribution of features on each component. |
selected_variables |
|
For the "mda"
and "cvp"
methods, it also returns the following items:
nb_to_sel |
a numeric value corresponding to an estimated number of mass-over-chage values where the intensities are significantly different between categories (see details). |
imp_sel |
a vector containing the variable importances for the selected features. |
References
Kuhn, Max. (2012). The caret Package. Journal of Statistical Software. 28.
Genuer, Robin, Jean-Michel Poggi and Christine Tuleau-Malot. VSURF : An R Package for Variable Selection Using Random Forests. R J. 7 (2015): 19.
Friedman J, Hastie T, Tibshirani R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22.
Kim-Anh Le Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois, Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics. Data Integration Project. R package version 6.1.1. https://CRAN.R-project.org/package=mixOmics
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1 (January 2002), 321–357.
Branco P, Ribeiro R, Torgo L (2016). “UBL: an R Package for Utility-Based Learning.” CoRR, abs/1604.08079.
Janitza, S., Celik, E., Boulesteix, A. L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12, 885-915.
Miron B. Kursa, Witold R. Rudnicki (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), p. 1-13.
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Examples
library("MSclassifR")
library("MALDIquant")
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-to-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
###############################################################################
## 2. Perform variables selection using SelectionVar with RFE and random forest
# with 5 to 10 variables,
# up sampling method and trained with the Kappa coefficient metric
a <- SelectionVar(X,
Y,
MethodSelection = c("RFERF"),
MethodValidation = c("cv"),
PreProcessing = c("center","scale","nzv","corr"),
NumberCV = 2,
Metric = "Kappa",
Sizes = c(5:10),
Sampling = "up")
# Plotting peaks on the first pre-processed mass spectrum and highlighting the
# discriminant mass-to-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=a$sel_moz,col_spec="blue",col_peak="black")
###############################################################################
## 3. Perform variables selection using SelectionVar with VSURF
# This function can last a few minutes
b <- SelectionVar(X, Y, MethodSelection = c("VSURF"))
summary(b$result)
###############################################################################
## 4. Perform variables selection using SelectionVar with "mda" or "cvp"
# option 1: Using mean decrease in accuracy
# with no sampling method
c <- SelectionVar(X,Y,MethodSelection="mda",Ntree=10*ncol(X))
# Estimation of the number of peaks to discriminate species
c$nb_to_sel
# Discriminant mass-to-charge values
c$sel_moz
# Plotting peaks on the first pre-processed mass spectrum and highlighting the
# discriminant mass-to-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=c$sel_moz,col_spec="blue",col_peak="black")
# option 2: Using cross-validated permutation variable importance measures (more "time-consuming")
# with no sampling method
d <- SelectionVar(X,Y,MethodSelection="cvp",NumberCV=2,ncores=2,Ntree=1000)
# Estimation of the number of peaks to discriminate species
d$nb_to_sel
# Discriminant mass-to-charge values
d$sel_moz
# Plotting peaks on the first pre-processed mass spectrum and highlighting the
# discriminant mass-to-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=d$sel_moz,col_spec="blue",col_peak="black")
# Mass-over charge values found with both methods ("mda" and "cvp")
intersect(c$sel_moz,d$sel_moz)
Variable selection using multiple statistical tests.
Description
This function performs a statistical test for each mass-to-charge value to determine which are discriminants between categories. Using the distribution of resulting multiple p-values, it determines an expected number of discriminant features, and adjusted p-values that can be used to control a false discovery rate threshold.
Usage
SelectionVarStat(X,
Y,
stat.test = "Limma",
pi0.method="abh",
fdr=0.05,
Sampling = c("no", "up","down", "smote"))
Arguments
X |
a |
Y |
a |
stat.test |
a |
pi0.method |
a |
fdr |
a |
Sampling |
a |
Details
The SelectionVarStat
function allows performing "quick" classification of mass-to-charge values. It tries to find all the mass-to-charge values (or the number of mass-to-charge values) that are discriminant between categories. This can conduct to select "correlated" mass-to-charge values (i.e. associated to intensities evolving similarly between categories). By default, multiple moderated t-tests using the limma
R package (bayesian regularization of variances) are performed and the p-values are corrected using an adaptive Benjamini and Hochberg procedure to control the false discovery rate. Different ways to estimate the proportion of true null hypotheses (object pi0
returned by the function - see the cp4p
R package for details) can be used for the adaptive Benjamini-Hochberg procedure ("abh
" by defaut).
Godmer et al. (2025) presents a comparison of different pipelines using SelectionVarStat that can help you to optimize your workflow. For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository: here.
Value
A list composed of:
nb_to_sel |
a |
sel_moz |
a |
ap |
a |
References
Gianetto, Quentin & Combes, Florence & Ramus, Claire & Bruley, Christophe & Coute, Yohann & Burger, Thomas. (2015). Technical Brief Calibration Plot for Proteomics (CP4P): A graphical tool to visually check the assumptions underlying FDR control in quantitative experiments. Proteomics. 16. 10.1002/pmic.201500189.
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Examples
library("MSclassifR")
library("MALDIquant")
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, labels = CitrobacterRKImetadata$Strain_name_spot)
# matrix with intensities of peaks arranged in rows (each column is a mass-to-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
###############################################################################
## 2. Estimate the optimal number of peaks to discriminate the different species
OptiPeaks <- SelectionVarStat(X,
Y,
stat.test = "Limma",
pi0.method="abh",
fdr=0.05,
Sampling="smote")
## Estimation of the optimal number of peaks to discriminate species (from the pi0 parameter)
OptiPeaks$nb_to_sel
## discriminant mass-to-chage values estimated using a 5 per cent false discovery rate
OptiPeaks$sel_moz
## p-values and adjusted p-values estimated for all the tested mass-to-charge values
OptiPeaks$ap$adjp
Function performing post acquisition signal processing
Description
This function performs post acquisition signal processing for list
of MassSpectrum
objects using commonly used methods : transform intensities ("sqrt"), smoothing ("Wavelet"), remove baseline ("SNIP"), calibrate intensities ("TIC") and align spectra. Methods used are selected from the MALDIquant
and MALDIrppa
R packages.
Usage
SignalProcessing(x,
transformIntensity_method = "sqrt",
smoothing_method = "Wavelet",
removeBaseline_method = "SNIP",
removeBaseline_iterations = 25,
calibrateIntensity_method = "TIC",
alignSpectra_NoiseMethod = "MAD",
alignSpectra_method = "lowess",
alignSpectra_halfWs = 11,
alignSpectra_SN = 3,
tolerance_align = 0.002,
referenceSpectra = NULL,
minFrequency= 0.5,
binPeaks_method = "strict",
keepReferenceSpectra = FALSE,
...)
Arguments
x |
a |
transformIntensity_method |
a |
smoothing_method |
a |
removeBaseline_method |
a |
removeBaseline_iterations |
a |
calibrateIntensity_method |
a |
alignSpectra_NoiseMethod |
a |
alignSpectra_method |
a |
alignSpectra_halfWs |
a |
alignSpectra_SN |
a |
tolerance_align |
a |
referenceSpectra |
a |
minFrequency |
a |
binPeaks_method |
a |
keepReferenceSpectra |
a |
... |
other arguments from |
Details
The SignalProcessing
function provides an analysis pipeline for MassSpectrum
objects including intensity transformation, smoothing, removing baseline.
The Wavelet
method relies on the wavShrink
function of the wmtsa
package and its dependencies (now archived by CRAN). The original C code by William Constantine and Keith L. Davidson, in turn including copyrighted routines by Insightful Corp., has been revised and included into MALDIrppa
for the method to work.
All the methods used for SignalProcessing
functions are selected from MALDIquant
and MALDIrppa
packages.
Godmer et al. (2025) presents a comparison of different pre-processing workflows that can help you to optimize your workflow.For a comprehensive guide, additional applications, and detailed examples of using this package, please visit our GitHub repository: here.
Value
A list of modified MassSpectrum
objects (see MALDIquant
R package) according to chosen arguments. If the argument referenceSpectra
is not completed and the argument keepReferenceSpectra
is TRUE
, a list containing the MassSpectrum
objects modified named "spectra"
and the created reference spectrum named "RefS"
is returned.
References
Gibb S, Strimmer K. MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics. 2012 Sep 1;28(17):2270-1. doi:10.1093/bioinformatics/bts447. Epub 2012 Jul 12. PMID: 22796955.
Javier Palarea-Albaladejo, Kevin Mclean, Frank Wright, David G E Smith, MALDIrppa: quality control and robust analysis for mass spectrometry data, Bioinformatics, Volume 34, Issue 3, 01 February 2018, Pages 522 - 523, doi:10.1093/bioinformatics/btx628
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
Examples
library("MALDIquant")
library("MSclassifR")
## Load mass spectra
data("CitrobacterRKIspectra", package = "MSclassifR")
# plot first unprocessed mass spectrum
PlotSpectra(SpectralData=CitrobacterRKIspectra[[1]], col_spec="blue")
## spectral treatment
spectra <- SignalProcessing(CitrobacterRKIspectra,
transformIntensity_method = "sqrt",
smoothing_method = "Wavelet",
removeBaseline_method = "SNIP",
removeBaseline_iterations = 25,
calibrateIntensity_method = "TIC",
alignSpectra_Method = "MAD",
alignSpectra_halfWs = 11,
alignSpectra_SN = 3,
tolerance_align = 0.002)
# plot first processed mass spectrum
PlotSpectra(SpectralData=spectra[[1]], col_spec="blue")
Function calculating the distance between two vectors.
Description
This function calculates the distance between two vectors using the specified distance metric. Distance metrics available are "p-norm", "Chebyshev", "Canberra", "Overlap", "HEOM" (Heterogeneous Euclidean-Overlap Metric), "HVDM" (Heterogeneous Value Difference Metric). Used in the fast_find_neighbors
function of our package.
Usage
calculate_distance(x, y, nominal_indices, p_code)
Arguments
x |
Vector of numeric and/or categorical values. |
y |
Vector of numeric and/or categorical values. |
nominal_indices |
Vector indicating which positions in
|
p_code |
Numeric code representing the distance metric to use:
|
Details
Different distance metrics handle nominal variables differently:
For pure numeric metrics (p >= 1, p = 0, p = -1), nominal features are ignored
For the Overlap metric (p = -2), only nominal features are considered
For HEOM (p = -3), numeric features use normalized Euclidean distance while nominal features use overlap distance (1 if different, 0 if same)
For HVDM (p = -4), a specialized metric combines normalized differences for numeric features and value difference metric for nominal features
Value
A numeric value representing the distance between the two vectors.
Function joining two tables based not on exact matches
Description
This function joins two tables based on a distance metric of one or more columns. It gives the same results that the distance_left_join
function of the fuzzyjoin
R package, but its execution time is faster. It is used to match mass spectra to short-listed mass-to-charge values in PredictLogReg
and PredictFastClass
functions.
Usage
d_left_join (x, y, by = NULL, method = "euclidean", max_dist = 1,
distance_col = NULL)
Arguments
x |
A |
y |
A |
by |
Columns by which to join the two tables. |
method |
Method to use for computing distance, either "euclidean" (default) or "manhattan". |
max_dist |
A |
distance_col |
A |
Value
Returns a data frame that contains the results of joining the x
dataset onto the y
datasets, where all rows from x
are preserved and matching data from y
is added according to the specified criteria.
Function finding k Nearest Neighbors for each row of a matrix
Description
This function finds the k nearest neighbors for each row in a matrix using the specified distance metric. Distance metrics available are "p-norm", "Chebyshev", "Canberra", "Overlap", "HEOM" (Heterogeneous Euclidean-Overlap Metric), "HVDM" (Heterogeneous Value Difference Metric). See the calculate_distance
function of our package for more details on the distances.
Usage
fast_find_neighbors(data, nominal_indices, p_code, k)
Arguments
data |
Matrix where the k nearest neighbors for each row are searched. |
nominal_indices |
Vector of column indices indicating which features are
categorical (nominal) variables. This is crucial for proper distance calculation
as nominal and numeric features require different handling. For example, if columns
2 and 5 contain categorical variables, nominal_indices should be |
p_code |
Numeric code representing the distance metric to use:
|
k |
Number of nearest neighbors to find. |
Value
A matrix where each row contains the indices of the k nearest neighbors for the corresponding example.
See Also
Function generating synthetic examples using SMOTE
Description
This function generates synthetic examples using the SMOTE algorithm.
For each example in the minority class, this function generates synthetic examples by interpolating between the example and its k nearest neighbors. This function is used in the smote_classif
function.
Usage
fast_generate_synthetic(dat, k, n, p_code)
Arguments
dat |
A |
k |
Number of nearest neighbors to consider. |
n |
Number of synthetic examples to generate. |
p_code |
Numeric code representing the distance metric to use:
|
Value
A data frame containing the generated synthetic examples.
See Also
SMOTE for Classification Problems
Description
This function performs Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance in classification problems. This implementation supports various distance metrics, different balancing strategies, and handles mixed data types (numeric and categorical). Its execution time is faster than the SmoteClassif
of the UBL
R package.
Usage
smote_classif (form, dat, C.perc = "balance", k = 5, repl = FALSE,
dist = "Euclidean", p = 2)
Arguments
form |
A model formula identifying the target variable (e.g., |
dat |
A data frame containing the imbalanced dataset. |
C.perc |
Either |
k |
Integer specifying the number of nearest neighbors to use when generating synthetic examples (default: 5). |
repl |
Logical, whether to allow sampling with replacement when under-sampling (default: FALSE). |
dist |
Distance metric to use for nearest neighbor calculations. Supported metrics: |
p |
Parameter used when |
Details
If you use this package in your research, please cite the associated publication (doi:10.1016/j.eswa.2025.128796).
Value
A data frame with the same structure as the input, but with rebalanced classes according to the specified strategy.
References
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.doi:10.1613/jair.953.
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, doi:10.1016/j.eswa.2025.128796.
See Also
fast_generate_synthetic
for the implementation of
synthetic example generation
Examples
# Load the iris dataset
data(iris)
# Create an imbalanced dataset by taking a subset
imbal_iris <- iris[c(1:40, 51:100, 101:110), ]
table(imbal_iris$Species) # Show class distribution
# Balance classes using the default "balance" strategy
balanced_iris <- smote_classif(Species ~ ., imbal_iris)
table(balanced_iris$Species) # Show balanced distribution
# Custom over/under-sampling
custom_iris <- smote_classif(Species ~ ., imbal_iris,
C.perc = list(setosa = 0.8,
versicolor = 1,
virginica = 3))
table(custom_iris$Species)