Version: | 1.0.2 |
Date: | 2025-04-09 |
Title: | Machine Learning Immunogenicity and Vaccine Response Analysis |
Author: | Ivan Tomic |
Description: | Used for analyzing immune responses and predicting vaccine efficacy using machine learning and advanced data processing techniques. 'Immunaut' integrates both unsupervised and supervised learning methods, managing outliers and capturing immune response variability. It performs multiple rounds of predictive model testing to identify robust immunogenicity signatures that can predict vaccine responsiveness. The platform is designed to handle high-dimensional immune data, enabling researchers to uncover immune predictors and refine personalized vaccination strategies across diverse populations. |
Maintainer: | Ivan Tomic <info@ivantomic.com> |
Packaged: | 2025-04-09 14:50:39 UTC; login |
Imports: | cluster, plyr, dplyr, caret, pROC, PRROC, stats, rlang, Rtsne, dbscan, FNN, igraph, fpc, mclust, ggplot2, grDevices, RColorBrewer, R.utils, clusterSim, parallel, doParallel |
Depends: | R (≥ 3.4.0) |
URL: | https://github.com/atomiclaboratory/immunaut, <https://atomic-lab.org> |
BugReports: | https://github.com/atomiclaboratory/immunaut/issues |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyLoad: | yes |
LazyData: | yes |
RoxygenNote: | 7.3.2.9000 |
NeedsCompilation: | no |
Repository: | CRAN |
Date/Publication: | 2025-04-09 17:10:02 UTC |
Automated Machine Learning Model Building
Description
This function automates the process of building machine learning models using the caret package. It supports both binary and multi-class classification and allows users to specify a list of machine learning algorithms to be trained on the dataset. The function splits the dataset into training and testing sets, applies preprocessing steps, and trains models using cross-validation. It computes relevant performance metrics such as confusion matrix, AUROC (for binary classification), and prAUC (for binary classification).
Usage
auto_simon_ml(dataset_ml, settings)
Arguments
dataset_ml |
A data frame containing the dataset for training. All columns except the outcome column should contain the features. |
settings |
A list containing the following parameters:
|
Details
The function performs preprocessing (e.g., centering, scaling, and imputation of missing values) on the dataset based on the provided settings. It splits the data into training and testing sets using the specified partition, trains models using cross-validation, and computes performance metrics.
For binary classification problems, the function calculates AUROC and prAUC. For multi-class classification, it calculates macro-averaged AUROC, though prAUC is not used.
The function returns a list of trained models along with their performance metrics, including confusion matrix, variable importance, and post-resample metrics.
Value
A list where each element corresponds to a trained model for one of the algorithms specified in
settings$selectedPackages
. Each element contains:
info
: General information about the model, including resampling indices, problem type, and outcome mapping.training
: The trained model object and variable importance.predictions
: Predictions on the test set, including probabilities, confusion matrix, post-resample statistics, AUROC (for binary classification), and prAUC (for binary classification).
Examples
## Not run:
dataset <- read.csv("fc_wo_noise.csv", header = TRUE, row.names = 1)
# Generate a file header for the dataset to use in downstream analysis
file_header <- generate_file_header(dataset)
settings <- list(
fileHeader = file_header,
# Columns selected for analysis
selectedColumns = c("ExampleColumn1", "ExampleColumn2"),
clusterType = "Louvain",
removeNA = TRUE,
preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"),
target_clusters_range = c(3,4),
resolution_increments = c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5),
min_modularities = c(0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9),
pickBestClusterMethod = "Modularity",
seed = 1337
)
result <- immunaut(dataset, settings)
dataset_ml <- result$dataset$original
dataset_ml$pandora_cluster <- tsne_clust[[i]]$info.norm$pandora_cluster
dataset_ml <- dplyr::rename(dataset_ml, immunaut = pandora_cluster)
dataset_ml <- dataset_ml[, c("immunaut", setdiff(names(dataset_ml), "immunaut"))]
settings_ml <- list(
excludedColumns = c("ExampleColumn0"),
preProcessDataset = c("scale", "center", "medianImpute", "corr", "zv", "nzv"),
selectedPartitionSplit = split, # Use the current partition split
selectedPackages = c("rf", "RRF", "RRFglobal", "rpart2", "c5.0", "sparseLDA",
"gcvEarth", "cforest", "gaussPRPoly", "monmlp", "slda", "spls"),
trainingTimeout = 180 # Timeout 3 minutes
)
ml_results <- auto_simon_ml(dataset_ml, settings_ml)
## End(Not run)
Perform t-Distributed Stochastic Neighbor Embedding (t-SNE)
Description
The calculate_tsne
function reduces high-dimensional data into a 2-dimensional space using
t-SNE for visualization and analysis. This function dynamically adjusts t-SNE parameters
based on the characteristics of the dataset, ensuring robust handling of edge cases.
It also performs data validation, such as checking for sufficient data, removing zero variance
columns, and adjusting perplexity for optimal performance.
Usage
calculate_tsne(dataset, settings, removeGroups = TRUE)
Arguments
dataset |
A data frame or matrix containing the dataset to be processed. Must contain numeric columns. |
settings |
A list of settings for t-SNE, which may include |
removeGroups |
Logical, indicating whether to remove grouping variables before performing t-SNE. Default is TRUE. |
Value
A list containing:
-
info.norm
: The dataset with the t-SNE coordinates (tsne1
,tsne2
) added. -
tsne.norm
: The output from theRtsne
function. -
tsne_columns
: The names of the t-SNE columns used. -
initial_dims
: The number of dimensions used in the initial PCA step. -
perplexity
: The perplexity parameter used. -
exaggeration_factor
: The exaggeration factor used. -
max_iter
: The number of iterations used. -
theta
: The Barnes-Hut approximation parameter used. -
eta
: The learning rate used.
Cast All Strings to NA
Description
This function processes the columns of a given dataset, converting all non-numeric string values
(including factor columns converted to character) to NA
. It excludes specified columns from
this transformation. Columns that are numeric or of other types are left unchanged.
Usage
castAllStringsToNA(dataset, excludeColumns = c())
Arguments
dataset |
A data frame containing the dataset to be processed. |
excludeColumns |
A character vector specifying the names of columns to be excluded from processing.
These columns will not have any values converted to |
Details
The function iterates through the specified columns (excluding those listed in excludeColumns
),
converts factors to character, and then attempts to convert character values to numeric.
Any non-numeric strings will be converted to NA
. This is useful for cleaning datasets that may contain
mixed data types.
Value
A data frame where non-numeric strings in the included columns are replaced with NA
, and all other columns remain unchanged.
Perform Density-Based Clustering on t-SNE Results Using DBSCAN
Description
This function applies Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
on t-SNE results to identify clusters and detect noise points. It dynamically calculates the
MinPts
and eps
parameters based on the t-SNE results and settings provided. Additionally,
the function computes silhouette scores to evaluate cluster quality and returns cluster centroids
along with cluster sizes.
Usage
cluster_tsne_density(info.norm, tsne.norm, settings)
Arguments
info.norm |
A data frame containing the normalized data on which the t-SNE analysis was carried out. |
tsne.norm |
The t-SNE results object, including the 2D t-SNE coordinates ( |
settings |
A list of settings for the DBSCAN clustering. These settings include:
|
Details
The function first calculates MinPts
based on the dimensionality of the t-SNE data and adjusts
it using the provided minPtsAdjustmentFactor
. The eps
value is determined dynamically from the
k-nearest neighbors distance using the quantile specified by epsQuantile
. DBSCAN is then applied
to the t-SNE data, and any NA values in the cluster assignments are replaced with a predefined
outlier cluster ID (100). Finally, the function calculates cluster centroids, sizes, and silhouette
scores to evaluate cluster separation and quality.
Value
A list containing:
-
info.norm
: The input data frame with an additionalpandora_cluster
column for cluster assignments. -
cluster_data
: A data frame with cluster centroids and labeled clusters. -
avg_silhouette_score
: The average silhouette score, providing a measure of clustering quality.
Perform Hierarchical Clustering on t-SNE Results
Description
This function applies hierarchical clustering to t-SNE results, allowing for the identification of clusters in a reduced-dimensional space. The function also handles outliers by using DBSCAN for initial noise detection, and provides options to include or exclude outliers from the clustering process. Silhouette scores are computed to evaluate clustering quality, and cluster centroids are returned for visualization.
Usage
cluster_tsne_hierarchical(info.norm, tsne.norm, settings)
Arguments
info.norm |
A data frame containing the normalized data on which the t-SNE analysis was carried out. |
tsne.norm |
The t-SNE results object, including the 2D t-SNE coordinates ( |
settings |
A list of settings for the clustering analysis. The settings must include:
|
Details
The function first uses DBSCAN to detect outliers (marked as cluster "100") and then applies hierarchical clustering on the t-SNE results, either including or excluding the outliers depending on the settings. Silhouette scores are computed to assess the quality of the clustering. Cluster centroids are calculated and returned, along with the sizes of each cluster. Outliers, if detected, are handled separately in the final centroid calculation.
Value
A list containing:
-
info.norm
: The input data frame with an additionalpandora_cluster
column for cluster assignments. -
cluster_data
: A data frame with cluster centroids and labeled clusters. -
avg_silhouette_score
: The average silhouette score, providing a measure of clustering quality.
Perform KNN and Louvain Clustering on t-SNE Results
Description
This function performs clustering on t-SNE results by first applying K-Nearest Neighbors (KNN) to construct a graph, and then using the Louvain method for community detection. The function dynamically adjusts KNN parameters based on the size of the dataset, ensuring scalability. Additionally, it computes the silhouette score to evaluate cluster quality and calculates cluster centroids for visualization.
Usage
cluster_tsne_knn_louvain(
info.norm,
tsne.norm,
settings,
resolution_increment = 0.1,
min_modularity = 0.5
)
Arguments
info.norm |
A data frame containing the normalized data on which the t-SNE analysis was carried out. |
tsne.norm |
A list containing the t-SNE results, including a 2D t-SNE coordinate matrix in the |
settings |
A list of settings for the analysis, including:
|
resolution_increment |
The step size for incrementing the Louvain clustering resolution. Defaults to 0.1. |
min_modularity |
The minimum modularity score allowed for a valid clustering. Defaults to 0.5. |
Details
This function begins by constructing a KNN graph from the t-SNE results, then applies the Louvain algorithm for community detection. The KNN parameter is dynamically adjusted based on the size of the dataset to ensure scalability. The function evaluates clustering quality using silhouette scores and calculates cluster centroids for visualization. NA cluster assignments are handled by assigning them to a separate cluster labeled as "100."
Value
A list containing the following elements:
-
info.norm
: The input data frame with an additionalpandora_cluster
column for cluster assignments. -
cluster_data
: A data frame containing cluster centroids and cluster labels. -
avg_silhouette_score
: The average silhouette score, a measure of clustering quality. -
modularity
: The modularity score of the Louvain clustering. -
num_clusters
: The number of clusters found.
Apply Mclust Clustering on t-SNE Results
Description
This function performs Mclust clustering on the 2D t-SNE results, which are derived from high-dimensional data. It includes an initial outlier detection step using DBSCAN, and the user can specify whether to exclude outliers from the clustering process. Silhouette scores are computed to evaluate the quality of the clustering, and cluster centroids are returned for visualization, with outliers handled separately.
Usage
cluster_tsne_mclust(info.norm, tsne.norm, settings)
Arguments
info.norm |
A data frame containing the normalized data on which the t-SNE analysis was carried out. |
tsne.norm |
The t-SNE results object, including the 2D t-SNE coordinates ( |
settings |
A list of settings for the clustering analysis, including:
|
Details
The function first uses DBSCAN to detect outliers (marked as cluster "100") and then applies Mclust clustering on the t-SNE results. Outliers can be either included or excluded from the clustering, depending on the settings. Silhouette scores are calculated to assess the quality of the clustering. Cluster centroids are returned, along with the sizes of each cluster, and outliers are handled separately in the centroid calculation.
Value
A list containing:
-
info.norm
: The input data frame with an additionalpandora_cluster
column for cluster assignments. -
cluster_data
: A data frame with cluster centroids and labeled clusters. -
avg_silhouette_score
: The average silhouette score, providing a measure of clustering quality.
Find Optimal Resolution for Louvain Clustering
Description
This function iterates over a range of resolution values to find the optimal resolution for Louvain clustering, balancing the number of clusters and modularity. It aims to identify a resolution that results in a reasonable number of clusters while maintaining a high modularity score.
Usage
find_optimal_resolution(
graph,
start_resolution = 0.1,
end_resolution = 10,
resolution_increment = 0.1,
min_modularity = 0.3,
target_clusters_range = c(3, 6)
)
Arguments
graph |
An |
start_resolution |
Numeric. The starting resolution for the Louvain algorithm. Default is 0.1. |
end_resolution |
Numeric. The maximum resolution to test. Default is 10. |
resolution_increment |
Numeric. The increment to adjust the resolution at each step. Default is 0.1. |
min_modularity |
Numeric. The minimum acceptable modularity for valid clusterings. Default is 0.3. |
target_clusters_range |
Numeric vector of length 2. Specifies the acceptable range for the number of clusters (inclusive). Default is |
Details
The function performs Louvain clustering at different resolutions, starting from start_resolution
and
ending at end_resolution
, incrementing by resolution_increment
at each step. At each resolution,
the function calculates the number of clusters and modularity. The results are filtered to select those
where modularity exceeds min_modularity
and the number of clusters falls within the specified range
target_clusters_range
. The optimal resolution is chosen based on the most frequent number of clusters and
the median resolution that satisfies these criteria.
Value
A list containing:
selected |
A list with the optimal resolution, best modularity, and number of clusters. |
frequent_clusters_results |
A data frame containing results for resolutions that yielded the most frequent number of clusters. |
all_results |
A data frame with the resolution, number of clusters, and modularity for all tested resolutions. |
Generate a Demo Dataset with Specified Number of Clusters and Overlap
Description
This function generates a demo dataset with a specified number of subjects, features,
and desired number of clusters, ensuring that the generated clusters are not too far apart
and have some degree of overlap to simulate real-world data.
The generated dataset includes demographic information (outcome
, age
, and gender
),
as well as numeric features with a specified probability of missing values.
Usage
generate_demo_data(
n_subjects = 1000,
n_features = 200,
missing_prob = 0.1,
desired_number_clusters = 3,
cluster_overlap_sd = 15
)
Arguments
n_subjects |
Integer. The number of subjects (rows) to generate. Defaults to 1000. |
n_features |
Integer. The number of features (columns) to generate. Defaults to 200. |
missing_prob |
Numeric. The probability of introducing missing values (NA) in the feature columns. Defaults to 0.1. |
desired_number_clusters |
Integer. The approximate number of clusters to generate in the feature space. Defaults to 3. |
cluster_overlap_sd |
Numeric. The standard deviation to control cluster overlap. Defaults to 15 for more overlap. |
Details
The function generates n_features
numeric columns based on Gaussian clusters
with some overlap between clusters to simulate more realistic data. Missing values are
introduced in each feature column based on the missing_prob
.
Value
A data frame containing the generated demo dataset, with columns:
-
outcome
: A categorical variable with values "low" or "high". -
age
: A numeric variable representing the age of the subject (range 18-90). -
gender
: A categorical variable with values "male" or "female". -
Feature X
: Numeric feature columns with random values and some missing data.
Examples
# Generate a demo dataset with 1000 subjects, 200 features, and 3 clusters
demo_data <- generate_demo_data(n_subjects = 1000, n_features = 200,
desired_number_clusters = 3,
cluster_overlap_sd = 15, missing_prob = 0.1)
# View the first few rows of the dataset
head(demo_data)
Generate a File Header
Description
This function generates a fileHeader object from a given data frame which includes original names and remapped names of the data frame columns.
Usage
generate_file_header(dataset)
Arguments
dataset |
The input data frame. |
Value
A data frame containing original and remapped column names.
Main function to carry out Immunaut Analysis
Description
This function performs clustering and dimensionality reduction analysis on a dataset using user-defined settings. It handles various preprocessing steps, dimensionality reduction via t-SNE, multiple clustering methods, and generates associated plots based on user-defined or default settings.
Usage
immunaut(dataset, settings = list())
Arguments
dataset |
A data frame representing the dataset on which the analysis will be performed. The dataset must contain numeric columns for dimensionality reduction and clustering. |
settings |
A named list containing settings for the analysis. If NULL, defaults will be used. The settings list may contain:
|
Value
A list containing the following:
-
tsne_calc
: The t-SNE results object. -
tsne_clust
: The clustering results. -
dataset
: A list containing the original dataset, the preprocessed dataset, and a dataset with machine learning-ready data. -
clusters
: The final cluster assignments. -
settings
: The list of settings used for the analysis.
Examples
data <- matrix(runif(2000), ncol=20)
settings <- list(clusterType = "Louvain",
resolution_increments = c(0.05, 0.1),
min_modularities = c(0.3, 0.5))
result <- immunaut(data.frame(data), settings)
print(result$clusters)
Demo data set from immunaut package. This data is used in this package examples. It consist of 4x4 feature matrix + additional dummy columns that can be used for testing.
Description
Demo data set from immunaut package. This data is used in this package examples. It consist of 4x4 feature matrix + additional dummy columns that can be used for testing.
Usage
data(immunautDemo)
Format
An object of class data.frame
with 4 rows and 7 columns.
Examples
## Not run:
data(immunautDemo)
## define settings variable
settings <- list()
settings$fileHeader <- generate_file_header(immunautDemo)
# ... and other settings
results <- immunaut(immunautDemo, settings)
## End(Not run)
Demo data set from immunaut package. This data is used in this package examples.
Description
Demo data set from immunaut package. This data is used in this package examples.
Usage
data(immunautLAIV)
Format
An object of class data.frame
with 244 rows and 32 columns.
Examples
## Not run:
data(immunautLAIV)
## define settings variable
settings <- list()
settings$fileHeader <- generate_file_header(immunautLAIV)
# ... and other settings
results <- immunaut(immunautLAIV, settings)
## End(Not run)
Is Numeric
Description
Determines whether a variable is a number or a numeric string
Usage
isNumeric(x)
Arguments
x |
Variable to be checked |
Value
Logical indicating whether x is numeric and non-NA
Check if request variable is Empty
Description
Checks if the given variable is empty and optionally logs the variable name.
Usage
is_var_empty(variable)
Arguments
variable |
The variable to check. |
Value
boolean TRUE if the variable is considered empty, FALSE otherwise.
Pick Best Cluster by Modularity
Description
This function selects the best cluster from a list of clustering results based on the highest modularity score.
Usage
pick_best_cluster_modularity(tsne_clust)
Arguments
tsne_clust |
A list of clustering results where each element contains clustering information, including the modularity score. |
Details
The function iterates over a list of clustering results (tsne_clust
) and
selects the cluster with the highest modularity score. If no clusters are valid or
the tsne_clust
list is empty, the function will stop and return an error.
Value
Returns the clustering result with the highest modularity score.
Pick the Best Clustering Result Based on Multiple Metrics
Description
This function evaluates multiple clustering results based on various metrics such as modularity, silhouette score, Davies-Bouldin Index (DBI), and Calinski-Harabasz Index (CH). It normalizes the scores across all metrics, calculates a combined score for each clustering result, and selects the best clustering result.
Usage
pick_best_cluster_overall(tsne_clust, tsne_calc)
Arguments
tsne_clust |
A list of clustering results. Each result should contain metrics such as modularity, silhouette score, and cluster assignments for the dataset. |
tsne_calc |
A list containing the t-SNE results. It includes the t-SNE coordinates of the dataset used for clustering. |
Details
The function computes four different metrics for each clustering result:
Modularity: A measure of the quality of the division of the network into clusters.
Silhouette score: A measure of how similar data points are to their own cluster compared to other clusters.
Davies-Bouldin Index (DBI): A ratio of within-cluster distances to between-cluster distances, with lower values being better.
Calinski-Harabasz Index (CH): The ratio of the sum of between-cluster dispersion to within-cluster dispersion, with higher values being better.
The scores for each metric are normalized between 0 and 1, and an overall score is calculated for each clustering result. The clustering result with the highest overall score is selected as the best.
Value
The clustering result with the highest combined score based on modularity, silhouette score, Davies-Bouldin Index (DBI), and Calinski-Harabasz Index (CH).
Pick Best Cluster by Silhouette Score
Description
This function selects the best cluster from a list of clustering results based on the highest average silhouette score.
Usage
pick_best_cluster_silhouette(tsne_clust)
Arguments
tsne_clust |
A list of clustering results where each element contains clustering information, including the average silhouette score. |
Details
The function iterates over a list of clustering results (tsne_clust
) and
selects the cluster with the highest average silhouette score. If no clusters are valid or
the tsne_clust
list is empty, the function will stop and return an error.
Value
Returns the clustering result with the highest average silhouette score.
Select the Best Clustering Based on Weighted Scores: AUROC, Modularity, and Silhouette
Description
This function selects the optimal clustering configuration from a list of t-SNE
clustering results
by evaluating each configuration's AUROC, modularity, and silhouette scores. These scores are combined
using a weighted average, allowing for a more comprehensive assessment of each configuration's relevance.
Usage
pick_best_cluster_simon(dataset, tsne_clust, tsne_calc, settings)
Arguments
dataset |
A data frame representing the original dataset, where each observation will be assigned cluster labels
from each clustering configuration in |
tsne_clust |
A list of clustering results from different t-SNE configurations, with each element containing
|
tsne_calc |
An object containing t-SNE results on |
settings |
A list of settings for machine learning model training and scoring, including:
|
Details
For each clustering configuration in tsne_clust
, this function:
Assigns cluster labels to the dataset.
Trains machine learning models specified in
settings
on the dataset with cluster labels.Evaluates each model based on AUROC, modularity, and silhouette scores.
Selects the clustering configuration with the highest weighted average score as the best clustering result.
Value
A list containing the best clustering configuration (with the highest weighted score) and its associated information.
Plot Clustered t-SNE Results
Description
This function generates a t-SNE plot with cluster assignments using consistent color mappings. It includes options for plotting points based on their t-SNE coordinates and adding cluster labels at the cluster centroids. The plot is saved as an SVG file in a temporary directory.
Usage
plot_clustered_tsne(info.norm, cluster_data, settings)
Arguments
info.norm |
A data frame containing t-SNE coordinates ( |
cluster_data |
A data frame containing the cluster centroids and labels, with columns |
settings |
A list of settings for the plot, including:
|
Value
ggplot2 object representing the clustered t-SNE plot.
Examples
## Not run:
# Example usage
plot <- plot_clustered_tsne(info.norm, cluster_data, settings)
print(plot)
## End(Not run)
Preprocess a Dataset Using Specified Methods
Description
This function preprocesses a dataset by applying a variety of transformation methods, such as centering, scaling, or imputation. Users can also specify columns to exclude from preprocessing. The function supports a variety of preprocessing methods, including dimensionality reduction and imputation techniques, and ensures proper method application order.
Usage
preProcessData(
data,
outcome,
excludeClasses,
methods = c("center", "scale"),
settings
)
Arguments
data |
A data frame or matrix representing the dataset to be preprocessed. |
outcome |
A character string representing the outcome variable, if any, for outcome-based transformations. |
excludeClasses |
A character vector specifying the column names to exclude from
preprocessing. Default is |
methods |
A character vector specifying the preprocessing methods to apply.
Default methods are |
settings |
A named list containing settings for the analysis. If NULL, defaults will be used. The settings list may contain:
- |
Details
The function applies various transformations to the dataset as specified by the user. It ensures
that methods are applied in the correct order to maintain data integrity and consistency. If fewer
than two columns remain after excluding specified columns, the function halts and returns NULL
.
The function also handles categorical columns by skipping their transformation. Users can also
specify outcome variables for specialized preprocessing.
Value
A list containing:
-
processedMat
: The preprocessed dataset. -
preprocessParams
: The preprocessing parameters that were applied to the dataset.
Pre-process and Resample Dataset
Description
This function applies pre-processing transformations to the dataset, then resamples it.
Usage
preProcessResample(
datasetData,
preProcess,
selectedOutcomeColumns,
outcome_and_classes,
settings
)
Arguments
datasetData |
Dataframe to be pre-processed |
preProcess |
Vector of pre-processing methods to apply |
selectedOutcomeColumns |
Character vector of outcome columns |
outcome_and_classes |
List of outcomes and their classes |
settings |
A named list containing settings for the analysis. If NULL, defaults will be used. The settings list may contain:
- |
Value
A list containing the pre-processing mapping and the processed dataset
Remove Outliers Based on Cluster Information
Description
The remove_outliers
function removes rows from a dataset based on the presence
of outliers marked by a specific cluster ID (typically 100) in the pandora_cluster
column.
This function is meant to be used internally during downstream dataset analysis
to filter out data points that have been identified as outliers during clustering.
Usage
remove_outliers(dataset, settings)
Arguments
dataset |
A data frame that includes clustering results, particularly a |
settings |
A list of settings. Must contain the logical value |
Value
A filtered data frame with outliers removed if applicable.