Title: | Analysing Chemodiversity of Phytochemical Data |
Version: | 0.3.1 |
Description: | Quantify and visualise various measures of chemical diversity and dissimilarity, for phytochemical compounds and other sets of chemical composition data. Importantly, these measures can incorporate biosynthetic and/or structural properties of the chemical compounds, resulting in a more comprehensive quantification of diversity and dissimilarity. For details, see Petrén, Köllner and Junker (2023) <doi:10.1111/nph.18685>. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
Depends: | R (≥ 2.10) |
Imports: | jsonlite, httr, vegan, webchem, fmcsR, ChemmineR, hillR, ape, GUniFrac, tidygraph, igraph, ggraph, ggplot2, gridExtra, ggdendro, tidyr, rlang, curl |
VignetteBuilder: | knitr |
URL: | https://github.com/hpetren/chemodiv |
BugReports: | https://github.com/hpetren/chemodiv/issues |
NeedsCompilation: | no |
Packaged: | 2025-06-18 17:43:34 UTC; Hampus |
Author: | Hampus Petrén |
Maintainer: | Hampus Petrén <hampus.petren@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-06-18 18:30:01 UTC |
chemodiv: A package for analysing phytochemical diversity
Description
chemodiv is an R package for analysing the chemodiversity of phytochemical data. The package includes a number of functions that enables quantification and visualization of phytochemical diversity and dissimilarity for any type of phytochemical (and similar) samples, such as herbivore defence compounds, volatiles and similar. Importantly, calculations of diversity and dissimilarity can incorporate biosynthetic and/or structural properties of the phytochemical compounds, resulting in more comprehensive quantifications of diversity and dissimilarity. Functions in the R-package will work best for sets of data, commonly generated by chemical ecologists using GC-MS, LC-MS or similar, where all or most compounds in the samples have been confidently identified. See Petren et al. 2023 for a detailed description of the package, and Petren et al. 2024 for a more in-depth discussion and review of plant chemodiversity.
Details
Two datasets are needed to use the full set of analyses included in the package.
The first dataset should contain data on the relative
abundance/concentration (i.e. proportion) of different compounds (columns)
in different samples (rows). See the included
dataset minimalSampData
for a basic example.
Note that all calculations of diversity, and most calculations of
dissimilarity, are only performed on relative, rather than absolute, values.
The second dataset should contain, in each of three columns in a data frame,
the compound name, SMILES and InChIKey IDs of all the compounds
present in the first dataset. See the included dataset
minimalCompData
for a basic example. SMILES and InChIKey
are chemical identifiers that are easily obtained for each compound
by searching for it in PubChem https://pubchem.ncbi.nlm.nih.gov/.
Here, a search with a common name will bring up the compound's
record in the database, where the (isomeric/canonical) SMILES and
InChIKey are included. Various automated tools such as
the PubChem Identifier Exchange Service
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi or
The Chemical Translation Service https://cts.fiehnlab.ucdavis.edu/
can also be used. The user is intentionally required to compile the
chemical identifiers manually to ensure these are correct,
as lists of compounds very often contain compounds wrongly named,
wrongly formatted, under various synonyms etc. which prevents easy
automatic translation of compound names to SMILES and InChIKey.
Note that SMILES IDs might contain the character combination "\C"
.
If SMILES are entered manually directly in R, this is interpreted as an
unrecognized escape and results in an error. In this case, an extra
backslash has to be added: "\\C"
. If the dataset is instead
imported into R as a csv-file or txt-file (recommended), this is done
automatically and no manual edits has to be done.
The second dataset with the chemical IDs is primarily used to construct
one or more dissimilarity matrices with pairwise dissimilarities between
chemical compounds, which can then be used in calculations of phytochemical
diversity and dissimilarity. As noted above, to do this, the compounds
in the samples have to be identified and their chemical IDs listed.
If some compounds in a dataset are unknown, these can be handled in
different ways decided by the user, see compDis
for details.
If many or all compounds are unknown, as is common for more metabolomic
type datasets, phytochemical diversity and dissimilarity can still be
calculated using indices that do not consider compound dissimilarities.
Alternatively, other ways to calculate compound dissimilarities,
not based on knowing compound identities, can be used.
For example, cosine dissimilarities between tandem (MS/MS) mass spectra of
metabolomic features can be calculated in the GNPS
framework https://gnps.ucsd.edu (Wang et al. 2016).
A dissimilarity matrix of such dissimilarities can then be used
for the compDisMat
argument in various functions in the package,
thereby enabling comprehensive quantification of phytochemical diversity
and dissimilarity also for datasets consisting of unidentified compounds.
Once the dataset with samples and the dataset with compounds are prepared, these should be imported/constructed as separate data frames in R, and all analyses in the package can then be performed, in largely the same order as they appear in the list below.
Data format checks
Compound classification and dissimilarity
Diversity calculations
calcDiv
calcBetaDiv
calcDivProf
Sample dissimilarities
Molecular network and properties
Chemodiversity and network plots
Shortcut function
Author(s)
Hampus Petren, Tobias G. Koellner, Robert R. Junker
References
Petren H, Koellner TG, Junker RR. 2023. Quantifying chemodiversity considering biochemical and structural properties of compounds with the R package chemodiv. New Phytologist 237: 2478-2492.
Petren H, Anaia RA, Aragam KS, Braeutigam A, Eckert S, Heinen R, Jakobs R, Ojeda L, Popp M, Sasidharan R, Schnitzler J-P, Steppuhn A, Thon F, Unsicker SB, van Dam NM, Weisser WW, Wittmann MJ, Yepes S, Ziaja D, Meuller C, Junker RR. 2024. Understanding the chemodiversity of plants: Quantification, variation and ecological function. Ecological Monographs 94: e1635.
Wang M, Carver JJ, Phelan VV, et al. 2016. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nature Biotechnology 34: 828-837.
See Also
https://github.com/hpetren/chemodiv
Generate NPClassifier classification
Description
Function to classify compounds with NPClassifier, and put the results in a data frame containing the pathway, superclass and class for each compound.
Usage
NPCTable(compoundData)
Arguments
compoundData |
Data frame with the chemical compounds of interest, usually the compounds found in the sample dataset. Should include a column named "compound" with common names of the compounds and a column named "smiles" with SMILES IDs of the compounds. |
Details
NPClassifier (Kim et al. 2021) is a deep-learning tool that automatically
classifies natural products (i.e. phytochemical compounds) into a
hierarchical classification of three levels: pathway, superclass and class.
This classification largely corresponds to the biosynthetic groups/pathways
the compounds are produced in. The NPCTable
function
conveniently performs this classification directly in R on the
compounds in compoundData
, by accessing the tool
at https://npclassifier.gnps2.org/ and downloading the classifications.
Value
Data frame with the NPClassifier classification for each compound as pathway, superclass and class. Note that compounds may be classified in more than one group, or no group, at each level of classification.
References
Kim HW, Wang M, Leber CA, Nothias L-F, Reher R, Kang KB, van der Hooft JJJ, Dorrestein PC, Gerwick WH, Cottrell GW. 2021. NPClassifier: A Deep Neural Network-Based Structural Classification Tool for Natural Products. Journal of Natural Products 84: 2795-2807.
Examples
data(minimalCompData)
NPCTable(minimalCompData)
data(alpinaCompData)
NPCTable(compoundData = alpinaCompData[1:3,]) # First three compounds only
Arabis alpina floral scent compounds
Description
A dataset listing the compounds in alpinaSampData
.
Usage
alpinaCompData
Format
A data frame with 15 rows and 3 columns. Each row is a compound. First column is a common name of the compound, second column is the SMILES (Simplified Molecular-Input Line-Entry System) specification, third column is the InChIKey (International Chemical Identifier).
Source
Petren H, Torang P, Agren J, Friberg M. 2021. Evolution of floral scent in relation to self-incompatibility and capacity for autonomous self-pollination in the perennial herb Arabis alpina. Annals of Botany 127: 737-747.
Arabis alpina floral scent compound dissimilarity matrix
Description
A matrix with compound dissimilarities calculated using
compDis
with type = "PubChemFingerprint"
,
for the compounds in alpinaCompData
.
Usage
alpinaCompDis
Format
A 15x15 compound dissimilarity matrix.
Arabis alpina floral scent molecular network
Description
A molecular network. Generated by the molNet
function
used on the alpinaCompDis
dataset, with
cutOff = 0.75
.
Usage
alpinaMolNet
Format
A tbl_graph
object with 15 nodes and 56 edges.
Arabis alpina floral scent NPClassifier table
Description
A table with the NPClassifier pathways, superclasses and classes,
along with the compound names, smiles and inchikey. Generated by
the NPCTable
function used on the
alpinaCompData
dataset.
Usage
alpinaNPCTable
Format
A dataframe with 15 compounds and their NPClassifier classifications.
Arabis alpina populations
Description
A dataset listing what population each sample
in alpinaSampData
comes from.
Usage
alpinaPopData
Format
A data frame with 87 rows and 1 column. Each row represents
the population each sample in alpinaSampData
comes from
(It8, S1 or G1).
Source
Petren H, Torang P, Agren J, Friberg M. 2021. Evolution of floral scent in relation to self-incompatibility and capacity for autonomous self-pollination in the perennial herb Arabis alpina. Annals of Botany 127: 737-747.
Arabis alpina floral scent data
Description
A dataset with proportional floral scent data from three populations of the plant Arabis alpina.
Usage
alpinaSampData
Format
A data frame with 87 rows and 15 columns. Each row is a sample, each column is a floral scent compound.
Source
Petren H, Torang P, Agren J, Friberg M. 2021. Evolution of floral scent in relation to self-incompatibility and capacity for autonomous self-pollination in the perennial herb Arabis alpina. Annals of Botany 127: 737-747.
Arabis alpina floral scent sample dissimilarity matrix
Description
A matrix with sample dissimilarities calculated from
alpinaSampData
and alpinaCompDis
using sampDis
with type = "GenUniFrac"
and
alpha = 0.5
.
Usage
alpinaSampDis
Format
A 87x87 sample dissimilarity matrix.
Calculate beta diversity
Description
Function to calculate beta diversity in the Hill diversity framework. This can be calculated as Hill beta diversity or Functional Hill beta diversity.
Usage
calcBetaDiv(sampleData, compDisMat = NULL, type = "HillDiv", q = 1)
Arguments
sampleData |
Data frame with the relative concentration of each compound (column) in every sample (row). |
compDisMat |
Compound dissimilarity matrix, as calculated by
|
type |
Type(s) of Hill beta diversity to calculate. |
q |
Diversity order to use for the calculation of beta diversity.
See |
Details
The function calculates a single beta diversity value for the supplied
sampleData
. This is calculated as beta = gamma / alpha. Gamma
diversity represents the diversity of the pooled data set, alpha diversity
represents the mean diversity across individual samples, and
beta diversity represents turnover or variability among samples.
With type = "HillDiv"
and q = 0
the calculated beta diversity
is equal to the well-known and most simple measure of beta diversity
introduced by Whittaker 1960, where beta = gamma / alpha, based only
on the number of species (here compounds).
Value
Data frame with type of Hill beta diversity calculated, q, and values for gamma diversity, mean alpha diversity and beta diversity.
References
Chao A, Chiu C-H, Jost L. 2014. Unifying Species Diversity, Phylogenetic Diversity, Functional Diversity, and Related Similarity and Differentiation Measures Through Hill Numbers. Annual Review of Ecology, Evolution, and Systematics 45: 297-324.
Jost L. 2007. Partitioning diversity into independent alpha and beta components. Ecology 88: 2427-2439.
Whittaker RH. 1960. Vegetation of the Siskiyou Mountains, Oregon and California. Ecological Monographs 30: 279-338.
Examples
data(minimalSampData)
data(minimalCompDis)
calcBetaDiv(sampleData = minimalSampData)
calcBetaDiv(sampleData = minimalSampData, compDisMat = minimalCompDis,
type = c("HillDiv", "FuncHillDiv"), q = 2)
data(alpinaSampData)
data(alpinaCompDis)
calcBetaDiv(sampleData = alpinaSampData, compDisMat = alpinaCompDis,
type = "FuncHillDiv")
Calculate various diversity measures
Description
Function to calculate different common measures of diversity, and components (richness, evenness, disparity) thereof. Types of measures that can be calculated includes Hill diversity, Functional Hill diversity, Mean Pairwise Dissimilarity (MPD), Shannon's diversity, Simpson diversity, Rao's Q, Pielou's evenness and Hill evenness.
Usage
calcDiv(sampleData, compDisMat = NULL, type = "HillDiv", q = 1)
Arguments
sampleData |
Data frame with the relative concentration of each compound (column) in every sample (row). |
compDisMat |
Compound dissimilarity matrix, as calculated by
|
type |
Type(s) of diversity or component(s) thereof to calculate.
Any of |
q |
Diversity order to use for Hill diversity, Functional Hill
Diversity and Hill Evenness. q should be equal to or larger than zero.
This parameter determines the sensitivity of the (Functional) Hill Diversity
measure to the relative frequencies of compounds. Commonly set to 0, 1 or 2,
although any value > 0 may be used. For |
Details
The function calculates diversity and/or components thereof for each
sample in sampleData
. It can calculate the following indices:
-
Shannon
. Shannon's Diversity. -
Simpson
. Simpson Diversity, often referred to as the Inverse Simpson Index. -
HillDiv
. Hill Diversity. Equation 4a/4b in Chao et al. 2014. Also referred to as the Hill number or the effective number of species (here compounds). The parameter q determines the sensitivity of the measure to the relative frequencies of compounds (see above for details). Forq = 0
, this equals the number of compounds in a sample, i.e. the richness. Forq = 1
, this equals the exponential of Shannon's Diversity. Forq = 2
, this equals the Simpson Diversity. -
FuncHillDiv
. Functional Hill Diversity. There are different versions of this. Here, the most common version is calculated. This is the FD(Q), called "total functional diversity", calculated in equation 4b/6b in Chiu & Chao 2014. This measure quantifies the effective total dissimilarity between compounds in the sample. Calculation of Functional Hill Diversity requires a compound dissimilarity matrix. The parameter q determines the sensitivity of the measure to the relative frequencies of compounds (see above for details). Forq = 1
, this is a measure sensitive to compound richness, evenness and dissimilarity, and is therefore the most comprehensive measure of diversity. Forq = 0
, this is equal to Functional Attribute Diversity (FAD) which is the sum of all dissimilarities in the dissimilarity matrix. -
MPD
. Mean Pairwise Dissimilarity. As the name suggests, this is equal to the mean of the pairwise dissimilarities in the compound dissimilarity matrix (excluding the 0 values in the diagonal). Therefore, in contrast to FAD (see above) this measure is not dependent on the number of compounds, and hence represents the disparity component of diversity. Practically, MPD is calculated as Functional Hill Diversity atq = 0
, divided by n(n-1), where n is the number of compounds, i.e. Hill Diversity atq = 0
. -
RaoQ
. Rao's quadratic entropy index Q. The perhaps most common measure of functional diversity. Requires a compound dissimilarity matrix. Rao's Q represents the average dissimilarity of two randomly selected (weighed by their proportions) compounds in the sample. -
PielouEven
. Pielou's Evenness, also referred to as Shannon's equitability. This is perhaps the most common evenness measure. Equal to the Shannon's Diversity divided by the natural logarithm of the number of compounds. In other words, it expresses evenness with the observed Shannon's diversity as a proportion of the maximum Shannon's diversity where all compounds are equally abundant. Therefore, this is a relative measure with a minimum value of 0 and a maximum value of 1. This measure of evenness is not replication invariant. -
HillEven
. Hill Evenness, as defined by equation 8 in Tuomisto 2012. This is equal to the Hill Diversity, at a given value of q, divided by the number of compounds, and therefore has a minimum value of 1 / number of compounds and maximum value of 1. This measure of evenness is replication invariant. This measure can be normalized to range from 0 to 1 (equation 13 in Tuomisto 2012).
Value
Data frame with calculated diversity values for each sample.
References
Chao A, Chiu C-H, Jost L. 2014. Unifying Species Diversity, Phylogenetic Diversity, Functional Diversity, and Related Similarity and Differentiation Measures Through Hill Numbers. Annual Review of Ecology, Evolution, and Systematics 45: 297-324.
Chiu C-H, Chao A. 2014. Distance-Based Functional Diversity Measures and Their Decomposition: A Framework Based on Hill Numbers. PLoS ONE 9: e100014.
Hill MO. 1973. Diversity and Evenness: A Unifying Notation and Its Consequences. Ecology 54: 427-432.
Petren H, Koellner TG, Junker RR. 2023. Quantifying chemodiversity considering biochemical and structural properties of compounds with the R package chemodiv. New Phytologist 237: 2478-2492.
Petren H, Anaia RA, Aragam KS, Braeutigam A, Eckert S, Heinen R, Jakobs R, Ojeda L, Popp M, Sasidharan R, Schnitzler J-P, Steppuhn A, Thon F, Unsicker SB, van Dam NM, Weisser WW, Wittmann MJ, Yepes S, Ziaja D, Meuller C, Junker RR. 2024. Understanding the chemodiversity of plants: Quantification, variation and ecological function. Ecological Monographs 94: e1635.
Tuomisto H. 2012. An updated consumer's guide to evenness and related indices. Oikos 121: 1203-1218
Examples
data(minimalSampData)
data(minimalCompDis)
calcDiv(sampleData = minimalSampData)
calcDiv(sampleData = minimalSampData, type = c("HillDiv", "HillEven"))
calcDiv(sampleData = minimalSampData, compDisMat = minimalCompDis,
type = "FuncHillDiv", q = 2)
data(alpinaSampData)
data(alpinaCompDis)
calcDiv(sampleData = alpinaSampData, compDisMat = alpinaCompDis,
type = "FuncHillDiv")
Calculate a diversity profile
Description
Function to calculate a diversity profile, i.e. calculate Hill diversity or Functional Hill Diversity for a range of q values.
Usage
calcDivProf(
sampleData,
compDisMat = NULL,
type = "HillDiv",
qMin = 0,
qMax = 3,
step = 0.1
)
Arguments
sampleData |
Data frame with the relative concentration of each compound (column) in every sample (row). |
compDisMat |
Compound distance matrix, as calculated by
|
type |
Type of Hill Diversity to calculate for the diversity profile.
|
qMin |
Minimum value of q. |
qMax |
Maximum value of q. |
step |
Increment by which q will be calculated between |
Details
The function calculates a diversity profile for each sample
in sampleData
. A diversity profile is a calculation of
Hill Diversity or Functional Hill Diversity for a range of
different values of q. This function performs the calculations,
while chemoDivPlot
can be used to conveniently
create the diversity profile plot, where Hill Diversity is
plotted as a function of q within the chosen range.
The shape of the diversity profile curve reflects the evenness
of compound proportions in the sample. For a perfectly even sample
the curve is flat. The more uneven the compound proportions are,
the more steep is the decline of the curve. A common range,
used as default, of q values is between qMin = 0
and
qMax = 3
, as diversity should change little beyond qMax = 3
.
See calcDiv
for further details on q.
Value
List with a diversity profile data frame with samples as rows
and the Hill diversity or Functional Hill diversity for different q
values as columns; and values for type
, qMin
,
qMax
and step
.
References
Chao A, Chiu C-H, Jost L. 2014. Unifying Species Diversity, Phylogenetic Diversity, Functional Diversity, and Related Similarity and Differentiation Measures Through Hill Numbers. Annual Review of Ecology, Evolution, and Systematics 45: 297-324.
Examples
data(minimalSampData)
data(minimalCompDis)
calcDivProf(sampleData = minimalSampData)
calcDivProf(sampleData = minimalSampData, compDisMat = minimalCompDis,
type = "FuncHillDiv")
data(alpinaCompData)
data(alpinaCompDis)
calcDivProf(sampleData = alpinaSampData, compDisMat = alpinaCompDis,
type = "FuncHillDiv", qMin = 1, qMax = 2, step = 0.2)
Check data formatting
Description
Function to check that the datasets used by other functions in the chemodiv package are correctly formatted.
Usage
chemoDivCheck(sampleData, compoundData)
Arguments
sampleData |
Data frame with the relative concentration of each compound (column) in every sample (row). |
compoundData |
Data frame with the compounds in |
Details
The function performs a number of checks on the two main datasets used as input data, to make sure datasets are formatted in a way suitable for the other functions in the package. This should make it easier for users to correctly construct datasets before starting with analyses.
Two datasets are needed to use the full set of analyses included in
the package, and these can be checked for formatting issues.
The first dataset should contain data on the proportions
of different compounds (columns) in different samples (rows).
Note that all calculations of diversity, and most calculations of
dissimilarity, are only performed on relative, rather than absolute,
values. The second dataset should contain, in each of three
columns in a data frame, the compound name, SMILES and InChIKey IDs of
all the compounds present in the first dataset. See
chemodiv
for details on obtaining SMILES and InChIKey IDs.
For compound names, avoid starting with a number, avoid using Greek letters
or special characters, and ensure there are no trailing spaces.
Value
One or several messages pointing out problems with data formatting, or a message informing that the datasets appear to be correctly formatted.
Examples
data(minimalSampData)
data(minimalCompData)
chemoDivCheck(minimalSampData, minimalCompData) # Correct format
chemoDivCheck(minimalSampData, minimalCompData[c(2,3,1),]) # Incorrect format
data(alpinaSampData)
data(alpinaCompData)
chemoDivCheck(sampleData = alpinaSampData, compoundData = alpinaCompData)
Plot chemodiversity
Description
Function to conveniently create basic plots of the different types of
chemodiversity measurements calculated by functions in the package.
This function exists to provide an easy way to make basic chemodiversity
plots. As functions in the package output data in standard formats,
customized plots are easily created with ggplot2
.
Usage
chemoDivPlot(
compDisMat = NULL,
divData = NULL,
divProfData = NULL,
sampDisMat = NULL,
groupData = NULL
)
Arguments
compDisMat |
Compound dissimilarity matrix, generated by
the |
divData |
Diversity/evenness data frame,
generated by the |
divProfData |
Diversity profile, generated by
the |
sampDisMat |
Sample dissimilarity matrix, generated by
the |
groupData |
Grouping data. Should be either a vector or a data frame with a single column. |
Details
The function can create four different types of plots,
(using ggplot2
) depending on which input data
is supplied:
Function argument
compDisMat
. A compound dissimilarity matrix will be plotted as a dendrogram visualizing how structurally/biosynthetically similar different compounds are to each other.Function argument
divData
. Diversity/evenness values will be plotted as a boxplot.Function argument
divProfData
. A diversity profile, plotting (Functional) Hill diversity at different values of q will be plotted as a line plot.Function argument
sampDisMat
. A sample dissimilarity matrix will be plotted as an NMDS plot.Function argument
groupData
. Grouping data (e.g. population, species etc.) may be supplied, to plot each group in different boxes/lines/colours.
Note that this function can take any combination of the four arguments
as input, and argument names should always be specified to ensure
each dataset is correctly plotted. If including the function
argument sampDisMat
, a Nonmetric Multidimensional Scaling (NMDS)
will be performed, which may take time for larger datasets.
Value
The specified chemodiversity plots.
Examples
minimalDiv <- calcDiv(minimalSampData, minimalCompDis, type = "FuncHillDiv")
groups <- c("A", "A", "B", "B")
chemoDivPlot(divData = minimalDiv, groupData = groups)
data(alpinaCompDis)
data(alpinaSampDis)
data(alpinaPopData)
alpinaDiv <- calcDiv(sampleData = alpinaSampData, compDisMat = alpinaCompDis,
type = "FuncHillDiv")
alpinaDivProf <- calcDivProf(sampleData = alpinaSampData,
compDisMat = alpinaCompDis, type = "FuncHillDiv",
qMin = 0, qMax = 2, step = 0.2)
chemoDivPlot(compDisMat = alpinaCompDis, divData = alpinaDiv,
divProfData = alpinaDivProf, sampDisMat = alpinaSampDis,
groupData = alpinaPopData)
Calculate compound dissimilarities
Description
Function to quantify dissimilarities between phytochemical compounds.
Usage
compDis(
compoundData,
type = "PubChemFingerprint",
npcTable = NULL,
unknownCompoundsMean = FALSE
)
Arguments
compoundData |
Data frame with the chemical compounds of interest, usually the compounds found in the sample dataset. Should have a column named "compound" with common names of the compounds, a column named "smiles" with SMILES IDs of the compounds, and a column named "inchikey" with the InChIKey IDs for the compounds. |
type |
Type of data compound dissimilarity calculations will be
based on: |
npcTable |
A data frame already generated by |
unknownCompoundsMean |
If unknown compounds, i.e. ones without SMILES or InChIKey, should be given mean dissimilarity values. If not, these will have dissimilarity 1 to all other compounds. |
Details
This function calculates matrices with pairwise dissimilarities between
the chemical compounds in compoundData
, to quantify how
different the molecules are to each other. It does so in three
different ways, based on the biosynthetic classification or
molecular structure of the molecules:
Using the classification from the NPClassifier tool,
type = "NPClassifier"
. NPClassifier (Kim et al. 2021) is a deep-learning tool that automatically classifies natural products (i.e. phytochemical compounds) into a hierarchical classification of three levels: pathway, superclass and class. This classification largely corresponds to the biosynthetic groups/pathways the compounds are produced in. Classifications are downloaded from https://npclassifier.gnps2.org/. NPClassifier does not always manage to classify every compound into all three hierarchical levels. In such cases, it might be beneficial to first runNPCTable
, manually edit the resulting data frame with probable classifications if possible (with help from the Supporting Information in Kim et al. 2021), and then supply this classification to thecompDis
function with thenpcTable
argument. This will ensure that compound dissimilarities are computed optimally.Using PubChem Fingerprints,
type = "PubChemFingerprint"
. This is a binary substructure fingerprint with 881 binary variables describing the chemical structure of a compound. With this method, compounds are therefore compared based on how structurally dissimilar the molecules are. See https://pubchem.ncbi.nlm.nih.gov/docs/data-specification for more information. (There are many other types of fingerprints, and ways of calculating compound dissimilarities based on them, see e.g. packagesfingerprint
andrcdk
). Fingerprint data for molecules is downloaded from PubChem. In association with this, there might be a Warning message about closing unused connections, which is not important.fMCS, flexible Maximum Common Substructure,
type = "fMCS"
. This is a pairwise graph matching concept. The fMCS of two compounds is the largest substructure that occurs in both compounds allowing for atom and/or bond mismatches (Wang et al 2013). As with the fingerprints, compounds are compared based on how structurally dissimilar the molecules are. While potentially a very accurate similarity measure, fMCS is much more computationally demanding than the other methods, and will take a significant amount of time for larger data sets. Data on molecules is downloaded from PubChem. In association with this, there might be a Warning message about closing unused connections, which is not important.
Dissimilarities using NPClassifier and PubChem Fingerprints are generated by calculating Jaccard (Tanimoto) dissimilarities from a 0/1 table with compounds as rows and group (NPClassifier) or binary fingerprint variable (PubChem Fingerprints) as columns. fMCS generates dissimilarity values by calculating Jaccard dissimilarities based on the number of atoms in the maximum common substructure, allowing for one atom and one bond mismatch. Dissimilarities are outputted as dissimilarity matrices.
If dissimilarities are calculated with more than one method,
the function will output additional dissimilarity matrices.
This always includes a matrix with the mean dissimilarity values of the
selected methods. If "NPClassifier"
is included in type
,
a matrix of "mix" values is also calculated. The values in this matrix
are the dissimilarities from NPClassifier when these are > 0.
For pairs of compounds where dissimilarities from NPClassifier
equals 0 (i.e. when the compounds belong to the same pathway, superclass
and class), values are equal to half of the (mean) value(s) of the
structural dissimilarity/-ies from PubChem Fingerprints and/or fMCS.
With this method, compound dissimilarities are primarily based on
NPClassifier, but instead of compounds with identical classification having
0 dissimilarity, these have a dissimilarity based on PubChem Fingerprints
and/or fMCS, scaled to always be less (< 0.5) than compounds being in the
same pathway and superclass, but different class.
If there are unknown compounds, which do not have a
corresponding SMILES or InChIKey, this can be handled in three
different ways. First, these can be completely removed from the list
of compounds and the sample data set, and hence excluded from all analyses.
Second, if unknownCompoundsMean = FALSE
, unknown compounds will
be given a dissimilarity value of 1 to all other compounds. Third, if
unknownCompoundsMean = TRUE
, unknown compounds will be given
a dissimilarity value to all other compounds which equals the mean
dissimilarity value between all known compounds. See chemodiv
for alternative methods that can be used when most or all compounds
are unknown.
Value
List with compound dissimilarity matrices. A list is always
outputted, even if only one matrix is calculated. Downstream functions,
including calcDiv
, calcBetaDiv
,
calcDivProf
, sampDis
, molNet
and chemoDivPlot
require only the matrix as
input (e.g. as fullList$specificMatrix
) rather than the whole list.
References
Kim HW, Wang M, Leber CA, Nothias L-F, Reher R, Kang KB, van der Hooft JJJ, Dorrestein PC, Gerwick WH, Cottrell GW. 2021. NPClassifier: A Deep Neural Network-Based Structural Classification Tool for Natural Products. Journal of Natural Products 84: 2795-2807.
Wang Y, Backman TWH, Horan K, Girke T. 2013. fmcsR: mismatch tolerant maximum common substructure searching in R. Bioinformatics 29: 2792-2794.
Examples
data(minimalCompData)
data(minimalNPCTable)
compDis(minimalCompData, type = "NPClassifier",
npcTable = minimalNPCTable) # Dissimilarity based on NPClassifier
## Not run: compDis(minimalCompData) # Dissimilarity based on Fingerprints
data(alpinaCompData)
data(alpinaNPCTable)
compDis(compoundData = alpinaCompData, type = "NPClassifier",
npcTable = alpinaNPCTable) # Dissimilarity based on NPClassifier
Minimal compound dataset
Description
A small dataset with three phytochemical compounds.
Usage
minimalCompData
Format
A data frame with 3 rows and 3 columns. Each row is a phytochemical compound. First column is a common name of the compound, second column is the SMILES (Simplified Molecular-Input Line-Entry System) specification, third column is the InChIKey (International Chemical Identifier).
Minimal compound dissimilarity matrix
Description
A matrix with compound dissimilarities calculated using
compDis
with type = "PubChemFingerprint"
,
for the compounds in minimalCompData
.
Usage
minimalCompDis
Format
A 3x3 compound dissimilarity matrix.
Minimal molecular network
Description
A molecular network. Generated by the molNet
function
used on the minimalCompDis
dataset, with
cutOff = "median"
.
Usage
minimalMolNet
Format
A tbl_graph
object with 3 nodes and 4 edges.
Minimal NPClassifier table
Description
A table with the NPClassifier pathways, superclasses and classes,
along with the compound names, smiles and inchikey. Generated by
the NPCTable
function used on the
minimalCompData
dataset.
Usage
minimalNPCTable
Format
A dataframe with 3 compounds and their NPClassifier classifications.
Minimal sample dataset
Description
A small made up dataset with phytochemical data.
Usage
minimalSampData
Format
A data frame with 4 rows and 3 columns. Each row is a sample, each column is a phytochemical compound.
Minimal sample dissimilarity matrix
Description
A matrix with sample dissimilarities calculated from
minimalSampData
and minimalCompDis
using sampDis
with type = "GenUniFrac"
and
alpha = 0.5
.
Usage
minimalSampDis
Format
A 4x4 sample dissimilarity matrix.
Generate a molecular network with some properties
Description
Function to generate a molecular network object, and some basic properties of the network.
Usage
molNet(compDisMat, npcTable = NULL, cutOff = "median")
Arguments
compDisMat |
Compound dissimilarity matrix, as calculated by
|
npcTable |
A data frame generated by |
cutOff |
Cut-off value for compound similarities. Any similarity
lower than this value will be set to zero when the network is generated,
which strongly affects the look of the network. The value can be set
manually to any value between 0 and 1; to the |
Details
Molecular networks can be used to illustrate the biosynthetic/structural
similarity of phytochemical compounds in a sample, while
simultaneously visualizing their relative concentrations. molNet
creates the network, and molNetPlot
can subsequently be
used to create a plot of the network.
Value
List with a (tbl_graph) graph object, the number of compounds,
number of NPC pathways and a measure of the modularity of the network
(see modularity
).
Examples
data(minimalCompDis)
data(minimalNPCTable)
molNet(minimalCompDis, minimalNPCTable, cutOff = 0)
data(alpinaCompDis)
molNet(compDisMat = alpinaCompDis, cutOff = 0.75)
Plot molecular networks
Description
Function to conveniently create a basic plot of the molecular network
created by the molNet
function. Molecular networks can be
used to illustrate the biosynthetic/structural similarity of
phytochemical compounds in a sample, while simultaneously visualizing
their relative concentrations. In the network, nodes are compounds,
with node sizes or node colours representing the relative concentrations
of compounds. Edges connects nodes, with edge widths representing
compound similarity. This function exists to provide an easy way to
make basic molecular network plots. Customized network plots can be
created with e.g. the ggraph
package or the
Cytoscape software platform.
Usage
molNetPlot(
sampleData,
networkObject,
groupData = NULL,
npcTable = NULL,
plotNames = FALSE,
layout = "kk"
)
Arguments
sampleData |
Data frame with the relative concentration of each compound (column) in every sample (row). |
networkObject |
A network object, as created by the
|
groupData |
Grouping data (e.g. population, species etc.). If supplied, a separate network will be created for each group. Should be either a vector, or a data frame with a single column. |
npcTable |
It is optional but recommended to supply an
|
plotNames |
Indicates if compounds names should be included in the molecular network plot. |
layout |
Layout used by |
Details
The network object from molNet
and sampleData
have to
be supplied. In addition, groupData
and/or an NPCTable
can be supplied. If an NPCTable is supplied, which is recommended,
node colours will represent NPC pathways, and node sizes the relative
concentration of the compounds. Edge widths represent compound similarity,
and only edges with similarity values above the cutOff
value
in the molNet
function will be plotted.
If groupData
is supplied, one network will be created for each group.
When groupData
but not an NPCTable
is supplied,
compounds missing (i.e. having a mean of 0) from
specific groups are plotted as triangles. When groupData
and an NPCTable
is supplied, compounds missing from
specific groups have a white fill. Additionally, in both cases, edges
connecting to missing compounds are lighter coloured. These graphical
styles are done so that networks are more easy to compare across groups.
Value
A plot with one or more molecular networks.
Examples
data(minimalSampData)
data(minimalNPCTable)
data(minimalMolNet)
groups <- c("A", "A", "B", "B")
molNetPlot(minimalSampData, minimalMolNet)
molNetPlot(minimalSampData, minimalMolNet, groups)
molNetPlot(minimalSampData, minimalMolNet, plotNames = TRUE)
data(alpinaSampData)
data(alpinaPopData)
data(alpinaMolNet)
data(alpinaNPCTable)
molNetPlot(sampleData = alpinaSampData, networkObject = alpinaMolNet,
npcTable = alpinaNPCTable)
Quickly calculate or plot chemodiversity
Description
This function is a shortcut that makes use of many of the other functions in the package. In one simple step chemodiversity is calculated, and if requested also plotted, for users wanting to quickly explore their data using standard parameters.
Usage
quickChemoDiv(
sampleData,
compoundData = NULL,
groupData = NULL,
outputType = "plots"
)
Arguments
sampleData |
Data frame with the relative concentration of each compound (column) in every sample (row). |
compoundData |
Data frame with the compounds in |
groupData |
Grouping data (e.g. population, species etc.). Should be either a vector, or a data frame with a single column. |
outputType |
Type of output that should be returned: either
|
Details
The function requires sample data as input, and can also include
compound data. chemoDivCheck
can be used to ensure these
datasets are correctly formatted, see chemodiv
for further
details on data formatting. If only sample data is supplied,
phytochemical diversity and dissimilarity will be calculated
as Hill diversity and Bray-Curtis dissimilarity, respectively.
If sample data and compound data is supplied, phytochemical diversity
and dissimilarity will be calculated as Functional Hill diversity
and Generalized UniFrac dissimilarity, respectively.
This function then uses the following other functions in the package:
-
compDis
is used to calculate compound dissimilarities using PubChem Fingerprints, if compound data is supplied. -
calcDiv
is used to calculate (Functional) Hill Diversity for q = 1. -
calcDivProf
is used to calculate a diversity profile with (Functional) Hill Diversity for q = 0-3. -
sampDis
is used to calculate Bray-Curtis or Generalized UniFrac dissimilarities between samples. -
chemoDivPlot
is used to create different chemodiversity plots if requested.
quickChemoDiv
is designed to provide an easy way to calculate and
visualize the most central measures of phytochemical diversity. It uses
default parameters to do so, which should be reasonable in most cases.
However, for detailed analyses it is recommended to use the separate
functions to allow for full control of function input, arguments and output.
Value
Different types of chemodiversity measures, either as elements in
a list or as separate plots. If outputType = "data"
, function returns
a compound dissimilarity matrix (if compound data was supplied),
a data frame with (Functional) Hill Diversity at q = 1,
a data frame with a (Functional) Hill Diversity profile for q = 0-3,
and a sample dissimilarity matrix. If outputType = "plots"
,
these data sets are plotted as a dendrogram (if compound data was supplied),
a boxplot, a diversity profile plot and an NMDS plot, respectively.
Examples
data(minimalCompData)
data(minimalSampData)
groups <- c("A", "A", "B", "B")
quickChemoDiv(sampleData = minimalSampData, groupData = groups,
outputType = "data") # Without compound data
data(alpinaSampData)
data(alpinaPopData)
quickChemoDiv(sampleData = alpinaSampData, outputType = "plots",
groupData = alpinaPopData) # Without compound data
Calculate sample dissimilarities
Description
Function to calculate dissimilarities between samples. Either Bray-Curtis dissimilarities and/or Generalized UniFrac dissimilarities are calculated.
Usage
sampDis(sampleData, compDisMat = NULL, type = "BrayCurtis", alpha = 1)
Arguments
sampleData |
Data frame with the relative concentration of each compound (column) in every sample (row). |
compDisMat |
Compound dissimilarity matrix, as calculated by
|
type |
Type of sample dissimilarities to be calculated. This is
Bray-Curtis dissimilarities, |
alpha |
Parameter used in calculations of Generalized UniFrac
dissimilarities. alpha can be set between 0 and 1.
With |
Details
The function calculates a dissimilarity matrix for all the samples
in sampleData
, for the given dissimilarity index/indices.
Bray-Curtis dissimilarities are calculated using only
the sampleData
. This is the most commonly calculated dissimilarity
index used for phytochemical data (other types of dissimilarities are
easily calculated using the vegdist
function in
the vegan
package).
If a compound dissimilarity matrix, compDisMat
, is supplied,
Generalized UniFrac dissimilarities can be calculated, which also
use the compound dissimilarity matrix for the sample dissimilarity
calculations. For the calculation of Generalized UniFrac
dissimilarities (Chen et al. 2012), the compound dissimilarity matrix is
transformed into a dendrogram using hierarchical clustering (with the
UPGMA method). Calculations of UniFrac dissimilarities quantifies the
fraction of the total branch length of the dendrogram that leads to
compounds present in either sample, but not both. The (weighted) Generalized
UniFrac dissimilarities implemented here additionally take compound
abundances into account. In this way, both the relative proportions of
compounds and the biosynthetic/structural dissimilarities of the compounds
are accounted for in the calculations of sample dissimilarities, such that
two samples containing more biosynthetically/structurally different
compounds have a higher pairwise dissimilarity than two samples
containing more biosynthetically/structurally similar compounds.
As with Bray-Curtis dissimilarities, Generalized UniFrac dissimilarities
range in value from 0 to 1.
Value
List with sample dissimilarity matrices. A list is always outputted, even if only one matrix is calculated.
References
Bray JR, Curtis JT. 1957. An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecological Monographs 27: 325-349.
Chen J, Bittinger K, Charlson ES, et al. 2012. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics 28: 2106-2113.
Lozupone C, Knight R. 2005. UniFrac: a New Phylogenetic Method for Comparing Microbial Communities. Applied and Environmental Microbiology 71: 8228-8235.
Examples
data(minimalSampData)
data(minimalCompDis)
sampDis(minimalSampData)
sampDis(sampleData = minimalSampData, compDisMat = minimalCompDis,
type = c("BrayCurtis", "GenUniFrac"), alpha = 0.5)
data(alpinaSampData)
data(alpinaCompDis)
sampDis(sampleData = alpinaSampData, compDisMat = alpinaCompDis,
type = "GenUniFrac")