Title: | A Glimpse at the Diversity of Peru's Endemic Plants |
Version: | 0.1.9 |
Description: | Introducing a novel and updated database showcasing Peru's endemic plants. This meticulously compiled and revised botanical collection encompasses a remarkable assemblage of over 7,898 distinct species. The data for this resource was sourced from the work of Govaerts, R., Nic Lughadha, E., Black, N. et al., titled 'The World Checklist of Vascular Plants: A continuously updated resource for exploring global plant diversity', published in Sci Data 8, 215 (2021) <doi:10.1038/s41597-021-00997-6>. |
License: | MIT + file LICENSE |
URL: | https://github.com/PaulESantos/ppendemic/ |
BugReports: | https://github.com/PaulESantos/ppendemic/issues/ |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
Depends: | R (≥ 4.1.0), |
Config/testthat/edition: | 3 |
Maintainer: | Paul E. Santos Andrade <paulefrens@gmail.com> |
Imports: | assertthat, dplyr, fuzzyjoin, memoise, progress, purrr, readr, stringr, tibble, tidyr |
NeedsCompilation: | no |
Packaged: | 2025-06-09 04:14:50 UTC; PC |
Author: | Paul E. Santos Andrade
|
Repository: | CRAN |
Date/Publication: | 2025-06-09 04:30:02 UTC |
Direct Match
Description
This function performs a direct match of species names. It matches the genus and species if the name is binomial, and matches the genus, species, and infra species if the name includes a subspecies.
Usage
direct_match(df, target_df = NULL)
Arguments
df |
A tibble containing the species data to be matched. |
target_df |
A tibble representing the ppendemic database containing the reference list of endemic species. |
Value
A tibble with an additional logical column direct_match indicating whether the binomial or trinomial name was successfully matched (TRUE
) or not (FALSE
).
Direct Match Species within Genus
Description
This function performs a direct match of specific epithets within an already matched genus from the list of endemic species in the ppendemic database.
Usage
direct_match_species_within_genus_helper(df, target_df)
Arguments
df |
A tibble containing the species data to be matched. |
target_df |
A tibble representing the ppendemic database containing the reference list of endemic species. |
Value
A tibble with an additional logical column indicating whether the specific epithet was successfully matched within the matched genus (TRUE
) or not (FALSE
).
Fuzzy Match Genus Name
Description
This function performs a fuzzy match of genus names against the ppendemic database using fuzzyjoin::stringdist() to account for slight variations in spelling.
Usage
fuzzy_match_genus(df, target_df = NULL)
Arguments
df |
A tibble containing the genus names to be matched. |
target_df |
A tibble representing the ppendemic database containing the reference list of endemic species. |
Value
A tibble with two additional columns:
fuzzy_match_genus: A logical column indicating whether the genus was successfully matched (
TRUE
) or not (FALSE
).fuzzy_genus_dist: A numeric column representing the distance for each match.
Fuzzy Match Infraspecies within Species
Description
This function performs a fuzzy match of specific infraspecies within an already matched epithet from the list of endemic species in the ppendemic database.
Usage
fuzzy_match_infraspecies_within_species(df, target_df = NULL)
Arguments
df |
A tibble containing the species data to be matched. |
target_df |
A tibble representing the ppendemic database containing the reference list of endemic species. |
Value
A tibble with an additional logical column fuzzy_match_infraspecies_within_species, indicating whether the specific infraspecies was successfully fuzzy matched within the matched species (TRUE
) or not (FALSE
).
Fuzzy Match Species within Genus
Description
This function attempts to fuzzy match species names within a genus to the ppendemic database using fuzzyjoin::stringdist for fuzzy matching.
Usage
fuzzy_match_species_within_genus_helper(df, target_df)
Arguments
df |
A tibble containing the species data to be matched. |
target_df |
A tibble representing the ppendemic database containing the reference list of endemic species. |
Value
A tibble with an additional logical column fuzzy_match_species_within_genus, indicating whether the specific epithet was successfully fuzzy matched within the matched genus (TRUE
) or not (FALSE
).
Match Genus Name
Description
This function performs a direct match of genus names against the genus names listed in the ppendemic database.
Usage
genus_match(df, target_df = NULL)
Arguments
df |
A tibble containing the genus names to be matched. |
target_df |
A tibble representing the ppendemic database containing the reference list of endemic species. |
Value
A tibble with an additional logical column genus_match indicating whether the genus was successfully matched (TRUE
) or not (FALSE
).
Check if species are endemic in the ppendemic database
Description
This function checks if a list of species names are endemic in the ppendemic database. The function allows fuzzy matching for species names with a maximum distance threshold to handle potential typos or variations in species names.
Usage
is_ppendemic(splist)
Arguments
splist |
A character vector containing the list of species names to be checked for endemic in the ppendemic database. |
Value
A character vector indicating if each species is endemic or not endemic.
Examples
is_ppendemic(c("Aa aurantiaca", "Aa aurantiaaia", "Werneria nubigena"))
Match Species Names to Endemic Plant List of Peru
Description
This function matches given species names against the internal database of endemic plant species in Peru.
Usage
matching_ppendemic(splist)
Arguments
splist |
A vector containing the species list. |
Details
The function first attempts to directly match species names with exact matches in the database (genus and specific epithet, or genus, specific epithet, and infra species). If no exact match is found, the function performs a fuzzy match using the fuzzyjoin package with an optimal string alignment distance of one, as implemented in stringdist.
The maximum edit distance is intentionally set to one.
The function matching_ppendemic returns a tibble with new columns Matched.Genus, Matched.Species, and Matched.Infraspecies, containing the matched names or NA if no match was found.
Additionally, a logical column is added for each function called, allowing users to see which functions were applied to each name during the matching process. If a process column shows NA
, the corresponding function was not called for that name because it was already matched by a preceding function.
Value
Returns a tibble with the matched names in Matched.Genus, Matched.Species for binomial names, and Matched.Infraspecies for valid infra species names.
ppendemic_tab14: Endemic Plant Database of Peru
Description
The ppendemic_tab14 dataset is a tibble (data frame) that provides easy access to a comprehensive database of Peru's endemic plant species. It contains a total of 7,898 records with essential botanical information, including the accepted name, accepted family, genus, species, infraspecific information, taxon authors, primary author, place of publication, volume and page, publication years, and version details.
Usage
ppendemic_tab14
Format
A tibble (data frame) with 7,898 rows and 18 columns:
- taxon_name
Character vector. The accepted name of the endemic plant species.
- taxon_status
Character vector. The taxonomic status of the species (e.g., "Accepted").
- family
Character vector. The family of the accepted name of the endemic plant species.
- genus
Character vector. The genus of the endemic plant species.
- species
Character vector. The specific epithet of the endemic plant species.
- infraspecific_rank
Character vector. The infraspecific rank (e.g., "subsp.", "var.") when applicable.
- infraspecies
Character vector. The infraspecific epithet when applicable.
- taxon_authors
Character vector. The author(s) of the accepted name of the endemic plant species.
- primary_author
Character vector. The primary author(s) of the publication containing the endemic plant species information.
- place_of_publication
Character vector. The place of publication of the endemic plant species information.
- volume_and_page
Character vector. The volume and page number of the publication containing the endemic plant species information.
- first_published
Character vector. The first published year of the publication containing the endemic plant species information.
- year_actual
Numeric vector. The actual year of publication extracted from first_published.
- year_nominal
Numeric vector. The nominal year of publication extracted from first_published.
- both_years
Character vector. Both actual and nominal years when different, extracted from first_published.
- has_different_years
Logical vector. Indicates whether the actual and nominal publication years differ (TRUE when both_years contains the pattern "YYYY|YYYY").
- version
Character vector. The version identifier "V-14" of the ppendemic database.
- version_date
Character vector. The version date "28-05-2025" indicating when this version was created.
Details
The dataset provides a curated and up-to-date collection of Peru's endemic plant species, gathered from reputable botanical sources and publications. The data for this database was extracted and compiled from the World Checklist of Vascular Plants (WCVP) database, which is a comprehensive and reliable repository of botanical information.
This version (ppendemic_tab14) includes enhanced temporal information with separate numeric fields for actual and nominal publication years. This allows for more precise bibliographic tracking and citation accuracy. The dataset also includes improved infraspecific taxonomy handling with dedicated fields for ranks and epithets.
The year extraction process uses sophisticated pattern matching to distinguish between actual publication years and nominal years, with the has_different_years field automatically flagging records where these differ. This is particularly important for historical botanical publications where publication delays were common.
Source
The dataset has been carefully compiled and updated to offer the latest insights into Peru's endemic plant species. The data is sourced from the World Checklist of Vascular Plants (WCVP) database, an international collaborative programme initiated in 1988 by Rafaƫl Govaerts that provides high-quality expert-reviewed taxonomic data on all vascular plants.
For detailed methodology, see Govaerts et al. (2021) "The World Checklist of Vascular Plants, a continuously updated resource for exploring global plant diversity" in Nature Scientific Data.
Examples
# Load the package
library(ppendemic)
# Access the dataset
data("ppendemic_tab14")
# View the structure of the dataset
str(ppendemic_tab14)
# View first few rows
head(ppendemic_tab14)
# Check for species with different actual and nominal years
different_years <- subset(ppendemic_tab14, has_different_years == TRUE)
nrow(different_years)
# View records with both years information
head(ppendemic_tab14$both_years[ppendemic_tab14$has_different_years])
Suffix Match Species within Genus
Description
Function to match the specific epithet by exchanging common suffixes within an already matched genus in the ppendemic database.
Usage
suffix_match_species_within_genus_helper(df, target_df)
Arguments
df |
A tibble. |
target_df |
A tibble representing the ppendemic database containing the reference list of endemic species. |
Value
Returns a tibble with the additional logical column
suffix_match_species_within_genus, indicating whether the specific
epithet was successfully matched within the matched genus (TRUE
)
or not (FALSE
).